Project

General

Profile

Feature #8651

Feature #5734: Monitor servers

Configure the receiving side of the monitoring notifications

Added by intrigeri over 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Infrastructure
Target version:
Start date:
01/09/2015
Due date:
% Done:

100%

Feature Branch:
Type of work:
Sysadmin
Blueprint:
Starter:
Affected tool:

Description

This implies discussing with the rest of the sysadmin team how/when we want to get notifications.


Related issues

Blocks Tails - Feature #9484: Deploy the monitoring setup to production Resolved 01/09/2015

History

#2 Updated by intrigeri over 4 years ago

  • Blocks Feature #9484: Deploy the monitoring setup to production added

#3 Updated by bertagaz almost 4 years ago

  • Target version changed from Tails_1.8 to Tails_2.0

Postponing

#4 Updated by bertagaz over 3 years ago

  • Target version changed from Tails_2.0 to Tails_2.2

Postponing this part of the monitoring setup, as it will be unlikely done for the previously planed deadline.

#6 Updated by bertagaz over 3 years ago

  • Target version changed from Tails_2.2 to Tails_2.3

#7 Updated by bertagaz over 3 years ago

  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 10

I've enabled notifications for me with commits referencing this ticket on puppet-tails.

They're sent to me for the moment, enabled for all hosts, the HTTP checks, the whisperback one as well as the disks checks.

They are sent when the type of event is Problem or Recovery.

#8 Updated by bertagaz over 3 years ago

  • Assignee changed from bertagaz to intrigeri
  • QA Check set to Info Needed

I've made some tests to get how Icinga2 is notifying.

At the moment, the notifications are configured to be sent when a service is "OK", "WARNING" or "CRITICAL", and when the event type is "Problem", "Recovery", "Acknowlegement", "DowntimeStart" and "DowntimeEnd".

It means:

We'll get notified if a Downtime starts or ends, or if someone acknowledges a problem.

When a service check starts to fail, Icinga2 will retry x times (configured with max_attempts, which is 5 at the moment), waiting retry_interval between each time, and if it is still failing, then send a "Problem" notification.

It will then retry to check the service every check_interval, and send a notification every interval time (30 minutes at the moment). If a Acknowledgment has been made, it will stop sending "Problem" notifications, but will send the "Recovery" one once the check succeed again.

That sounds quite fair to me. Only problem I see is that we'll get spam bombing every 30 minutes if a service fails continuously and no one acknowledged the problem. I think we should set interval to 1 day, so that we get notified less often in this case.

What's your opinion on this?

#9 Updated by bertagaz over 3 years ago

  • % Done changed from 10 to 40

#10 Updated by intrigeri over 3 years ago

  • Assignee changed from intrigeri to bertagaz
  • QA Check changed from Info Needed to Dev Needed

That sounds quite fair to me. Only problem I see is that we'll get spam bombing every 30 minutes if a service fails continuously and no one acknowledged the problem.

That sounds too much, considering the kind of availability we can reasonably expect from the on-duty sysadmin.

I think we should set interval to 1 day, so that we get notified less often in this case.

Yes.

I'm curious to know how many email you've received over N days, but perhaps fix some actual problems and check robustness issues before computing these stats.

#11 Updated by bertagaz over 3 years ago

  • Assignee changed from bertagaz to intrigeri
  • % Done changed from 40 to 60
  • QA Check changed from Dev Needed to Info Needed

intrigeri wrote:

That sounds quite fair to me. Only problem I see is that we'll get spam bombing every 30 minutes if a service fails continuously and no one acknowledged the problem.

That sounds too much, considering the kind of availability we can reasonably expect from the on-duty sysadmin.

I think we should set interval to 1 day, so that we get notified less often in this case.

Yes.

Done in commits puppet-tails:b7c4915 and puppet-tails:cd501eb

I'm curious to know how many email you've received over N days, but perhaps fix some actual problems and check robustness issues before computing these stats.

Since April 20, I had 153 emails, from which approx. 50 were due to the apt-snapshots-disk check spamming me every 30 minutes. I had it resolved, but it came back again. All the rest is due to the whisperback check, that is flapping very much.

We'll see now that I've set the notification interval to 1 day, and worked a bit on the checks. Shall we use this ticket to track this evaluation, or use #8652 for that?

#12 Updated by bertagaz over 3 years ago

  • Target version changed from Tails_2.3 to Tails_2.4

#14 Updated by intrigeri over 3 years ago

  • Assignee changed from intrigeri to bertagaz
  • QA Check deleted (Info Needed)

bertagaz wrote:

Since April 20, I had 153 emails, from which approx. 50 were due to the apt-snapshots-disk check spamming me every 30 minutes. I had it resolved, but it came back again. All the rest is due to the whisperback check, that is flapping very much.

We'll see now that I've set the notification interval to 1 day, and worked a bit on the checks. Shall we use this ticket to track this evaluation, or use #8652 for that?

IMO making sure that these notifications are useful is part of this ticket (it's actually the hardest part of it I bet, since just enabling email notifications was not too hard I guess).

So, please reassign to me for QA once you're happy with the current notifications (as in: you actually manage to stay on top of them in practice, plus they give you information you were not aware of, and that's actionable), and you deem it ready to be directed to all sysadmins (instead of you only). I guess we're not far from it :)

#15 Updated by bertagaz over 3 years ago

  • Assignee changed from bertagaz to intrigeri
  • QA Check set to Info Needed

Ok, so here are some new stats since the 1d remailing interval change on April 26:

I received 14 emails in total (apart from the whisperback one, but this one were less spammy too) from which:

  • 7 are acknowledgments of a problem
  • 2 are downtimes start and end notifications.

No new emails since April 29 in the evening, as there was no real change in the situation.

All this emails were legit, there effectively were (and is) some problems, an they were worked on.

I'm wondering if it wouldn't be the good time to point the notifications to our sysadmins list, so that you can have a look at what it looks like. I don't think that with the current setting it would be much risky, and I think it's working quite well at the moment. We could still roll back to emailing just me if that doesn't work well for you.

Some URI that could help you to have a look maybe:

https://icingaweb2.tails.boum.org/monitoring/alertsummary/index?interval=1w

https://icingaweb2.tails.boum.org/monitoring/list/notifications?limit=100

What do you think?

#16 Updated by intrigeri over 3 years ago

  • Assignee changed from intrigeri to bertagaz
  • QA Check changed from Info Needed to Dev Needed

Ok, so here are some new stats since the 1d remailing interval change on April 26:

Thanks, good to hear!

I'm wondering if it wouldn't be the good time to point the notifications to our sysadmins list, so that you can have a look at what it looks like.

OK, let's try this!

#17 Updated by bertagaz over 3 years ago

  • Assignee changed from bertagaz to intrigeri
  • QA Check changed from Dev Needed to Ready for QA

intrigeri wrote:

OK, let's try this!

Done. I'm assigning this ticket to you and setting it to RfQA, as a reminder to check if the notification rate meets our design.

#18 Updated by intrigeri over 3 years ago

  • Assignee changed from intrigeri to bertagaz

I'm assigning this ticket to you and setting it to RfQA, as a reminder to check if the notification rate meets our design.

I think we have a ticket to deal with the consequences of the initial deployment, so feel free to close this one.

#19 Updated by bertagaz over 3 years ago

  • Status changed from In Progress to Resolved
  • Assignee deleted (bertagaz)
  • % Done changed from 60 to 100
  • QA Check changed from Ready for QA to Pass

intrigeri wrote:

I think we have a ticket to deal with the consequences of the initial deployment, so feel free to close this one.

\o/

Also available in: Atom PDF