Project

General

Profile

Feature #11358

Feature #5734: Monitor servers

Feature #9482: Create a monitoring setup prototype

Set relevant check_interval and retry_interval for hosts and services

Added by bertagaz over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Infrastructure
Target version:
Start date:
04/20/2016
Due date:
% Done:

100%

Feature Branch:
Type of work:
Sysadmin
Blueprint:
Starter:
Affected tool:

Description

So far we've used import generic-service and import generic-host, which set the same check_interval and retry_interval for every services and hosts. These are a bit too short and not always relevant. We should set them to something that makes more sense.

History

#1 Updated by bertagaz over 3 years ago

  • % Done changed from 0 to 20

Set the check_interval to 5 minutes, and the retry_interval to 2 minutes for every hosts in commit puppet-tails:44a5a30. As the max_check_attempts is set to 3 by default, we should be notified after 6 minutes if a host is down (max_check_attempts * retry_interval). Previously the settings were 1 minutes and 30 seconds respectively, which sounded a bit low and intense to me.

#2 Updated by bertagaz over 3 years ago

  • Assignee changed from bertagaz to intrigeri
  • % Done changed from 20 to 70
  • QA Check set to Ready for QA

I've set up the check_interval (c below) and retry_interval (r below) for the various services:

  • disks: c=12h and r=5m
  • apt: c=6h and r=5m
  • http: c=15m and r=5m
  • memory: c=10m and r=2m
  • torbrowser_archive: c=10m and r=2m
  • rsync: c=10m and r=2m
  • ssh and sftp accounts: c=10m and r=2m
  • whisperback: c=10m and r=2m

What do you think about that?

Also I think it will probably helps the HTTP checks to be a bit more stable, given they will be check less often.

#3 Updated by bertagaz over 3 years ago

  • Blocks Feature #9484: Deploy the monitoring setup to production added

#4 Updated by bertagaz over 3 years ago

bertagaz wrote:

Also I think it will probably helps the HTTP checks to be a bit more stable, given they will be check less often.

This means we'll have to check how the HTTP checks are behaving with this changes (see #8650#note-25)

#5 Updated by intrigeri over 3 years ago

  • Assignee changed from intrigeri to bertagaz
  • QA Check changed from Ready for QA to Dev Needed

I've set up the check_interval (c below) and retry_interval (r below) for the various services:

  • disks: c=12h and r=5m
  • apt: c=6h and r=5m

Sounds good.

  • http: c=15m and r=5m

So, we'll learn after 15 (best case) to 30 (worst case) minutes if one of our HTTP services is down. A lot of our stuff (e.g. CI, image building, additional software packages feature) depends on our various HTTP services to be up, so this feels too relaxed to me. I would say c=5m and r=100s so that the notification is triggered between 5 and 10 minutes after the outage starts.

  • memory: c=10m and r=2m

I see value in checking this more often, as problematic memory usage peaks can be very short lived. Just set it back to the (somewhat crazy) defaults?

  • torbrowser_archive: c=10m and r=2m
  • rsync: c=10m and r=2m
  • ssh and sftp accounts: c=10m and r=2m
  • whisperback: c=10m and r=2m

OK.

#6 Updated by bertagaz over 3 years ago

  • Target version changed from Tails_2.3 to Tails_2.4

#7 Updated by bertagaz over 3 years ago

  • Assignee changed from bertagaz to intrigeri
  • QA Check changed from Dev Needed to Ready for QA

intrigeri wrote:

  • http: c=15m and r=5m

So, we'll learn after 15 (best case) to 30 (worst case) minutes if one of our HTTP services is down. A lot of our stuff (e.g. CI, image building, additional software packages feature) depends on our various HTTP services to be up, so this feels too relaxed to me. I would say c=5m and r=100s so that the notification is triggered between 5 and 10 minutes after the outage starts.

Ok, I've implemented that in commit puppet-tails:625fd30. Let see how it behaves.

  • memory: c=10m and r=2m

I see value in checking this more often, as problematic memory usage peaks can be very short lived. Just set it back to the (somewhat crazy) defaults?

True, let's try that. commit puppet-tails:6a66282

#8 Updated by intrigeri over 3 years ago

  • Status changed from In Progress to Resolved
  • Assignee deleted (intrigeri)
  • Target version changed from Tails_2.4 to Tails_2.3
  • % Done changed from 70 to 0
  • QA Check changed from Ready for QA to Pass

OK, great. Let's handle as subtasks of #8652 any issue we identify once we start really using the thing.

#9 Updated by intrigeri over 3 years ago

  • Blocks deleted (Feature #9484: Deploy the monitoring setup to production)

#10 Updated by intrigeri over 3 years ago

  • Parent task changed from #5734 to #9482

#11 Updated by bertagaz over 3 years ago

  • % Done changed from 0 to 100

Also available in: Atom PDF