Project

General

Profile

Feature #8649

Feature #5734: Monitor servers

Specify our monitoring needs and build an inventory of the services that need monitoring

Added by intrigeri almost 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Infrastructure
Target version:
Start date:
01/09/2015
Due date:
% Done:

100%

Feature Branch:
Type of work:
Research
Starter:
Affected tool:

Description

... with a per-{service,test} priority level.


Related issues

Blocks Tails - Feature #8645: Research and decide what monitoring solution to use Resolved 01/09/2015 09/28/2015
Blocks Tails - Feature #8646: Research and decide where to host the monitoring software Resolved 01/09/2015

History

#1 Updated by intrigeri almost 5 years ago

  • Blocks Feature #8650: Configure monitoring for the most critical services added

#2 Updated by intrigeri almost 5 years ago

  • the WhisperBack SMTP relay must accept email from us (we've had too many issues about it in the past); no need to actually send the email, as we won't be able to check that it actually arrived, and we don't want to bother frontdesk with such email anyway

#3 Updated by intrigeri over 4 years ago

  • Subject changed from Build an inventory of the services that need monitoring to Specify our monitoring needs and build an inventory of the services that need monitoring

#4 Updated by intrigeri over 4 years ago

  • Status changed from Confirmed to In Progress
  • Assignee changed from bertagaz to intrigeri
  • Target version changed from Tails_1.8 to Tails_1.4.1
  • % Done changed from 0 to 10

We've worked on this. I'll dump the results in a blueprint.

#6 Updated by intrigeri over 4 years ago

  • Blocks Feature #8645: Research and decide what monitoring solution to use added

#7 Updated by intrigeri over 4 years ago

  • Blocks Feature #8646: Research and decide where to host the monitoring software added

#8 Updated by intrigeri over 4 years ago

  • Blocks deleted (Feature #8650: Configure monitoring for the most critical services)

#9 Updated by intrigeri over 4 years ago

  • Assignee changed from intrigeri to Dr_Whax
  • % Done changed from 10 to 50
  • QA Check set to Ready for QA
  • Blueprint set to https://tails.boum.org/blueprint/monitor_servers/

DrWhax: please carefully read the blueprint, and then read it again. Then, please tell us if something is not clear enough, badly specified, wrong, or simply missing. If that's the case (which feels very likely), reassign to bertagaz and flag "Info Needed". Once happy, close as resolved.

bertagaz: I've elaborated a little bit on some aspects, beyond what we had discussed together, so you may want to have another look. Note that some service checks priority levels where changed, because we have made some commitments about them. So beware not to adjust these without checking with me first.

#10 Updated by Dr_Whax over 4 years ago

  • Assignee changed from Dr_Whax to bertagaz
  • QA Check changed from Ready for QA to Info Needed

I reviewed the document and have a few questions in order to understand things right. In addition, i've found some typo's which I have corrected in branch: drwhax/feature-8649-specify-monitoring-needs

I've broken down the questions per section of the document.

  1. Compromised monitored machine

Is there any encryption needed between the agent and the server? Is a new key or SNMP password needed per monitored machine?

  1. Network Attacker

Point 3 mentions "it MUST NOT be a big deal if it leaks into the hands of an adversary". What is considered a big deal here?

  1. Configuration

Point 2 mentions "SHOULD allow humans to easily review the service checks configuration." It seems there isn't a preference for either through a web interface or the commandline. Is this the case?

Is there a need for keeping logs of when servers went down and for how long, if any?

What is the "shared puppet module"? What is being referred to?

  1. Hosting of the monitoring machine

Point 4 mentions "MUST be reactive and easy to get in touch with." Is there a turn-around that is important? <24h? <48h?

#11 Updated by intrigeri over 4 years ago

  • Assignee changed from bertagaz to Dr_Whax
  • QA Check changed from Info Needed to Ready for QA

I reviewed the document and have a few questions in order to understand things right.

Excellent!

In addition, i've found some typo's which I have corrected in branch: drwhax/feature-8649-specify-monitoring-needs

Thanks, merged locally => will push once I'm back online.

I've broken down the questions per section of the document.

Cool. Note that Redmine is using textile markup, not Markdown.

  1. Compromised monitored machine

Regarding these two questions: note that this blueprint provides a specification and a threat model, and your task is to find solutions that address them :) That's why we did not spend any time on implementation details, and also why I don't always have good and precise answers to some questions below.

Is there any encryption needed between the agent and the server?

It seems to me that this is already somewhat covered by the specs found in the "Network attacker" section, no?

Is a new key or SNMP password needed per monitored machine?

I'm not sure I've understood what you mean (e.g. is it about global keys rollover when one monitored machine is compromised?)

If the question is about the need, in general, for per-monitored machine authentication, then it seems that the answer is clearly yes, because I don't see how this part of the spec can be satisfied without this kind of authentication: "It MUST NOT result in a compromise of the network traffic between other monitored machines and the monitoring machine (e.g. if that traffic is encrypted, the monitored machines MUST NOT use the same private key)."

  1. Network Attacker

Point 3 mentions "it MUST NOT be a big deal if it leaks into the hands of an adversary". What is considered a big deal here?

Good question :) I personally have no prior first-hand experience with monitoring, so I don't really know what can potentially transit on the wire, so I've a hard time finding examples. It would help me if you asked more specifically "is $this OK?" for various values of $this.

Still, let's try => for example: disk space usage doesn't seem to be a big deal. Passwords used by some "agent" to connect to services may be a big deal, depending on the exact situation.

  1. Configuration

Point 2 mentions "SHOULD allow humans to easily review the service checks configuration." It seems there isn't a preference for either through a web interface or the commandline. Is this the case?

I personally would prefer very much reviewing such changes with Git. But I could live with reviewing them in a web interface, as long as it gives me information akin to git log -p (I mean, having to go through the entire web config interface and compare the current status with what I remember it was before simply won't work). Ideally, such config changes could be proposed and reviewed before being applied on the production systems.

Is there a need for keeping logs of when servers went down and for how long, if any?

I'll assume you mean "services" here. Indeed, we didn't think of this. That's a good question! For me, it would be a weak SHOULD or a strong MAY, and a year sounds like a good sample of data, e.g. to evaluate services robustness and focus future efforts on what we think breaks too often (I don't believe in optimization without profiling or benchmarking).

What is the "shared puppet module"? What is being referred to?

We're using lots of modules that come from this project (to which we're contributing a bit):

  1. Hosting of the monitoring machine

Point 4 mentions "MUST be reactive and easy to get in touch with." Is there a turn-around that is important? <24h? <48h?

I'd say that getting an initial answer within "a few" (=~ 5) days, when something goes wrong, is a MUST. We've added this requirement because we have had experience with hosting projects that sometime take weeks or months to reply (when they reply at all) -- but I think we don't really need something like a service contract with "<24h" written on it. Mutual trust, mutual aid, and an understanding of the human aspects sound more important :)

Yay!

#12 Updated by Dr_Whax over 4 years ago

  • Assignee changed from Dr_Whax to intrigeri
  • QA Check changed from Ready for QA to Info Needed

intrigeri wrote:

In addition, i've found some typo's which I have corrected in branch: drwhax/feature-8649-specify-monitoring-needs

Thanks, merged locally => will push once I'm back online.

Thx!

I've broken down the questions per section of the document.

Cool. Note that Redmine is using textile markup, not Markdown.

Fair enough :)

  1. Compromised monitored machine

Regarding these two questions: note that this blueprint provides a specification and a threat model, and your task is to find solutions that address them :) That's why we did not spend any time on implementation details, and also why I don't always have good and precise answers to some questions below.

ACK.

Is there any encryption needed between the agent and the server?

It seems to me that this is already somewhat covered by the specs found in the "Network attacker" section, no?

Sure, it is. See below.

Is a new key or SNMP password needed per monitored machine?

I'm not sure I've understood what you mean (e.g. is it about global keys rollover when one monitored machine is compromised?)

It doesn't of course has to be a global keys rollover if we have a different password and/or ssl key per service we monitor.

If the question is about the need, in general, for per-monitored machine authentication, then it seems that the answer is clearly yes, because I don't see how this part of the spec can be satisfied without this kind of authentication: "It MUST NOT result in a compromise of the network traffic between other monitored machines and the monitoring machine (e.g. if that traffic is encrypted, the monitored machines MUST NOT use the same private key)."

ACK, it's clear now!

  1. Network Attacker

Point 3 mentions "it MUST NOT be a big deal if it leaks into the hands of an adversary". What is considered a big deal here?

Good question :) I personally have no prior first-hand experience with monitoring, so I don't really know what can potentially transit on the wire, so I've a hard time finding examples. It would help me if you asked more specifically "is $this OK?" for various values of $this.

I think I can answer my own question, it's considered a big deal when one of the items outside of our threat model happens?

Still, let's try => for example: disk space usage doesn't seem to be a big deal. Passwords used by some "agent" to connect to services may be a big deal, depending on the exact situation.

  1. Configuration

Point 2 mentions "SHOULD allow humans to easily review the service checks configuration." It seems there isn't a preference for either through a web interface or the commandline. Is this the case?

I personally would prefer very much reviewing such changes with Git. But I could live with reviewing them in a web interface, as long as it gives me information akin to git log -p (I mean, having to go through the entire web config interface and compare the current status with what I remember it was before simply won't work). Ideally, such config changes could be proposed and reviewed before being applied on the production systems.

ACK. We prefer our monitoring configuration in git and changes would be proposed over a determined channel. Would that be the `tails-sysadmins` mailinglist, something else? Do we want to declare that on the blueprint or is that considered to be outside of the blueprint scope?

Is there a need for keeping logs of when servers went down and for how long, if any?

I'll assume you mean "services" here. Indeed, we didn't think of this. That's a good question! For me, it would be a weak SHOULD or a strong MAY, and a year sounds like a good sample of data, e.g. to evaluate services robustness and focus future efforts on what we think breaks too often (I don't believe in optimization without profiling or benchmarking).

Sure, services! Alright, i'll fix that up on the blueprint.

What is the "shared puppet module"? What is being referred to?

We're using lots of modules that come from this project (to which we're contributing a bit):

Ack, I joined the irc-channel and will look around if I have finished killing tickets :)

  1. Hosting of the monitoring machine

Point 4 mentions "MUST be reactive and easy to get in touch with." Is there a turn-around that is important? <24h? <48h?

I'd say that getting an initial answer within "a few" (=~ 5) days, when something goes wrong, is a MUST. We've added this requirement because we have had experience with hosting projects that sometime take weeks or months to reply (when they reply at all) -- but I think we don't really need something like a service contract with "<24h" written on it. Mutual trust, mutual aid, and an understanding of the human aspects sound more important :)

Ok, ack, in addition and this might not be something we want to discuss here, or not even with me, who knows! Do we want to only host with a collective? Would hosting with a hosting company we trust be ok? Should the monitoring machine be hosted in the western hemisphere? Say, North America/Europe? Is there an absolute, no we don't want to host in this or that country in North America/Europe? This makes it easier for me to narrow down some propositions I want to make.

Thanks!

#13 Updated by intrigeri over 4 years ago

  • Assignee changed from intrigeri to Dr_Whax
  • QA Check changed from Info Needed to Ready for QA
  1. Network Attacker

Point 3 mentions "it MUST NOT be a big deal if it leaks into the hands of an adversary". What is considered a big deal here?

Good question :) I personally have no prior first-hand experience with monitoring, so I don't really know what can potentially transit on the wire, so I've a hard time finding examples. It would help me if you asked more specifically "is $this OK?" for various values of $this.

I think I can answer my own question, it's considered a big deal when one of the items outside of our threat model happens?

Yes, "a big deal" includes this, but probably more things that the rest of our threat model doesn't cover.

ACK. We prefer our monitoring configuration in git and changes would be proposed over a determined channel. Would that be the `tails-sysadmins` mailinglist, something else?

By default, the already documented ways to contribute changes to our infrastructure without being root apply as a fallback. If that's not good enough, then we'll refine it later.

Do we want to declare that on the blueprint or is that considered to be outside of the blueprint scope?

How exactly we'll be handling pull requests seems out of scope now :)

Is there a need for keeping logs of when servers went down and for how long, if any?

I'll assume you mean "services" here. Indeed, we didn't think of this. That's a good question! For me, it would be a weak SHOULD or a strong MAY, and a year sounds like a good sample of data, e.g. to evaluate services robustness and focus future efforts on what we think breaks too often (I don't believe in optimization without profiling or benchmarking).

Sure, services! Alright, i'll fix that up on the blueprint.

Cool, thanks. Please do so for everything that was unclear and you had to ask more info about too, by the way. Having specs only as comments on Redmine won't cut it :)

Do we want to only host with a collective? Would hosting with a hosting company we trust be ok?

As long as the "Hosting of the monitoring machine" spec's requirements are satisfied, the exact business model of the hosting organization don't matter much to me. I'm pretty sure our potential list of trusted orgs contains way more autonomous collectives than companies, but it doesn't mean that it's a MUST.

Should the monitoring machine be hosted in the western hemisphere? Say, North America/Europe?
Is there an absolute, no we don't want to host in this or that country in North America/Europe?

I don't think we care. But perhaps I'm missing your underlying point?

This makes it easier for me to narrow down some propositions I want to make.

:)

#14 Updated by bertagaz over 4 years ago

intrigeri wrote:

bertagaz: I've elaborated a little bit on some aspects, beyond what we had discussed together, so you may want to have another look. Note that some service checks priority levels where changed, because we have made some commitments about them. So beware not to adjust these without checking with me first.

Sorry for the lag, just reviewed it. Small typo fixes, and a bit of repĥrasing (ab1d216 deserves a quick look). Frankly I don't remember that well now our discussion, so that's not easy to get what was more elaborated without an initial commit of it, but it sounds good to me.

#15 Updated by intrigeri over 4 years ago

DrWhax: ping? What's your current ETA?

#16 Updated by BitingBird over 4 years ago

  • Target version changed from Tails_1.4.1 to Tails_1.5

#17 Updated by intrigeri over 4 years ago

intrigeri wrote:

DrWhax: ping? What's your current ETA?

A month later: ping?

#18 Updated by intrigeri over 4 years ago

  • Status changed from In Progress to Resolved
  • Assignee deleted (Dr_Whax)
  • % Done changed from 50 to 100
  • QA Check changed from Ready for QA to Pass

Two months + the work DrWhax did on other aspects of this project should be enough to spot serious mistakes, and I don't want to block forever on that QA => calling this done.

Also available in: Atom PDF