Project

General

Profile

Feature #8648

Feature #5734: Monitor servers

Feature #9482: Create a monitoring setup prototype

Initial set up of the monitoring software

Added by intrigeri almost 5 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Infrastructure
Target version:
Start date:
03/07/2016
Due date:
% Done:

100%

Feature Branch:
Type of work:
Sysadmin
Blueprint:
Starter:
Affected tool:

Subtasks

Feature #11194: Install a web interface to our monitoring softwareResolved


Related issues

Blocks Tails - Feature #8650: Configure monitoring for the most critical services Resolved 01/09/2015

History

#1 Updated by intrigeri almost 5 years ago

  • Blocked by Feature #8647: Install an OS on the machine that will host the production monitoring setup added

#2 Updated by intrigeri almost 5 years ago

  • Blocks Feature #8650: Configure monitoring for the most critical services added

#3 Updated by Dr_Whax over 4 years ago

  • Assignee changed from bertagaz to Dr_Whax

#4 Updated by Dr_Whax over 4 years ago

  • Assignee changed from Dr_Whax to bertagaz

#6 Updated by intrigeri over 4 years ago

  • Blocked by deleted (Feature #8647: Install an OS on the machine that will host the production monitoring setup)

#7 Updated by intrigeri over 4 years ago

  • Assignee changed from bertagaz to Dr_Whax
  • Target version changed from Tails_1.8 to Tails_1.5
  • Parent task changed from #5734 to #9482

#8 Updated by intrigeri over 4 years ago

  • Blocks deleted (Feature #8650: Configure monitoring for the most critical services)

#9 Updated by intrigeri over 4 years ago

  • Blocks Feature #8650: Configure monitoring for the most critical services added

#10 Updated by intrigeri over 4 years ago

  • Target version changed from Tails_1.5 to Tails_1.6

#11 Updated by Dr_Whax over 4 years ago

  • Target version changed from Tails_1.6 to Tails_1.5

#12 Updated by intrigeri about 4 years ago

  • Target version changed from Tails_1.5 to Tails_1.6

#13 Updated by bertagaz about 4 years ago

  • Target version changed from Tails_1.6 to Tails_1.7

#14 Updated by intrigeri about 4 years ago

  • Due date set to 09/28/2015

#15 Updated by Dr_Whax about 4 years ago

  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 10

VM got reset and i'll be doing the initial setup.

#16 Updated by intrigeri almost 4 years ago

  • Due date deleted (09/28/2015)
  • Assignee changed from Dr_Whax to bertagaz
  • Target version changed from Tails_1.7 to Tails_2.0

#17 Updated by bertagaz almost 4 years ago

  • Target version changed from Tails_2.0 to Tails_2.2

#18 Updated by bertagaz over 3 years ago

  • % Done changed from 10 to 30

I've added a ::tails::monitoring::base define that is now included in our ::tails::base manifest so that all of our system have a basic icinga2 install and service running.

That's on puppet-tails master branch, all commits are Refs: with this ticket.

Next step, setup the pki part, yay!

#19 Updated by intrigeri over 3 years ago

bertagaz wrote:

I've added a ::tails::monitoring::base define that is now included in our ::tails::base manifest so that all of our system have a basic icinga2 install and service running.

  • Why are tails::monitoring::* defines, and not classes?
  • This path looks weird: templates/monitoring/2.4/. It's not obvious what "2.4" refers to. Insert a patch component that clarifies it, just above "2.4"?
  • Trailing whitespace in templates/monitoring/zone.conf.erb.

#20 Updated by intrigeri over 3 years ago

The handling of $ip_address in tails::monitoring::host is unnecessarily complicated => just do $ip_address = $::ipaddress in the params definition, it should work.

#21 Updated by intrigeri over 3 years ago

Maybe the $nodename parameter could default to $::fqdn in tails::monitoring::master? I see our only usage of that class sets it like this, which makes it a good default probably.

#22 Updated by bertagaz over 3 years ago

  • % Done changed from 30 to 40

It still needs some polishing, but a first squeleton for host, zone and certificate management is deployed. Now I need to it with another host, probably monitor.li.

#23 Updated by bertagaz over 3 years ago

  • Assignee changed from bertagaz to intrigeri
  • QA Check set to Ready for QA

Ok, I think I've stabilized the manifests for this ticket, after a bunch of refactorings.

It's not really perfect, it probably would benefit of some polishing, but for now it does make the job, and sound abstracted enough to be easy to improve later.

I'll let you judge and see if it fits. As a matter of testing it, you can add tails::monitoring::agent in some Lizard VM manifest, and try to deploy it. Beware it uses exported resources, and thus needs several puppet runs on different systems. That's documented in the sysadmin repo. In the end, you should have a configured agent connected to the satellite on monitor.lizard.

Now I'm focusing on installing the icingaweb2 interface, which is tracked by ticket #11194.

#24 Updated by intrigeri over 3 years ago

  • Target version changed from Tails_2.2 to Tails_2.3

#25 Updated by intrigeri over 3 years ago

  • Assignee changed from intrigeri to bertagaz
  • QA Check changed from Ready for QA to Dev Needed

bertagaz wrote:

It's not really perfect, it probably would benefit of some polishing, but for now it does make the job, and sound abstracted enough to be easy to improve later.

I did a quick code review. I don't understand the problem space well enough to judge the global design, so I'll trust you on that one. I could wonder why we have a zone per VM, but at this point I've no idea what is a "zone" in this context, so I'll shut up and see if it gives me what I need with my sysadmin hat.

Generally it looks pretty good Puppet code to me! The only bits I find problematic enough to mention them are:

  • I see execs like install_ido_database that assume "a bit" too much about what's in the user-defined strings, before passing it to a shell; in particular, passwords can very well contain interesting special chars.
  • Isn't /etc/nginx/sites-enabled/default shipped by a Debian package? If it is, then the way we try to ensure it's not there is not reliable, as in it'll work except between an upgrade and next Puppet run, and then it'll be confusing.
  • I think it's best practice not to group multiple resources of the same type in a single block (file { X: ; Y: ;}).
  • I'm not sure what you mean with this strange bit of regexp: (.+)?, so I'm confused. Do you mean (.*), or what?

As a matter of testing it, you can add tails::monitoring::agent in some Lizard VM manifest, and try to deploy it. Beware it uses exported resources, and thus needs several puppet runs on different systems. That's documented in the sysadmin repo.

Tried it on apt.lizard:

  • Pushed one fix + some nitpicking to the doc.
  • This doc instructs me to create keys and move them somewhere, and then I realize that those keys already exist in that somewhere. I'm not sure what's the deal and what I should do, i.e. the doc apparently assumes I know more than I do (and way more than I will remember in 6 months).
  • The part about puppet-tails:/files/monitoring/public_keys/$ZONE/$NODENAME seems to be wrong, so I've skipped it. Not sure what I should have done instead (e.g. if the keys had not pre-existed). I assume the pubkey should have been copied somewhere relevant?
  • Nothing points to this doc from the one about setting up new systems (except for lizard VMs).

In the end, you should have a configured agent connected to the satellite on monitor.lizard.

I can't confirm that. I see an icinga2 process listening on TCP port 5665 on apt.lizard, but no such connection.

#26 Updated by bertagaz over 3 years ago

  • Assignee changed from bertagaz to intrigeri
  • QA Check changed from Dev Needed to Ready for QA

intrigeri wrote:

bertagaz wrote:
I did a quick code review. I don't understand the problem space well enough to judge the global design, so I'll trust you on that one. I could wonder why we have a zone per VM, but at this point I've no idea what is a "zone" in this context, so I'll shut up and see if it gives me what I need with my sysadmin hat.

Yeah, that's icinga2 zone. It seems a node has to have its own for it to function, so I added a dedicated zone for each.

Generally it looks pretty good Puppet code to me!

Woo, glad you liked it. Means something. :)

The only bits I find problematic enough to mention them are:

Well, these one are more web interface/icingaweb2 related (so #11194), but I'll reply here anyway:

  • I see execs like install_ido_database that assume "a bit" too much about what's in the user-defined strings, before passing it to a shell; in particular, passwords can very well contain interesting special chars.

This exec has disappeared since then, and the only remaining one quotes the different arguments. Note that this class is not supposed to be included directly, but though tails::monitoring::master, which validates the content of this arguments. Do you mean I should be more strict one the validate_ function I'm using on this arguments?

  • Isn't /etc/nginx/sites-enabled/default shipped by a Debian package? If it is, then the way we try to ensure it's not there is not reliable, as in it'll work except between an upgrade and next Puppet run, and then it'll be confusing.
  • I think it's best practice not to group multiple resources of the same type in a single block (file { X: ; Y: ;}).

Ok, I've seen that in some of your manifests I think. I won't do that again, I promise. :)

  • I'm not sure what you mean with this strange bit of regexp: (.+)?, so I'm confused. Do you mean (.*), or what?

Not sure either. Changed and tested. Works.

Tried it on apt.lizard:

  • Pushed one fix + some nitpicking to the doc.

Thanks!

  • This doc instructs me to create keys and move them somewhere, and then I realize that those keys already exist in that somewhere. I'm not sure what's the deal and what I should do, i.e. the doc apparently assumes I know more than I do (and way more than I will remember in 6 months).

Yeah, it's been written a bit for the `future`, when one deploy a really new system, not one that already exists, because I've been a bit a bit bold and created all the certificates for the nodes that already exist. :)
Won't mind correcting that, it will fix by itself in some days.

  • The part about puppet-tails:/files/monitoring/public_keys/$ZONE/$NODENAME seems to be wrong, so I've skipped it. Not sure what I should have done instead (e.g. if the keys had not pre-existed). I assume the pubkey should have been copied somewhere relevant?

Yes, I've modified their hosting after the live discussion we had about them. They are now in modules/site_tails_monitoring/public_keys/$ZONE/$NODENAME.crt. Doc fixed.

  • Nothing points to this doc from the one about setting up new systems (except for lizard VMs).

Added a reference, in the base system. For other lizard specific node, it's already in the dedicated notes.

In the end, you should have a configured agent connected to the satellite on monitor.lizard.

I can't confirm that. I see an icinga2 process listening on TCP port 5665 on apt.lizard, but no such connection.

Now that #11194 has been forward, you should also see your newly installed monitoring agent popping in the web interface.

Also it seems the connection between agents and the satellite are not permanent. At the moment, you can only see pings from the satellite host to the agent hosts, as that's the only remote monitoring check configured.

#27 Updated by bertagaz over 3 years ago

Forgot this one:

intrigeri wrote:

  • Isn't /etc/nginx/sites-enabled/default shipped by a Debian package? If it is, then the way we try to ensure it's not there is not reliable, as in it'll work except between an upgrade and next Puppet run, and then it'll be confusing.

Fixed, no puppet wondering about it doesn't change anything: we only accept port 443 in ecours' firewall, and IPv6 is disabled.

#28 Updated by intrigeri over 3 years ago

bertagaz: the concat that generates /etc/icinga2/zones.conf seems to have an unstable output or something: running puppet agent --test many times on apt.lizard today changed the ordering of the content of this file a few times, for no good reason I can think of. It decreases the s/n ratio of Puppet's output quite a bit, and breaks one of the Puppet axioms, that is that a system's state should eventually converge. You can check what I am refering to for yourself on that VM, thanks to etckeeper :)

#29 Updated by intrigeri over 3 years ago

  • Assignee changed from intrigeri to bertagaz
  • QA Check changed from Ready for QA to Dev Needed

FTR I've moved the discussion about passwords vs. the shell to #11194 since indeed I was wrong to comment about it here.

  • I think it's best practice not to group multiple resources of the same type in a single block (file { X: ; Y: ;}).

Ok, I've seen that in some of your manifests I think. I won't do that again, I promise. :)

I've been learning just as you are. Old code of mine can totally contain mistakes I would not do anymore. New code of mine can contain new mistakes, or old mistakes I've forgotten about. If I were you, I would trust Puppet best practices doc more than old sample code of mine :)

  • I'm not sure what you mean with this strange bit of regexp: (.+)?, so I'm confused. Do you mean (.*), or what?

Not sure either. Changed and tested. Works.

Cool. If you have a todo list about technologies you should learn about, please make sure that regexps are on top of it. Nothing fancy, but it would be nice if you got the basic bits fully clear in your mind at some point. Yeah, I know, time and all..

So here we're mostly good: only the concat issue, and the #11194 subtask, are remaining. Yeah :)

#30 Updated by intrigeri over 3 years ago

I see you've scheduled downtime today (thanks!), but I can't find how to tell icinga it's over, without going through each monitored host one by one. Did I miss something? Maybe "host groups" would help, e.g. we could pack together all systems that are in the same physical location, and perform actions such as scheduling/ending downtime on all of then in one go?

#31 Updated by bertagaz over 3 years ago

intrigeri wrote:

I see you've scheduled downtime today (thanks!), but I can't find how to tell icinga it's over, without going through each monitored host one by one. Did I miss something? Maybe "host groups" would help, e.g. we could pack together all systems that are in the same physical location, and perform actions such as scheduling/ending downtime on all of then in one go?

I didn't document that, my fault: you can CTRL+click on every hosts so that they all get selected. Then the change you apply with the panel that open on the right of the screen will be applied to every selected hosts.

#32 Updated by bertagaz over 3 years ago

  • Assignee changed from bertagaz to intrigeri
  • QA Check changed from Dev Needed to Ready for QA

intrigeri wrote:

I've been learning just as you are. Old code of mine can totally contain mistakes I would not do anymore. New code of mine can contain new mistakes, or old mistakes I've forgotten about. If I were you, I would trust Puppet best practices doc more than old sample code of mine :)

ACK

Cool. If you have a todo list about technologies you should learn about, please make sure that regexps are on top of it. Nothing fancy, but it would be nice if you got the basic bits fully clear in your mind at some point. Yeah, I know, time and all..

That's something that always frightened me, wonder why... :)

So here we're mostly good: only the concat issue

I've pushed two commits referencing this ticket that add basic ordering using the order parameter of concat::fragment (like in shorewall actually). It's deployed, and seems to work so far. The zones.conf file now have the same hashes on all nodes and doesn't seem to be randomly modified anymore. We'll see.

and the #11194 subtask, are remaining. Yeah :)

I'm still putting this ticket in RfQA, given all related code is done. Closing it can still wait #11194 to be fixed.

#33 Updated by bertagaz over 3 years ago

bertagaz wrote:

intrigeri wrote:

So here we're mostly good: only the concat issue

I've pushed two commits referencing this ticket that add basic ordering using the order parameter of concat::fragment (like in shorewall actually). It's deployed, and seems to work so far. The zones.conf file now have the same hashes on all nodes and doesn't seem to be randomly modified anymore. We'll see.

My bad, it seems that the bug is still here. Adding explicit ordering didn't seem to change the game. So I've opened #11242 to track that, because I really believe that the puppet-concat module version we're using is sorted things quite erratically. We get dumb commit annoyance, but still the generated config is almost the same (only sorting is different in some cases), and working, so this probably doesn't block this very ticket.

#34 Updated by bertagaz over 3 years ago

  • QA Check changed from Ready for QA to Info Needed

While going forward, I stumbled on a trouble: it seems Icinga2 from wheezy (2.1) is not that compatible with the one from Jessie (2.4). The "remote client" feature seems to have been added in Icinga2 2.2... When I did the compatibility test on my systems, I did it the other way: the 2.1 Icinga2 instance was the master... But deployment plans have changed since then.

So at the moment, the Icinga2 client on lizard.t.b.o is not reporting its checks.

Good news is that Icinga2 from testing is simple to backport to Wheezy. So I propose to do that backport and upload it in a suite in our APT repo, so that we can install it on lizard. I'm not sure it deserves a fully official backport. What do you think?

#35 Updated by intrigeri over 3 years ago

  • Assignee changed from intrigeri to bertagaz
  • QA Check changed from Info Needed to Dev Needed

I didn't document that, my fault: you can CTRL+click on every hosts so that they all get selected. Then the change you apply with the panel that open on the right of the screen will be applied to every selected hosts.

Thank you. This feels cumbersome when one has 20+ VMs in a single physical, so again: isn't that the kind of things that "host groups" are for? Perhaps it would be trivial to group *.lizard together? Feel free to postpone to a new, dedicated ticket, of course.

While going forward, I stumbled on a trouble: it seems Icinga2 from wheezy (2.1) is not that compatible with the one from Jessie (2.4)

Bad news, sorry about that!

Do you mean "jessie-backports" instead of "Jessie", and "wheezy-backports" instead of "Wheezy"?

Good news is that Icinga2 from testing is simple to backport to Wheezy.

In practice it's rather "Icinga2 from jessie-backports is simple to backport to wheezy-backports-sloppy" I guess, but I'm nitpicking :)

So I propose to do that backport and upload it in a suite in our APT repo, so that we can install it on lizard. I'm not sure it deserves a fully official backport. What do you think?

I'm fine with a short-lived "private" backport; to make sure it's short-lived, #11186 and #11178 need to be dealt with quickly (so IMO they should become Deliverable for = SponsorS_Internal, and you should become responsible for them; I'm fine with keeping them on my plate for the next 2 weeks and to help with them though).

#36 Updated by bertagaz over 3 years ago

intrigeri wrote:

I didn't document that, my fault: you can CTRL+click on every hosts so that they all get selected. Then the change you apply with the panel that open on the right of the screen will be applied to every selected hosts.

Thank you. This feels cumbersome when one has 20+ VMs in a single physical, so again: isn't that the kind of things that "host groups" are for? Perhaps it would be trivial to group *.lizard together? Feel free to postpone to a new, dedicated ticket, of course.

Yep, that was one of the task I thought might be handful too. Created #11277.

While going forward, I stumbled on a trouble: it seems Icinga2 from wheezy (2.1) is not that compatible with the one from Jessie (2.4)

Bad news, sorry about that!

Do you mean "jessie-backports" instead of "Jessie", and "wheezy-backports" instead of "Wheezy"?

Good news is that Icinga2 from testing is simple to backport to Wheezy.

In practice it's rather "Icinga2 from jessie-backports is simple to backport to wheezy-backports-sloppy" I guess, but I'm nitpicking :)

Ack, thx for the nitpicking, precisely an error I would have done. :)

So I propose to do that backport and upload it in a suite in our APT repo, so that we can install it on lizard. I'm not sure it deserves a fully official backport. What do you think?

I'm fine with a short-lived "private" backport; to make sure it's short-lived, #11186 and #11178 need to be dealt with quickly (so IMO they should become Deliverable for = SponsorS_Internal, and you should become responsible for them; I'm fine with keeping them on my plate for the next 2 weeks and to help with them though).

Re-scheduled this tickets to SponsorS_Internal as requested, left them assigned to you, but will own them later when I'll be a bit more relaxed on the monitoring setup.

#37 Updated by bertagaz over 3 years ago

  • Assignee changed from bertagaz to intrigeri
  • QA Check changed from Dev Needed to Ready for QA

bertagaz wrote:

intrigeri wrote:

So I propose to do that backport and upload it in a suite in our APT repo, so that we can install it on lizard. I'm not sure it deserves a fully official backport. What do you think?

I'm fine with a short-lived "private" backport; to make sure it's short-lived, #11186 and #11178 need to be dealt with quickly (so IMO they should become Deliverable for = SponsorS_Internal, and you should become responsible for them; I'm fine with keeping them on my plate for the next 2 weeks and to help with them though).

Uploaded and deployed a backported version to Wheezy of the jessie-backports Icinga2 packages. Works, after that both lizard and whisperback are running their checks fine. So we're good here I think.

#38 Updated by intrigeri over 3 years ago

  • Status changed from In Progress to Resolved
  • Assignee deleted (intrigeri)
  • QA Check changed from Ready for QA to Pass

Yes!

Also available in: Atom PDF