Project

General

Profile

Bug #12429

Some lizard VMs sometimes lack memory to run Puppet

Added by bertagaz over 2 years ago. Updated 28 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Infrastructure
Target version:
-
Start date:
04/06/2017
Due date:
% Done:

10%

Feature Branch:
Type of work:
Sysadmin
Blueprint:
Starter:
Affected tool:

Description

Lately it has been the case that running puppet agent --test in some VMs on lizard fail. They don't have enough memory to fork. It's a bit erratic, so it seems to depends on the system load. Here's a list of the affected VMs:

  • bridge
  • bittorrent
  • apt-proxy
  • misc
  • puppet-git
  • whisperback
  • www

Maybe adding 128M of memory each is something to consider. Assigning to intrigeri if he wants to handle that as part of hit shift.

History

#1 Updated by intrigeri over 2 years ago

  • Assignee changed from intrigeri to bertagaz
  • QA Check set to Info Needed

Lately it has been the case that running puppet agent --test in some VMs on lizard fail. They don't have enough memory to fork.

A while ago I've fixed all these problems wrt. Puppet run by cron, and AFAIK we don't see them anymore, so to me it looks like our VMs have enough RAM to run one instance of Puppet.

Now, I have also observed symptoms similar to what you're describing (basically forever, not only lately) when running Puppet by hand. Most of the times I've seen that, I was worried so I've looked closer, and it was due to one of these two root causes:

  • Another Puppet was already being run by cron (and apparently, when using --splay, the locking feature of Puppet that avoids concurrent runs is not effective during the random waiting time): fixed by either waiting for that one to do its job, or by killing it manually if I really wanted to run Puppet manually. I don't think we should spend RAM on trying to support 2 concurrent runs of Puppet, so I say let's ignore this one. If we want to improve the sysadmin UX, we could get ourselves a wrapper to run the puppet agent manually, that would check if another run is scheduled already, or something; this might avoid this topic from resurfacing later, by providing a clearer explanation of the observed behavior.
  • icinga2 was eating tons of RAM: fixed by restarting that service. We could surely workaround this by giving VMs more RAM, but it's not clear to me if this would really fix the problem: if icinga2 has memory leaks (and apparently it has), chances are that they're not capped, and eventually we'll be back in the exact same situation, only a bit less often / after more uptime. Better way to fix that would be to either restart icinga2 regularly via cron (ugly, but quick) or set up resources limits with systemd, and ensure the service gets automatically restarted when it's killed (this should be 2-3 lines in a systemd drop-in override, nicer but someone has to learn how this works to do it, which could actually be a very useful investment for tackling similar tasks in the future).

So, two questions:

  • Did you see the behavior you describe in other situations? If you don't know, please keep this in mind and investigate next time you see this problem.
  • If not, what do you think about the solutions I'm proposing? Better options? How high on our todo list do you think implementing these solutions should be, compared to our other tasks? (So far I've handled this with very low priority, and didn't even bother filing tickets as the workaround was straightforward, but I can reconsider if you feel this is an important problem.)

#2 Updated by intrigeri over 2 years ago

For example, I've just seen a monitoring warning about memory usage on puppet-git.lizard (84%). Restarting icinga2.service got it back down to 36%.

#4 Updated by intrigeri about 2 years ago

  • Subject changed from Some lizard VMs lack memory to run puppet agent interactively to Some lizard VMs sometimes lack memory to run Puppet

(This actually happens even for automatic Puppet runs as well.)

#5 Updated by intrigeri about 2 years ago

  • Assignee changed from bertagaz to intrigeri
  • Target version set to Tails_3.2
  • QA Check deleted (Info Needed)

Since puppet-git.lizard was upgraded to Stretch, when memory is getting full restarting icinga2.service does not change anything: RAM is eaten primarily by passenger, mysqld and the occasional git process on tails.git. But restarting apache2.service immediately saves lots of RAM, even after running puppet agent by hand to ensure the puppetmasterd passenger app is loaded. So at least on that VM we can dismiss the "icinga2 is leaking memory with painful side effects" hypothesis => we should instead have a look at the passenger/puppetmasterd config to see if there's something we can do to save RAM, and/or restart apache2 automatically when it eats too much memory, and/or allocate more RAM to that VM. I'll take care of this (as a follow-up to the Stretch upgrade) shortly (as the current situation is rather painful), and then will reassign to bertagaz for the pending investigation on other VMs.

#6 Updated by intrigeri about 2 years ago

  • Status changed from Confirmed to In Progress
  • Assignee changed from intrigeri to bertagaz
  • Target version deleted (Tails_3.2)
  • % Done changed from 0 to 10

intrigeri wrote:

we should instead have a look at the passenger/puppetmasterd config to see if there's something we can do to save RAM,

That was easy once I had learned some basics about configuring Passenger: it was set up in a way that was creating a memory hog. I've lowered PassengerMaxPoolSize, which should be enough I think. If not, then that means puppetmaster is leaking memory, and then we can workaround this by tuning https://www.phusionpassenger.com/library/config/apache/reference/#passengermaxrequests and/or https://www.phusionpassenger.com/library/config/apache/reference/#passengerpoolidletime.

will reassign to bertagaz for the pending investigation on other VMs.

Here we go.

#7 Updated by intrigeri over 1 year ago

This problem still happens occasionally but I've not noticed obvious patterns. E.g. I've not noticed recently Icinga2 leaking memory as much as it used to, perhaps the upgrade to Stretch helped => dropping the "Deliverable for" as it's not a follow-up to the monitoring setup.

I've tweaked the Puppet splay config to lower the chances that 2+ instances of the Puppet agent running concurrently. But If you folks ever notice it again, please reassign to me (ideally with the relevant excerpts of the Journal + ps $your_preferred_options) and I'll consider running the Puppet agent as a standalone service instead of via cron.

Meanwhile, as part of your (plural) sysadmin duty shifts, if you notice that again, please try to identify a root cause so we can fix it :)

#8 Updated by intrigeri 28 days ago

  • Status changed from In Progress to Resolved

I've not noticed this happening in a way that causes real trouble recently. Please reopen if you did :)

Also available in: Atom PDF