Project

General

Profile

Bug #14948

zombie processes when running the test suite

Added by groente about 2 years ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Test suite
Target version:
Start date:
11/10/2017
Due date:
% Done:

100%

Feature Branch:
bugfix/14948-avoid-zombies-in-testsuite-runs+force-all-tests
Type of work:
Code
Blueprint:
Starter:
Affected tool:

Description

the build process leaves quite a few zombie processes (mostly avconv and grep) on the isotesters.


Related issues

Related to Tails - Bug #17216: Make the test suite clean up after itself even in most tricky failure modes Confirmed
Blocks Tails - Feature #16209: Core work: Foundations Team Confirmed

Associated revisions

Revision b1d5e33f (diff)
Added by CyrilBrulebois 10 months ago

Avoid zombies by reaping all child processes (refs: #14948).

IO.popen makes it possible to start a command as a child process but it
won't automatically reap the subprocess when it exits, leading to many
zombies when running the automated test suite.

Fix this in several ways depending on the call site:

  • Create an intermediary variable to hold the subprocess returned by
    IO.popen, making it possible to read its standard output, but also
    allowing to call its close method afterwards. This is less
    straightforward than a IO.popen().gets call, but allows for
    appropriate clean-up.
  • When the subprocess object itself (p) is available, add a "p.close"
    call after killing it through "Process.kill(signal, p.pid)".
  • Otherwise, when only the pid of the subprocess is available, call
    "Process.wait(pid)" after killing it.

Revision f7976eb0 (diff)
Added by CyrilBrulebois 10 months ago

Avoid zombies by reaping all child processes (refs: #14948).

IO.popen makes it possible to start a command as a child process but it
won't automatically reap the subprocess when it exits, leading to many
zombies when running the automated test suite.

Fix this in several ways depending on the call site:

  • Create an intermediary variable to hold the subprocess returned by
    IO.popen, making it possible to read its standard output, but also
    allowing to call its close method afterwards. This is less
    straightforward than a IO.popen().gets call, but allows for
    appropriate clean-up.
  • When the subprocess object itself (p) is available, add a "p.close"
    call after killing it through "Process.kill(signal, p.pid)".
  • Otherwise, when only the pid of the subprocess is available, call
    "Process.wait(pid)" after killing it.

Revision cc86a75f (diff)
Added by CyrilBrulebois 10 months ago

Avoid zombies by reaping all child processes (refs: #14948).

IO.popen makes it possible to start a command as a child process but it
won't automatically reap the subprocess when it exits, leading to many
zombies when running the automated test suite.

Fix this in several ways depending on the call site:

  • Create an intermediary variable to hold the subprocess returned by
    IO.popen, making it possible to read its standard output, but also
    allowing to call its close method afterwards. This is less
    straightforward than a IO.popen().gets call, but allows for
    appropriate clean-up.
  • When the subprocess object itself (p) is available, add a "p.close"
    call after killing it through "Process.kill(signal, p.pid)".
  • Otherwise, when only the pid of the subprocess is available, call
    "Process.wait(pid)" after killing it.

Revision 70456889 (diff)
Added by intrigeri 9 months ago

Test suite: avoid zombies by waiting for killed child processes to exit (refs: #14948).

This is part of cc86a75fe6fdbb5ac54714d62f4b8dfe5831eb80,
authored by Cyril Brulebois <>.

Revision 0d3dc2a0 (diff)
Added by intrigeri 3 months ago

Test suite: don't leave redir(1) processes behind (refs: #14948)

By default, redir(1) runs "in the background, detached like a proper UNIX
daemon" (quoting its manpage). So, killing the PID we've recorded when we
started it is not terribly useful: I regularly find a leftover redir(1) process
running after a test suite run.

Let's instead delegate to systemd the task of tracking these processes and
killing them.

Note that I have plans to run the entire test suite under systemd-run in the
future, so we can drop all subprocess management code from our Cucumber code,
and the clean up happens reliably even if when our Cucumber hooks are not run
for some reason (OOM, CTRL-C, etc.).

Revision 5e7ff8e9 (diff)
Added by intrigeri about 2 months ago

Test suite: close communication channels with child processes before killing them (refs: #14948)

The "close" method (https://ruby-doc.org/core-2.2.0/IO.html#method-i-close)
"Closes ios and flushes any pending writes to the operating system". This could
increase the chances that the child process we kill manages to exit (instead of
being left as a zombie), by closing any communication channels we may have left
open with it. But it's more likely that we get this expected benefit if we run
.close before we kill that child process, to increase the success chances of
whatever signal handler that child process has setup.

Revision c04c13ed (diff)
Added by intrigeri about 1 month ago

Test suite: don't call "close" method that hangs forever (refs: #14948)

When proc.close is called once the scenario has completed, for some reason it
hangs forever. I don't remember seeing this specific child process left
as a zombie after the test suite completed, and on #14948 we did not
manage to reach a clear understanding of why calling proc.close
would help, so let's remove it.

This reverts 5e7ff8e9e644374bea8120096706855bc3c1b5b2 and part
of cc86a75fe6fdbb5ac54714d62f4b8dfe5831eb80.

Revision 22522410
Added by intrigeri about 1 month ago

Merge branch 'bugfix/14948-avoid-zombies-in-testsuite-runs+force-all-tests' into stable (Closes: #14948)

History

#1 Updated by intrigeri about 2 years ago

  • Subject changed from zombie processes on isotesters to zombie processes when running the test suite
  • Category changed from Continuous Integration to Test suite
  • Target version deleted (Tails_3.5)
  • Type of work changed from Sysadmin to Code

(I think we have plenty of other stuff that deserves being worked on in the next 2 months more than this one => dropping the milestone. And our CI is not affected to this bug in practice, it can only cause problems for developers running the test suite manually => setting metadata accordingly.)

#2 Updated by CyrilBrulebois 10 months ago

  • Status changed from Confirmed to In Progress

#3 Updated by CyrilBrulebois 10 months ago

  • Status changed from In Progress to Confirmed
  • QA Check set to Ready for QA
  • Feature Branch set to bugfix/14948-avoid-zombies-in-testsuite-runs

I've worked on that and here's a patch that helped back in January. Should probably be considered for other branches?

#4 Updated by CyrilBrulebois 10 months ago

  • Status changed from Confirmed to In Progress

#5 Updated by CyrilBrulebois 10 months ago

Oops, initially posted a branch with WIP commits. Force-pushed seconds ago. And in any case: only the last commit is what matters…

#6 Updated by intrigeri 10 months ago

#7 Updated by intrigeri 10 months ago

  • Feature Branch changed from bugfix/14948-avoid-zombies-in-testsuite-runs to bugfix/14948-avoid-zombies-in-testsuite-runs+force-all-tests

I've merged feature/buster into the branch so it hopefully builds again on Jenkins; and I've added the +force-all-tests suffix so we get more data from Jenkins about it, which should inform the "shall we merge this?" decision :)

#8 Updated by intrigeri 10 months ago

Our test suite is not good enough on Buster yet for Jenkins to give us sufficient insight & confidence wrt. whether this branch breaks anything. So I've transplanted the topic branch onto stable and forced pushed.

#9 Updated by intrigeri 9 months ago

  • Assignee changed from anonym to intrigeri

(As agreed elsewhere.)

#10 Updated by intrigeri 9 months ago

The Process.wait() bits LGTM and should indeed fix part of the problem this ticket is about. Jenkins is happy about this branch so I'll cherry-pick them right away.

Wrt. the .close bits, I have a doubt. I think they do no harm so I'd be fine with merging them, but it seems there's a misunderstanding, quite possibly on my side, so first I'd like to ensure I understand the intent better. What https://ruby-doc.org/core-2.2.0/IO.html#method-i-close does is "Closes ios and flushes any pending writes to the operating system". In itself, I understand it does not reap the child process started with IO.popen. But the commit message suggests that we run .close in order to reap child processes, so I'm confused. I guess that the idea in this branch is to increase the chances that the child process we kill manages to exit (instead of being left as a zombie), by closing any communication channels we may have left open with it — right? But then, shouldn't we run .close before we kill that child process, to increase the success chances of whatever signal handler that process has setup?

Finally, when we kill Chutney, shouldn't we run Process.wait, to ensure the Chutney processes have actually exited before we move on? I see you're adding $chutney_onionservice_job.close but as said above, I don't think that's sufficient to avoid leaving zombie processes around.

#11 Updated by intrigeri 9 months ago

  • Assignee changed from intrigeri to CyrilBrulebois
  • QA Check changed from Ready for QA to Info Needed

intrigeri wrote:

The Process.wait() bits LGTM and should indeed fix part of the problem this ticket is about. Jenkins is happy about this branch so I'll cherry-pick them right away.

Done!

#12 Updated by intrigeri 8 months ago

  • QA Check deleted (Info Needed)

(Preparing to drop the "QA Check" field as per "[Tails-dev] Proposal: Redmine workflow change".)

#13 Updated by intrigeri 3 months ago

  • Related to Bug #17216: Make the test suite clean up after itself even in most tricky failure modes added

#14 Updated by CyrilBrulebois 2 months ago

TBH I don't recall the exact chain of thoughts/internet searches that led me to believe those .close additions were the right thing to do. I wish I had explained things better in the commit message… I think I tried to gather all code paths that could lead to external processes (either by observation or by grepping through the code), and I suppose I stackoverflow'd some answer mentioning .close for IO.popen cases, but I'm not finding that again right now, and the API indeed doesn't seem to suggest this should be what we need to do to chase zombies…

#15 Updated by intrigeri about 2 months ago

  • Status changed from In Progress to Needs Validation
  • Assignee changed from CyrilBrulebois to intrigeri

TBH I don't recall the exact chain of thoughts/internet searches that led me to believe those .close additions were the right thing to do. I wish I had explained things better in the commit message… I think I tried to gather all code paths that could lead to external processes (either by observation or by grepping through the code), and I suppose I stackoverflow'd some answer mentioning .close for IO.popen cases, but I'm not finding that again right now, and the API indeed doesn't seem to suggest this should be what we need to do to chase zombies…

Thanks for looking into this again!

I've merged stable into the topic branch, resolved merge conflicts (since then I have fixed one of the problems this branch was about via 0d3dc2a0), and what's left in the remaining diff is:

  • Adding the call to .close after killing the child process → as per above analysis, I've pushed 5e7ff8e9e644374bea8120096706855bc3c1b5b2 that moves this call to .close to before we send the TERM signal.
  • Store the return value of IO.popen(['grep', …]) in a variable and run .close on it. I'm not sure I understand how this one can help but I'm comfortable with taking this change anyway.

Let's see what Jenkins thinks. If it does not spot regressions, I'll merge and we'll be done for this iteration.

IMO, longer-term we need a more robust and generic solution. I've described one on #17216 :)

#16 Updated by intrigeri about 2 months ago

  • Status changed from Needs Validation to In Progress

#17 Updated by intrigeri about 1 month ago

  • Target version set to Tails_4.2

(I'm close to wrapping this up and I'd like to reduce our amount of WIP.)

#18 Updated by intrigeri about 1 month ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

#19 Updated by intrigeri about 1 month ago

  • Assignee deleted (intrigeri)

OK, we're done here for at least redir and grep, in the best case scenario when the test suite manages to run its cleanup hooks. For the worst case scenarios, see you on #17216!

Also available in: Atom PDF