Project

General

Profile

Bug #14948

zombie processes when running the test suite

Added by groente almost 2 years ago. Updated 3 months ago.

Status:
In Progress
Priority:
Normal
Category:
Test suite
Target version:
-
Start date:
11/10/2017
Due date:
% Done:

0%

Feature Branch:
bugfix/14948-avoid-zombies-in-testsuite-runs+force-all-tests
Type of work:
Code
Blueprint:
Starter:
Affected tool:

Description

the build process leaves quite a few zombie processes (mostly avconv and grep) on the isotesters.


Related issues

Blocks Tails - Feature #16209: Core work: Foundations Team Confirmed 03/22/2019

Associated revisions

Revision b1d5e33f (diff)
Added by CyrilBrulebois 5 months ago

Avoid zombies by reaping all child processes (refs: #14948).

IO.popen makes it possible to start a command as a child process but it
won't automatically reap the subprocess when it exits, leading to many
zombies when running the automated test suite.

Fix this in several ways depending on the call site:

  • Create an intermediary variable to hold the subprocess returned by
    IO.popen, making it possible to read its standard output, but also
    allowing to call its close method afterwards. This is less
    straightforward than a IO.popen().gets call, but allows for
    appropriate clean-up.
  • When the subprocess object itself (p) is available, add a "p.close"
    call after killing it through "Process.kill(signal, p.pid)".
  • Otherwise, when only the pid of the subprocess is available, call
    "Process.wait(pid)" after killing it.

Revision f7976eb0 (diff)
Added by CyrilBrulebois 5 months ago

Avoid zombies by reaping all child processes (refs: #14948).

IO.popen makes it possible to start a command as a child process but it
won't automatically reap the subprocess when it exits, leading to many
zombies when running the automated test suite.

Fix this in several ways depending on the call site:

  • Create an intermediary variable to hold the subprocess returned by
    IO.popen, making it possible to read its standard output, but also
    allowing to call its close method afterwards. This is less
    straightforward than a IO.popen().gets call, but allows for
    appropriate clean-up.
  • When the subprocess object itself (p) is available, add a "p.close"
    call after killing it through "Process.kill(signal, p.pid)".
  • Otherwise, when only the pid of the subprocess is available, call
    "Process.wait(pid)" after killing it.

Revision cc86a75f (diff)
Added by CyrilBrulebois 4 months ago

Avoid zombies by reaping all child processes (refs: #14948).

IO.popen makes it possible to start a command as a child process but it
won't automatically reap the subprocess when it exits, leading to many
zombies when running the automated test suite.

Fix this in several ways depending on the call site:

  • Create an intermediary variable to hold the subprocess returned by
    IO.popen, making it possible to read its standard output, but also
    allowing to call its close method afterwards. This is less
    straightforward than a IO.popen().gets call, but allows for
    appropriate clean-up.
  • When the subprocess object itself (p) is available, add a "p.close"
    call after killing it through "Process.kill(signal, p.pid)".
  • Otherwise, when only the pid of the subprocess is available, call
    "Process.wait(pid)" after killing it.

Revision 70456889 (diff)
Added by intrigeri 4 months ago

Test suite: avoid zombies by waiting for killed child processes to exit (refs: #14948).

This is part of cc86a75fe6fdbb5ac54714d62f4b8dfe5831eb80,
authored by Cyril Brulebois <>.

History

#1 Updated by intrigeri almost 2 years ago

  • Subject changed from zombie processes on isotesters to zombie processes when running the test suite
  • Category changed from Continuous Integration to Test suite
  • Target version deleted (Tails_3.5)
  • Type of work changed from Sysadmin to Code

(I think we have plenty of other stuff that deserves being worked on in the next 2 months more than this one => dropping the milestone. And our CI is not affected to this bug in practice, it can only cause problems for developers running the test suite manually => setting metadata accordingly.)

#2 Updated by CyrilBrulebois 5 months ago

  • Status changed from Confirmed to In Progress

#3 Updated by CyrilBrulebois 5 months ago

  • Status changed from In Progress to Confirmed
  • QA Check set to Ready for QA
  • Feature Branch set to bugfix/14948-avoid-zombies-in-testsuite-runs

I've worked on that and here's a patch that helped back in January. Should probably be considered for other branches?

#4 Updated by CyrilBrulebois 5 months ago

  • Status changed from Confirmed to In Progress

#5 Updated by CyrilBrulebois 5 months ago

Oops, initially posted a branch with WIP commits. Force-pushed seconds ago. And in any case: only the last commit is what matters…

#6 Updated by intrigeri 5 months ago

#7 Updated by intrigeri 5 months ago

  • Feature Branch changed from bugfix/14948-avoid-zombies-in-testsuite-runs to bugfix/14948-avoid-zombies-in-testsuite-runs+force-all-tests

I've merged feature/buster into the branch so it hopefully builds again on Jenkins; and I've added the +force-all-tests suffix so we get more data from Jenkins about it, which should inform the "shall we merge this?" decision :)

#8 Updated by intrigeri 4 months ago

Our test suite is not good enough on Buster yet for Jenkins to give us sufficient insight & confidence wrt. whether this branch breaks anything. So I've transplanted the topic branch onto stable and forced pushed.

#9 Updated by intrigeri 4 months ago

  • Assignee changed from anonym to intrigeri

(As agreed elsewhere.)

#10 Updated by intrigeri 4 months ago

The Process.wait() bits LGTM and should indeed fix part of the problem this ticket is about. Jenkins is happy about this branch so I'll cherry-pick them right away.

Wrt. the .close bits, I have a doubt. I think they do no harm so I'd be fine with merging them, but it seems there's a misunderstanding, quite possibly on my side, so first I'd like to ensure I understand the intent better. What https://ruby-doc.org/core-2.2.0/IO.html#method-i-close does is "Closes ios and flushes any pending writes to the operating system". In itself, I understand it does not reap the child process started with IO.popen. But the commit message suggests that we run .close in order to reap child processes, so I'm confused. I guess that the idea in this branch is to increase the chances that the child process we kill manages to exit (instead of being left as a zombie), by closing any communication channels we may have left open with it — right? But then, shouldn't we run .close before we kill that child process, to increase the success chances of whatever signal handler that process has setup?

Finally, when we kill Chutney, shouldn't we run Process.wait, to ensure the Chutney processes have actually exited before we move on? I see you're adding $chutney_onionservice_job.close but as said above, I don't think that's sufficient to avoid leaving zombie processes around.

#11 Updated by intrigeri 4 months ago

  • Assignee changed from intrigeri to CyrilBrulebois
  • QA Check changed from Ready for QA to Info Needed

intrigeri wrote:

The Process.wait() bits LGTM and should indeed fix part of the problem this ticket is about. Jenkins is happy about this branch so I'll cherry-pick them right away.

Done!

#12 Updated by intrigeri 3 months ago

  • QA Check deleted (Info Needed)

(Preparing to drop the "QA Check" field as per "[Tails-dev] Proposal: Redmine workflow change".)

Also available in: Atom PDF