Project

General

Profile

Bug #12009

Jenkins ISO builders are highly unreliable

Added by intrigeri almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
Continuous Integration
Target version:
Start date:
12/01/2016
Due date:
% Done:

100%

Feature Branch:
Type of work:
Sysadmin
Blueprint:
Starter:
Affected tool:

Description

Since a couple weeks, lots of ISO build jobs fail due to Jenkins crashing (OOM killer?).

The first affected builds are:

The only relevant changes I see immediately before this date are allocating a bit more huge pages on lizard, and giving each ISO builder VM a bit more RAM (8912896 KiB → 9437184 KiB). Then, on November 25 bertagaz gave them 11461632 KiB.

This should be fixed / workarounded by #11972, but depending on the timing of that work, we may need to fix the problem earlier, and in a different way.


Related issues

Related to Tails - Bug #11858: Monitor if isobuilders systems are running fine Resolved 10/03/2016
Related to Tails - Feature #11972: Switch our Jenkins ISO build system to vagrant-libvirt Resolved 12/04/2016

Associated revisions

Revision 881d4b2c (diff)
Added by intrigeri almost 3 years ago

Have build-tails print some possibly useful debugging info (refs: #12009).

Revision dd54f72f (diff)
Added by intrigeri almost 3 years ago

Have build-tails print some possibly useful debugging info (refs: #12009).

Revision 0abe7d55 (diff)
Added by intrigeri almost 3 years ago

Avoid failing the build when there's no open file in the tmpfs (refs: #12009).

Revision 34553a80 (diff)
Added by intrigeri almost 3 years ago

Avoid failing the build when there's no open file in the tmpfs (refs: #12009).

Revision b9ec485f (diff)
Added by intrigeri almost 3 years ago

build-tails: drop the -l (lazy) option from the umount call (refs: #12009).

If the build tmpfs cannot be unmounted in practice, then the space it uses
remains as occupied cached memory, which fails the next build. So let's try
to really unmount it.

Revision c680ada3 (diff)
Added by intrigeri almost 3 years ago

build-tails: make the umount call verbose (refs: #12009).

… in the hope that it tells us why the filesystem cannot be unmounted,
if this ever happens.

Revision 3c89ca0a (diff)
Added by intrigeri almost 3 years ago

build-tails: fix lsof|grep pattern (refs: #12009).

Revision 0c3b0369 (diff)
Added by intrigeri almost 3 years ago

build-tails: drop the -l (lazy) option from the umount call (refs: #12009).

If the build tmpfs cannot be unmounted in practice, then the space it uses
remains as occupied cached memory, which fails the next build. So let's try
to really unmount it.

Revision 897fd754 (diff)
Added by intrigeri almost 3 years ago

build-tails: make the umount call verbose (refs: #12009).

… in the hope that it tells us why the filesystem cannot be unmounted,
if this ever happens.

Revision 415075d6 (diff)
Added by intrigeri almost 3 years ago

build-tails: kill processes that block the tmpfs umount operation (refs: #12009).

History

#1 Updated by intrigeri almost 3 years ago

  • Related to Bug #11858: Monitor if isobuilders systems are running fine added

#2 Updated by intrigeri almost 3 years ago

  • Related to Feature #11972: Switch our Jenkins ISO build system to vagrant-libvirt added

#3 Updated by intrigeri almost 3 years ago

The process killed by the OOM killer is:

[ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[  713]   109   713  1281356    27548     160        0             0 java

#4 Updated by intrigeri almost 3 years ago

  • Assignee changed from intrigeri to bertagaz
  • QA Check set to Info Needed

bertagaz: I don't understand anymore (sorry if I forgot!) why #11972 would fix this breakage. Granted, the memory used by the build system itself will be restricted to whatever we allocate for the Vagrant build VM, but jenkins-slave will still be running on the isobuilderN level-1 guest, and plausibly it would behave just the same as it currently does, and would end up being killed by the OOM killer anyway, no?

#5 Updated by intrigeri almost 3 years ago

  • Target version set to Tails_2.9.1

#6 Updated by intrigeri almost 3 years ago

("Whatever the reason, this bug simply doesn't happen on isobuilder1 with vagrant-libvirt" might be a valid answer :)

#7 Updated by anonym almost 3 years ago

  • Target version changed from Tails_2.9.1 to Tails 2.10

#8 Updated by intrigeri almost 3 years ago

  • Assignee changed from bertagaz to intrigeri

#9 Updated by intrigeri almost 3 years ago

intrigeri wrote:

The only relevant changes I see immediately before this date are allocating a bit more huge pages on lizard, and giving each ISO builder VM a bit more RAM. I don't see how given ISO builders more memory can trigger this problem.

vagrant/provision/assets/build-tails mounts a tmpfs with size=100%, so giving ISO builders more memory can actually affect this.

#10 Updated by intrigeri almost 3 years ago

  • Description updated (diff)
  • QA Check deleted (Info Needed)

#11 Updated by intrigeri almost 3 years ago

I've noticed that the "cached" value in the output of free sometimes stays very high, after a build has succeeded. During a build, this value grows with the amount of used space on the tmpfs mounted on /tmp/tails-build.*. So it looks like the way we umount this tmpfs (as part of trap cleanup EXIT, with the -l == lazy option) might sometimes not work, which leaves the memory used by that build occupied, which breaks the next builds.

Detailed analysis, on isobuilder2:

  • Dec 18, https://jenkins.tails.boum.org/job/build_Tails_ISO_feature-stretch/101/
    • 15:10:37: build has started, tmpfs not mounted yet, 379 MB cached memory
    • 16:08:45: sudo umount -f -l /tmp/tails-build.EvA6qUle
    • 16:08:47: the tmpfs is not listed in /proc/mounts anymore, but then there's still 7590 MB cached memory
    • 16:10:23: [WS-CLEANUP] Deleting project workspace...[WS-CLEANUP] done
    • 16:10:25: Finished: SUCCESS
    • 16:10:28: cached memory drops to 6386 MB
    • tl;dr: the tmpfs was not successfully unmounted, and the corresponding memory is never freed
  • Dec 18, https://jenkins.tails.boum.org/job/build_Tails_ISO_feature-stretch-unfrozen/50/
    • 21:19:00: build starts, still more than 6 GB of cached memory
    • 21:46:06: 7179 MB cached memory, of which /tmp/tails-build.vhvkbSG0 uses 385 MB
    • 21:46:10: dpkg fails which fails the build
    • 21:46:10: perl -ni -E say $mountpoint if (($mountpoint) = ($_ =~ m{^(?:aufs|tmpfs|devpts-live|proc-live|sysfs-live) on (/tmp/tails-build(?:-tmpfs)?\.[/[:alnum:]]+)}))
    • 21:46:10: sudo umount -f -l /tmp/tails-build.vhvkbSG0
    • 21:46:11: the tmpfs is not listed in /proc/mounts anymore, there's 6661 MB cached memory
    • 21:46:13: Finished: FAILURE
    • tl;dr: the tmpfs was successfully unmounted, and the corresponding memory freed
  • then a couple builds fail before having mounted their tmpfs, and cached memory doesn't change in the significative way
  • Dec 19, https://jenkins.tails.boum.org/job/build_Tails_ISO_feature-5684-screen-locker/19/console
    • 04:00:05: 10147 MB cached memory, of which /tmp/tails-build.pCpXxptd uses 3.8 GB
    • 04:00:09: oom killer kills the (Jenkins) java process, FATAL: hudson.remoting.RequestAbortedException: java.io.EOFException
    • 04:00:15: 10302 MB cached memory, of which /tmp/tails-build.pCpXxptd uses 4.1 GB
    • 04:00:20: the tmpfs is not listed in /proc/mounts anymore, cached memory dropped to 6168 MB
    • tl;dr: the tmpfs was successfully unmounted, and the corresponding memory freed

I see the same pattern on:

#12 Updated by intrigeri almost 3 years ago

intrigeri wrote:

bertagaz: I don't understand anymore (sorry if I forgot!) why #11972 would fix this breakage.

If the analysis I've posted in my previous comment is correct, then #11972 will indeed fix this: the memory used by the tmpfs will be freed by shutting down the Vagrant level-2 VM. Still, we've been suffering from this breakage since more than a month, and there's no ETA that I'm aware of on #11972, so I'll try to find a quick fix without waiting #11972. I'll try not spending too much time on it though.

#13 Updated by intrigeri almost 3 years ago

  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 10

#14 Updated by intrigeri almost 3 years ago

  • % Done changed from 10 to 20

#15 Updated by intrigeri almost 3 years ago

  • % Done changed from 20 to 50
  • QA Check set to Ready for QA

My workaround seems to work. Let's wait a bit more before calling it a success, and then I'll ask anonym to review the set of commits I did to make it work.

#16 Updated by intrigeri almost 3 years ago

  • Assignee changed from intrigeri to anonym

Seems to be fixed! Now, some branches based on Stretch might still not have the relevant changes merged in (I did most of them though). The solution if they break isobuilders again is to merge feature/stretch in them.

Please review: git diff c2f09ce2a87d8962964ce7c820be32088c814027..d70a227c2c87d6af95f770189b0eb14054dd8d2e -- vagrant

#17 Updated by anonym almost 3 years ago

  • Status changed from In Progress to Resolved
  • Assignee deleted (anonym)
  • % Done changed from 50 to 100
  • QA Check changed from Ready for QA to Pass

Pass!

Also available in: Atom PDF