Project

General

Profile

Bug #16177

Building ISO in RAM often fails: mksquashfs reaped by OOM killer

Added by intrigeri 8 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Build system
Target version:
Start date:
12/02/2018
Due date:
% Done:

100%

Feature Branch:
bugfix/16177-limit-mksquashfs-memory-usage
Type of work:
Code
Blueprint:
Starter:
No
Affected tool:

Description

My local builds with TAILS_BUILD_OPTIONS=ram, that use gzip SquashFS compression by default, often fail like this:

kernel: Out of memory: Kill process 12575 (mksquashfs) score 145 or sacrifice child
kernel: Killed process 12575 (mksquashfs) total-vm:3580684kB, anon-rss:1992044kB, file-rss:4kB, shmem-rss:2288kB
kernel: oom_reaper: reaped process 12575 (mksquashfs), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

It seems that xz SquashFS compression is not affected because this does not happen on Jenkins.

When building in RAM we're mounting a tmpfs with size=100%, which can potentially trigger this kind of issues if that filesystem is almost filled. Bumping the amount of RAM given to the VM (via VM_MEMORY_BASE) should fix it; this has been done on feature/buster already and our CI was adjusted for it already. Now, what's strange is the fact this problem only happens ~20-40% of the time. So perhaps a system event eats precious RAM at the exact wrong time and we could fix the problem by disabling it or ensuring it's done before the build starts.


Related issues

Related to Tails - Feature #9788: Take benefit from squashfs-tools 4.3 for faster test builds Resolved 07/20/2015
Related to Tails - Bug #16457: Building feature/buster in RAM sometimes fails: mksquashfs reaped by OOM killer Resolved 02/11/2019
Blocks Tails - Feature #15507: Core work 2019Q1: Foundations Team Resolved 04/08/2018

Associated revisions

Revision 6ce0bf43 (diff)
Added by intrigeri 8 months ago

Disable irrelevant recurring jobs in Vagrant build box (refs: #16177)

They are useless and might explain some FTBFS where mksquashfs is reaped
by the OOM killer.

Revision e0ec60c0
Added by intrigeri 6 months ago

Merge branch 'bugfix/16177-ftbfs-mksquashfs-oom-killed' into devel

Refs: #16177

Revision a98c0318 (diff)
Added by intrigeri 6 months ago

Limit the memory used by mksquashfs to 768M (refs: #16177)

By default mksquashfs will use 25% of the physical memory. So when we use the
"ram" build option, build in a VM with 13GB of RAM, of which up to 12G is
supposed to be used by the build tmpfs, mksquashfs will try using 13/4 = 3.25G
of memory. And then it will get reaped by the OOM killer more or less
occasionally depending on how much space is really used in the build tmpfs and
how much memory the rest of the system is using. So let's limit the memory
used by mksquashfs to 75% of the memory we allocate to the build VM,
excluding the part of it that we expect tmpfs data to fill.

In passing, the fact mksquashfs does not get killed every time suggests that our
current BUILD_SPACE_REQUIREMENT value exceeds the real needs of a build: a value
around 10 or 11G should be enough. But that will be for another commit.

Revision 69939005 (diff)
Added by intrigeri 6 months ago

Limit the memory used by mksquashfs to 512M (refs: #16177)

By default mksquashfs will use 25% of the physical memory. So when we use the
"ram" build option, build in a VM with 13GB of RAM, of which up to 12G is
supposed to be used by the build tmpfs, mksquashfs will try using 13/4 = 3.25G
of memory. And then it will get reaped by the OOM killer more or less
occasionally depending on how much space is really used in the build tmpfs and
how much memory the rest of the system is using. So let's limit the memory
used by mksquashfs to 50% of the memory we allocate to the build VM,
excluding the part of it that we expect tmpfs data to fill.

In passing, the fact mksquashfs does not get killed every time suggests that our
current BUILD_SPACE_REQUIREMENT value exceeds the real needs of a build: a value
around 10 or 11G should be enough. But that will be for another commit.

Revision 0892bff8 (diff)
Added by intrigeri 6 months ago

Limit the memory used by mksquashfs to 512M (refs: #16177)

By default mksquashfs will use 25% of the physical memory. So when we use the
"ram" build option, build in a VM with 13GB of RAM, of which up to 12G is
supposed to be used by the build tmpfs, mksquashfs will try using 13/4 = 3.25G
of memory. And then it will get reaped by the OOM killer more or less
occasionally depending on how much space is really used in the build tmpfs and
how much memory the rest of the system is using. So let's limit the memory
used by mksquashfs to 50% of the memory we allocate to the build VM,
excluding the part of it that we expect tmpfs data to fill.

In passing, the fact mksquashfs does not get killed every time suggests that our
current BUILD_SPACE_REQUIREMENT value exceeds the real needs of a build: a value
around 10 or 11G should be enough. But that will be for another commit.

Revision 3fb3a793 (diff)
Added by intrigeri 6 months ago

Vagrant: use xz with default settings to compress non-release SquashFS (refs: #16177)

squashfs-tools 1:4.3-11, used to build feature/buster, does not consistently
honor the value passed to -mem: the xz compressor does but at least the gzip and
lzo ones don't. This makes the build often fail because mksquashfs gets reaped
by the OOM-killer. Our only other option is currently to bump the build VM
memory a lot, which is going to be painful on developers' systems and might
not be an option on Jenkins.

So let's fall back to xz with default settings (not the crazy slow but
efficient we use at release time) when building non-release images.

Revision 1435cb75 (diff)
Added by intrigeri 5 months ago

Limit the memory used by mksquashfs to 512M (refs: #16177)

By default mksquashfs will use 25% of the physical memory. So when we use the
"ram" build option, build in a VM with 13GB of RAM, of which up to 12G is
supposed to be used by the build tmpfs, mksquashfs will try using 13/4 = 3.25G
of memory. And then it will get reaped by the OOM killer more or less
occasionally depending on how much space is really used in the build tmpfs and
how much memory the rest of the system is using. So let's limit the memory
used by mksquashfs to 50% of the memory we allocate to the build VM,
excluding the part of it that we expect tmpfs data to fill.

In passing, the fact mksquashfs does not get killed every time suggests that our
current BUILD_SPACE_REQUIREMENT value exceeds the real needs of a build: a value
around 10 or 11G should be enough. But that will be for another commit.

Revision 58395bbb
Added by intrigeri 5 months ago

Merge branch 'bugfix/16177-limit-mksquashfs-memory-usage' into stable (Fix-committed: #16177)

History

#1 Updated by intrigeri 8 months ago

#2 Updated by intrigeri 8 months ago

segfault, kibi: do you build in RAM (check if the Vagrant VM has ~1GB of RAM or 13+GB)? If yes, do you see this problem too?

#3 Updated by intrigeri 8 months ago

Peak memory usage inside the build VM (at 0.5s granularity) during two successful RAM+gzip builds. First line is free -m, 2nd line is the "Avail" column of df -BM /tmp/tails*:

Mem:          13026        2076         262       10192       10687         435
2834M
Mem:          13026        1450         725       10355       10851         898
2680M
Mem:          13026        1631         475       10437       10920         635
2598M
Mem:          13026        1637         434       10472       10954         594
2564M

#4 Updated by segfault 8 months ago

intrigeri wrote:

segfault, kibi: do you build in RAM (check if the Vagrant VM has ~1GB of RAM or 13+GB)? If yes, do you see this problem too?

My Vagrant VM has ~1GB of RAM and I have never seen this problem.

#5 Updated by intrigeri 8 months ago

  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 10
  • Feature Branch set to bugfix/16177-ftbfs-mksquashfs-oom-killed
  • Starter set to No

intrigeri wrote:

So perhaps a system event eats precious RAM at the exact wrong time and we could fix the problem by disabling it or ensuring it's done before the build starts.

Before bumping the VM memory, I'd like to give this a try → done on the topic branch. Let's merge this and if I still see the problem, let's cherry-pick 09233368a326ea739c3a1b6927c6cb59c4b10c5e.

#6 Updated by intrigeri 8 months ago

My Vagrant VM has ~1GB of RAM and I have never seen this problem.

Thanks. I've never seen it either before I switched back to building in RAM a couple weeks ago, so it seems that my "Building ISO in RAM" classification is correct.

#7 Updated by intrigeri 8 months ago

  • Assignee changed from intrigeri to segfault
  • QA Check set to Ready for QA

intrigeri wrote:

intrigeri wrote:

So perhaps a system event eats precious RAM at the exact wrong time and we could fix the problem by disabling it or ensuring it's done before the build starts.

Before bumping the VM memory, I'd like to give this a try → done on the topic branch. Let's merge this and if I still see the problem, let's cherry-pick 09233368a326ea739c3a1b6927c6cb59c4b10c5e.

Please review and if happy, merge into master → stable → devel and then reassign to me: I'm not convinced that will suffice to fix the problem I'm seeing and I'll keep an eye on it.

#8 Updated by intrigeri 7 months ago

  • Target version changed from Tails_3.11 to Tails_3.12

Too late for 3.11.

#9 Updated by intrigeri 7 months ago

#10 Updated by intrigeri 7 months ago

#11 Updated by intrigeri 7 months ago

  • Assignee deleted (segfault)

#12 Updated by intrigeri 6 months ago

  • Assignee set to hefee

#13 Updated by hefee 6 months ago

Run into this issue too with 14336 MB of RAM attached to the VM and TAILS_BUILD_OPTIONS="ram gzipcomp":

path relative to the source directories.  IgnoringKilled
P: Begin unmounting filesystems...

real    19m23.041s
user    17m37.388s
sys     3m13.152s
+ kill -9 14954
+ cleanup
+ [ -n /tmp/tails-build.nAPsvka5 ]
+ cd /
+ remove_build_dirs
+ old_build_dirs
+ + mount
tac
+ perl -ni -E say $mountpoint if (($mountpoint) = ($_ =~ m{^(?:aufs|tmpfs|devpts-live|proc-live|sysfs-live) on (/tmp/tails-build(?:-tmpfs)?\.[/[:alnum:]]+)}))
-i used with no filenames on the command line, reading from STDIN.
+ tries=0
+ sudo lsof
+ grep --fixed-strings /tmp/tails-build.nAPsvka5
+ true
+ sudo umount -f --verbose /tmp/tails-build.nAPsvka5
umount: /tmp/tails-build.nAPsvka5 (tmpfs) unmounted
+ sudo rm -rf /tmp/tails-build.nAPsvka5
+ sudo rm -rf /tmp/tails-build.nAPsvka5
rake aborted!
VagrantCommandError: 'vagrant ["ssh", "-c", "MKSQUASHFS_OPTIONS='-comp gzip -Xcompression-level 1' ... command failed with exit status 137
/home/hefee/tails/foundation/tails/Rakefile:112:in `rescue in run_vagrant'
/home/hefee/tails/foundation/tails/Rakefile:109:in `run_vagrant'
/home/hefee/tails/foundation/tails/Rakefile:127:in `call'
/home/hefee/tails/foundation/tails/Rakefile:127:in `block (2 levels) in <top (required)>'
/home/hefee/tails/foundation/tails/Rakefile:413:in `block in <top (required)>'

Caused by:
CommandError: command ["vagrant", "ssh", "-c", "MKSQUASHFS_OPTIONS='-comp gzip -Xcompression-level 1' ... failed with exit status 137
/home/hefee/tails/foundation/tails/Rakefile:72:in `run_command'
/home/hefee/tails/foundation/tails/Rakefile:110:in `run_vagrant'
/home/hefee/tails/foundation/tails/Rakefile:127:in `call'
/home/hefee/tails/foundation/tails/Rakefile:127:in `block (2 levels) in <top (required)>'
/home/hefee/tails/foundation/tails/Rakefile:413:in `block in <top (required)>'
Tasks: TOP => build
(See full trace by running task with --trace)
rake build  7,60s user 2,14s system 0% cpu 29:59,58 total

Merging bugfix/16177-ftbfs-mksquashfs-oom-killed does not help to get rid of my issue.

#14 Updated by hefee 6 months ago

  • Assignee changed from hefee to intrigeri
  • QA Check changed from Ready for QA to Pass

branch looks fine to me. Review passed.

#15 Updated by intrigeri 6 months ago

  • Assignee deleted (intrigeri)
  • Target version changed from Tails_3.12 to Tails_3.13
  • QA Check deleted (Pass)
  • Feature Branch deleted (bugfix/16177-ftbfs-mksquashfs-oom-killed)

Merged bugfix/16177-ftbfs-mksquashfs-oom-killed, which is an improvement, but that does not fix the problem in all cases.

#16 Updated by intrigeri 6 months ago

See 50fbec1225bfb4439bd1d0f2895bcfe43da90fbb, which I assume is about this problem.

#17 Updated by intrigeri 6 months ago

I wonder if this is a function of the number of cores or hyperthreads of the host system = number of cores of the build VM: mksquashfs parallelizes work over all available CPUs, so it might use more memory when more cores are present. E.g. my system has 8 hyperthreads => 8 vcpus in the build VM and when building feature/buster, I see mksquashfs use more than 3.1g (top's RES column) of memory.

#18 Updated by intrigeri 6 months ago

I think it's even worse: when building feature/buster mith VM_MEMORY_BASE = 3.5*1024, BUILD_SPACE_REQUIREMENT = 12*1024, and the ram defaultcomp build options, I've seen mksquashfs eat basically as much as it can (up to 4.2g i.e. everything but what's used by the tmpfs) until it gets killed. So it looks like mksquashfs might adapt its memory usage to the memory it sees as available, while counting space used by tmpfs as available.

I think debian/patches/0002-fix_phys_mem_calculation.patch points to the relevant code: indeed, the memory mksquashfs will use is a factor of the physical memory.

Note that mksquashfs supports a -mem option. According to the README:

-mem <size>             Use <size> physical memory.  Currently set to 1922M
                        Optionally a suffix of K, M or G can be given to specify
                        Kbytes, Mbytes or Gbytes respectively

… but on my sid system with 4.3-8.0tails1, mksquashfs --help instead says "Currently set to 3968M" (i.e. ~25% of my total RAM).

So we should probably pass a fixed value to the -mem option, that works in all configurations we care about. Otherwise, we can bump the build VM memory as much as we can, this will only make mksquashfs use more and more of it.

#19 Updated by intrigeri 6 months ago

  • Assignee set to intrigeri

#20 Updated by intrigeri 6 months ago

  • Feature Branch set to bugfix/16177-limit-mksquashfs-memory-usage, bugfix/16177-limit-mksquashfs-memory-usage-buster

#21 Updated by intrigeri 6 months ago

  • Subject changed from Building ISO in RAM with gzip compression often fails: mksquashfs reaped by OOM killer to Building ISO in RAM often fails: mksquashfs reaped by OOM killer

#22 Updated by intrigeri 6 months ago

  • Related to Feature #9788: Take benefit from squashfs-tools 4.3 for faster test builds added

#23 Updated by intrigeri 5 months ago

  • Priority changed from Normal to Elevated

#24 Updated by intrigeri 5 months ago

  • Priority changed from Elevated to Normal

Test @anonym, are you notified?

#25 Updated by intrigeri 5 months ago

Test results on Tails 3.x

Test results on bugfix/16177-limit-mksquashfs-memory-usage (based on devel, VM_MEMORY_BASE = 1024, BUILD_SPACE_REQUIREMENT = 12*1024, -mem 512M):

  • noram gzipcomp: works
  • noram defaultcomp: works
  • ram gzipcomp: works
  • ram defaultcomp: works

⇒ For Stretch we're good.

Next steps for Tails 3.x

I'll propose this branch for 3.13 and I'm confident it will improve developer experience :)

Test results on feature/buster

On feature/buster it's a different story. Test results on bugfix/16177-limit-mksquashfs-memory-usage-buster (based on feature/buster) with smaller -mem values:

  • noram gzipcomp (still using gzip):
    • VM_MEMORY_BASE = 1024, BUILD_SPACE_REQUIREMENT = 12*1024 + 256: fails with -mem 512M
    • VM_MEMORY_BASE = 1.5*1024, BUILD_SPACE_REQUIREMENT = 12*1024 + 256: fails with -mem 512M (used at least 1.3g)
    • VM_MEMORY_BASE = 1.5*1024, BUILD_SPACE_REQUIREMENT = 12*1024 + 256: fails with -mem 256M
    • VM_MEMORY_BASE = 1.5*1024, BUILD_SPACE_REQUIREMENT = 12*1024 + 256: fails with -mem 64M (used at least 1.2g)
  • ram defaultcomp:
    • VM_MEMORY_BASE = 1024, BUILD_SPACE_REQUIREMENT = 12*1024 + 256: fails with -mem 512M
    • VM_MEMORY_BASE = 1.5*1024, BUILD_SPACE_REQUIREMENT = 12*1024 + 256: fails with -mem 512M
    • VM_MEMORY_BASE = 2*1024, BUILD_SPACE_REQUIREMENT = 12*1024 + 256: fails with -mem 512M
  • noram defaultcomp: not tested

In all cases, very little memory is used by the rest of the system. mksquashfs gets reaped by the OOM killer because it attempts to use way more memory than what we ask it to.

Conclusion: squashfs-tools 1:4.3-11, used to build feature/buster, does not always honor the value passed to -mem (at least the gzip and lzo ones don't; I've seen the xz compressor honor it when using its default settings, but I need to do some more tests with our custom settings). While OTOH squashfs-tools 1:4.3-3.0tails4, used to build Tails 3.x, consistently honors that argument; note that this package has a pile of non-upstream patches (collected by lynxis) applied, that 4.3-11 does not include.

So here are results on an updated bugfix/16177-limit-mksquashfs-memory-usage-buster, based on feature/buster, that sets -mem 512M, VM_MEMORY_BASE = 2*1024, BUILD_SPACE_REQUIREMENT = 12*1024 + 256, and whose "fastcomp" build option now uses xz with non-crazy compression settings:

  • noram fastcomp: works
  • ram fastcomp: works
  • noram defaultcomp: fails (OOM-killed)
  • ram defaultcomp: works, but we're very lucky as the mksquashfs bug is still there:
    • mksquashfs uses up to 2.8G despite -mem 512M; very little free RAM was left near the end of the compression process; had mksquashfs needed the 25% it will use if it ignores our custom -mem option, it would have been reaped by the OOM killer
    • FTR 10G were used on the tmpfs so our BUILD_SPACE_REQUIREMENT value may be over-estimated

I've merged that branch into feature/buster anyway, so that builds on Jenkins (ram defaultcomp) again, while previously mksquashfs would be reaped by the OOM killer. We'll see how it goes.

Next steps for Buster

  1. Move all the Buster-specific bits to a dedicated ticket.
  2. Verify that squashfs-tools 4.3-3.0tails4 does honor -mem consistently (I think I did already but let's be sure).
  3. Find out why squashfs-tools 4.3-11 does not honor -mem consistently, e.g.:
    1. Try to reproduce with pristine 4.3-3 changes
    2. Try reverting the 4.3-3..4.3-11 changes
    3. Try applying the patchset collected by lynxis on top of 4.3-11
  4. Report the bug, ideally with a patch attached.

How to debug this sort of things

FTR, to debug this kind of things, build with the rescue build option, rake vm:ssh after failure, chroot into /tmp/tails-build*/chroot and then run something like this:

mksquashfs chroot filesystem.squashfs \
  -comp xz -Xbcj x86 -b 1024K -Xdict-size 1024K \
  -no-exports -mem 512M -wildcards \
  -ef chroot/usr/share/tails/build/mksquashfs-excludes \
  -no-progress -info -sort squashfs.sort

#26 Updated by intrigeri 5 months ago

  • Related to Bug #16457: Building feature/buster in RAM sometimes fails: mksquashfs reaped by OOM killer added

#27 Updated by intrigeri 5 months ago

Move all the Buster-specific bits to a dedicated ticket.

Done: #16457

#28 Updated by intrigeri 5 months ago

  • Feature Branch changed from bugfix/16177-limit-mksquashfs-memory-usage, bugfix/16177-limit-mksquashfs-memory-usage-buster to bugfix/16177-limit-mksquashfs-memory-usage

#29 Updated by intrigeri 5 months ago

  • Assignee changed from intrigeri to hefee
  • % Done changed from 10 to 50
  • QA Check set to Ready for QA

@hefee, do you want to review this? Given the trouble you're having with our build system at the moment, it seems you'll have to dive into related parts of our code base anyway, so this could be a good opportunity to learn and kill 2 drones with one stone :)

#30 Updated by anonym 5 months ago

In order to not bloat #16337 I'll respond to your question in #16337#note-6 here:

intrigeri wrote:

@anonym, I'm working on #16177 and I'm curious about what exact problem aca28cec16287befd7dbadbfee7834c153936b04 solved for you. If that was about mksquashfs getting killed:

Yes, mksquashfs was OOM killed

  • gzip compression or xz?

gzip

  • Building in RAM or not?

RAM

  • Please try to revert this commit and retry with bugfix/16177-limit-mksquashfs-memory-usage merged into your branch.

Sure, I'll report back here!

#31 Updated by anonym 5 months ago

anonym wrote:

  • Please try to revert this commit and retry with bugfix/16177-limit-mksquashfs-memory-usage merged into your branch.

Sure, I'll report back here!

The build was successful! Looks promising! :)

#32 Updated by intrigeri 5 months ago

The build was successful! Looks promising! :)

Great!

#33 Updated by hefee 5 months ago

  • QA Check changed from Ready for QA to Pass

Works for me. I merged this into hefee/bugfix/16349-tor-0.3.5+force-all-tests and had no issues building that branch.

#34 Updated by hefee 5 months ago

  • Assignee changed from hefee to intrigeri

#35 Updated by intrigeri 5 months ago

  • Status changed from In Progress to Fix committed
  • % Done changed from 50 to 100

#36 Updated by intrigeri 5 months ago

  • Assignee deleted (intrigeri)

Thanks hefee for the review and anonym for testing!

#37 Updated by CyrilBrulebois 4 months ago

  • Status changed from Fix committed to Resolved

Also available in: Atom PDF