Project

General

Profile

Bug #17104

"Erasure of memory freed by killed userspace processes" test scenario regression caused by the ugprade to Linux 5.2.0-3

Added by intrigeri about 2 months ago. Updated 26 days ago.

Status:
Resolved
Priority:
Elevated
Assignee:
-
Category:
-
Target version:
Start date:
Due date:
% Done:

100%

Feature Branch:
bugfix/17104-memory-erasure-scenario-fragile
Type of work:
Code
Blueprint:
Starter:
Affected tool:

Description

See #17100.


Related issues

Related to Tails - Bug #17100: devel branch FTBFS: linux-image-5.2.0-2-amd64 not available Resolved
Related to Tails - Bug #17117: Upgrade to Linux 5.3 Resolved
Blocks Tails - Feature #16209: Core work: Foundations Team Confirmed

Associated revisions

Revision 61bed214 (diff)
Added by segfault about 1 month ago

Increase memory reserved for root processes and PF_MEMALLOC allocations (refs: #17104)

Revision c6838f36 (diff)
Added by anonym about 1 month ago

Test suite: only partially fill memory for userspace processes.

Filling all memory often leads to the VM crashing. So why fill all
memory? Back when Tails was 32-bit (or mixed with 64-bit) there were
some of uncertainty around the 4 GiB limit, which motivated us to fill
a larger region than 4 GiB, so without much thought we might have just
gone ahead and filled all of it. But, AFAICT, if we went with the
approach I describe above and allocated >4 GiB then the test would
still have done what we wanted.

In today's fully 64-bit world we have no reason for allocating >4 GiB,
so we could go with 128 MiB like the other tests arbitrarily (?) do.

Will-fix: #17104

Revision 43cce79b
Added by anonym about 1 month ago

Merge remote-tracking branch 'origin/bugfix/17104-memory-erasure-scenario-fragile' into testing

Fix-committed: #17104

History

#1 Updated by intrigeri about 2 months ago

#2 Updated by intrigeri about 2 months ago

  • Related to Bug #17100: devel branch FTBFS: linux-image-5.2.0-2-amd64 not available added

#3 Updated by intrigeri about 1 month ago

  • Related to Bug #17117: Upgrade to Linux 5.3 added

#4 Updated by intrigeri about 1 month ago

  • Assignee set to intrigeri

Will investigate via #17117 if more recent kernels are affected.

#5 Updated by intrigeri about 1 month ago

In the last 8 test suite runs on devel, this problem happened 5 times.
In the first 2 test suite runs with Linux 5.3-rc5, this problem did not happen.
This makes me hopeful but that's not enough data points to draw conclusions yet ⇒ I'll keep crossing fingers and will keep an eye on this while working on #17117.

#6 Updated by intrigeri about 1 month ago

  • Assignee deleted (intrigeri)

Unfortunately, #17117 does not fix that problem: it happens on most recent test suite runs for that branch on Jenkins.

#7 Updated by segfault about 1 month ago

  • Status changed from Confirmed to In Progress

#8 Updated by intrigeri about 1 month ago

  • Feature Branch set to bugfix/17104-memory-erasure-scenario-fragile

@segfault, oh oh, interesting! :)

#9 Updated by segfault about 1 month ago

intrigeri wrote:

@segfault, oh oh, interesting! :)

Not really, I just increased the memory constants as a wild guess. It didn't work, in both test runs for this commit the scenario failed again. I will delete the commit from the feature branch.

And I still don't have a working setup to run the test suite locally, so I think I'm not in a good position to work on this.

#10 Updated by anonym about 1 month ago

  • Assignee set to anonym

Taking a look...

#11 Updated by anonym about 1 month ago

  • Assignee deleted (anonym)

So the problem reproduces with similar frequency in my setup as on Jenkins, good (historically this has not been the case for me, which has made debugging a nightmare)!

I tried playing with the values we set in the "I prepare Tails for memory erasure tests" step:
  • Dropping all of it ⇒ same crash
  • Bumping {kernel,admin}_mem_reserved_k with various amounts ⇒ same crash
  • I tried things we used to do before, like vm.overcommit_memory = 2 + various values of vm.overcommit_ratio ⇒ the remote shell server crashes by failing to allocate memory or failing to start a subprocess

So no luck. I also tried bumping the timeout for fillram to an hour, just to see if the deadlock would resolve eventually, but it didn't. So the VM seems to have crashed completely.


Taking a step back, do we really have to attempt to fill all RAM and thus involve OOM? It seems to me that it would be enough to fill a portion of free RAM, leaving a good safety margin to avoid these crashes (which we've had problems with before). In fact, this is effectively what the filesystem cache tests do. Thoughts?

#12 Updated by anonym about 1 month ago

I just noticed that this scenario lacks an "anti-test", and I can see how it would be hard to implement: the point you want to dump the memory and verify that the patterns are there is right before OOM kills fillram which is pretty hard to pin-point to say the least. :) Still, it would inspire confidence in our testing method to have such a test!

Luckily, if we do a partial fill, like I suggest above, we will retain full control and can easily make the partial filler signal when done (e.g. create a file that we wait for via the remote shell) so we know when to dump and verify the pattern, and wait until we manually kill it after which we proceed to dump and verify that the pattern is gone.

So I am just wondering if I am missing something important about filling specifically all of RAM. Back when Tails was 32-bit (or mixed with 64-bit) there were some of uncertainty around the 4 GiB limit, which motivated us to fill a larger region than 4 GiB, so without much thought we might have just gone ahead and filled all of it. But, AFAICT, if we went with the approach I describe above and allocated >4 GiB then the test would still have done what we wanted.

Also note that in today's 64-bit world we have no reason for allocating >4 GiB, so we could go with 128 MiB like the other tests arbitrarily (?) do.

#13 Updated by intrigeri about 1 month ago

Hi @anonym,

I just noticed that this scenario lacks an "anti-test", and I can see how it would be hard to implement: the point you want to dump the memory and verify that the patterns are there is right before OOM kills fillram which is pretty hard to pin-point to say the least. :) Still, it would inspire confidence in our testing method to have such a test!

Right, there's a possibility that the fillram instances fail for unrelated reasons, and then we're testing exactly nothing here.

Luckily, if we do a partial fill, like I suggest above, we will retain full control and can easily make the partial filler signal when done (e.g. create a file that we wait for via the remote shell) so we know when to dump and verify the pattern, and wait until we manually kill it after which we proceed to dump and verify that the pattern is gone.

So I am just wondering if I am missing something important about filling specifically all of RAM. Back when Tails was 32-bit (or mixed with 64-bit) there were some of uncertainty around the 4 GiB limit, which motivated us to fill a larger region than 4 GiB, so without much thought we might have just gone ahead and filled all of it. But, AFAICT, if we went with the approach I describe above and allocated >4 GiB then the test would still have done what we wanted.

Also note that in today's 64-bit world we have no reason for allocating >4 GiB, so we could go with 128 MiB like the other tests arbitrarily (?) do.

This proposal looks entirely good enough to me. And of course, it should solve this ticket :)

#14 Updated by anonym about 1 month ago

  • Assignee set to anonym

Then I'll implement my proposal!

#15 Updated by anonym about 1 month ago

  • Status changed from In Progress to Needs Validation
  • % Done changed from 0 to 50

Let's see what Jenkins thinks!

#16 Updated by anonym about 1 month ago

  • Assignee deleted (anonym)
  • Type of work changed from Research to Code

Out of six runs, four were failure-free, and the other two had only unrelated failures. Looks good to me!

#17 Updated by intrigeri about 1 month ago

  • Assignee set to intrigeri

Out of six runs, four were failure-free, and the other two had only unrelated failures. Looks good to me!

Ooh yeah. I'll try to look into it today. Failing that, it'll be on Thursday or Friday, so if someone wants to take this review on Tue or Wed, that's totally fine by me.

#18 Updated by intrigeri about 1 month ago

  • Assignee changed from intrigeri to anonym

Neat!

Just one nit: any good reason to use pgrep instead of $vm.has_process??
If there's one, please merge into testing → devel yourself :)

#19 Updated by anonym about 1 month ago

  • Status changed from Needs Validation to Fix committed
  • Assignee deleted (anonym)
  • % Done changed from 50 to 100

intrigeri wrote:

Just one nit: any good reason to use pgrep instead of $vm.has_process??

I tried $vm.has_process?, which is based on pidof -x, but it doesn't match against the arguments, i.e. the script name used to identify the process at all. I didn't dare switch implementation of $vm.has_process?.

I was considering to actually make the remote shell record the PID of the last executed command and store it in the RemoteShell::ShellCommand object that is returned (like $! in bash etc.) but went with this simpler solution in the end.

If there's one, please merge into testing → devel yourself :)

Ack, I'll merge this and close this ticket, but if you think it's worth spending time on any of the above, let me know and I file a ticket.

#20 Updated by intrigeri about 1 month ago

Hi @anonym,

I tried $vm.has_process?, which is based on pidof -x, but it doesn't match against the arguments, i.e. the script name used to identify the process at all. I didn't dare switch implementation of $vm.has_process?.

OK!

I was considering to actually make the remote shell record the PID of the last executed command and store it in the RemoteShell::ShellCommand object that is returned (like $! in bash etc.) but went with this simpler solution in the end.

In passing, for your curiosity mostly: I've played a bit with systemd-run a few weeks ago for other reasons (finding a way to avoid having to reboot isotesters between test suite runs, which would be desirable for a number of reasons). I think it would provide exactly what we want here: a way to reliably monitor and manage a process's life cycle (we can give it a unique identifier with --unit; --collect might be useful too). It's probably not worth the effort in this specific case where we already have something simple that works well, but now that I've re-discovered this tool, I'm pretty sure I'll find cases where it's exactly the one we need :)

if you think it's worth spending time on any of the above, let me know and I file a ticket.

Let's not bother IMO. Good enough is good enough!

#21 Updated by intrigeri 26 days ago

  • Status changed from Fix committed to Resolved

Also available in: Atom PDF