Project

General

Profile

Bug #9707

Jessie: System sometimes does not poweroff after memory erasure

Added by intrigeri about 4 years ago. Updated almost 3 years ago.

Status:
Rejected
Priority:
High
Assignee:
-
Category:
Test suite
Target version:
Start date:
07/08/2015
Due date:
% Done:

0%

Feature Branch:
test/9707-power-off-after-memory-erasure
Type of work:
Research
Blueprint:
Starter:
Affected tool:

Description

Reproduced in VMs and via the I shutdown Tails and wait for the computer to power off test suite step: I see a bunch of stars (sdmem progress) then it stops and nothing happens anymore.


Related issues

Related to Tails - Bug #10487: Improve VM and OOM settings for erasing memory Resolved 11/05/2015
Related to Tails - Bug #11786: System often crashes during/after memory wipe since Linux 4.6 Duplicate 09/08/2016
Blocked by Tails - Bug #10733: Run our initramfs memory erasure hook earlier Resolved 12/09/2015

Associated revisions

Revision ca9cf70a (diff)
Added by intrigeri almost 4 years ago

Don't forcefully overcommit memory when erasing memory on shutdown.

Overcommitting memory seems too likely to trigger kernel freezes.

Refs: #9707, #10487.

Revision bfa09055 (diff)
Added by intrigeri almost 4 years ago

Don't kill the allocating task when erasing memory on shutdown.

The kernel's default (0) that we enforce here is to kill the hoggers, which will
be our sdmem instances.

Killing the allocating task may have been sound in the past, but it feels too
risky these days, given early boot (initramfs) sees increasing concurrency.

Refs: #9707, #10487.

Revision 552bab73 (diff)
Added by intrigeri almost 4 years ago

Make sure the kernel doesn't starve from memory during memory erasure.

We want to set it to the lowest possible value, to maximize the coverage
of memory erasure. But Documentation/sysctl/vm.txt for
vm.min_free_kbytes says: "if you set this to lower than 1024KB, your
system will become subtly broken, and prone to deadlock under high
loads". With 2048KB I've seen freezes, so let's not modify the kernel
default (8192KB on x86_64 currently).

Refs: #9707, #10487.

Revision a3c19d42 (diff)
Added by Tails developers almost 4 years ago

Make parallel sdmem faster and more robust.

Initial description:

We avoid running too many parallel instances (which adds overhead) by
only running one instance per 2048 MiB of RAM (which is below the x86
per-process limitation). This also seems to play better with
VirtualBox, which often hangs when running too many sdmem in parallel.
Furthermore we kill all sdmem instances as soon as one of them has
been oom killed; after all, that implies that the memory has been
completely filled, so we're done. If we don't kill the others they may
continue running unecessarily, filling the freed memory of the killed
instance, and so on.
Another consequence of this is cleaner output so that the debug
message (that memory dumping can ensue) will be easier to spot when
running with boot option debug=wipemem. That will be a helpful cue
for the automated test suite.

Three years later, when revisiting this idea, we realized that the
reasoning above is not 100%-correct on 32-bit: in theory there could
always be one process that runs ahead of all the others, and reaches the
per-process limit is killed. Then another one does, and so on, and in
the end all sdmem processes have been killed due to the per-process
limit, not because the memory was filled. However:

  • In practice, this will-fix: #9707, while still erasing almost all
    memory according to our test suite.
  • 32-bit systems that have more than 2GB of RAM are pretty rare anyway.

Revision cd53d12c
Added by anonym over 3 years ago

Merge remote-tracking branch 'origin/bugfix/10487-improve-VM-and-OOM-settings-for-erasing-memory' into devel

Fix-committed: #9707, #10497

Revision 388355bd (diff)
Added by intrigeri over 3 years ago

Test suite: reserve 64 MiB for the kernel before the memory is wiped.

We hope that this might help fix refs: #9707.

Revision 5b7351df (diff)
Added by intrigeri over 3 years ago

Test suite: reserve some more memory for the core privileged processes before wiping memory.

This is a follow-up on 388355b, that missed vm.admin_reserve_kbytes for
some reason.

refs: #9707

Revision fa990829
Added by anonym almost 3 years ago

Merge remote-tracking branch 'origin/test/9707-power-off-after-memory-erasure' into stable

Note that this merge does not close #9707.

Refs: #9707

History

#1 Updated by romeopapa about 4 years ago

I have failed to reproduce this in VirtualBox 4.3.18_Debian.

#2 Updated by intrigeri about 4 years ago

I have failed to reproduce this in VirtualBox 4.3.18_Debian.

With an ISO built from which branch/commit, exactly?

#3 Updated by romeopapa about 4 years ago

intrigeri wrote:

With an ISO built from which branch/commit, exactly?

Sorry, hadn't seen your reply.

7days ago, when I first replied to this ticket, I had tested on a VM with my very own jessie build (must have been f9530ea316716d117e7660045b694428be9700c9 of 2015-08-09 22:33:10).

I have tested again last night with nightly jessie 20150816T0408Z-e01d609 (yesterday's): on a VBox VM, and on 2 laptops (an asus without EFI, a samsung ultrabook with EFI).

Intregeri, do you still see occurrences of this bug? Does it happen systematically? I haven't seen it once since I've been looking for it. I might have said, on IRC that I had already seen this bug in the wild, but now I'm wondering if I perhaps hadn't dreamed..

#4 Updated by intrigeri about 4 years ago

  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 20

I've not seen it recently. It might have been a bug in my QEMU/libvirt/seabios/whatever. If nobody reproduces it within 1-2 months, I'll close this ticket. Thanks!

#5 Updated by intrigeri almost 4 years ago

  • Status changed from In Progress to Resolved
  • Assignee deleted (intrigeri)
  • % Done changed from 20 to 100

Not since it recently, closing.

#6 Updated by intrigeri almost 4 years ago

  • Category set to Test suite
  • Status changed from Resolved to Confirmed
  • Assignee set to intrigeri
  • % Done changed from 100 to 0

Sadly I have to reopen this ticket, as I've seen it regularly when running the test suite against feature/jessie ISO images. When it happens I see "VM is still running after 600 seconds". QEMU is eating a CPU core. The system under test's screen if full of stars, but no progress.

IMO what's really blocking 2.0 is the part that blocks the test suite update (#7563): we can live with this bug, so a test suite -specific workaround would be good enough, hence the category I'm setting. Now, it might be that a real fix is easier to find & implement than an ugly workaround :)

#7 Updated by intrigeri almost 4 years ago

  • Blocks Feature #7563: Update the automated test suite for Jessie ISO images added

#8 Updated by intrigeri almost 4 years ago

Note that the test suite goes on (at least when it was rebooting after creating a persistent volume) if I kill the QEMU process at that point.

#9 Updated by intrigeri almost 4 years ago

  • Related to Bug #10487: Improve VM and OOM settings for erasing memory added

#10 Updated by intrigeri almost 4 years ago

The (_) cursor that's after the last star char is still blinking... but very slowly (cycle period is ~ 7 seconds). Ideas to better undertand what's going on:

  • check if the never submitted stuff from feature/faster_sdmem_on_shutdown might help
  • raise the timeout (currently = 10 minutes): the display stays just the same (including the blinking cursor), just longer
  • ask libvirt/QEMU to send some keybinding to ask via sysrq what's going on to the guest kernel
  • try the QEMU console
  • dump the memory to see how much was erased
  • remove "the running Tails is installed on USB drive" from the usb-install-tails-greeter checkpoint: saves ~2 minutes on all scenarios that install Tails to USB, which is totally welcome when running them repeatedly without keeping snapshots; but doesn't help with the problem at hand
  • check that our hook runs at the right time (init-premount) compared to e.g. udev
  • bypass memory erasure once we've kexec'ed, unless specifically testing memory erasure; and add a failing test case reproducing this bug; at least this removes this one blocker for running all other USB tests
  • 2 vcpus for the guest doesn't help

#11 Updated by intrigeri almost 4 years ago

FWIW, the machine where I can reproduce this is running Jessie, and has TailsToasterStorage on ext4 / dm-crypt / SSD. I can't reproduce this on another machine that's running sid, with storage in a tmpfs. (scratch that: I've eventually seen one such failure on that other — and faster — machine when the system was under heavy load and swapping)

#12 Updated by anonym almost 4 years ago

I have seen it when running the test suite on a baremetal i7-3520M system, also using "ext4 / dm-crypt / SSD" and Debian Sid.

#13 Updated by intrigeri almost 4 years ago

  • Related to Bug #10733: Run our initramfs memory erasure hook earlier added

#14 Updated by intrigeri almost 4 years ago

  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 20

Status: fine-tuning stuff for #10487, that fixes this problem.

#15 Updated by intrigeri almost 4 years ago

  • Assignee changed from intrigeri to anonym
  • % Done changed from 20 to 50
  • QA Check set to Ready for QA
  • Feature Branch set to bugfix/10487-improve-VM-and-OOM-settings-for-erasing-memory

Please review, test and merge into feature/jessie.

#16 Updated by intrigeri almost 4 years ago

  • Priority changed from Elevated to High

#17 Updated by anonym over 3 years ago

  • Status changed from In Progress to Fix committed
  • % Done changed from 50 to 100

#18 Updated by intrigeri over 3 years ago

  • Assignee deleted (anonym)
  • QA Check changed from Ready for QA to Pass

#19 Updated by anonym over 3 years ago

  • Status changed from Fix committed to Confirmed
  • Assignee set to intrigeri
  • % Done changed from 100 to 0
  • Feature Branch deleted (bugfix/10487-improve-VM-and-OOM-settings-for-erasing-memory)

I've seen this again now, while running the automated test suite. I suppose the previous branch (bugfix/10487-improve-VM-and-OOM-settings-for-erasing-memory) at least improved the situation, or it was just coincidental that I saw no occurrences of it for quite a while during my tests of the branch.

To be very specific, what I see now is a screen filled with sdmem's asterisks, the cursor is blinking (although veeeeery slowly), and the CPU is used at 100%. This reminds my slightly of when running fillram in erase_memory.feature without some of the kernel tweaks -- when the memory is essentially 100% filled up, the system becomes unusable with 100% VM CPU being utilized. Possibly the memory has become so scarce that the kernel cannot function properly? Perhaps vm.admin_reserve_kbytes and/or vm.min_free_kbytes needs to be bumped above their defaults like we currently do in erase_memory.feature?

#20 Updated by intrigeri over 3 years ago

  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 10
  • QA Check changed from Pass to Dev Needed

(Please... it at least on my test hardware, the branch for #10487 made a huge difference :)

OK, back to the drawing board!

#21 Updated by intrigeri over 3 years ago

To be very specific, what I see now is a screen filled with sdmem's asterisks, the cursor is blinking (although veeeeery slowly), and the CPU is used at 100%.

This perfectly matches what I saw myself (#9707#note-6, #9707#note-10).

When you say "the CPU is used at 100%", I guess you mean that QEMU is using a full CPU core, right? (Given we have no means at that point to know what's happening inside the VM.)

Did you see that on the 32-bit or 64-bit kernel? (Not sure it's relevant, but these tests are slow so if I need to run them hundreds of times to get stats, better focus on the cases that are known to expose the bug.)

Meta: I still stand behind what I wrote a month ago, that is "IMO what's really blocking 2.0 is the part that blocks the test suite update (#7563): we can live with this bug, so a test suite -specific workaround would be good enough, hence the category I'm setting. Now, it might be that a real fix is easier to find & implement than an ugly workaround :)"

Possibly the memory has become so scarce that the kernel cannot function properly? Perhaps vm.admin_reserve_kbytes and/or vm.min_free_kbytes needs to be bumped above their defaults like we currently do in erase_memory.feature?

I would guess the same. Now, I don't know what's best:

  • optimizing for erasing more memory, with the increased risk that occasionally the system ends up freezing and never powers off (this is the current situation on the devel branch);
  • optimizing for clean shutdown after memory erasure, with the increased risk of leaving important data unerased in memory.

A potentially relevant data point is that on most EFI systems I've seen, we currently don't shut down properly after erasing memory. So on current hardware, as long as that other bug isn't fixed, we can as well optimize for erasing memory more completely, no?

Thoughts, opinions?

Meanwhile, I'm going to focus on the test suite side of things:

  1. try and reproduce the bug, so I have baseline data to compare candidate changes with; I'm assuming I will finally manage to reproduce it at all, and I hope I'm right;
  2. have the test suite raise vm.admin_reserve_kbytes and vm.min_free_kbytes, in the kexec'ed system, the exact same way it does in the regular system, and measure if it makes any difference wrt. this robustness problem. I dunno if it's acceptable that the test suite tests something slightly different than what happens normally, though -- it feels risky. Perhaps it would be OK to tweak things this way in most cases, as long as we add one (fragile) test case for the real, unmodified memory erasure process.

#22 Updated by anonym over 3 years ago

intrigeri wrote:

When you say "the CPU is used at 100%", I guess you mean that QEMU is using a full CPU core, right? (Given we have no means at that point to know what's happening inside the VM.)

Yes!

Did you see that on the 32-bit or 64-bit kernel? (Not sure it's relevant, but these tests are slow so if I need to run them hundreds of times to get stats, better focus on the cases that are known to expose the bug.)

This was with the default (in the test suite) 64-bit kernel (during the Cat:ing a Tails isohybrid ... scenario, the I shutdown Tails and wait for the computer to power off step).

Meta: I still stand behind what I wrote a month ago, that is "IMO what's really blocking 2.0 is the part that blocks the test suite update (#7563): we can live with this bug, so a test suite -specific workaround would be good enough, hence the category I'm setting. Now, it might be that a real fix is easier to find & implement than an ugly workaround :)"

Sure. I hope frontdesk won't suffer too much from it.

Possibly the memory has become so scarce that the kernel cannot function properly? Perhaps vm.admin_reserve_kbytes and/or vm.min_free_kbytes needs to be bumped above their defaults like we currently do in erase_memory.feature?

I would guess the same. Now, I don't know what's best:

  • optimizing for erasing more memory, with the increased risk that occasionally the system ends up freezing and never powers off (this is the current situation on the devel branch);
  • optimizing for clean shutdown after memory erasure, with the increased risk of leaving important data unerased in memory.

I suppose we should optimize for maximal memory erasure.

A potentially relevant data point is that on most EFI systems I've seen, we currently don't shut down properly after erasing memory. So on current hardware, as long as that other bug isn't fixed, we can as well optimize for erasing memory more completely, no?

True.

Thoughts, opinions?

Meanwhile, I'm going to focus on the test suite side of things:

  1. try and reproduce the bug, so I have baseline data to compare candidate changes with; I'm assuming I will finally manage to reproduce it at all, and I hope I'm right;
@product
Feature: foo
  Scenario: bar
  Given I have started Tails from DVD without network and logged in
  When I shutdown Tails and wait for the computer to power off

But who knows, maybe some extra conditions are necessary? In my newest instance, for instance, Tails had booted from a (isohybrid) USB. I doubt it, though.

  1. have the test suite raise vm.admin_reserve_kbytes and vm.min_free_kbytes, in the kexec'ed system, the exact same way it does in the regular system, and measure if it makes any difference wrt. this robustness problem. I dunno if it's acceptable that the test suite tests something slightly different than what happens normally, though -- it feels risky. Perhaps it would be OK to tweak things this way in most cases, as long as we add one (fragile) test case for the real, unmodified memory erasure process.

To do it the "exact" same way requires starting the remote shell after kexec:ing, of course. And possibly some way to prevent the wiping from starting until you have applied the necessary kernel tweaks via the remote shell. Also, I would suggest --pause-on-fail -- on the off chance that the remote shell remains responsive during the "freeze", it would be extremely useful to interactively debug the situation with vm-execute.

#23 Updated by intrigeri over 3 years ago

To do it the "exact" same way requires starting the remote shell after kexec:ing, of course.

I thought I would instead piggy-tail on the debug=wipemem codepaths.

I don't want to bother putting the remote shell and all its deps into the initramfs.

#24 Updated by intrigeri over 3 years ago

I've run that:

  Scenario: Memory erasure on a modern computer
    Given a computer
    And the computer is a modern 64-bit system
    And the computer has 8 GiB of RAM
    And I set Tails to boot with options "debug=wipemem" 
    And I start Tails from DVD with network unplugged and I login
    Then the PAE kernel is running
    And at least 8 GiB of RAM was detected
    And process "memlockd" is running
    And process "udev-watchdog" is running
    And udev-watchdog is monitoring the correct device
    And I shutdown and wait for Tails to finish wiping the memory

... 500 times and didn't see this bug. I guess I'll retry with a scenario that involves booting from USB.

#25 Updated by intrigeri over 3 years ago

And now I've run:

@product
Feature: System memory erasure on shutdown

  Scenario: Test 1
    Given I have started Tails without network from a USB drive without a persistent partition and stopped at Tails Greeter's login screen  
    And I log in to a new session
    When I shutdown Tails and wait for the computer to power off

... 100 times, and no problem either :(

#26 Updated by anonym over 3 years ago

I saw it again now, when creating the

I have started Tails without network from a USB drive without a persistent partition and stopped at Tails Greeter's login screen

checkpoint so, again, when running from USB.

I also noticed that if I take over the VM in virt-viewer, then the cursor starts blinking with a normal frequency again (it was slow in the vnc viewer window provided by --view). CPU is still stuck at 100% usage. I have no idea if this is relevant.

#27 Updated by anonym over 3 years ago

BTW, these 500 + 100 runs you did, was that with a special IAO? Or were you just trying to reproduce the bug?

#28 Updated by intrigeri over 3 years ago

BTW, these 500 + 100 runs you did, was that with a special IAO?

IAO?

Or were you just trying to reproduce the bug?

Yes.

#29 Updated by anonym over 3 years ago

intrigeri wrote:

BTW, these 500 + 100 runs you did, was that with a special IAO?

IAO?

ISO. Sorry! :)

Or were you just trying to reproduce the bug?

Yes.

Ok. If you ever prepare a branch with the proposed debugging stuff, let me know and I'll run it in parallel with you. Perhaps my setup makes it more likely to trigger, for some reason.

#30 Updated by intrigeri over 3 years ago

  • Related to Bug #10776: Step "I shutdown and wait for Tails to finish wiping the memory" fails when memory wiping causes a freeze added

#31 Updated by intrigeri over 3 years ago

  • Feature Branch set to test/9707-power-off-after-memory-erasure

#32 Updated by intrigeri over 3 years ago

  • Assignee changed from intrigeri to anonym
  • % Done changed from 10 to 20
  • QA Check changed from Dev Needed to Ready for QA

Would this help?

#33 Updated by anonym over 3 years ago

  • Assignee changed from anonym to intrigeri
  • QA Check changed from Ready for QA to Dev Needed

I started a run with the scenario from #9707#note-25 repeated two hundred times in a single feature, and already during the sixth run the bug appeared. Then I did a second run, and all 200 scenarios passed. Also, I haven't seen this issues since I last reported it in this thread (19 days ago).

I am not sure what to make out of this. But the problem remains, how ever more/equally/less frequent.

#34 Updated by intrigeri over 3 years ago

  • Target version changed from Tails_2.0 to Tails_2.2

#35 Updated by intrigeri over 3 years ago

  • Blocks deleted (Feature #7563: Update the automated test suite for Jessie ISO images)

#36 Updated by intrigeri over 3 years ago

  • Target version changed from Tails_2.2 to Tails_2.3

#37 Updated by intrigeri over 3 years ago

  • Target version changed from Tails_2.3 to Tails_2.4

#38 Updated by intrigeri over 3 years ago

  • Target version changed from Tails_2.4 to Tails_2.5

#39 Updated by intrigeri over 3 years ago

  • Assignee changed from intrigeri to anonym
  • QA Check changed from Dev Needed to Info Needed

intrigeri wrote:

Meanwhile, I'm going to focus on the test suite side of things:

  1. try and reproduce the bug, so I have baseline data to compare candidate changes with; I'm assuming I will finally manage to reproduce it at all, and I hope I'm right;

Last time I've tried, I could not reproduce the bug, but anonym could easily do it. So it feels safer to do it this way: anonym, can you reproduce with the updated branch?

  1. have the test suite raise vm.admin_reserve_kbytes and vm.min_free_kbytes, in the kexec'ed system

Done on the topic branch. Pushed, but the two affected scenarios are fragile so Jenkins won't run them (and besides, I don't recall seeing this issue on Jenkins anyway).

#40 Updated by intrigeri about 3 years ago

  • Target version changed from Tails_2.5 to Tails_2.7

#41 Updated by intrigeri about 3 years ago

I say let's do #10733 first.

#42 Updated by intrigeri about 3 years ago

Merged stable and pushed, so that we have Jenkins data once we resume working on it.

#43 Updated by intrigeri about 3 years ago

  • Related to deleted (Bug #10733: Run our initramfs memory erasure hook earlier)

#44 Updated by intrigeri about 3 years ago

  • Blocked by Bug #10733: Run our initramfs memory erasure hook earlier added

#45 Updated by anonym almost 3 years ago

So the problems I had with this branch was from before Tails 2.0 was released, 9 months ago. I haven't had these problem for a long time, so I guess we are good, and can just drop the new commit on this branch. Or, perhaps it is somehow related to #11786 (Memory wipe fails more often since Linux 4.6) just a new way for it to manifest? That one I can easily reproduce, so let me see if I can with an image built with this new commit.

#46 Updated by anonym almost 3 years ago

  • Assignee changed from anonym to intrigeri

I ran this scenario 100 times both for an image built from this branch, and for Tails 2.6:

  Scenario: bar
    Given I have started Tails from DVD without network and logged in
    When I execute "poweroff" as root
    Then the computer shuts down within 60 seconds

and I got 11 and 13 failures (all of them stop at the busybox shell) respectively. So, IMHO, let's skip this branch since it doesn't seem to improve things, and then close this ticket in favor of #11786. Agreed?

#47 Updated by anonym almost 3 years ago

  • Related to Bug #11786: System often crashes during/after memory wipe since Linux 4.6 added

#48 Updated by intrigeri almost 3 years ago

So the problems I had with this branch was from before Tails 2.0 was released, 9 months ago. I haven't had these problem for a long time, so I guess we are good,

Great!

and can just drop the new commit on this branch.

Well, I'm unsure: what this branch does is basically synchronizing our memory erasure
settings with the ones we use in the test suite while filling memory. I think that
still makes sense, even though it does not immediately fix a visible problem. What do
you think?

#49 Updated by intrigeri almost 3 years ago

  • Assignee changed from intrigeri to anonym
  • Target version changed from Tails_2.7 to Tails_2.9.1
  • QA Check changed from Info Needed to Ready for QA

I ran this scenario 100 times both for an image built from this branch, and for Tails 2.6:
[...]
and I got 11 and 13 failures (all of them stop at the busybox shell) respectively.
So, IMHO, let's skip this branch since it doesn't seem to improve things, and then
close this ticket in favor of #11786. Agreed?

Thanks for testing.

The fact that this branch does not fix what looks like a different problem (#11786,
that has different symptoms i.e. the poweroff command fails after erasing memory,
compared to a blinking cursor while apparently still being erasing memory) is not
enough to convince me that it's entirely irrelevant. As said in my previous command,
I think this branch still makes sense.

So, I propose that we:

  • merge this branch for 2.8, on the grounds that the changes it brings make sense
  • close this ticket, on the grounds that we can't reproduce the bug anymore, but not because the branch would fix it (nothing proves that)

What do you think?

#50 Updated by anonym almost 3 years ago

  • Related to deleted (Bug #10776: Step "I shutdown and wait for Tails to finish wiping the memory" fails when memory wiping causes a freeze)

#51 Updated by anonym almost 3 years ago

  • Status changed from In Progress to Rejected
  • Assignee deleted (anonym)
  • % Done changed from 20 to 0
  • QA Check deleted (Ready for QA)

Ok, I agree, and have merged the branch.

So, to be clear: I'm rejecting this branch on the grounds that we cannot reproduce the original problem, and #11786 (which still is a problem) is considered unrelated.

Also available in: Atom PDF