Project

General

Profile

Feature #10503

Bug #10288: Fix newly identified issues to make our test suite more robust and faster

Run erase_memory.feature first to optimize test suite performance

Added by intrigeri about 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Test suite
Target version:
Start date:
11/06/2015
Due date:
% Done:

100%

Feature Branch:
test/10503-reorder-feature-execution
Type of work:
Code
Blueprint:
Starter:
Affected tool:

Description

According to #9401#note-4, the test suite's peak disk usage in /tmp/TailsToaster should be around 15GB these days. This matters since we mount an ext4 filesystem on that directory, and count on the Linux kernel to do caching and write to disk as little of these GB as possible. When we reach peak usage, the system under test is allocated 8GB of RAM, so to avoid any writes to disk we would need at the very least 15+8GB of RAM on each isotesterN. We "only" give them 20GB each, so inevitably a bunch of GBs have to be written to disk at some point while running erase_memory.feature (and possibly re-read from disk into cache soon after). This adds I/O load to lizard that we could happily do without (we run a bunch of other services there), and on less powerful systems reading/writing that many GBs can have a significant impact on the test suite's runtime.

According to #9401#note-4 again, peak disk usage could be lowered to 8GB just by running erase_memory.feature first, which would magically fix all the aforementioned problems. It seems to be such a low-hanging fruit with such obvious advantages that exceptionally, perhaps we can do premature optimization and just do it. It could be good enough to just ensure that this change indeed has the intended effect on peak disk usage, and doesn't make things worse (the test suite's run time should be a good enough metric).

Moreover, for #9264 (deadline = end of 2015) we may want to run the test suite with absolutely everything in RAM, so decreasing max RAM usage for such usecases by 30% could impact the hardware specs quite a bit.


Related issues

Related to Tails - Feature #9264: Consider buying more server hardware to run our automated test suite Resolved 12/15/2015
Related to Tails - Bug #11582: Some upgrade test scenarios fail due to lack of disk space on Jenkins Resolved 07/21/2016

Associated revisions

Revision e6aefe92 (diff)
Added by anonym almost 4 years ago

Reorder the execution of feature to decrease peak disk usage.

The current (lexical) order seem to result in the highest peak disk
usage possible; by the time we run the features that uses the most
disk space, erase_memory.feature and (to a lesser degree)
untrusted_partitions.feature, we have already created all
non-temporary snapshots. At the moment, none of these use snapshots
themselves, so running them first seems like an easy optimization.

Will-fix: #10503

Revision 90226bc9 (diff)
Added by anonym almost 4 years ago

Also run features using temporary snapshots first.

At the moment, none of the features using temporary snapshots use
persistence, which creates the most space hungry snapshots. By running
them first we further decrease the peak disk space usage.

Refs: #10503

Revision f35824c9 (diff)
Added by anonym almost 4 years ago

Also reorder usb_install.feature, which will be the disk usage peak.

Refs: #10503

Revision f3b9856d (diff)
Added by intrigeri almost 4 years ago

Fix typo.

Refs: #10503

Revision c701b57d
Added by anonym almost 4 years ago

Merge remote-tracking branch 'origin/test/10503-reorder-feature-execution' into devel

Fix-committed: #10503

History

#2 Updated by anonym about 4 years ago

  • Target version set to Tails_1.8

#3 Updated by intrigeri almost 4 years ago

  • Related to Feature #9264: Consider buying more server hardware to run our automated test suite added

#4 Updated by anonym almost 4 years ago

  • Target version changed from Tails_1.8 to Tails_2.0

#5 Updated by intrigeri almost 4 years ago

  • Target version changed from Tails_2.0 to Tails_2.2

#6 Updated by intrigeri almost 4 years ago

FYI we're in the process of updating/refining the hardware config used to run the test suite via Jenkins, so it would be helpful to reach a conclusion here.

#7 Updated by anonym almost 4 years ago

  • Status changed from Confirmed to In Progress

#8 Updated by anonym almost 4 years ago

  • Assignee changed from anonym to intrigeri
  • % Done changed from 0 to 50
  • QA Check set to Ready for QA
  • Feature Branch set to test/10503-reorder-feature-execution

I believe the current situation yields the highest possible peak disk usage. The feature branch will, I believe, bring it to the smallest possible peak disk usage. :)

#9 Updated by anonym almost 4 years ago

I ran a custom feature that runs all the non-temporary snapshot restoring steps, and it came in at around 7.5 GiB. So, when only considering the obvious offenders of disk space usage (excluding logs, videos, screenshots and such optional artifacts) the peak disk usage is:

  max($biggest_ram_dump, $biggest_non_snapshot_disk, $internal_snapshot_disk)
= max(8 GiB, $tails_iso_size, 7.5 GiB)
= 8 GiB.

(Note: for $biggest_non_snapshot_disk = $tails_iso_size, see the Tails can boot from live systems stored on hard drives scenario in untrusted_partitions.feature)

Earlier I think it must have been

8 GiB + $internal_snapshot_disk = 8 + 7.5 GiB = 15.5 GiB,

so this is quite a nice improvement. :)

#10 Updated by anonym almost 4 years ago

Ah, actually, usb_install.feature will both need all the heavy snapshots while also creating some big disks, so the above reasoning is probably wrong.

#11 Updated by intrigeri almost 4 years ago

Ah, actually, usb_install.feature will both need all the heavy snapshots while also creating some big disks, so the above reasoning is probably wrong.

The reasoning in the ticket description, or the latest one you posted?

#12 Updated by anonym almost 4 years ago

anonym wrote:

Ah, actually, usb_install.feature will both need all the heavy snapshots while also creating some big disks, so the above reasoning is probably wrong.

Running usb_install.feature while all other snapshots exist brings us to 8.6 GiB. Pushed f35824c to deal with this.

#13 Updated by anonym almost 4 years ago

intrigeri wrote:

Ah, actually, usb_install.feature will both need all the heavy snapshots while also creating some big disks, so the above reasoning is probably wrong.

The reasoning in the ticket description, or the latest one you posted?

The latest one. So much has changed regarding snapshots so the old reasoning isn't completely valid any more, so I'm mostly disregarding it.

#14 Updated by intrigeri almost 4 years ago

  • Assignee changed from intrigeri to anonym
  • QA Check changed from Ready for QA to Info Needed

Code review passes, but I wonder why we need the if not intersection.empty? conditional: the 3 statements it guards would be no-op if the condition is wrong, no?

I'll now run some tests. Do you already have some measurements for the peak disk usage during a complete run? (Don't bother if you don't have this info handy.)

#15 Updated by intrigeri almost 4 years ago

I see peak disk usage is 12553240 here.

#16 Updated by intrigeri almost 4 years ago

intrigeri wrote:

I see peak disk usage is 12553240 here.

On #10396#note-10 we measured that it was "almost 12G" on lizard early in December. I expected that the reordering would lower this number. Maybe the @fragile tagging on lizard explains why we don't save much (because I'm comparing a full run today, against a partial run on lizard 2 months ago).

Anyway: regardless of peak disk space usage, just running memory erasure first should decrease the maximum(RAM allocated to the system under testing + temporary disk space we would like to keep in memory cache), which is what this ticket is actually about IIRC. We haven't measured that yet IIRC, but if I'm happy with this branch and merge it, I'll measure how much RAM an isotester needs these days to run the test suite without hitting disk. I'll need this for #11011 anyawy.

#17 Updated by intrigeri almost 4 years ago

  • % Done changed from 50 to 90
  • QA Check changed from Info Needed to Pass

Tests pass. Feel free to merge with, or without the minor style change I suggested above.

#18 Updated by intrigeri almost 4 years ago

  • Subject changed from Consider running erase_memory.feature first to optimize test suite performance to Run erase_memory.feature first to optimize test suite performance
  • Type of work changed from Discuss to Code

#19 Updated by anonym almost 4 years ago

  • Status changed from In Progress to 11
  • % Done changed from 90 to 100

#20 Updated by anonym almost 4 years ago

  • Assignee changed from anonym to intrigeri
  • QA Check changed from Pass to Info Needed

intrigeri wrote:

Code review passes, but I wonder why we need the if not intersection.empty? conditional: the 3 statements it guards would be no-op if the condition is wrong, no?

That is what I initially did, but I found it a bit clearer and less "clever" (in the bad sense) to do it that way. I think I'll just leave it as-is.

I'll now run some tests. Do you already have some measurements for the peak disk usage during a complete run? (Don't bother if you don't have this info handy.)

Yeah, I did a full run which I apparently forgot to post here, sorry. I ran this:

while true; do
  stat -c%s "${TMP}"/*.memstate "${TMP}"/TailsToasterStorage/* | \
    sum-column >> /tmp/sizes-monitor
  sleep 5
done

where sum-column is just a simple script that does what you'd expect.

So I only monitored the storage and snapshots and I got a peak of 9104.2 MiB, but the other artifacts shouldn't add up to much (<20 MiB at any point, typically).

I see peak disk usage is 12553240 here.

I'm surprised you see over 3 GiB more than me. Did you run with --keep-snapshots or something? Then the temporary snapshots will be kept, and there are five of them which I believe would add up to that number. Since we don't run the automated test suite with --keep-snapshots on Jenkins, I don't think the tests you do should use that, if you did.

Anyway: regardless of peak disk space usage, just running memory erasure first should decrease the maximum(RAM allocated to the system under testing + temporary disk space we would like to keep in memory cache), which is what this ticket is actually about IIRC. We haven't measured that yet IIRC, but if I'm happy with this branch and merge it, I'll measure how much RAM an isotester needs these days to run the test suite without hitting disk. I'll need this for #11011 anyawy.

ACK, so I merged it into devel so you're not blocked any more. I'm leaving the ticket as Info needed, though, until you've answered my question above (I'd like to open another ticket to investigate this if you did not run with --keep-snapshots). Please close this ticket properly once you've answered!

#21 Updated by intrigeri almost 4 years ago

That is what I initially did, but I found it a bit clearer and less "clever" (in the bad sense) to do it that way. I think I'll just leave it as-is.

OK.

Yeah, I did a full run which I apparently forgot to post here, sorry. I ran this:
[...]
So I only monitored the storage and snapshots and I got a peak of 9104.2 MiB, but the other artifacts shouldn't add up to much (<20 MiB at any point, typically).

I see peak disk usage is 12553240 here.

I'm surprised you see over 3 GiB more than me.

So am I :)

Did you run with --keep-snapshots or something?

I didn't. But I had a test.feature lying around, not sure what was in there, nor if it matters.
I'll re-run in a clean environment (because again, I need these results for #11011).

Please close this ticket properly once you've answered!

It's fix committed and will stay this way until 2.2 is out so I think I should not close it.
Anyway, I'll follow-up on #11011 or #11113 about it.

#22 Updated by intrigeri almost 4 years ago

  • Assignee deleted (intrigeri)
  • QA Check changed from Info Needed to Pass

#23 Updated by anonym over 3 years ago

  • Status changed from 11 to Resolved

#24 Updated by intrigeri over 3 years ago

  • Related to Bug #11582: Some upgrade test scenarios fail due to lack of disk space on Jenkins added

Also available in: Atom PDF