Feature #9264: Consider buying more server hardware to run our automated test suite
Try running more isotester:s on lizard
The munin data we have suggests that we're are underusing lizard v2 a lot. And so far #10971 shows that neither giving more vcpus to each isotester, nor giving more vcpus to the (TailsToaster) system under test, improves our usage rate. There are regularly ISO test jobs waiting for an available isotester, and we plan to run more such jobs, so let's try another way to fix this bottleneck: parallelize over more isotesters.
#1 Updated by intrigeri over 3 years ago
As said elsewhere:
I'd like to try this setup:
- keep 2 isotesters with our standard config (3 vcpus, 20GiB of RAM for disk cache)
- downgrade a bit 2 isotesters to 3 vcpus, 10GiB of RAM or so
- add 2 more isotesters, with 3 vcpus and 10GiB of RAM or so
I hope that something like 6-8 ISO testers would help us get the maximum we can out of our current hardware, if giving each of them more vcpus is not a winning strategy.
#3 Updated by intrigeri over 3 years ago
- Status changed from Confirmed to In Progress
- % Done changed from 0 to 10
We now have 2 more isotesters (isotester5 and isotester6), each with 12267520 KiB of RAM. To sum up:
- 2 isotesters (1, 4) with 20 GiB of RAM
- 4 isotesters (2, 3, 5, 6) with 12267520 KiB of RAM
And I've set up a job (https://jenkins.tails.boum.org/job/manual_test_Tails_ISO_testing/) that I can manually run 1. without waiting for an ISO to be built; 2. concurrently on multiple isotesters. With this, hopefully I can get more interesting data.
#5 Updated by intrigeri over 3 years ago
So, I've run 6 concurrent instances of the test suite on the testing branch. They all took around 100 minutes to complete. This is quite slower than the 60-90 minutes we usually see with 4 isotesters, but we're not trying to optimize for latency purely: the main goal here is to optimize for throughput under heavy load. I run it three times:
- 3.6 runs/hour (6 runs in 100 minutes)
- 90 minutes each => 4 runs/hour (the box was under heavy I/O load for other reasons)
- 72 minutes each => 5 runs/hour
Also, note that our usual 60-90 minutes is with 20 GiB of RAM on each isotester, which reduces I/O load and presumably makes everything run faster. While right now, two thirds of our isotesters only have ~12GiB of RAM, and during the experiment I see serious iowait spikes (which are probably bigger than in less artificial conditions, because the test suite runs started more or less at the same time, so the most I/O intensive part of the test suite, namely memory erasure, was run at the same time). Ideally, each of these 6 isotesters would have at least 13 GiB (temporary directory) + 8 GiB (max allocated RAM to TailsToaster) + something like 2 GiB (OS, QEMU, Cucumber, etc.) = 23 GiB of RAM. Total = 6 * 23 = 138 GiB of RAM. We currently allocate a total amount of 89 GiB to our isotesters, so we would need 138 - 89 = 49 GiB RAM more. And if we want to go up to 8 isotesters, we need 8*23-89=95 GiB more.
Then, I've measured throughput under heavy load with less isobuilders, e.g. 4 concurrent runs on 4 isotesters, with similar memory available to them (3 isotesters with 12 GiB of RAM, 1 isotester with 20 GiB of RAM):
- 3 * 71 + 1 * 62 minutes => 3.5 runs/hour
- 65 minutes each => 3.7 runs/hour
#7 Updated by intrigeri over 3 years ago
- % Done changed from 10 to 50
The above results finished to convince me that indeed, we should parallelize our ISO testing workload over more isotesters: even with memory-deprived isotesters => crazy I/O load, we get better throughput with 6 isotesters than with 4. But right now, with the memory we have currently, we should not be pushing the machine this way since it impacts all other workloads we run, so we need more RAM.
I'll close this research ticket, will sum up my plans on the blueprint, revert the changes (turn off the 2 additional isotesters, give the 4 remaining ones 20 GiB of RAM), and will file tickets for the next steps.
Reviews, opinions and comments welcome.
#8 Updated by intrigeri over 3 years ago
- Status changed from In Progress to Resolved
- Assignee deleted (
- % Done changed from 50 to 100
- Blueprint set to https://tails.boum.org/blueprint/hardware_for_automated_tests_take2/
I'll close this research ticket, will sum up my plans on the blueprint, revert the changes (turn off the 2 additional isotesters, give the 4 remaining ones 20 GiB of RAM),