Bug #10288: Fix newly identified issues to make our test suite more robust and faster
Sometimes fails to boot from USB on Jenkins with I/O errors
While working on #10720 I noticed a few I/O errors that blocked the boot. Let's start compiling them here and we'll see what can be done about this. So far I've seen such issues only when booting from USB. I'm curious if the same root cause can trigger more subtle issues, i.e. not blocking the boot but causing false positives later on (I'm thinking e.g. of all the scenarios in which the system under test seems frozen in Tails Greeter after clicking "log in").
- The second (and last) "I start Tails from USB drive "isohybrid" with network unplugged and I login" step in "Cat:ing a Tails isohybrid to a USB drive and booting it, then trying to upgrading it but ending up having to do a fresh installation, which boots" fails: the test suite options are added to the kernel command line, and then, while the syslinux menu is still displayed and there's no trace of Linux booting:
- 7min30 later:
CHS: Error 0c00 reading sector 2247939 (140/14/6)and
EDD: Error 0c00 reading sector 2249987
- another minute later:
CHS: Error 0c00 reading sector 2251621 (140/72/34)and
EDD: Error 0c00 reading sector 2253669
- the test suite times out before anything else happens
- 7min30 later:
- (see 2 times) "I start Tails from USB drive "old" with network unplugged and I login" fails with very similar CHS/EDD errors as above, but at some point Linux starts spitting output and there's a kernel panic ("Failed to execute /init")
- I've seen at least two Tails cat:ed from ISO fail to boot with SquashFS errors.
- "I start Tails from USB drive "__internal" with network unplugged and I login with persistence enabled" in "Watching MP4 videos stored on the persistent volume should work as expected given our AppArmor confinement" fails with similar CHS/EDD errors as above; at some point Linux starts spitting output and there's a kernel panic ("Failed to execute /init")
- "I start Tails from USB drive "__internal" with network unplugged and I login with read-only persistence enabled" in "I start Tails from USB drive "__internal" with network unplugged and I login with read-only persistence enabled" fails with similar CHS/EDD errors as above
- "I start Tails from USB drive "old" with network unplugged and I login" in "Creating a persistent partition with the old Tails USB installation": kernel panic
- "I start Tails from USB drive "old" with network unplugged and I login with persistence enabled" in "Writing files to a read/write-enabled persistent partition with the old Tails USB installation": CHS/EDD errors
- "I start Tails from USB drive "to_upgrade" with network unplugged and I login with persistence enabled" in "Booting a USB drive upgraded from ISO with persistence enabled" is stuck at "syslinux 6.03 EDD" and never displays the bootloader menu (see 02_39_57_Booting_a_USB_drive_upgraded_from_ISO_with_persistence_enabled.mkv attached)
I've never seen that outside of Jenkins, so I suspect a problem with the platform.
Random debugging ideas:
upgrade isotesters' kernel to Linux 4.6: done between 2016-07-23 10:31 UTC and 11:00 UTC upgrade isotesters' QEMU to 2.5 from jessie-backports: done on 2016-07-27 around 08:00 UTC check if the isotesters' Journal has anything interesting around the time of the failure: nothing special in there check if isotesters I/O load is as we expect it to be while running the test suite (including USB scenarios), i.e. most of our temporary data should stay in memory cache, and should never be flushed out to disk; the most recent work we've done in this area can serve as reference: #11175: I/O load is as expected (most action happens on tmpfs so isotesters don't do much disk I/O)
- check if there's anything interesting on Munin around the time of the failures: WIP; nothing I could notice; only a potential correlation with check-mirrors runs might be worth looking closer into
- give the system under testing a USB3 (
nec-xhci) controller: WIP (499c630, https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_test-11588-usb-on-jenkins/)
- crashes during memory erasure on shutdown, but with #10733 merged on top it seems to be fine: https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_test-11588-usb-on-jenkins-10733/
- not seen any I/O error on these branches yet, there's hope!
- upgrade the host system's QEMU to 2.5 from jessie-backports
- check virtual USB disk settings, e.g. the "cache" attribute
- check how we're managing snapshots vs. disks in the scenarios that sometimes fail
Let's keep in mind that we have other options, such as finally giving up on nested KVM for running our test suite on Jenkins, and instead getting a dedicated machine. Infrastructure-wise, IMO we are now ready to handle more machines (we have the VPN & Puppet setup in place for that). The additional engineering effort (support running multiple instances of our test suite concurrently on the same system) is certainly non-trivial, but it may still be cheaper than fixing this very ticket and all other bugs we only see on Jenkins. So let's not spend too much time on this here.
Test suite: use more recent virtual hardware, i.e. USB 3.0 (nec-xhci) on a pc-i440fx-2.5 machine.
I have some vague hope that switching USB controllers might help with problems
we see on Jenkins when booting from USB (refs: #11588). This change requires
upgrading the machine type as well, QEMU otherwise won't boot from this USB
3.0 controller. And while we're at it, let's migrate from IDE to SATA,
in order to reflect better hardware Tails is being used with.
Also, there are chances that more recent virtual hardware sees more testing
these days, so it sounds potentially useful to "upgrade".
Note that I've initially tried the more modern pc-q35-2.5 machine type,
which worked fine when running the test suite on a sid host, but when
running on Jessie the VM under testing crashed when logging into GNOME.
I'll file a ticket about trying this again once Stretch is out.
#8 Updated by intrigeri over 3 years ago
FWIW it seems that failures occur much more frequently when the system is under heavy load (e.g. running multiple instances of the test suite at the same time): since I've stopped working on such things and triggering builds+tests on branches that have the USB tests enabled, https://jenkins.tails.boum.org/job/test_Tails_ISO_bugfix-10720-installer-freezes-on-jenkins/ is quite robust again.
#14 Updated by intrigeri over 3 years ago
- File 02_39_57_Booting_a_USB_drive_upgraded_from_ISO_with_persistence_enabled.mkv added
- Description updated (diff)
#27 Updated by intrigeri over 3 years ago
- Status changed from Confirmed to In Progress
- Assignee set to intrigeri
- Target version set to Tails_2.6
- % Done changed from 0 to 10
- Feature Branch changed from test/11588-usb-on-jenkins to test/11588-usb-on-jenkins+10733
Status update: looks like I've got something robust enough, see https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_test-11588-usb-on-jenkins-10733/ and https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_test-10971-more-cpus-for-tailstoaster-11588-10733/. I'll let it run a couple more weeks on Jenkins and we'll see. I hope we'll be enable to merge this (and re-enable most USB tests) during the 2.6 cycle, fingers crossed.
#32 Updated by intrigeri over 3 years ago
I'd like to ease reviewing for the 2.6 RM, and to get automated tests running about the combination of all these changes ASAP in the 2.6 dev cycle. So, I've merged this work, along with the other major branches I'm proposing for 2.6, into the feature/from-intrigeri-for-2.6 integration branch (Jenkins builds and tests).