Project

General

Profile

Bug #11588

Bug #10288: Fix newly identified issues to make our test suite more robust and faster

Sometimes fails to boot from USB on Jenkins with I/O errors

Added by intrigeri about 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Test suite
Target version:
Start date:
07/22/2016
Due date:
% Done:

100%

Feature Branch:
test/11588-usb-on-jenkins+10733
Type of work:
Sysadmin
Blueprint:
Starter:
Affected tool:

Description

While working on #10720 I noticed a few I/O errors that blocked the boot. Let's start compiling them here and we'll see what can be done about this. So far I've seen such issues only when booting from USB. I'm curious if the same root cause can trigger more subtle issues, i.e. not blocking the boot but causing false positives later on (I'm thinking e.g. of all the scenarios in which the system under test seems frozen in Tails Greeter after clicking "log in").

  • The second (and last) "I start Tails from USB drive "isohybrid" with network unplugged and I login" step in "Cat:ing a Tails isohybrid to a USB drive and booting it, then trying to upgrading it but ending up having to do a fresh installation, which boots" fails: the test suite options are added to the kernel command line, and then, while the syslinux menu is still displayed and there's no trace of Linux booting:
    • 7min30 later: CHS: Error 0c00 reading sector 2247939 (140/14/6) and EDD: Error 0c00 reading sector 2249987
    • another minute later: CHS: Error 0c00 reading sector 2251621 (140/72/34) and EDD: Error 0c00 reading sector 2253669
    • the test suite times out before anything else happens
  • (see 2 times) "I start Tails from USB drive "old" with network unplugged and I login" fails with very similar CHS/EDD errors as above, but at some point Linux starts spitting output and there's a kernel panic ("Failed to execute /init")
  • I've seen at least two Tails cat:ed from ISO fail to boot with SquashFS errors.
  • "I start Tails from USB drive "__internal" with network unplugged and I login with persistence enabled" in "Watching MP4 videos stored on the persistent volume should work as expected given our AppArmor confinement" fails with similar CHS/EDD errors as above; at some point Linux starts spitting output and there's a kernel panic ("Failed to execute /init")
  • "I start Tails from USB drive "__internal" with network unplugged and I login with read-only persistence enabled" in "I start Tails from USB drive "__internal" with network unplugged and I login with read-only persistence enabled" fails with similar CHS/EDD errors as above
  • "I start Tails from USB drive "old" with network unplugged and I login" in "Creating a persistent partition with the old Tails USB installation": kernel panic
  • "I start Tails from USB drive "old" with network unplugged and I login with persistence enabled" in "Writing files to a read/write-enabled persistent partition with the old Tails USB installation": CHS/EDD errors
  • "I start Tails from USB drive "to_upgrade" with network unplugged and I login with persistence enabled" in "Booting a USB drive upgraded from ISO with persistence enabled" is stuck at "syslinux 6.03 EDD" and never displays the bootloader menu (see 02_39_57_Booting_a_USB_drive_upgraded_from_ISO_with_persistence_enabled.mkv attached)

I've never seen that outside of Jenkins, so I suspect a problem with the platform.

Random debugging ideas:

  • upgrade isotesters' kernel to Linux 4.6: done between 2016-07-23 10:31 UTC and 11:00 UTC
  • upgrade isotesters' QEMU to 2.5 from jessie-backports: done on 2016-07-27 around 08:00 UTC
  • check if the isotesters' Journal has anything interesting around the time of the failure: nothing special in there
  • check if isotesters I/O load is as we expect it to be while running the test suite (including USB scenarios), i.e. most of our temporary data should stay in memory cache, and should never be flushed out to disk; the most recent work we've done in this area can serve as reference: #11175: I/O load is as expected (most action happens on tmpfs so isotesters don't do much disk I/O)
  • check if there's anything interesting on Munin around the time of the failures: WIP; nothing I could notice; only a potential correlation with check-mirrors runs might be worth looking closer into
  • give the system under testing a USB3 (nec-xhci) controller: WIP (499c630, https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_test-11588-usb-on-jenkins/)
  • upgrade the host system's QEMU to 2.5 from jessie-backports
  • check virtual USB disk settings, e.g. the "cache" attribute
  • check how we're managing snapshots vs. disks in the scenarios that sometimes fail

Let's keep in mind that we have other options, such as finally giving up on nested KVM for running our test suite on Jenkins, and instead getting a dedicated machine. Infrastructure-wise, IMO we are now ready to handle more machines (we have the VPN & Puppet setup in place for that). The additional engineering effort (support running multiple instances of our test suite concurrently on the same system) is certainly non-trivial, but it may still be cheaper than fixing this very ticket and all other bugs we only see on Jenkins. So let's not spend too much time on this here.


Related issues

Related to Tails - Bug #12142: The nec-xhci virtual USB controller + tails-persistence-setup causes a VM freeze on Debian Stretch or newer hosts Rejected 01/13/2017
Blocks Tails - Bug #11583: UEFI boot tests fail on Jenkins Resolved 07/21/2016
Blocked by Tails - Bug #11590: Improve Tails Installer robustness for 2.6 Resolved 07/22/2016
Blocked by Tails - Bug #10733: Run our initramfs memory erasure hook earlier Resolved 12/09/2015

Associated revisions

Revision 499c6309 (diff)
Added by intrigeri about 3 years ago

Test suite: use more recent virtual hardware, i.e. USB 3.0 (nec-xhci) on a pc-i440fx-2.5 machine.

I have some vague hope that switching USB controllers might help with problems
we see on Jenkins when booting from USB (refs: #11588). This change requires
upgrading the machine type as well, QEMU otherwise won't boot from this USB
3.0 controller. And while we're at it, let's migrate from IDE to SATA,
in order to reflect better hardware Tails is being used with.

Also, there are chances that more recent virtual hardware sees more testing
these days, so it sounds potentially useful to "upgrade".

Note that I've initially tried the more modern pc-q35-2.5 machine type,
which worked fine when running the test suite on a sid host, but when
running on Jessie the VM under testing crashed when logging into GNOME.
I'll file a ticket about trying this again once Stretch is out.

Revision bd961bd8
Added by anonym about 3 years ago

Merge remote-tracking branch 'origin/feature/from-intrigeri-for-2.6' into devel

Fix-committed: #5650, #6729, #6850, #8485, #10190, #10298, #10733, #10733, #11281, #11588, #11582, #11590

History

#1 Updated by intrigeri about 3 years ago

  • Description updated (diff)

#2 Updated by intrigeri about 3 years ago

  • Description updated (diff)

#3 Updated by intrigeri about 3 years ago

  • Description updated (diff)

#4 Updated by intrigeri about 3 years ago

  • Description updated (diff)

#5 Updated by intrigeri about 3 years ago

  • Description updated (diff)

#6 Updated by intrigeri about 3 years ago

  • Description updated (diff)

#7 Updated by intrigeri about 3 years ago

  • Description updated (diff)

#8 Updated by intrigeri about 3 years ago

FWIW it seems that failures occur much more frequently when the system is under heavy load (e.g. running multiple instances of the test suite at the same time): since I've stopped working on such things and triggering builds+tests on branches that have the USB tests enabled, https://jenkins.tails.boum.org/job/test_Tails_ISO_bugfix-10720-installer-freezes-on-jenkins/ is quite robust again.

#9 Updated by intrigeri about 3 years ago

  • Description updated (diff)

#10 Updated by intrigeri about 3 years ago

  • Description updated (diff)

#11 Updated by intrigeri about 3 years ago

  • Description updated (diff)

#12 Updated by intrigeri about 3 years ago

  • Description updated (diff)

#13 Updated by intrigeri about 3 years ago

  • Description updated (diff)

#15 Updated by intrigeri about 3 years ago

  • Description updated (diff)

#16 Updated by intrigeri about 3 years ago

  • Description updated (diff)

#17 Updated by intrigeri about 3 years ago

  • Description updated (diff)

#18 Updated by intrigeri about 3 years ago

  • Description updated (diff)

#19 Updated by intrigeri about 3 years ago

  • Description updated (diff)

#20 Updated by intrigeri about 3 years ago

  • Blocks Bug #11583: UEFI boot tests fail on Jenkins added

#21 Updated by intrigeri about 3 years ago

  • Blocked by Bug #10720: Tails Installer freezes when calling system_partition.call_set_name_sync in partition_device added

#22 Updated by intrigeri about 3 years ago

  • Feature Branch set to test/11588-usb-on-jenkins

#23 Updated by intrigeri about 3 years ago

  • Blocked by deleted (Bug #10720: Tails Installer freezes when calling system_partition.call_set_name_sync in partition_device)

#24 Updated by intrigeri about 3 years ago

  • Blocked by Bug #11590: Improve Tails Installer robustness for 2.6 added

#25 Updated by intrigeri about 3 years ago

  • Description updated (diff)

#26 Updated by intrigeri about 3 years ago

  • Description updated (diff)

#27 Updated by intrigeri about 3 years ago

  • Status changed from Confirmed to In Progress
  • Assignee set to intrigeri
  • Target version set to Tails_2.6
  • % Done changed from 0 to 10
  • Feature Branch changed from test/11588-usb-on-jenkins to test/11588-usb-on-jenkins+10733

Status update: looks like I've got something robust enough, see https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_test-11588-usb-on-jenkins-10733/ and https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_test-10971-more-cpus-for-tailstoaster-11588-10733/. I'll let it run a couple more weeks on Jenkins and we'll see. I hope we'll be enable to merge this (and re-enable most USB tests) during the 2.6 cycle, fingers crossed.

#28 Updated by intrigeri about 3 years ago

  • Blocked by Bug #10733: Run our initramfs memory erasure hook earlier added

#31 Updated by intrigeri about 3 years ago

  • Assignee changed from intrigeri to anonym
  • % Done changed from 10 to 20
  • QA Check set to Ready for QA

This seems to be rock solid on Jenkins.

#32 Updated by intrigeri about 3 years ago

I'd like to ease reviewing for the 2.6 RM, and to get automated tests running about the combination of all these changes ASAP in the 2.6 dev cycle. So, I've merged this work, along with the other major branches I'm proposing for 2.6, into the feature/from-intrigeri-for-2.6 integration branch (Jenkins builds and tests).

#33 Updated by anonym about 3 years ago

  • Status changed from In Progress to Fix committed
  • Assignee deleted (anonym)
  • % Done changed from 20 to 100
  • QA Check changed from Ready for QA to Pass

#34 Updated by anonym about 3 years ago

  • Status changed from Fix committed to Resolved

#35 Updated by anonym almost 3 years ago

  • Related to Bug #12142: The nec-xhci virtual USB controller + tails-persistence-setup causes a VM freeze on Debian Stretch or newer hosts added

Also available in: Atom PDF