Project

General

Profile

Bug #9157

ISO testers (level-1) VMs crash when running the test suite with Jessie's kernel

Added by intrigeri over 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Infrastructure
Target version:
Start date:
04/04/2015
Due date:
% Done:

100%

Feature Branch:
Type of work:
Sysadmin
Blueprint:
Starter:
No
Affected tool:

Description

... but works fine when boooted with Wheezy's kernel. In both cases, the system is running Jessie.

Symptoms: the system is unresponsible, including ping.

I can't reproduce that bug (with nested virtualization as well) in another Jessie VM, that's running on different hardware, but with a very similar setup. I can't reproduce it on bare metal either.

More information I could gather:

[  152.976056] INFO: rcu_sched detected stalls on CPUs/tasks: { 0 2} (detected by 1, t=5254 jiffies, g=7943, c=7942, q=6728)
[  360.096135] INFO: task kworker/u6:0:6 blocked for more than 120 seconds.
[  360.097693]       Not tainted 3.16.0-4-amd64 #1
[  360.098687] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  360.100691] INFO: task jbd2/dm-0-8:139 blocked for more than 120 seconds.
[  360.102189]       Not tainted 3.16.0-4-amd64 #1
[  360.103178] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  360.105015] INFO: task jbd2/vdb-8:328 blocked for more than 120 seconds.
[  360.106475]       Not tainted 3.16.0-4-amd64 #1
[  360.107462] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  360.109328] INFO: task rs:main Q:Reg:700 blocked for more than 120 seconds.
[  360.110740]       Not tainted 3.16.0-4-amd64 #1
[  360.111654] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  360.113449] INFO: task master:787 blocked for more than 120 seconds.
[  360.114731]       Not tainted 3.16.0-4-amd64 #1
[  360.115645] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  360.117405] INFO: task pickup:798 blocked for more than 120 seconds.
[  360.118682]       Not tainted 3.16.0-4-amd64 #1
[  360.119595] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  360.121365] INFO: task cucumber:2205 blocked for more than 120 seconds.
[  360.122598]       Not tainted 3.16.0-4-amd64 #1
[  360.123446] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  360.125100] INFO: task qemu-system-x86:2308 blocked for more than 120 seconds.
[  360.126514]       Not tainted 3.16.0-4-amd64 #1
[  360.127370] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

and:

BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:1:34]

trace1 (3.33 KB) intrigeri, 04/17/2015 03:57 PM

trace2 (22.7 KB) intrigeri, 04/17/2015 03:57 PM

trace3 (3.5 KB) bertagaz, 09/20/2015 04:00 AM

trace_no_hpet_no_acpi (4.41 KB) bertagaz, 09/20/2015 04:00 AM

trace_no_disk_cache_L1 (10.5 KB) bertagaz, 09/20/2015 04:00 AM

trace_qemu2.4 (7.18 KB) bertagaz, 09/22/2015 02:39 AM

trace_qemu2.4_cpus1 (5.78 KB) bertagaz, 09/22/2015 02:47 AM


Related issues

Related to Tails - Feature #9264: Consider buying more server hardware to run our automated test suite Resolved 12/15/2015
Related to Tails - Bug #10717: Concerning amount of test suite runs aborted on Jenkins due to timeout Rejected 12/06/2015
Blocked by Tails - Feature #11178: Upgrade lizard host system to Jessie Resolved 02/29/2016

History

#1 Updated by intrigeri over 4 years ago

Note that I've tried removing irqbalance, as it seemed to be a workaround for people affected by similar problems, but it didn't help.

Next steps are to try and reproduce in various settings, e.g.:

  • starting the same ISO with QEMU, outside of the test suite environment and of libvirt
  • starting the same ISO with libvirt, outside of the test suite environment
  • changing the QEMU command line a bit
  • allocating HugePages statically
  • disabling automatic allocation of AnonHugePages
  • more ideas welcome!

#2 Updated by intrigeri over 4 years ago

Next step is probably to retry with linux (3.16.7-ckt9-1) first.

#3 Updated by intrigeri over 4 years ago

With linux 3.16.7-ckt9-2, I see basically the same:

[  231.044003] INFO: rcu_sched self-detected stall on CPU { 2}  (t=5250 jiffies g=3872 c=3871 q=50)
[  256.064003] BUG: soft lockup - CPU#1 stuck for 22s! [qemu-system-x86:1221]
[  256.080003] BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:2:328]
[  284.064002] BUG: soft lockup - CPU#1 stuck for 22s! [qemu-system-x86:1221]
[  284.080003] BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:2:328]
[  294.056003] INFO: rcu_sched self-detected stall on CPU { 2}  (t=21003 jiffies g=3872 c=3871 q=50)
[  320.064002] BUG: soft lockup - CPU#1 stuck for 23s! [qemu-system-x86:1221]
[  320.080003] BUG: soft lockup - CPU#2 stuck for 23s! [kworker/2:2:328]
[  348.064003] BUG: soft lockup - CPU#1 stuck for 23s! [qemu-system-x86:1221]
[  348.080003] BUG: soft lockup - CPU#2 stuck for 23s! [kworker/2:2:328]
[  357.068003] INFO: rcu_sched self-detected stall on CPU { 2}  (t=36756 jiffies g=3872 c=3871 q=50)
[  384.064002] BUG: soft lockup - CPU#1 stuck for 22s! [qemu-system-x86:1221]
[  384.080003] BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:2:328]
[  412.064002] BUG: soft lockup - CPU#1 stuck for 22s! [qemu-system-x86:1221]
[  412.080003] BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:2:328]
[  420.080003] INFO: rcu_sched self-detected stall on CPU { 2}  (t=52509 jiffies g=3872 c=3871 q=50)
[  448.064004] BUG: soft lockup - CPU#1 stuck for 23s! [qemu-system-x86:1221]
[  448.080003] BUG: soft lockup - CPU#2 stuck for 23s! [kworker/2:2:328]
[  476.064006] BUG: soft lockup - CPU#1 stuck for 22s! [qemu-system-x86:1221]
[  476.080004] BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:2:328]

#4 Updated by intrigeri over 4 years ago

With Linux 3.19.3-1~exp1:

[   76.076003] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:0:19]
[  104.076004] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:0:19]
[  132.076004] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [kworker/2:0:19]
[  160.076003] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [kworker/2:0:19]
[  188.076003] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [kworker/2:0:19]
[  216.076003] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [kworker/2:0:19]
[  244.076003] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [kworker/2:0:19]
[  272.076003] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:0:19]
[  300.076003] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:0:19]
[  328.076003] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:0:19]
[  356.076003] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:0:19]
[  384.076003] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:0:19]
[  412.076003] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:0:19]
[  440.076004] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:0:19]
[  468.076004] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:0:19]

#5 Updated by intrigeri over 4 years ago

  • On current Jessie kernel, I can't reproduce the bug with noapic disableapic. But then, of course, only one CPU is used, so that's not a valid workaround. This leads me to think that there's a deadlock between CPUs.
  • Booting with notsc nohpet noapictimer doesn't help with current Jessie kernel.
  • sudo qemu-system-x86_64 -cdrom /srv/tails-i386-1.3.2.iso -m 1280 -nographic seems to work just fine. As soon as I add -machine pc-0.15,accel=kvm, it doesn't work anymore. Removing usb=off, adding mem-merge=off, using -machine pc-q35-2.1 doesn't help.
  • Both attached traces were produced by current Jessie kernel, with sudo qemu-system-x86_64 -cdrom /srv/tails-i386-1.3.2.iso -m 1280 -nographic -machine pc-0.15,accel=kvm
  • Disabling transparent hugepages (echo never >/sys/kernel/mm/transparent_hugepage/enabled) doesn't help.

#6 Updated by intrigeri over 4 years ago

Reported on the Debian BTS: https://bugs.debian.org/782774

#7 Updated by hybridwipe about 4 years ago

How is the first kvm instance being started (i.e., what QEMU command line), and what image/iso is it running? I tried booting a testing image to boot tails 1.3.2 (to minimize changes..), but kvm wouldn't even load inside kvm instance 1.

#8 Updated by intrigeri about 4 years ago

How is the first kvm instance being started (i.e., what QEMU command line), and what image/iso is it running? I tried booting a testing image to boot tails 1.3.2 (to minimize changes..), but kvm wouldn't even load inside kvm instance 1.

We're using libvirt, both on the host and (via the test suite) in the level 1 guest.
I suggest taking this question to our support channels.

#9 Updated by intrigeri about 4 years ago

Reproduced with today's Jessie kernel (3.16.7-ckt11-1).

#10 Updated by bertagaz almost 4 years ago

Reproduced with jessie-backport kernel (4.1.0-0.bpo.2 : 4.1.6-1~bpo8+1 (2015-09-09)).

Tried to boot the L1 VM with nmi_watchdog=0, also fixed the CPUs frequency on the host as I read it may be due to frequency scaling, but none of that worked so far.

I've tried to set the L1 VM cpu in host-passthrough mode, and then saw an error I haven't seen before:

[  231.456005] INFO: rcu_sched detected stalls on CPUs/tasks: { 2} (detected by 1, t=36762 jiffies, g=432, c=431, q=92)
[  231.456005] rcu_sched kthread starved for 36762 jiffies!
[  240.168087] INFO: task jbd2/dm-0-8:139 blocked for more than 120 seconds.
[  240.169158]       Tainted: G             L  4.1.0-0.bpo.2-amd64 #1
[  240.170057] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  256.076002] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:3:329]

#11 Updated by bertagaz almost 4 years ago

Did some other tests with this backport kernel.

trace_3 is the basic trace.

trace_no_hpet_no_acpi is when trying to start the guest L2 qemu-system-x86_64 with the -no-hpet and -no-acpi option.

trace_no_disk_cache_L1 is when starting the L1 guest with cache='none' for all its attached disks.

I also failed to find a way to set enable_apicv to Yes for the guest L1 kvm_intel module.

I'll post some links to pages that can interest us later.

There's definitely a problem with interrupts or apic. I tried to diable X2APIC access to the L1 guest, but didn't change anything.

#12 Updated by bertagaz almost 4 years ago

This tests were done with qemu-system-x86 from Jessie (which is 2.1 : 2.1+dfsg-12+deb8u4). I've tried with the version from Stretch (2.4 : 2.4+dfsg-3) and got a bit of a different trace.

#13 Updated by bertagaz almost 4 years ago

And this other trace is when adding -smp cpus=1 to the usual qemu-system-x86_64 -cdrom $ISO -m 1280 -machine accel=kvm used for all this testings.

#14 Updated by bertagaz almost 4 years ago

Note that all this tests were done with a qemu from Wheezy on the host, which is a bit old (1.1.2). Meanwhile there have been changes in qemu related to APIC and such. There's a wheezy-backports version available which is 2.1, so a next step could be to upgrade to it on the host and give some more testings.

#15 Updated by intrigeri almost 4 years ago

  • Assignee changed from intrigeri to bertagaz
  • QA Check set to Info Needed

bertagaz wrote:

I've tried to set the L1 VM cpu in host-passthrough mode, and then saw an error I haven't seen before:

[...]

I don't understand what you mean here: this is essentially the same error message as the one that's in the ticket description.

#16 Updated by intrigeri almost 4 years ago

bertagaz wrote:

Note that all this tests were done with a qemu from Wheezy on the host, which is a bit old (1.1.2). Meanwhile there have been changes in qemu related to APIC and such. There's a wheezy-backports version available which is 2.1, so a next step could be to upgrade to it on the host and give some more testings.

Good idea.

I suggest we hold on, and block on #9264 though: if we're going to stop doing nested virtualization soonish, then it's not worth investing lots of time into fixing it now (fixing it 6 months ago would have been different). What do you think?

#17 Updated by bertagaz almost 4 years ago

  • Assignee changed from bertagaz to intrigeri

intrigeri wrote:

bertagaz wrote:

Note that all this tests were done with a qemu from Wheezy on the host, which is a bit old (1.1.2). Meanwhile there have been changes in qemu related to APIC and such. There's a wheezy-backports version available which is 2.1, so a next step could be to upgrade to it on the host and give some more testings.

Good idea.

I suggest we hold on, and block on #9264 though: if we're going to stop doing nested virtualization soonish, then it's not worth investing lots of time into fixing it now (fixing it 6 months ago would have been different). What do you think?

Well, I'm tempted to agree, but I'm also tempted to give a try to this upgrade. There are quite a good amount of aborted test jobs in Jenkins, and I wonder if it won't help with that. Let say it's not mandatory but can be something that can be played with while waiting to see how #9264 goes?

#18 Updated by intrigeri almost 4 years ago

  • Assignee changed from intrigeri to bertagaz

Well, I'm tempted to agree, but I'm also tempted to give a try to this upgrade.

I certainly won't prevent you from spending your spare time on this, if you wish so :)

There are quite a good amount of aborted test jobs in Jenkins, and I wonder if it won't help with that.

This is another problem. It looks like a severe one, that needs a ticket.

Let say it's not mandatory but can be something that can be played with while waiting to see how #9264 goes?

There are lots of things wrt. #9264 and follow-up tasks that can be done without "waiting", e.g. make sure that we'll have more info when we come back to it, that we can host another box at the colo, that our contacts there will be available to purchase and plug the new box, etc. If all this is well handled and you have spare time, well, again, feel free to use it as you wish :)

#19 Updated by intrigeri over 3 years ago

I think we should now do this, in that order, and stopping on success:

  1. upgrade the host (lizard) to Jessie: we have to do it soonish anyway, and it'll give us the QEMU upgrade bertagaz wanted to test :)
  2. reproduce the bug on another isotesterN if not done yet
  3. test with Linux 4.3 on isotesterN
  4. make sure #9264 saves the day

#20 Updated by intrigeri over 3 years ago

  • Related to Feature #9264: Consider buying more server hardware to run our automated test suite added

#21 Updated by bertagaz over 3 years ago

  • Related to Bug #10717: Concerning amount of test suite runs aborted on Jenkins due to timeout added

#22 Updated by kytv over 3 years ago

intrigeri wrote:

  1. test with Linux 4.3 on isotesterN

I need to try 4.3 (and 4.4-rc) but to add a datapoint which may be helpful even though my hardware is wildly different.

With 4.2.x my kernel reliably OOPSes during the UEFI Tails-Installer test. I reverted to a self-compiled 4.1.13 in the meantime.

#23 Updated by kytv over 3 years ago

kytv wrote:

intrigeri wrote:

  1. test with Linux 4.3 on isotesterN

I need to try 4.3 (and 4.4-rc) but to add a datapoint which may be helpful even though my hardware is wildly different.

The oops can be reproduced with 4.4.0-rc4, but the screen is still responsive, unlike with 4.2.x.

Perhaps isoteesterX has better luck with it than I do.

#24 Updated by intrigeri over 3 years ago

  • Blocked by Feature #11178: Upgrade lizard host system to Jessie added

#25 Updated by intrigeri over 3 years ago

FWIW:

  • I've set cpu mode=host-passthrough on all isotesters, as suggested by the OpenStack doc when doing nested KVM.
  • Since 8544f14485289653b8dde9a29dc496f79e768866 (introduced on feature/jessie, part of Tails 2.0+) the test suite is run with cpu mode='host-model'.

It might be that one of these changes fixes the problem this ticket is about => worth retrying with Linux 3.16 from Jessie, and Linux 4.4 from sid, maybe. I might play with this one of these days when bored.

Also, wrt. the QEMU backport for Wheezy: it seems to be unmaintained and lacks a big pile of security fixes, so I don't think it's an option.

After the conclusions I've reached on #11011, I think we're back to needing to fix this problem before Wheezy's Linux 3.2 is not maintained anymore, that is before the end of wheezy-lts, since we can't count on a 2nd testing machine to magically solve this problem for us => bertagaz, can you set an appropriate target version?

#26 Updated by intrigeri over 3 years ago

  • Target version set to Tails_2.3

Reproduced on an isotester running current Jessie kernel (3.16.7-ckt20-1+deb8u3), same with the one from Stretch (4.3.0-1-amd64 version 4.3.5-1). I wonder if we should bisect Linux 3.2..3.16 packages, to see when the problem appeared.

Another option would be to try a newer kernel on the host.

I'm giving up until we've done #11178, setting the same tentative milestone at that one.

#27 Updated by anonym over 3 years ago

I think I've stumbled into this issue too now: #6354#note-16.

intrigeri wrote:

Another option would be to try a newer kernel on the host.

On my level-0 host I am running Debian Unstable, with kernel version 4.3.0-1-amd64 (and I've tried with the same backported kernels in the other two VMs), so I don't think this will help. :/

#28 Updated by intrigeri over 3 years ago

I think I've stumbled into this issue too now: #6354#note-16.

intrigeri wrote:

Another option would be to try a newer kernel on the host.

On my level-0 host I am running Debian Unstable, with kernel version 4.3.0-1-amd64 (and I've tried with the same backported kernels in the other two VMs), so I don't think this will help. :/

I'd like to confirm this is really the same problem we're seeing, first, so: is it fixed for you if the level-1 VM (where Vagrant will be run) runs Wheezy's kernel + Jessie's userspace?

#29 Updated by anonym over 3 years ago

intrigeri wrote:

On my level-0 host I am running Debian Unstable, with kernel version 4.3.0-1-amd64 (and I've tried with the same backported kernels in the other two VMs), so I don't think this will help. :/

I'd like to confirm this is really the same problem we're seeing, first, so: is it fixed for you if the level-1 VM (where Vagrant will be run) runs Wheezy's kernel + Jessie's userspace?

As mentioned on #tails-dev, this doesn't fix the problem. Specifically I was running:

  • level 0: 4.4.0-1-amd64 (Sid kernel/user space)
  • level 1: 3.2.0-4-amd64 (Wheezy kernel, Jessie user space)
  • level 2: 3.16.0-4-amd64 (Jessie kernel/user space)

I tried twice, and the first time even level-0 crashed! Second time I got this in level-2:

kernel:[ 1092.132037] BUG: soft lockup - CPU#3 stuck for 22s! [apt-get:25441] 

which actually isn't exactly the same as I reported in #6354#note-16. I couldn't see what happened in level-1 since it crashed, and the journal was lost.

I wonder if it matters that level-0 runs kernel 4.4.0. I guess lizard runs the Jessie (or Wheezy?) kernel?

#30 Updated by intrigeri over 3 years ago

I wonder if it matters that level-0 runs kernel 4.4.0. I guess lizard runs the Jessie (or Wheezy?) kernel?

lizard level-0 host runs 3.16 from wheezy-backports.

#31 Updated by anonym over 3 years ago

I do not have the issue when both level-0 and level-1 are running Debian Sid, see #6354#note-18. So if my kernel issue is related somehow to the kernel issue this ticket is about, perhaps everything will be good once Debian Stretch is out, and lizard and the isobuildes migrate to it.

An interesting experiment that we could do how would be installing the 4.4.0-1 kernel from Jessie backports on both lizard and the isotesters (and isobuilders?). What do you think? Perhaps the user land also matters.

#32 Updated by intrigeri over 3 years ago

anonym wrote:

I do not have the issue when both level-0 and level-1 are running Debian Sid, see #6354#note-18. So if my kernel issue is related somehow to the kernel issue this ticket is about, perhaps everything will be good once Debian Stretch is out, and lizard and the isobuildes migrate to it.

An interesting experiment that we could do how would be installing the 4.4.0-1 kernel from Jessie backports on both lizard and the isotesters (and isobuilders?). What do you think? Perhaps the user land also matters.

Tried it, but sadly we can't do that as long as we're running Wheezy on the host system: "linux-image-4.3.0-0.bpo.1-amd64 : Breaks: udev (< 208-8~) but 175-7.2 is to be installed". So we're back to blocking on #11178, as previously.

#33 Updated by intrigeri over 3 years ago

  • Subject changed from isotester1.lizard crashes when running the test suite with Jessie's kernel to ISO testers (level-1) VMs crash when running the test suite with Jessie's kernel
  • QA Check deleted (Info Needed)

#34 Updated by intrigeri over 3 years ago

After upgrading the level-0 host to Jessie, this combination (quite expectedly) still fails:

  • level-0: Jessie userspace + kernel
  • level-1: Jessie userspace, 4.4 kernel from jessie-backports
  • level-2: Jessie userspace + kernel

But this works:

  • level-0: Jessie userspace, 4.4 kernel from jessie-backports
  • level-1: Jessie userspace, 4.4 kernel from jessie-backports
  • level-2: Jessie userspace + kernel

Thanks anonym for the lead!

#35 Updated by intrigeri over 3 years ago

  • Status changed from Confirmed to Resolved
  • Assignee deleted (bertagaz)
  • % Done changed from 0 to 100

Also available in: Atom PDF