Bug #10288: Fix newly identified issues to make our test suite more robust and faster
The check_tor_leaks hook is fragile
https://jenkins.tails.boum.org/job/test_Tails_ISO_feature-5650-rngd/8/ failed (pcap attached) with:
calling as root: systemctl is-system-running call returned: [0, "running\n", ""] And Tor is ready # features/step_definitions/common_steps.rb:373 calling as root: date --rfc-2822 call returned: [0, "Fri, 10 Jun 2016 18:18:47 +0000\n", ""] Then Tails clock is less than 5 minutes incorrect # features/step_definitions/time_syncing.rb:40 Time was 0.436531902 seconds off Unexpected connections were made: #<OpenStruct mac_saddr="50:54:00:f3:87:dc", mac_daddr="52:54:00:53:3c:df", protocol="tcp", saddr="10.2.1.92", daddr="10.2.1.1", sport=55150, dport=5024>. <false> is not true. (Test::Unit::AssertionFailedError) /usr/lib/ruby/vendor_ruby/test/unit/assertions.rb:55:in `block in assert_block' /usr/lib/ruby/vendor_ruby/test/unit/assertions.rb:1588:in `call' /usr/lib/ruby/vendor_ruby/test/unit/assertions.rb:1588:in `_wrap_assertion' /usr/lib/ruby/vendor_ruby/test/unit/assertions.rb:53:in `assert_block' /usr/lib/ruby/vendor_ruby/test/unit/assertions.rb:163:in `block in assert' /usr/lib/ruby/vendor_ruby/test/unit/assertions.rb:1593:in `call' /usr/lib/ruby/vendor_ruby/test/unit/assertions.rb:1593:in `_wrap_assertion' /usr/lib/ruby/vendor_ruby/test/unit/assertions.rb:130:in `assert' /var/lib/jenkins/workspace/test_Tails_ISO_feature-5650-rngd/features/support/helpers/firewall_helper.rb:68:in `assert_all_connections' /var/lib/jenkins/workspace/test_Tails_ISO_feature-5650-rngd/features/support/hooks.rb:274:in `After'
The pcap shows one initial packet (at time = 0) that's precisely the one that violates the assertion, and then a DCHP transaction 53s later, and then what looks like a Tor bootstrap with chutney 30s later. So, my theory is that the initial packet, that triggers the failure, is a leftover of the previous scenario, or something, and has nothing to do with the current scenario.
Test suite: power off system under testing after scenario.
Until now we have relied on either one of the generated "snapshot
restore" steps or the "[Given] a computer" step to implicitly stop the
old VM when we move on to a new scenario. That meant the old VM was
still running during the new scenarios
Before hooks. If the new
scenario is tagged @check_tor_leaks that means we start its sniffer
while the old VM is still running, possibly sending packets that then
affect the new scenario. That would explain some myserious "Unexpected
connections were made" failures we have seen (refs: #11521).
#3 Updated by bertagaz over 3 years ago
It has happened 4 times in the last 50 days on Jenkins.
That's mentioned on #11087#note-9. Attaching the pcap just in case.
#4 Updated by intrigeri over 3 years ago
Here's the pcap for another similar failure.
- Target version set to Tails_3.15
- % Done changed from 0 to 40
- Feature Branch set to test/11521-always-power-off-vm
First of all, I
grep@ed for "Unexpected connections were made"
over all @debug.log:s for the past three months, and it occurred 56 times. I randomly picked three of them, and all three had this bad first packet, so this bug is still affecting us.
So, my theory is that the initial packet, that triggers the failure, is a leftover of the previous scenario, or something, and has nothing to do with the current scenario.
It is a great theory! :)
Note that this first packet is sent by the Tails under testing and not the router. Looking into our hooks (
features/support/hooks.rb) I see that in none of the
After hooks do we shut down the VM, instead we seem to rely on either one of the generated "snapshot restore" steps or the
[Given] a computer step to implicitly get rid off the old VM... in the next scenario. That means the old VM is still running during the new scenarios
Before hooks, where we start the sniffer. This finding is pretty conclusive: there is a window where the previous scenario's network traffic leaks into the next one's sniffer.
So, let's just shut down the VM in the last
- Status changed from In Progress to Needs Validation
- Assignee deleted (
Out of 3 runs I only saw unrelated failures, so this branch doesn't seem to introduce any regression (and I don't see why @fragile scenarios would be any different so I'm ignoring that they are not full runs).
I propose we just merge this branch and close this ticket, and reopen (or file a new one) if the problem reappears.
- Status changed from Needs Validation to In Progress
- Assignee changed from intrigeri to anonym
LGTM but just one thing: the only other time in this hook that we use
$vm, we guard it with
if $vm; the rationale you provided in 1c387ab37779308980c870157c8d6d0aa752294b seems to apply here as well, so maybe add
if $vm here too?