Project

General

Profile

Feature #7821

Feature #6298: Write more automated tests

Automatic tests for Tor

Added by intrigeri over 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Test suite
Target version:
Start date:
08/26/2014
Due date:
% Done:

90%

Feature Branch:
test/7821-tor
Type of work:
Code
Blueprint:
Starter:
Affected tool:

Description

This covers everything in the Tor and stream isolation sections of our manual test suite.


Related issues

Related to Tails - Feature #5644: Test suite: always check for firewall leaks Resolved
Related to Tails - Bug #8941: The remote shell server is racing against Tails Greeter Resolved 02/22/2015 03/06/2015
Blocks Tails - Feature #6305: Write tests for Tor bridges Resolved 09/26/2013
Blocks Tails - Feature #8702: Write regression test for Unsafe Browser update checks Resolved 01/14/2015

Associated revisions

Revision ceee0062 (diff)
Added by Tails developers almost 5 years ago

Remove manual Tor tests that now has been automated (Will-fix: #7821).

  • We leave the "The version of Tor should be the latest stable one"
    test, but it probably should be part of the release process instead.
  • We don't remove the bridge tests since that's a different ticket
    (#6305).
  • We move the stream isolation test for Claws Mail to its section and
    do not bother automating that test since we soon are going to
    migrate to Icedove. That test would be complicated, and depend on
    #6301 which haven't been finished yet.

Revision 5feef88f
Added by Tails developers almost 5 years ago

Merge branch 'test/7821-tor' into testing

Fix-committed: #7821, #5644

History

#1 Updated by intrigeri over 5 years ago

#2 Updated by intrigeri over 5 years ago

  • Parent task set to #6298

#4 Updated by anonym about 5 years ago

I've been researching a bit the best way to reliably do the Tor stream isolation tests. I.e. we'd like to monitor the traffic on a PID-basis, preferably with tools already installed inside Tails (since we are gonna look at loopback traffic, we have to do it from inside Tails). Some conclusions:

  • Good ol' tcpdump doesn't log PIDs. However in Mac OS X it's patched to accept the -k NP option which includes PIDs. That would be ideal for a simple solution...
  • The auditd frame work, which can be used to log syscalls, like socket(AF_INET, SOCK_STREAM, blah) or connect(blah). I'm not really sure if we can get the port from the audit trails, or if it would cover all possibilities an application can use to do networking.
  • SystemTap surely could achieve what we want, but it has some heavy dependencies (not to install, but to actually run stuff), like the Linux headers.
  • One would imagine that this could be done with some simple firewall LOG rule, but AFAICT it's oblivious to PIDs (e.g. --pid-owner was removed in Linux 2.6.14).
  • tracedump seems like another perfect solution, but it isn't packaged in Debian, and I'm unsure if we'd want it in Tails just for this purpose.
  • Something based on Linux's network namespaces (e.g. we run the tested application inside one and dump everything from its virtual interface). It seems a bit complex and intrusive (e.g. messing with firewall NAT rules).

#5 Updated by intrigeri about 5 years ago

I've been researching a bit the best way to reliably do the Tor stream isolation
tests. I.e. we'd like to monitor the traffic on a PID-basis, preferably with tools
already installed inside Tails (since we are gonna look at loopback traffic, we have
to do it from inside Tails). Some conclusions:

Ouch, looks harder than what I would have imagined.

https://trac.torproject.org/projects/tor/ticket/3455#comment:64 lead me to believe that maybe we could directly ask Tor what kind of connection request it receives. Perhaps stem (python-stem is in wheezy-backports) would help parsing stuff.

#6 Updated by anonym about 5 years ago

intrigeri wrote:

https://trac.torproject.org/projects/tor/ticket/3455#comment:64 lead me to believe that maybe we could directly ask Tor what kind of connection request it receives.

On the face of it, it seems like a great idea. However, after having a look at the control protocol specification it seems that the info for stream events contains nothing about which SocksPort is being used. It does provide the id of the circuit it belongs to, but the info one can query about circuits similarly has no notion of which SocksPort it was spawned for. Shame.

Perhaps stem (python-stem is in wheezy-backports) would help parsing stuff.

After a quick look at the API, it seem it (as expected) has no extension that would solve the above mentioned problem.

#7 Updated by anonym about 5 years ago

anonym wrote:

  • Something based on Linux's network namespaces (e.g. we run the tested application inside one and dump everything from its virtual interface). It seems a bit complex and intrusive (e.g. messing with firewall NAT rules).

After looking a bit deeper, this won't work, really. The new namespace will have its own loopback, so it will not see the SocksPort:s etc listening on the "outer" loopback. Adding a NAT rule that re-directs traffic destined to loopback inside the namespace to the outer one (if that's even possible) kind of defeats the purpose of the stream isolation tests. What remains is to restart tor and polipo inside the new namespace, but then we truly start to diverge from the setup we actually want to test.

#8 Updated by intrigeri about 5 years ago

I'm starting to think that the easiest and most robust solution could be to:

  1. Disable all SOCKS ports that are not supposed to be used for the next test
  2. Tell Tor to reload its configuration
  3. Do the test: if it manages to connect, then it must be that it's using the right SOCKS port

A bit ugly, perhaps...

#9 Updated by anonym about 5 years ago

  • Target version changed from Tails_1.2.2 to Tails_1.2.3

#10 Updated by anonym about 5 years ago

  • Target version changed from Tails_1.2.3 to Tails_1.3

#11 Updated by anonym almost 5 years ago

  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 20
  • QA Check set to Dev Needed
  • Feature Branch set to test/7821-tor

#12 Updated by Tails almost 5 years ago

Applied in changeset commit:51cebe3199b1e6449249ff38c542d668f0b2f53e.

#13 Updated by anonym almost 5 years ago

  • % Done changed from 20 to 40

#14 Updated by anonym almost 5 years ago

intrigeri wrote:

I'm starting to think that the easiest and most robust solution could be to:

  1. Disable all SOCKS ports that are not supposed to be used for the next test
  2. Tell Tor to reload its configuration
  3. Do the test: if it manages to connect, then it must be that it's using the right SOCKS port

A bit ugly, perhaps...

Too ugly, IMHO. I definitely think we must avoid messing with the configuration that we are actually testing, i.e. no changes to the firewall rules or Tor's configuration. I ended up using the same method as the manual tests, i.e. netstat spamming, so we'll get something that's at least as good as that (minus the human intelligence that perhaps could spot something truly unexpected).

Of course, that approach is not robust for catching all connections since one quite reasonably could happen (socket open to close) between two calls of netstat. In fact, if an application sometimes leaks (i.e. sometimes it uses the configured SocksPort and sometimes it doesn't) the leaked connection would be dropped by the firewall so fast that it's quite likely it will happen between two netstat calls and hence be invisible. However, I think we can ignore the potential existence of such leaks as long as we verify that the application succeeded to do whatever it was supposed to do (=> any such leaks were non-essential), and that we check for Tor leaks (=> any such leaks were harmless). Otherwise, if we care about this, I suppose we could check the firewall's log for any dropped packages...

#15 Updated by intrigeri almost 5 years ago

Too ugly, IMHO.

Fair enough.

Of course, that approach is not robust for catching all connections since one quite reasonably could happen (socket open to close) between two calls of netstat. [...] However, I think we can ignore the potential existence of such leaks as long as we verify that the application succeeded to do whatever it was supposed to do (=> any such leaks were non-essential), and that we check for Tor leaks (=> any such leaks were harmless).

Agreed.

#16 Updated by anonym almost 5 years ago

  • Assignee changed from anonym to kytv
  • % Done changed from 40 to 50
  • QA Check changed from Dev Needed to Ready for QA

Note that some commits may seem unrelated. For instance, for

a8c2537 Hook transparent firewall leak checking via @check_tor_leaks tag.

we need:
366e7a2 Split VM networking into its own class, VMNet.

which in turn needed the fixes:
f8b116b Decouple VMStorage class from VM class.
fd61a7d Fix forgotten instance of VM class' s/update_domain/update/.
f9ce357 Improve encapsulation.

This will import most of the useful stuff from the good ol' test/reorg and test/firewall-check-tag branches.

Also, because of

1a55905 Create the environment for remote shell calls more realisitcally.

the image tested muse be built from this branch!

#17 Updated by anonym almost 5 years ago

  • Related to Feature #5644: Test suite: always check for firewall leaks added

#18 Updated by kytv almost 5 years ago

anonym wrote:

Also, because of
[...]
the image tested muse be built from this branch!

ISO built and testing has commenced.

#19 Updated by intrigeri almost 5 years ago

Is it on purpose that commit:425e622c7a1cc9a7527deae3152be011349db387 hasn't been merged into experimental?

#20 Updated by anonym almost 5 years ago

intrigeri wrote:

Is it on purpose that commit:425e622c7a1cc9a7527deae3152be011349db387 hasn't been merged into experimental?

No that was a mistake which I've fixed now.

#21 Updated by intrigeri almost 5 years ago

  • Assignee changed from kytv to anonym
  • QA Check changed from Ready for QA to Dev Needed
  • The iptables output parser scares me a bit. I'd rather see a grammar-based parser generator, but oh well, now the work has been done, let's say it's too late.
  • I'm not sure what's the value of "the firewall's NAT rules only redirect traffic for Tor's TransPort and DNSPort" without checking that the redirection to TransPort is limited to .onion mapped addresses, since we don't want transparent proxying in general. It's a bit confusing to read that step definition IMO. Not sure how to improve this without too much work, though.
  • in "the firewall is configured to block all IPv6 traffic", there's an error message about "The NAT table" while we're actually looking at the filter table.
  • it feels weird to have "the firewall is configured to block all IPv6 traffic" checked in a feature that's called "Tor is configured properly". Maybe rename this feature torification.feature and adjust its title to match what we're actually testing?
  • I think that the title of the "untorified network connections to (\S+) fails" step should make it clear that it's about HTTP only.
  • My Ruby is no good, but in assert(!status.success? && status.stderr[expected_stderr], we're checking that status.stderr includes expected_stderr, right? If that's the case, then I suggest renaming the expected_stderr variable to expected_in_stderr so that it's clearer what we're actually checking.
  • Instead of testing "untorified network connections to 1.2.3.4", how about testing connections to an IP that has a web server listening on?
  • Any reason not to call the destroy method in VMNet.clean_up instead of duplicating its code entirely?
  • some code uses custom_sniffer that isn't defined anywhere, and then it fails
  • typo in "ESTABLISHED statate"
  • in I re-run htpdate, maybe use && instead of ; between commands, so that it makes more sense to use execute_successfully?
  • regarding I do a whois-lookup of domain, I've actually implemented a similar check in the torified_misc feature (introduced by my branch for #5525) => we should merge these together at some point.
  • now that the remote shell sets $PATH correctly, maybe we should stop running stuff in /usr/local/*bin with absolute paths? This way, we would also test $PATH implicitly.
  • commit:f8b116b3aba199d0b4a00ffa8808bfaad24995ff adds a :mac attribute reader, which doesn't seem to be used; buggy cherry-pick?

#22 Updated by intrigeri almost 5 years ago

  • Back when we were on test/firewall-check-tag, to my request you introduced commit:894f78216861bb79a1ea36f1b5e5ebf2bbb2ac1d, and now it's been left aside.
  • Same for commit:942a030?
  • My suggestion sent on 2013-08-19 in the "Please review'n'merge test/firewall-check-tag [was: Please review'n'merge test/reorg]" was apparently ignored.
  • In the "Please review'n'merge test/reorg" thread I've asked for documentation for the clean_up and destroy methods.
  • In commit:bdd57a05da7e30c33d8f0d47efde3d809b89b18b you had done some changes to follow-up on my review, and this seems to have been lost along the way.
  • On 2013-08-19 in the "Please review'n'merge test/reorg" thread, I wrote "So please name it differently, else I might not be the only one to get confused :)" about the net variable. This hasn't been addressed apparently.

#23 Updated by intrigeri almost 5 years ago

devel branch should be merged into this one, and tests it has added must be updated to use the @check_tor_leaks tag instead of the "all Internet traffic has only flowed through Tor" step.

#24 Updated by anonym almost 5 years ago

  • Assignee changed from anonym to intrigeri
  • QA Check changed from Dev Needed to Info Needed

intrigeri wrote:

  • The iptables output parser scares me a bit. I'd rather see a grammar-based parser generator, but oh well, now the work has been done, let's say it's too late.

I looked at a few of Ruby's parser libraries, e.g. TreeTop (which is in Debian) and citrus (not in Debian), but it was my assessment that it would be a pretty large initial investment (+ more code) to make a parser that's potentially more robust than than a simple state machine + heavy use of regex over full lines.

  • I'm not sure what's the value of "the firewall's NAT rules only redirect traffic for Tor's TransPort and DNSPort" without checking that the redirection to TransPort is limited to .onion mapped addresses, since we don't want transparent proxying in general. It's a bit confusing to read that step definition IMO. Not sure how to improve this without too much work, though.

Ah, checking the destination was indeed forgotten. What about commit 10327cc? The step name is perhaps not the best still. What I wanted to do is the same as the (vague) manual test of looking at the output of iptables -t nat -L -n -v, and IMHO this step now does that adequately. What do you think?

  • in "the firewall is configured to block all IPv6 traffic", there's an error message about "The NAT table" while we're actually looking at the filter table.

Whoops, that's a copy-paste error. Fixed in commit 67aab00.

  • it feels weird to have "the firewall is configured to block all IPv6 traffic" checked in a feature that's called "Tor is configured properly". Maybe rename this feature torification.feature and adjust its title to match what we're actually testing?

Well, it was in the "Tor" section of the manual test suite. :) Maybe tor.feature should be split into tor_stream_isolation.feature (all the "... SocksPort" scenarios") and tor_enforcement.feature (remaining scenarios)? Quite possibly it would make sense to merge the firewall_leaks.feature into tor_enforcement.feature since they're just anti-tests.

  • My Ruby is no good, but in assert(!status.success? && status.stderr[expected_stderr], we're checking that status.stderr includes expected_stderr, right?

Correct. I actually think I learned that trick when reviewing your work some time ago. :)

If that's the case, then I suggest renaming the expected_stderr variable to expected_in_stderr so that it's clearer what we're actually checking.

Fixed in commit 34a5a22.

  • I think that the title of the "untorified network connections to (\S+) fails" step should make it clear that it's about HTTP only.

Sure, but I guess what I want the step to actually do is "untorified TCP connections ... fail" since it's not about HTTP particularly, so I might as well use netcat or whatever. So...

  • Instead of testing "untorified network connections to 1.2.3.4", how about testing connections to an IP that has a web server listening on?

If so I guess we'd have to verify that a web server actually is listening on it, right? While that's doable with a separate torified curl call it seems clunky.

What if we check that the firewall logged the dropped connection instead? OTOH that would be harder when a domain name is given instead. On second thought, though, the usage of a domain name vs IP address seems completely orthogonal to what I want to test. I think approach is better, so I've completely reworked this test in commit 5e6acef, and as a bonus we now also test that UDP (commit 30ae3bf) and ICMP (commit a32617a) packets are dropped.

  • some code uses custom_sniffer that isn't defined anywhere, and then it fails

Fixed in commit 33a52a4.

  • typo in "ESTABLISHED statate"

Fixed in commit e02e06d.

  • in I re-run htpdate, maybe use && instead of ; between commands, so that it makes more sense to use execute_successfully?

Fixed in commit a1812ed.

  • regarding I do a whois-lookup of domain, I've actually implemented a similar check in the torified_misc feature (introduced by my branch for #5525) => we should merge these together at some point.

Done in commit 3cc44d9.

  • now that the remote shell sets $PATH correctly, maybe we should stop running stuff in /usr/local/*bin with absolute paths? This way, we would also test $PATH implicitly.

Fixed in commit e998859.

  • commit:f8b116b3aba199d0b4a00ffa8808bfaad24995ff adds a :mac attribute reader, which doesn't seem to be used; buggy cherry-pick?

Possibly, yeah. I had an idea of future use, but if so we'll re-introduce them. Removed in commit 687b1b0. Their current names were not fitting in that class any way.

  • Back when we were on test/firewall-check-tag, to my request you introduced commit:894f78216861bb79a1ea36f1b5e5ebf2bbb2ac1d, and now it's been left aside.

I've reworked this in another branch (Tor bridges, take 2) that I haven't published yet. Will you take my word on that this will be fixed there?

  • Same for commit:942a030?

It's already in commit 366e7a2.

  • My suggestion sent on 2013-08-19 in the "Please review'n'merge test/firewall-check-tag [was: Please review'n'merge test/reorg]" was apparently ignored.

As far as I could tell, none of them were relevant still.

  • Any reason not to call the destroy method in VMNet.clean_up instead of duplicating its code entirely?

destroy uses @net, which hasn't been set before the first call to clean_up (that's how it can be used to clean up instances remaining from previous test suite runs). I suppose destroy could call clean_up instead, which is funny since I just marked it private. OTOH, destroy is perhaps a poor choice of name since it easily can be confused with the destroy action in libvirt nomenclature. Perhaps destroy_and_undefine will make it self-documented (see below for your other concern)?

Hmm. Perhaps the most sensible thing is to remove clean_up and only have destroy/destroy_and_undefine (as a public method) which does what clean_up did (i.e. lookup the domain by name)? What do you think?

  • In the "Please review'n'merge test/reorg" thread I've asked for documentation for the clean_up and destroy methods.
  • In commit:bdd57a05da7e30c33d8f0d47efde3d809b89b18b you had done some changes to follow-up on my review, and this seems to have been lost along the way.

Fixed for clean_up in commit e7dca50. As for destroy, would renaming it to destroy_and_undefine as suggested above be good enough? Otherwise, what is it that you want to be documented?

  • On 2013-08-19 in the "Please review'n'merge test/reorg" thread, I wrote "So please name it differently, else I might not be the only one to get confused :)" about the net variable. This hasn't been addressed apparently.

Sorry, now fixed in commit e9b7b1a.

#25 Updated by anonym almost 5 years ago

intrigeri wrote:

devel branch should be merged into this one, and tests it has added must be updated to use the @check_tor_leaks tag instead of the "all Internet traffic has only flowed through Tor" step.

Done in commit 78721c9.

#26 Updated by intrigeri almost 5 years ago

Impressive work!

(Snipping everything you've fixed since then. Thanks.)

anonym wrote:

intrigeri wrote:

  • I'm not sure what's the value of "the firewall's NAT rules only redirect traffic for Tor's TransPort and DNSPort" without checking that the redirection to TransPort is limited to .onion mapped addresses, since we don't want transparent proxying in general. It's a bit confusing to read that step definition IMO. Not sure how to improve this without too much work, though.

Ah, checking the destination was indeed forgotten. What about commit 10327cc?

Good!

The step name is perhaps not the best still. What I wanted to do is the same as the (vague) manual test of looking at the output of iptables -t nat -L -n -v, and IMHO this step now does that adequately. What do you think?

It's good enough IMO.

  • it feels weird to have "the firewall is configured to block all IPv6 traffic" checked in a feature that's called "Tor is configured properly". Maybe rename this feature torification.feature and adjust its title to match what we're actually testing?

Well, it was in the "Tor" section of the manual test suite. :) Maybe tor.feature should be split into tor_stream_isolation.feature (all the "... SocksPort" scenarios") and tor_enforcement.feature (remaining scenarios)? Quite possibly it would make sense to merge the firewall_leaks.feature into tor_enforcement.feature since they're just anti-tests.

I agree with this proposal, please go ahead. And then, perhaps torified_misc.feature that I've introduced can be merged into tor_enforcement.feature?

  • I think that the title of the "untorified network connections to (\S+) fails" step should make it clear that it's about HTTP only.

Sure, but I guess what I want the step to actually do is "untorified TCP connections ... fail" since it's not about HTTP particularly, so I might as well use netcat or whatever. So...

OK. HTTP was a special case when we had Polipo, but now it's not anymore, and you made the test use netcat anyway, so I'm happy.

  • Instead of testing "untorified network connections to 1.2.3.4", how about testing connections to an IP that has a web server listening on?

If so I guess we'd have to verify that a web server actually is listening on it, right? While that's doable with a separate torified curl call it seems clunky.

What if we check that the firewall logged the dropped connection instead? OTOH that would be harder when a domain name is given instead. On second thought, though, the usage of a domain name vs IP address seems completely orthogonal to what I want to test. I think approach is better, so I've completely reworked this test in commit 5e6acef, and as a bonus we now also test that UDP (commit 30ae3bf) and ICMP (commit a32617a) packets are dropped.

Awesome!

  • Back when we were on test/firewall-check-tag, to my request you introduced commit:894f78216861bb79a1ea36f1b5e5ebf2bbb2ac1d, and now it's been left aside.

I've reworked this in another branch (Tor bridges, take 2) that I haven't published yet. Will you take my word on that this will be fixed there?

Fair enough.

  • Same for commit:942a030?

It's already in commit 366e7a2.

OK. I missed it since some bits are still not there, but perhaps there's a good reason for it, so I'll point them out and you can explain:

  • the VMNet.{ip,mac,bridge_name} accessors have been left aside, and bridge_name is still an attribute of the VMNet class, where we copy data on update, instead of being delegated to the libvirt network object with an accessor; IMO that's worse separation of concerns than what commit:942a030 implemented
  • the VM class still has :ip and :mac listed in attr_reader. Are they useful there?
  • My suggestion sent on 2013-08-19 in the "Please review'n'merge test/firewall-check-tag [was: Please review'n'merge test/reorg]" was apparently ignored.

As far as I could tell, none of them were relevant still.

You're right.

OTOH, destroy is perhaps a poor choice of name since it easily can be confused with the destroy action in libvirt nomenclature. Perhaps destroy_and_undefine will make it self-documented (see below for your other concern)?

Good idea.

Hmm. Perhaps the most sensible thing is to remove clean_up and only have destroy/destroy_and_undefine (as a public method) which does what clean_up did (i.e. lookup the domain by name)? What do you think?

Good idea.

As for destroy, would renaming it to destroy_and_undefine as suggested above be good enough?

Yes.

Then I took a look at your recent commits, and I see:

  assert(!firewall_has_dropped_packet_to?(proto, host, port),
         "A #{proto} packet to #{host}:#{port} has already been dropped by " \
         "the firewall")

=> the error message won't work for ICMP, since port is nil in that case, right?

So, it seems that in the end, only a little bit of refactoring is left to do. Yay!

#27 Updated by intrigeri almost 5 years ago

  • Assignee changed from intrigeri to anonym
  • % Done changed from 50 to 70
  • QA Check changed from Info Needed to Dev Needed

#28 Updated by anonym almost 5 years ago

  • Assignee changed from anonym to intrigeri
  • QA Check changed from Dev Needed to Ready for QA

intrigeri wrote:

Impressive work!

Thanks, and thanks for the very thorough reviews!

  • it feels weird to have "the firewall is configured to block all IPv6 traffic" checked in a feature that's called "Tor is configured properly". Maybe rename this feature torification.feature and adjust its title to match what we're actually testing?

Well, it was in the "Tor" section of the manual test suite. :) Maybe tor.feature should be split into tor_stream_isolation.feature (all the "... SocksPort" scenarios") and tor_enforcement.feature (remaining scenarios)? Quite possibly it would make sense to merge the firewall_leaks.feature into tor_enforcement.feature since they're just anti-tests.

I agree with this proposal, please go ahead.

Done in commits 83b5b29 and e19bce8.

And then, perhaps torified_misc.feature that I've introduced can be merged into tor_enforcement.feature?

IMHO no. The tor_enforcement.feature should check that we cannot escape Tor, and tests that check that certain applications work when using Tor should be in some torified_*.feature", like @torified_{browsing,gnupg}.feature. Makes sense? Otherwise, why not merging the latter two into tor_enforcement.feature as well?

  • Same for commit:942a030?

It's already in commit 366e7a2.

OK. I missed it since some bits are still not there, but perhaps there's a good reason for it, so I'll point them out and you can explain:

  • the VMNet.{ip,mac,bridge_name} accessors have been left aside, and bridge_name is still an attribute of the VMNet class, where we copy data on update, instead of being delegated to the libvirt network object with an accessor; IMO that's worse separation of concerns than what commit:942a030 implemented

The VMNet.{ip,mac} accessors were already removed in commit 687b1b0. As for VMNet.bridge_name, now fixed in commit 37533f6, I hope.

  • the VM class still has :ip and :mac listed in attr_reader. Are they useful there?

Oops, those were forgotten. Removed in commit a510880.

OTOH, destroy is perhaps a poor choice of name since it easily can be confused with the destroy action in libvirt nomenclature. Perhaps destroy_and_undefine will make it self-documented (see below for your other concern)?

Good idea.

Hmm. Perhaps the most sensible thing is to remove clean_up and only have destroy/destroy_and_undefine (as a public method) which does what clean_up did (i.e. lookup the domain by name)? What do you think?

Good idea.

As for destroy, would renaming it to destroy_and_undefine as suggested above be good enough?

Yes.

All of these done in commit db1993b.

Then I took a look at your recent commits, and I see:

[...]

=> the error message won't work for ICMP, since port is nil in that case, right?

You're right, good catch! Fixed in commit 1498d14.

#29 Updated by intrigeri almost 5 years ago

  • Assignee changed from intrigeri to anonym
  • % Done changed from 70 to 90
  • QA Check changed from Ready for QA to Info Needed

Code review passes. Full test suite passes except expected failures and:

  @keep_volumes
  Scenario: Upgrading an old Tails USB installation from an ISO image, running on the old version # features/usb_install.feature:223
    Given a computer                                                                              # features/step_definitions/common_steps.rb:64
    And I clone USB drive "old" to a new USB drive "to_upgrade"                                   # features/step_definitions/usb.rb:51
    And I setup a filesystem share containing the Tails ISO                                       # features/step_definitions/usb.rb:139
    When I start Tails from USB drive "old" with network unplugged and I login                    # features/step_definitions/common_steps.rb:165
      Remote shell seems to be down (RuntimeError)
      ./features/support/helpers/exec_helper.rb:19:in `rescue in wait_until_remote_shell_is_up'
      ./features/support/helpers/exec_helper.rb:14:in `wait_until_remote_shell_is_up'
      ./features/support/helpers/vm_helper.rb:357:in `wait_until_remote_shell_is_up'
      ./features/step_definitions/common_steps.rb:242:in `/^the computer (re)?boots Tails$/'
      ./features/step_definitions/common_steps.rb:175:in `/^I start Tails from (.+?) drive "(.+?)"(| with network unplugged) and I login(| with(| read-only) persistence password "([^"]+)")$/'
      features/usb_install.feature:227:in `When I start Tails from USB drive "old" with network unplugged and I login'

And on the second try (only running usb_install.feature this time), Scenario: Installing an old version of Tails to a pristine USB drive freezes after completing the installation.

I've not run the test suite on that box for a while, so I don't remember if such problems usually happen there. I didn't check if that was a regression either. anonym, any chance this is related to some of the VM class refactoring you've done on that branch? If not, I'll just go ahead and merge.

#30 Updated by intrigeri almost 5 years ago

intrigeri wrote:

And on the second try (only running usb_install.feature this time), Scenario: Installing an old version of Tails to a pristine USB drive freezes after completing the installation.

I'll take this back, it seemed to be frozen but in the end it completed just fine.

#31 Updated by anonym almost 5 years ago

  • Assignee changed from anonym to intrigeri

intrigeri wrote:

Code review passes. Full test suite passes except expected failures and:

[...]
Remote shell seems to be down (RuntimeError)
[...]

I've not run the test suite on that box for a while, so I don't remember if such problems usually happen there. I didn't check if that was a regression either. anonym, any chance this is related to some of the VM class refactoring you've done on that branch? If not, I'll just go ahead and merge.

I doubt the VM class refactoring affects this, since it only should change things in between scenarios and features, but this happen deep in a scenario at the 'the computer boots Tails' step.

However, perhaps my changes to the remote shell server in has something to do with it? In particular, for e

So the exact situation where we got this failure is (in the 'the computer boots Tails' step) when we've seen TailsGreeter.png and we run VMCommand.wait_until_remote_shell_is_up, which sends a true (but it could really be anything, even an invalid command) to the remote shell and, and simply waits for an answer for up to 30 seconds. I used to see errors often years ago, and that's #5491, but I haven't seen it for like a year.

However, it seems suspicious that you and Alan (#5491#note-6) suddenly see this issue again after I've modified the remote shell server, per commit 1a55905. The only change that should be relevant is that each remote shell call will result in an extra shell command run server-side before the requested command, namely 'su -c env ' + user, which is used to create an appropriate environment. That extra call will also happen to VMCommand.wait_until_remote_shell_is_up, so if there's a problem there, it's affected too. However, what could it be? Could 'su -c env ' + user some how fail because we're too early in the boot process, i.e. something isn't guaranteed to be set up by our boot process at this time, i.e. when Tails Greeter has just started? Any ideas?

I'm at a loss here. If this branch makes #5491 worse, merging it is a bad idea. I suggest you re-run the full test suite multiple times and see if you get this error again.

#32 Updated by intrigeri almost 5 years ago

If this branch makes #5491 worse, merging it is a bad idea. I suggest you re-run the full test suite multiple times and see if you get this error again.

Sure, it's running.

#33 Updated by intrigeri almost 5 years ago

Just seen it again:

  @keep_volumes
  Scenario: Booting Tails from a USB drive upgraded from USB with persistence enabled                                 # features/usb_install.feature:213
    Given a computer                                                                                                  # features/step_definitions/common_steps.rb:64
    And I start Tails from USB drive "to_upgrade" with network unplugged and I login with persistence password "asdf" # features/step_definitions/common_steps.rb:165
      Remote shell seems to be down (RuntimeError)
      ./features/support/helpers/exec_helper.rb:19:in `rescue in wait_until_remote_shell_is_up'
      ./features/support/helpers/exec_helper.rb:14:in `wait_until_remote_shell_is_up'
      ./features/support/helpers/vm_helper.rb:357:in `wait_until_remote_shell_is_up'
      ./features/step_definitions/common_steps.rb:242:in `/^the computer (re)?boots Tails$/'
      ./features/step_definitions/common_steps.rb:175:in `/^I start Tails from (.+?) drive "(.+?)"(| with network unplugged) and I login(| with(| read-only) persistence password "([^"]+)")$/'
      features/usb_install.feature:215:in `And I start Tails from USB drive "to_upgrade" with network unplugged and I login with persistence password "asdf"'

(bare metal, --view, no --capture)

#34 Updated by anonym almost 5 years ago

  • Related to deleted (Feature #6305: Write tests for Tor bridges)

#35 Updated by anonym almost 5 years ago

#36 Updated by anonym almost 5 years ago

  • QA Check changed from Info Needed to Ready for QA

We've discussed this off list, and hopefully commit dc77fae fixed intrigeri's issue.

#37 Updated by intrigeri almost 5 years ago

  • QA Check changed from Ready for QA to Info Needed

anonym wrote:

We've discussed this off list, and hopefully commit dc77fae fixed intrigeri's issue.

Unfortunately not: I've seen Remote shell seems to be down (RuntimeError) again today while running the test suite from testing + test/7821-tor (at commit:dc77fae) merged into it. Re-running it with --debug flag to see if that commit sometimes helps or not. I had never seen this problem before, so it seems clear to me that this branch still makes the remote shell thing worse :(

#38 Updated by kytv almost 5 years ago

  • Blocks Feature #8702: Write regression test for Unsafe Browser update checks added

#39 Updated by intrigeri almost 5 years ago

intrigeri wrote:

Re-running it with --debug flag to see if that commit sometimes helps or not.

... and of course, the debug run didn't trigger this error :) Running it again.

#40 Updated by anonym almost 5 years ago

intrigeri wrote:

anonym wrote:

We've discussed this off list, and hopefully commit dc77fae fixed intrigeri's issue.

Unfortunately not: I've seen Remote shell seems to be down (RuntimeError) again today while running the test suite from testing + test/7821-tor (at commit:dc77fae) merged into it.

Argh! Ok. Assuming it's the change I made to the remote shell server (commit 1a55905, which is my best bet here) that introduced this regression some how, it's actually not at all a critical part of this branch. What if you revert all these commits

dc77fae Retry sending the command in wait_until_remote_shell_is_up.
e998859 Don't provide absolute paths to commands.
12ca94c Drop explicit assignment of LANG.
1a55905 Create the environment for remote shell calls more realisitcally.

and build a new image from the branch (actually, 1.3~rc1 should do) and then run it in the test suite on that branch?

#41 Updated by intrigeri almost 5 years ago

  • Assignee changed from intrigeri to anonym

anonym wrote:

Argh! Ok. Assuming it's the change I made to the remote shell server (commit 1a55905, which is my best bet here) that introduced this regression some how, it's actually not at all a critical part of this branch. What if you revert all these commits
[...]
and build a new image from the branch (actually, 1.3~rc1 should do) and then run it in the test suite on that branch?

I can do that, but my 2nd --debug run of usb_install.feature passed again, so I can't efficiently evaluate how reverting these commits help (I'd have to run usb_install.feature a lot of times, which would take days. Instead, I'd rather use the specific test feature you gave me a few days ago in order to first evaluate how often the bug arises with these commits, and then do it again with these commits reverted. Unfortunately the pastebin thing has expired => could you please send me this feature again?

#42 Updated by intrigeri almost 5 years ago

  • Assignee changed from anonym to intrigeri

OK, I wrote a test feature that installs Tails on a USB virtual drive and boots from it 20 times (since I've mostly seen the bug when booting off emulated USB), and then I could reproduce it again. I see calling as root: echo 'hello?' in the debug log, and this call doesn't return anything I can see. I've taken control of the TailsToaster VM, and verified that the autotest_remote_shell.py process is running. Sadly, the way we're starting it doesn't log anything, and we're running SysV init, so I can't check if it was started before GDM, and anything that autotest_remote_shell.py may print is probably entirely lost.

What's clear, though, is that /etc/rc2.d/S05rc.local is meant to start after /etc/rc2.d/S03gdm3, so I'm not surprised that we have a race condition here, and sometimes the remote shell isn't up before we see our Greeter welcome screen.

I'm now going to try and revert the commits that slightly change the remote shell behaviour (and might be enough to introduce a regression). Reverting these commits might be good enough for 1.3, but that's not a good enough mid-term solution IMO (in some settings, it seems obvious that we will see this bug again): what we really need is to make sure we start the remote shell soon enough. I think that should be addressed in 1.3.1, and probably needs a dedicated ticket. That should be relatively easy. We should at the very least:

  • migrate the remote shell startup to a dedicated initscript, that acts as a no-op unless autotest_never_use_this_option is passed on the kernel command-line
  • set X-Start-Before: $x-display-manager gdm gdm3 in the remote shell initscript

And also, ideally:

  • in autotest_remote_shell.py, touch some file after port.open() has succeeded
  • patch the gdm3 initscript to wait for that file to appear, when autotest_never_use_this_option is passed on the kernel command-line, before it actually starts GDM.

(That last part is ugly but can beneficially be replaced by systemd's daemon activation feedback protocol, in Tails/Jessie.)

#43 Updated by intrigeri almost 5 years ago

intrigeri wrote:

I'm now going to try and revert the commits that slightly change the remote shell behaviour (and might be enough to introduce a regression).

Done locally (had to resolve a bunch of conflicts, oh well) and now running my test feature.

#44 Updated by intrigeri almost 5 years ago

intrigeri wrote:

intrigeri wrote:

I'm now going to try and revert the commits that slightly change the remote shell behaviour (and might be enough to introduce a regression).

Done locally (had to resolve a bunch of conflicts, oh well) and now running my test feature.

I've seen the bug again with these commits reverted, so now I'm confused => running the same test feature on 1.3~rc1 from testing branch. This should be a good starting point to pinpoint if there's a regression at all, and if yes, if it is in the code and/or in the test suite.

#45 Updated by anonym almost 5 years ago

intrigeri wrote:

I've seen the bug again with these commits reverted, so now I'm confused => running the same test feature on 1.3~rc1 from testing branch. This should be a good starting point to pinpoint if there's a regression at all, and if yes, if it is in the code and/or in the test suite.

That the ISO under testing does not contain commit 1a55905 ("Create the environment for remote shell calls more realisitcally" [sic!]) is the important thing, at least that's my assertion. It's not clear to me whether that was the case when you say you saw the bug "again".

#46 Updated by intrigeri almost 5 years ago

That the ISO under testing does not contain commit 1a55905
("Create the environment for remote shell calls more realisitcally"
[sic!]) is the important thing

Indeed, I was confused. Running a bunch of new tests to pinpoint the problem (if any).

#47 Updated by intrigeri almost 5 years ago

intrigeri wrote:

running the same test feature on 1.3~rc1 from testing branch

The following tests are all run on the same system. All ISO images are compressed with the default (efficient) xz settings, to eliminate a possible cause of variations.

  • 1.3~rc1 ISO, test suite run from testing branch: passes 120/120 runs
  • testing + commit:1a55905e5d51d0f578b5ab3732a8ac0453ac6363 ("Create the environment for remote shell calls more realisitcally.") cherry-picked ISO, test suite run from testing branch: out of 100 runs, 2 "Remote shell seems to be down" failures

#48 Updated by anonym almost 5 years ago

  • Related to Bug #8941: The remote shell server is racing against Tails Greeter added

#49 Updated by intrigeri almost 5 years ago

  • QA Check changed from Info Needed to Ready for QA

#50 Updated by Tails almost 5 years ago

  • Status changed from In Progress to 11
  • % Done changed from 90 to 100

Applied in changeset commit:40d4c1bf9fc5cc2f251e0f546afd82f79c91169f.

#51 Updated by intrigeri almost 5 years ago

  • Assignee deleted (intrigeri)
  • % Done changed from 100 to 90
  • QA Check changed from Ready for QA to Pass

#52 Updated by BitingBird almost 5 years ago

  • Status changed from 11 to Resolved

Also available in: Atom PDF