Project

General

Profile

Bug #16792

Upgrade our Chutney fork

Added by intrigeri 5 months ago. Updated 8 days ago.

Status:
Needs Validation
Priority:
Normal
Assignee:
-
Category:
Test suite
Target version:
Start date:
Due date:
% Done:

20%

Feature Branch:
feature/16792-only-update-chutney+force-all-tests
Type of work:
Code
Blueprint:
Starter:
Affected tool:

Description

We haven't updated our Chutney fork since 2016. Current diffstat is: 139 files changed, 3401 insertions(+), 712 deletions(-).
Using configuration and code that was developed for a somewhat antique version of tor will likely cause trouble at some point.

I've noticed this because our suite sets deprecated options in torrc for Chutney usage:

Jun 10 19:38:59 amnesia tor[8575]: Jun 10 19:38:59.126 [warn] The configuration option 'TestingBridgeDownloadSchedule' is deprecated; use 'TestingBridgeDownloadInitialDelay' instead.
Jun 10 19:38:59 amnesia tor[8575]: Jun 10 19:38:59.126 [warn] The configuration option 'TestingClientConsensusDownloadSchedule' is deprecated; use 'TestingClientConsensusDownloadInitialDelay' instead.
Jun 10 19:38:59 amnesia tor[8575]: Jun 10 19:38:59.126 [warn] The configuration option 'TestingClientDownloadSchedule' is deprecated; use 'TestingClientDownloadInitialDelay' instead.

Upstream Chutney has removed all DownloadSchedule torrc options from these templates so in this case, it's no big deal.


Related issues

Related to Tails - Bug #17163: "SSH is using the default SocksPort" test suite scenario is fragile Resolved
Blocks Tails - Feature #16209: Core work: Foundations Team Confirmed
Blocks Tails - Feature #15211: Reduce our Chutney network In Progress 01/22/2018
Blocks Tails - Bug #16471: Drop time synchronization hacks that tor 0.3.5 and 0.4.x made obsolete In Progress 02/17/2019
Blocks Tails - Bug #11589: Time syncing over bridge is fragile Confirmed 07/22/2016

Associated revisions

Revision 3ef18965 (diff)
Added by anonym 5 months ago

Chutney: update to current upstream master (Will-fix: #16792).

Revision c09577e4 (diff)
Added by anonym 5 months ago

Chutney: use normal consensus cycle interval (Will-fix: #16792).

Revision 068dcb91 (diff)
Added by anonym 3 months ago

Revert "Chutney: use normal consensus cycle interval (Will-fix: #16792)."

This reverts commit c09577e4fd9ee8f0cfe12aea3dc5b072e556bcda.

Revision b5439e0c (diff)
Added by anonym 3 months ago

Chutney: use normal consensus cycle interval (Will-fix: #16792).

Revision b4109f93 (diff)
Added by anonym 2 months ago

Chutney: update to current upstream master (Will-fix: #16792).

Revision 00da865a (diff)
Added by anonym 2 months ago

Chutney: use normal consensus cycle interval (Will-fix: #16792).

Revision 3addffea (diff)
Added by anonym 2 months ago

Chutney: update to current upstream master (Will-fix: #16792).

Revision ae6e0753 (diff)
Added by anonym 2 months ago

Chutney: use normal consensus cycle interval (Will-fix: #16792).

Revision 8bb166dd (diff)
Added by intrigeri 2 months ago

Test suite: fix Chutney exit relays tagging (refs: #16792)

In all Chutney's networks/*, exit relays are tagged "r" (AFAICT only the first
letter of the tag is taken into account, so "relay" should be an equivalent).
No tag="e" or tag="exit" can be found in Chutney's Git history so I believe that
our usage of tag="exit" is incorrect. I don't know what's the practical impact
of using an unsupported tag, though.

Revision 73e5ae37 (diff)
Added by intrigeri 2 months ago

Test suite: fix Chutney bridge authorities tagging (refs: #16792)

In all Chutney's networks/*, bridge authorities are tagged "ba".
Let's do the same to avoid having to wonder whether using anything
else can cause trouble.

Revision b5ba531c (diff)
Added by intrigeri 2 months ago

Test suite: set client=1 for the Chutney clients (refs: #16792)

This was added to Chutney's networks/*. Let's do the same to avoid having to
wonder whether not doing so can cause trouble.

Revision 79f86a31 (diff)
Added by intrigeri 2 months ago

WIP: Test suite: save Tor and Chutney logs on any scenario failure (refs: #16792)

This commit should not be merged into our base branches.

Goal: investigate the "nc: connection failed, SOCKSv5 error: General SOCKS
server failure" problem we see with SSH since we updated Chutney.

Revision 70a1712c (diff)
Added by intrigeri about 2 months ago

Test suite: wait for all Chutney nodes to have bootstrapped before assuming the simulated Tor network is ready.

refs: #16792

Revision 05ce3ed4 (diff)
Added by intrigeri about 2 months ago

Revert "WIP: Test suite: save Tor and Chutney logs on any scenario failure (refs: #16792)"

This reverts commit 79f86a310a3568a5ef2f4646b216b1be1fab7144.

I've seen failures to archive build artifacts on Jenkins since
I've made this change, which makes it harder to investigate
issues, while this change was meant to make this easier.

Revision 4e326584 (diff)
Added by intrigeri about 2 months ago

Revert "WIP: Test suite: save Tor and Chutney logs on any scenario failure (refs: #16792)"

This reverts commit 79f86a310a3568a5ef2f4646b216b1be1fab7144, which
causes failures to save artifacts:

Then I see that SSH is properly stream isolated
  1. features/step_definitions/tor.rb:289
    cannot handle socket (RuntimeError)
    /usr/lib/ruby/2.3.0/fileutils.rb:1379:in `copy'
    /usr/lib/ruby/2.3.0/fileutils.rb:472:in `block in copy_entry'
    /usr/lib/ruby/2.3.0/fileutils.rb:1498:in `wrap_traverse'
    /usr/lib/ruby/2.3.0/fileutils.rb:1501:in `block in wrap_traverse'
    /usr/lib/ruby/2.3.0/fileutils.rb:1500:in `each'
    /usr/lib/ruby/2.3.0/fileutils.rb:1500:in `wrap_traverse'
    /usr/lib/ruby/2.3.0/fileutils.rb:1501:in `block in wrap_traverse'
    /usr/lib/ruby/2.3.0/fileutils.rb:1500:in `each'
    /usr/lib/ruby/2.3.0/fileutils.rb:1500:in `wrap_traverse'
    /usr/lib/ruby/2.3.0/fileutils.rb:1501:in `block in wrap_traverse'
    /usr/lib/ruby/2.3.0/fileutils.rb:1500:in `each'
    /usr/lib/ruby/2.3.0/fileutils.rb:1500:in `wrap_traverse'
    /usr/lib/ruby/2.3.0/fileutils.rb:469:in `After'
    Scenario failed at time 03:39:50

Revision 4b35d25b (diff)
Added by intrigeri about 2 months ago

Test suite: only save Chutney logs on failure (refs: #16792)

commit 79f86a310a3568a5ef2f4646b216b1be1fab7144 causes failures to save
artifacts because log.tor does not always exist.

Revision 6401cada (diff)
Added by intrigeri about 2 months ago

Test suite: don't try to save tor control sockets as artifacts (refs: #16792).

This fixes this exception that happens whenever we try to copy all Chutney
data to the artifacts directory:

cannot handle socket (RuntimeError)
/usr/lib/ruby/2.3.0/fileutils.rb:1379:in `copy'
/usr/lib/ruby/2.3.0/fileutils.rb:472:in `block in copy_entry'
/usr/lib/ruby/2.3.0/fileutils.rb:1498:in `wrap_traverse'
/usr/lib/ruby/2.3.0/fileutils.rb:1501:in `block in wrap_traverse'
/usr/lib/ruby/2.3.0/fileutils.rb:1500:in `each'
/usr/lib/ruby/2.3.0/fileutils.rb:1500:in `wrap_traverse'
/usr/lib/ruby/2.3.0/fileutils.rb:1501:in `block in wrap_traverse'
/usr/lib/ruby/2.3.0/fileutils.rb:1500:in `each'
/usr/lib/ruby/2.3.0/fileutils.rb:1500:in `wrap_traverse'
/usr/lib/ruby/2.3.0/fileutils.rb:1501:in `block in wrap_traverse'
/usr/lib/ruby/2.3.0/fileutils.rb:1500:in `each'
/usr/lib/ruby/2.3.0/fileutils.rb:1500:in `wrap_traverse'
/usr/lib/ruby/2.3.0/fileutils.rb:469:in `After'
Scenario failed at time 04:31:14

Revision 350ccbcd (diff)
Added by intrigeri about 1 month ago

Test suite: give the Chutney network more time to bootstrap (refs: #16792)

Even with a 180s timeout, in the last 7 test suite runs on Jenkins,
Chutney failed to fully bootstrap its network twice.

Revision de8da55a (diff)
Added by intrigeri about 1 month ago

Chutney: update to current upstream master (refs: #16792).

Revision 16ac509f (diff)
Added by intrigeri 21 days ago

Test suite: give the Chutney network more time to bootstrap (refs: #16792)

Even with a 240s timeout, in the last 16 test suite runs on Jenkins,
Chutney failed to fully bootstrap its network 3 times.

Revision 33a827cc (diff)
Added by intrigeri 8 days ago

Revert "WIP: Test suite: save Tor and Chutney logs on any scenario failure (refs: #16792)"

This reverts commit 79f86a310a3568a5ef2f4646b216b1be1fab7144.

History

#1 Updated by intrigeri 5 months ago

#2 Updated by intrigeri 5 months ago

  • Blocks Bug #16471: Drop time synchronization hacks that tor 0.3.5 and 0.4.x made obsolete added

#3 Updated by anonym 5 months ago

  • Assignee set to anonym

#4 Updated by anonym 5 months ago

  • Subject changed from Upgrade our Chutney fork to Upgrade our Chutney fork and make configuratoin more similar to the real Tor network

#5 Updated by hefee 5 months ago

Here is another patch for chutney to be more like the current Tor network, to let tests passes that do time screws.
https://salsa.debian.org/hefee/chutney/merge_requests/1

#6 Updated by anonym 5 months ago

  • Subject changed from Upgrade our Chutney fork and make configuratoin more similar to the real Tor network to Upgrade our Chutney fork and make configuration more similar to the real Tor network

#7 Updated by anonym 5 months ago

  • Status changed from Confirmed to In Progress

#8 Updated by anonym 5 months ago

  • % Done changed from 0 to 10
  • Feature Branch set to feature/16792-update-chutney+force-all-tests

The current branch updates chutney, and I've at least verified that tor bootstraps. I saw some initial failures with htpdate (restarting the services fixed it), not sure why is that or if it was a transient thing.

Any way, let's see what a full run looks like.

#9 Updated by anonym 5 months ago

  • % Done changed from 10 to 20

The full test runs look like any other +force-all-tets run, so I see no apparent regressions, yay!

Next step, make the configuration more similar to the real Tor network.

#10 Updated by anonym 5 months ago

We used to get consensuses looking like this:

valid-after 2019-06-26 10:12:00
fresh-until 2019-06-26 10:15:00
valid-until 2019-06-26 10:18:00
voting-delay 4 4

i.e. really short intervals/delays. In the real Tor network it looks like this:
valid-after 2019-06-26 09:00:00
fresh-until 2019-06-26 10:00:00
valid-until 2019-06-26 12:00:00
voting-delay 300 300

I've adjusted the *V3Auth* settings so they now have the same length intervals etc as the real Tor network:
 
valid-after 2019-06-26 11:31:00
fresh-until 2019-06-26 12:31:00
valid-until 2019-06-26 14:31:00
voting-delay 300 300

That should be all we need for #16471, but there are other discrepancies too (e.g. the real Tor network use consensus-method 28 but we get consensus-method 25 for some reason, but this is just an example -- I don't think we care about this one).

#11 Updated by intrigeri 3 months ago

  • Target version set to Tails_3.16

anonym, could finishing this be your next focus point once you're done with 4.0~beta1? Or maybe you prefer someone else to finish this work?

In any case, I've refreshed the branch so that it builds on Jenkins again.

Getting this done would allow us to make progress on #16471 :)

Setting target version to the same as #16471, for consistency. But postponing both would be fine.

#12 Updated by intrigeri 3 months ago

#13 Updated by anonym 3 months ago

intrigeri wrote:

anonym, could finishing this be your next focus point once you're done with 4.0~beta1?

Yes!

In any case, I've refreshed the branch so that it builds on Jenkins again.

Thanks! Sadly we lost all results except last week's, so now we have have much less data to rely on. I guess there is nothing we can do about this? Or could they still exist on some backup that isn't too awkward (say at most 15 minutes of work) to restore from?

After looking briefly at Jenkins runs 1-5 and comparing them devel's results from the same time frame, I see mostly the same network using branches failing, but not similar enough that it's a no-brainer to tell if things are worse/same/better. So I feel tempted to spam test this branch a bit on Jenkins. I started 3 runs now, and hope to start a few more during the weekend (feel free to help me if you happen to see some Jenkins slaves idling!), and hopefully have more convincing data on Monday. Then there's two weeks to get this and #16471 ready for Tails 3.16, which seems doable.

#14 Updated by intrigeri 3 months ago

Yo,

anonym wrote:

intrigeri wrote:

anonym, could finishing this be your next focus point once you're done with 4.0~beta1?

Yes!

Excellent news :)

Thanks! Sadly we lost all results except last week's, so now we have have much less data to rely on. I guess there is nothing we can do about this? Or could they still exist on some backup that isn't too awkward (say at most 15 minutes of work) to restore from?

Nope: we don't backup the output of Jenkins jobs (that would take huge amounts of disk, bandwidth, CPU, and degrade performance of all our services).

After looking briefly at Jenkins runs 1-5 and comparing them devel's results from the same time frame, I see mostly the same network using branches failing, but not similar enough that it's a no-brainer to tell if things are worse/same/better. So I feel tempted to spam test this branch a bit on Jenkins. I started 3 runs now, and hope to start a few more during the weekend (feel free to help me if you happen to see some Jenkins slaves idling!), and hopefully have more convincing data on Monday. Then there's two weeks to get this and #16471 ready for Tails 3.16, which seems doable.

Sounds like a great plan! Let's take it easy wrt. spamming Jenkins: these days it happens regularly that it has a large backlog (I've seen the feedback loop take 20+ hours a few times recently).

I'll run a few jobs on my local Jenkins too, if its Jenkins workers and myself both have some spare time for it.

#15 Updated by intrigeri 3 months ago

comparing them devel's results from the same time frame

Note that comparing with the base branch of this one (stable) instead would probably be a tiny bit more accurate :)

#16 Updated by intrigeri 3 months ago

I'll run a few jobs on my local Jenkins too, if its Jenkins workers and myself both have some spare time for it.

We both had! So I should be able to report about 12+ full local runs tomorrow :)

#17 Updated by intrigeri 3 months ago

We both had! So I should be able to report about 12+ full local runs tomorrow :)

Here we go. I got 12 full runs on my local Jenkins.

Apart of the usual suspects (mostly #11592 these days, see my comment there), I see:

  • lots of TimeSyncingError, in the time sync and bridges scenarios, including in "Clock with host's time"
  • some TorBootstrapFailure, in the time sync scenarios, including in "Clock with host's time"
  • lots of "available upgrades have been checked" failures after "Tor is ready" succeeded, in the bridges scenarios
  • one "Tails clock is less than 5 minutes incorrect" failure in "Clock with host's time": clock was off by 435 seconds; interestingly, the clock was correct before htpdate "fixed" it, so I dunno what the heck is going on here
  • lots of GnuPG key fetching failures; that might be caused by instability of my Internet connection but I have a doubt since I've not seen this fail as much while running tests for other branches recently; or it might be that pool.sks-keyservers.net was having trouble at that point, which is entirely possible
  • some weird failures for SSH and Git tests

All in all, that's many more failures than I see usually in the same environment (that tends to expose less robustness issues than our shared Jenkins).

Once you've analysed the runs on our shared Jenkins, if there's important problems I'm seeing that you did not see, and if you think it would help, I could share with you a tarball of the debug logs, tor logs, and Journal dumps.

#18 Updated by anonym 3 months ago

intrigeri wrote:

  • lots of TimeSyncingError, in the time sync and bridges scenarios, including in "Clock with host's time"

Same.

  • some TorBootstrapFailure, in the time sync scenarios, including in "Clock with host's time"

I saw this once, for "Clock with host's time".

  • lots of "available upgrades have been checked" failures after "Tor is ready" succeeded, in the bridges scenarios

None of these. I find it odd that you see "lots" of these while I see none, but I suppose it could be your internet connection? IIRC these tests are especially sensitive to that?

  • one "Tails clock is less than 5 minutes incorrect" failure in "Clock with host's time": clock was off by 435 seconds; interestingly, the clock was correct before htpdate "fixed" it, so I dunno what the heck is going on here

Huh. Will keep an eye open for htpdate strangeness.

  • lots of GnuPG key fetching failures; that might be caused by instability of my Internet connection but I have a doubt since I've not seen this fail as much while running tests for other branches recently; or it might be that pool.sks-keyservers.net was having trouble at that point, which is entirely possible
  • some weird failures for SSH and Git tests

Same for these.

All in all, that's many more failures than I see usually in the same environment (that tends to expose less robustness issues than our shared Jenkins).

On jenkins there's also noticeably more failures on the feature branch compared to stable. However, it seems that most failures are introduced in a few runs only, with the other runs looking pretty much like stable. Hmm. One could almost suspect that Chutney generated a "buggy" Tor network for those bad runs.

Let's note that this branch added two things:
  • Chutney: use normal consensus cycle interval
  • Chutney: update to current upstream master
We should test these separately and see which one causes all the trouble, so I created these poorly named branches:
  • feature/16792-only-update-chutney+force-all-tests
  • feature/16792-normal-consensus+force-all-tests

It looks increasingly unlikely we'll have this done for Tails 3.16. Or what do you think, @intrigeri?

#19 Updated by intrigeri 3 months ago

Hi,

  • lots of "available upgrades have been checked" failures after "Tor is ready" succeeded, in the bridges scenarios

None of these. I find it odd that you see "lots" of these while I see none, but I suppose it could be your internet connection? IIRC these tests are especially sensitive to that?

I'm not used to see lots of those here but it could indeed be my Internet connection. If you feel it would be useful, I could trigger a few more runs to see if this happens often again.

Let's note that this branch added two things: […]
We should test these separately and see which one causes all the trouble

Excellent idea :)

It looks increasingly unlikely we'll have this done for Tails 3.16.

I agree it looks less easy than we could have been hoping previously, but there's still ~10 days left, which is plenty. We'll see! If it makes it into 3.16, great, and it gets merged a bit later, fine.

#20 Updated by anonym 3 months ago

intrigeri wrote:

  • lots of "available upgrades have been checked" failures after "Tor is ready" succeeded, in the bridges scenarios

None of these. I find it odd that you see "lots" of these while I see none, but I suppose it could be your internet connection? IIRC these tests are especially sensitive to that?

I'm not used to see lots of those here but it could indeed be my Internet connection. If you feel it would be useful, I could trigger a few more runs to see if this happens often again.

Yes, it would be very useful ...

Let's note that this branch added two things: […]
We should test these separately and see which one causes all the trouble

Excellent idea :)

... if you could do a few runs of the two new branches!

#21 Updated by intrigeri 3 months ago

Yes, it would be very useful ...
[…]
... if you could do a few runs of the two new branches!

OK, I shall do this once I'm done with #12092, which currently keeps my local Jenkins busy.

#22 Updated by intrigeri 3 months ago

  • Target version changed from Tails_3.16 to Tails_3.17

intrigeri wrote:

Yes, it would be very useful ...
[…]
... if you could do a few runs of the two new branches!

OK, I shall do this once I'm done with #12092, which currently keeps my local Jenkins busy.

Unfortunately, this has been the case almost continuously since I wrote that, so I could not run anything on the new branches yet.

#23 Updated by intrigeri 3 months ago

I've started 3 runs of each of these 2 new branches locally. And then I noticed that the 2 branches are based on stable. This of course made sense back when they were prepared: we were targeting 3.16. But since then, lots of test suite robustness improvements went to devel and not to stable, so it'll be more painful than it could be to analyze the results. So if you don't mind, I'll merge devel into these branches.

#24 Updated by intrigeri 2 months ago

  • Assignee deleted (anonym)

#25 Updated by intrigeri 2 months ago

  • Target version changed from Tails_3.17 to Tails_4.0

intrigeri wrote:

So if you don't mind, I'll merge devel into these branches.

I've rebased the 3 branches on top of current devel.

#26 Updated by intrigeri 2 months ago

  • Assignee set to intrigeri

I'll take care of analyzing the test results, in the hope that we can at least merge the "upgrade Chutney" part soonish :)

#27 Updated by intrigeri 2 months ago

I feel I need much more data to draw any conclusion. I'll update the info in this comment in a week or so.

#28 Updated by intrigeri 2 months ago

One week later, I've updated my analysis of test results (see below) and I now feel I have enough data to draw preliminary conclusions:

  • In the current state of things, none of the 2 branches is ready for merging. Both have issues that need further investigation.
  • Regarding the "only update Chutney" branch: I'll come back to it in a week or so to see if the tweaks that I've just pushed help.
  • I'm starting to suspect that we're starting to use the simulated tor network before it's fully ready: in Chutney nodes' logs I see quite often bootstrap status that's below 100%. It seems we should use chutney wait_for_bootstrap in our ensure_chutney_is_running function: see the "Waiting for the network" section in Chutney's README for details.

Test results:

  • only update Chutney
    • runs 18-30 of https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_feature-16792-only-update-chutney-force-all-tests/:
      • "Cloning a Git repository ǂ Cloning git repository over SSH" failed 8/13 times, compared to 0/21 times on devel. git clone immediately fails with "nc: connection failed, SOCKSv5 error: General SOCKS server failure".
      • "Tor stream isolation is effective ǂ SSH is using the default SocksPort" failed 10/13 times, compared to 3/21 times on devel. This looks like #17013 but the increased failure rate is concerning. Just like in the Git over SSH scenario, SSH immediately fails with "nc: connection failed, SOCKSv5 error: General SOCKS server failure".
      • Apart of these 2 concerning issues, it's business as usual, no regression spotted there.
    • runs 4-7 on my local Jenkins:
      • 3/4 fully passed
      • "Cloning a Git repository ǂ Cloning git repository over SSH" failed 1/4 times, in exactly the same way as on lizard
      • "Tor stream isolation is effective ǂ SSH is using the default SocksPort" failed 1/4 times, in exactly the same way as on lizard
    • Wrt. "nc: connection failed, SOCKSv5 error: General SOCKS server failure", since then I've applied a few tweaks that might help (unlikely) and added 79f86a310a3568a5ef2f4646b216b1be1fab7144 to get more debugging info.
  • normal consensus

Wrt. the aforementioned Tor bootstrap failures, tor got a valid consensus:

Sep 07 10:25:51 amnesia time[8106]: Waiting for a Tor consensus file to contain a valid time interval
Sep 07 10:25:51 amnesia time[8109]: A Tor consensus file now contains a valid time interval.
Sep 07 10:25:51 amnesia time[8110]: Waiting for the chosen Tor consensus file to contain a valid time interval...
Sep 07 10:25:51 amnesia time[8112]: The chosen Tor consensus now contains a valid time interval, let's use it.
Sep 07 10:25:51 amnesia time[8116]: Tor: valid-after=2019-09-07 10:13:00 | valid-until=2019-09-07 13:13:00
Sep 07 10:25:51 amnesia time[8119]: Current time is 2019-09-07 10:25:51
Sep 07 10:25:51 amnesia time[8124]: Current time is in valid Tor range
Sep 07 10:25:51 amnesia time[8125]: Waiting for Tor to be working...

… it can't build a circuit:

Sep 07 10:25:50.000 [notice] Bootstrapped 0% (starting): Starting
Sep 07 10:25:50.000 [notice] Starting with guard context "default" 
Sep 07 10:25:50.000 [notice] Signaled readiness to systemd
Sep 07 10:25:50.000 [notice] New control connection opened from 127.0.0.1.
Sep 07 10:25:51.000 [notice] New control connection opened from 127.0.0.1.
Sep 07 10:25:51.000 [notice] Opening Control listener on /run/tor/control
Sep 07 10:25:51.000 [notice] Opened Control listener on /run/tor/control
Sep 07 10:25:51.000 [notice] Bootstrapped 5% (conn): Connecting to a relay
Sep 07 10:25:51.000 [notice] Bootstrapped 10% (conn_done): Connected to a relay
Sep 07 10:25:51.000 [notice] Bootstrapped 14% (handshake): Handshaking with a relay
Sep 07 10:25:51.000 [notice] Bootstrapped 15% (handshake_done): Handshake with a relay done
Sep 07 10:25:51.000 [notice] Bootstrapped 20% (onehop_create): Establishing an encrypted directory connection
Sep 07 10:25:51.000 [notice] Bootstrapped 25% (requesting_status): Asking for networkstatus consensus
Sep 07 10:25:51.000 [notice] New control connection opened.
Sep 07 10:25:51.000 [notice] Bootstrapped 30% (loading_status): Loading networkstatus consensus
Sep 07 10:25:51.000 [notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.
Sep 07 10:25:51.000 [notice] Bootstrapped 40% (loading_keys): Loading authority key certs
Sep 07 10:25:51.000 [notice] The current consensus has no exit nodes. Tor can only build internal paths, such as paths to onion services.
Sep 07 10:25:51.000 [notice] Bootstrapped 45% (requesting_descriptors): Asking for relay descriptors
Sep 07 10:25:51.000 [notice] I learned some more directory information, but not enough to build a circuit: We need more microdescriptors: we have 0/2, and can only build 0% of likely paths. (We have 0% of guards bw, 0% of midpoint bw, and 0% of end bw (no exits in consensus, using mid) = 0% of path bw.)
Sep 07 10:25:51.000 [notice] Bootstrapped 50% (loading_descriptors): Loading relay descriptors
Sep 07 10:25:51.000 [notice] The current consensus contains exit nodes. Tor can build exit and internal paths.
Sep 07 10:25:51.000 [notice] Bootstrapped 75% (enough_dirinfo): Loaded enough directory info to build circuits
Sep 07 10:25:51.000 [notice] New control connection opened from 127.0.0.1.
Sep 07 10:25:51.000 [notice] New control connection opened from 127.0.0.1.
Sep 07 10:25:52.000 [notice] New control connection opened from 127.0.0.1.
Sep 07 10:25:52.000 [warn] No available nodes when trying to choose node. Failing.
Sep 07 10:25:52.000 [warn] No available nodes when trying to choose node. Failing.
Sep 07 10:25:52.000 [warn] Failed to find node for hop #2 of our path. Discarding this circuit.
Sep 07 10:25:52.000 [notice] Our circuit 0 (id: 3) died due to an invalid selected path, purpose General-purpose client. This may be a torrc configuration issue, or a bug.
Sep 07 10:25:52.000 [notice] New control connection opened.
Sep 07 10:25:53.000 [notice] New control connection opened from 127.0.0.1.
Sep 07 10:25:53.000 [warn] No available nodes when trying to choose node. Failing.
Sep 07 10:25:53.000 [warn] No available nodes when trying to choose node. Failing.
Sep 07 10:25:53.000 [warn] Failed to find node for hop #2 of our path. Discarding this circuit.
Sep 07 10:25:53.000 [notice] New control connection opened.
Sep 07 10:25:53.000 [notice] New control connection opened from 127.0.0.1.
Sep 07 10:25:54.000 [notice] New control connection opened from 127.0.0.1.
Sep 07 10:25:54.000 [warn] No available nodes when trying to choose node. Failing.
Sep 07 10:25:54.000 [warn] No available nodes when trying to choose node. Failing.
Sep 07 10:25:54.000 [warn] Failed to find node for hop #2 of our path. Discarding this circuit.

… and then the last 3 lines repeat endlessly.

Note that we currently use 20 non-exit relays.

#29 Updated by intrigeri 2 months ago

  • Blocks deleted (Bug #16471: Drop time synchronization hacks that tor 0.3.5 and 0.4.x made obsolete)

#30 Updated by intrigeri about 2 months ago

  • Subject changed from Upgrade our Chutney fork and make configuration more similar to the real Tor network to Upgrade our Chutney fork
  • Feature Branch changed from feature/16792-update-chutney+force-all-tests to feature/16792-only-update-chutney+force-all-tests

Narrowing scope to something I feel can be completed soon.

I've renamed to wip/ the branches that "make configuration more similar to the real Tor network": wip/feature/16792-update-chutney+force-all-tests and wip/feature/16792-normal-consensus+force-all-tests. That'll be for another ticket whenever we need that.

#31 Updated by intrigeri about 2 months ago

  • I'm starting to suspect that we're starting to use the simulated tor network before it's fully ready: in Chutney nodes' logs I see quite often bootstrap status that's below 100%. It seems we should use chutney wait_for_bootstrap in our ensure_chutney_is_running function: see the "Waiting for the network" section in Chutney's README for details.

I've implemented this on feature/16792-only-update-chutney+force-all-tests. I have good hopes it will eliminate some kinds of test suite robustness issues but unfortunately, it did not magically solve the "nc: connection failed, SOCKSv5 error: General SOCKS server failure" problem, which is thus left to be investigated.

#32 Updated by intrigeri about 2 months ago

intrigeri wrote:

it did not magically solve the "nc: connection failed, SOCKSv5 error: General SOCKS server failure" problem, which is thus left to be investigated.

I've seen this problem happen a lot on devel-based branches in the last few days. I suspect this is caused by memory starvation on isotesters (#17088). So it might be that this problem is actually caused by limitations in our CI environment, and has nothing to do with Chutney itself. I've just given isotesters more RAM (for #17088), let's see if it helps here too!

#33 Updated by intrigeri about 1 month ago

  • Target version changed from Tails_4.0 to Tails_4.1

intrigeri:

  • I'm starting to suspect that we're starting to use the simulated tor network before it's fully ready: in Chutney nodes' logs I see quite often bootstrap status that's below 100%. It seems we should use chutney wait_for_bootstrap in our ensure_chutney_is_running function: see the "Waiting for the network" section in Chutney's README for details.

I've implemented this on feature/16792-only-update-chutney+force-all-tests. I have good hopes it will eliminate some kinds of test suite robustness issues but unfortunately, it did not magically solve the "nc: connection failed, SOCKSv5 error: General SOCKS server failure" problem, which is thus left to be investigated.

Since I gave isotesters more RAM for unrelated reasons, apart of the scenarios that are super fragile on devel too these days, runs 54-60 on lizard yielded this:

  • chutney wait_for_bootstrap failed 2/7 times. Note that this aborts the test suite run before running anything, which is why the following stats are about the 5 other test suite runs. FWIW, the nodes that did not bootstrap yet are:
    • test035bridge: (80, None, 'Connecting to the Tor network internally')
    • test039obfs4: (85, None, 'Finishing handshake with first hop of internal circuit')
    • test004ba: (85, None, 'Finishing handshake with first hop of internal circuit')
    • test035bridge: (80, None, 'Connecting to the Tor network internally')
    • test036bridge: (80, None, 'Connecting to the Tor network internally')
  • "SSH is using the default SocksPort" failed with the now common "nc: connection failed […]" error 5/5 times (vs. 3 times in the last 10 test suite runs on the devel branch)
  • "Chatting with some friend over XMPP" failed 1/5 times vs. "Chatting with some friend over XMPP and with OTR" failing 1/10 times on devel with similar symptoms ⇒ I won't lose any sleep over this one.

Apart of this, everything looks just as good/bad as on the devel branch.

So IMO, the only blockers here are now:

  • Increase the chutney wait_for_bootstrap timeout until it's robust enough. If this requires an unreasonable large timeout, or if we notice that Chutney simply will sometimes never bootstrap at all, come back to the drawing board: it could be a bug in Chutney. I've just bumped the timeout from 180s to 240s.
  • Find out what's going on with "SSH is using the default SocksPort". Idea: try with a different host, that's outside of the *.tails.boum.org network (which our isotesters are part of).

I've also merged upstream Chutney changes again on this branch, because 4 months passed since anonym did it, and I'd hate to be blocked here by issues that upstream has fixed already.

In passing, we're still running Chutney under Python 2, while it now supports Python 3, and it seems that Tor folks now run it that way. We should change this at some point.

I probably won't have time to come back to this ticket in time for 4.0 and anyway, I'll need to let a week or two pass until there's enough new data to analyze ⇒ postponing.

#34 Updated by intrigeri 29 days ago

  • Related to Bug #17163: "SSH is using the default SocksPort" test suite scenario is fragile added

#35 Updated by intrigeri 21 days ago

I've just looked at runs 61-76.

So IMO, the only blockers here are now:

  • Increase the chutney wait_for_bootstrap timeout until it's robust enough. If this requires an unreasonable large timeout, or if we notice that Chutney simply will sometimes never bootstrap at all, come back to the drawing board: it could be a bug in Chutney. I've just bumped the timeout from 180s to 240s.

This problem happened in 3 out of the 16 last runs. I've bumped the timeout again, this time to 600 seconds.

If that's still not enough, I see two ways to approach this, whenever I or someone else has time for this:

  • This might be related to our V3AuthVotingInterval 180 custom setting. Next steps:
    • Research why we did that
    • Try reverting to Chutney's default (V3AuthVotingInterval 20)
  • Worst case, retry (i.e. restart Chutney) when chutney wait_for_bootstrap fails.
  • Find out what's going on with "SSH is using the default SocksPort". Idea: try with a different host, that's outside of the *.tails.boum.org network (which our isotesters are part of).

This was solved via #17163 :)

Apart of that, the test suite on this branch is as robust as on devel.

#36 Updated by intrigeri 8 days ago

  • Increase the chutney wait_for_bootstrap timeout until it's robust enough. If this requires an unreasonable large timeout, or if we notice that Chutney simply will sometimes never bootstrap at all, come back to the drawing board: it could be a bug in Chutney. I've just bumped the timeout from 180s to 240s.

This problem happened in 3 out of the 16 last runs. I've bumped the timeout again, this time to 600 seconds.

This problem did not happen any single time during the last 15 runs (77-91), so I'm confident this did the trick.

I've looked closer at the last 5 runs and the failures are all well-known robustness problems.

#37 Updated by intrigeri 8 days ago

  • Status changed from In Progress to Needs Validation
  • Assignee deleted (intrigeri)

#38 Updated by intrigeri 8 days ago

  • Blocked by Bug #16471: Drop time synchronization hacks that tor 0.3.5 and 0.4.x made obsolete added

#39 Updated by intrigeri 8 days ago

  • Blocked by deleted (Bug #16471: Drop time synchronization hacks that tor 0.3.5 and 0.4.x made obsolete)

#40 Updated by intrigeri 8 days ago

  • Blocks Bug #16471: Drop time synchronization hacks that tor 0.3.5 and 0.4.x made obsolete added

#41 Updated by intrigeri 8 days ago

  • Blocks Bug #11589: Time syncing over bridge is fragile added

Also available in: Atom PDF