Project

General

Profile

Bug #11846

The remote shell can mix up responses after an execution has been aborted

Added by anonym over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Test suite
Target version:
Start date:
09/27/2016
Due date:
% Done:

100%

Feature Branch:
test/11846-remote-shell-request-ids
Type of work:
Code
Blueprint:
Starter:
Affected tool:

Description

From #11832#note-3:

[...]
calling as amnesia: gpg --batch --list-keys '10CC5BC7'
Tor operation failed (Try 1 of 10) with:
Timeout::Error: try_for() timeout expired
Last ignored exception was: ExecutionFailedInVM: Command failed: gpg --batch --list-keys '10CC5BC7'
error code: 2
stderr: gpg: error reading key: public key not found
.
<false> is not true.
Forcing new Tor circuit...
calling as root: . /usr/local/lib/tails-shell-library/tor.sh && tor_control_send "signal NEWNYM" 
call returned: [2, "", "gpg: error reading key: public key not found\n"]
    When I fetch the "10CC5BC7" OpenPGP key using Seahorse                         # features/step_definitions/torified_gnupg.rb:154
      Command failed: . /usr/local/lib/tails-shell-library/tor.sh && tor_control_send "signal NEWNYM" 
      error code: 2
      stderr: gpg: error reading key: public key not found
      .
      <false> is not true. (ExecutionFailedInVM)
      ./features/support/helpers/vm_helper.rb:442:in `rescue in execute_successfully'
      ./features/support/helpers/vm_helper.rb:439:in `execute_successfully'
      ./features/support/helpers/misc_helpers.rb:94:in `force_new_tor_circuit'
      ./features/support/helpers/misc_helpers.rb:104:in `block in retry_tor'
      ./features/support/helpers/misc_helpers.rb:133:in `call'
      ./features/support/helpers/misc_helpers.rb:133:in `rescue in block in retry_action'
      ./features/support/helpers/misc_helpers.rb:125:in `block in retry_action'
      ./features/support/helpers/misc_helpers.rb:124:in `loop'
      ./features/support/helpers/misc_helpers.rb:124:in `retry_action'
      ./features/support/helpers/misc_helpers.rb:108:in `retry_tor'
      ./features/step_definitions/torified_gnupg.rb:172:in `/^I fetch the "([^"]+)" OpenPGP key using Seahorse( via the OpenPGP Applet)?$/'
      features/torified_gnupg.feature:22:in `When I fetch the "10CC5BC7" OpenPGP key using Seahorse'


and look closer at:
calling as root: . /usr/local/lib/tails-shell-library/tor.sh && tor_control_send "signal NEWNYM" 
call returned: [2, "", "gpg: error reading key: public key not found\n"]

I.e. the remote shell has mixup up the response from the gpg call earlier. That gpg call was aborted (by a try_for timeout) while it was executing on the guest, before writing back the response. So our remote shell is too simplistic here -- we need an ID that ties a response to a specific call.

Simple solution: when the client sends a request to the remote shell, except the command line it also sends an ID. The remote shell will put that ID on the response. Whenever the client reads a reponse it will drop those with unmatching ID (since they must have been aborted).


Related issues

Related to Tails - Bug #10502: The test suite sometimes cannot connect to the remote shell Resolved 11/06/2015
Related to Tails - Bug #12040: Test suite cannot sometimes connect to the remote shell: "Dropped out-of-order remote shell response: got id but expected id NNNN" Confirmed 12/19/2016

Associated revisions

Revision 4e341a2f (diff)
Added by anonym over 3 years ago

Introduce a request ID for remote shell calls.

When a remote shell call is aborted while the command is executing, the
next remote shell call will get as response the response meant for the
aborted call. Using this unique ID we can make sure that requests will
only get the intended responses.

Will-fix: #11846

Revision 86357945
Added by bertagaz over 3 years ago

Merge branch 'test/11846-remote-shell-request-ids' into stable

Fix-committed: #11846

History

#1 Updated by anonym over 3 years ago

  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 30
  • QA Check set to Dev Needed
  • Feature Branch set to test/11846-remote-shell-request-ids

I've pushed an untested branch. Let's hope it doesn't break anything jenkins -- I think that is enough criteria for merging, since this error is very rare.

#2 Updated by anonym over 3 years ago

  • Assignee changed from anonym to bertagaz
  • Target version changed from Tails_2.9.1 to Tails_2.7
  • % Done changed from 30 to 50
  • QA Check changed from Dev Needed to Ready for QA

Note: I force-pushed a fix for a syntax error.

This branch seems to work fine, with two runs on Jenkins with only unrelated problems (e.g. that one of our keys will expire in <3 months). I guess we can even try to get this bugfix into Tails 2.7.

#3 Updated by anonym over 3 years ago

  • Assignee changed from bertagaz to anonym
  • QA Check changed from Ready for QA to Dev Needed

While we're working on remote shell issues, I think we should consider trying this:

--- a/features/support/helpers/exec_helper.rb
+++ b/features/support/helpers/exec_helper.rb
@@ -37,16 +37,7 @@ class VMCommand
     socket.puts(JSON.dump([id, type, options[:user], cmd]))
     loop do
       s = socket.readline(sep = "\0").chomp("\0")
-      begin
-        response_id, *rest = JSON.load(s)
-      rescue JSON::ParserError
-        # The server often returns something unparsable for the very
-        # first execute() command issued after a VM start/restore
-        # (generally from wait_until_remote_shell_is_up()) presumably
-        # because the TCP -> serial link isn't properly setup yet. All
-        # will be well after that initial hickup, so we just retry.
-        return VMCommand.execute(vm, cmd, options)
-      end
+      response_id, *rest = JSON.load(s)
       if response_id == id
         debug_log("#{type} returned: #{s}") if not(options[:spawn])
         return rest

That part seems like a hack to me now. I've never seen (which would manifest it self with two+ debug_log() calls) the retrying part happen for any real use of the remote shell, only for the first one during VMCommand.wait_until_remote_shell_is_up(), and it does its own retrying. Combined with the response ID thing guaranteeing that the client will only get a response for the actual command issued, I think it's safe to remove this code and any problem it may cause. So, I went ahead and pushed a commit doing that. Let's see what Jenkins thinks.

#4 Updated by anonym over 3 years ago

  • Assignee changed from anonym to bertagaz
  • QA Check changed from Dev Needed to Ready for QA

Still passes on Jenkins so let's go for it!

#5 Updated by bertagaz over 3 years ago

  • Status changed from In Progress to 11
  • Assignee deleted (bertagaz)
  • % Done changed from 50 to 100
  • QA Check changed from Ready for QA to Pass

Yay, love it. Saw it rescuing precisely what it was meant to, so merged it.

I wonder if we shouldn't track when this unrelated previous remote shell responses happen, as that's a sign something may not be that good there.

#6 Updated by anonym over 3 years ago

  • Assignee set to bertagaz
  • QA Check changed from Pass to Info Needed

bertagaz wrote:

Yay, love it. Saw it rescuing precisely what it was meant to, so merged it.

Wow, if you can link me to the debug.log where this happened, I would be delighted!

I wonder if we shouldn't track when this unrelated previous remote shell responses happen, as that's a sign something may not be that good there.

It would be interesting, indeed. I sometimes wish I had access to the system where the artifacts are stored, and that I could do something like grep $SOMETHING */debug.log over all artifacts...

Otherwise, what did you have in mind with "tracking", specifically?

#7 Updated by intrigeri over 3 years ago

I sometimes wish I had access to the system where the artifacts are stored, and that I could do something like grep $SOMETHING */debug.log over all artifacts...

Please file a ticket about it and we'll discuss the implications.

#8 Updated by bertagaz over 3 years ago

  • Assignee changed from bertagaz to anonym

anonym wrote:

bertagaz wrote:
Wow, if you can link me to the debug.log where this happened, I would be delighted!

calling as root: /usr/local/sbin/tor-has-bootstrapped
call returned: [238, 3, "", ""]
calling as root: /usr/local/sbin/tor-has-bootstrapped
calling as root: journalctl SYSLOG_IDENTIFIER=restart-tor
Dropped out-of-order remote shell response: got id 239 but expected id 240
call returned: [240, 0, "-- Logs begin at Mon 2016-10-03 14:08:40 UTC, end at Mon 2016-10-03 14:14:42 UTC. --\n \
Oct 03 14:10:46 amnesia restart-tor[4265]: Tor seems to have stalled while bootstrapping. Restarting Tor.\n \ 
Oct 03 14:11:08 amnesia restart-tor[5280]: Tor seems to have stalled while bootstrapping. Restarting Tor.\n \
Oct 03 14:11:31 amnesia restart-tor[6276]: Tor seems to have stalled while bootstrapping. Restarting Tor.\n \
Oct 03 14:11:54 amnesia restart-tor[7325]: Tor seems to have stalled while bootstrapping. Restarting Tor.\n \
Oct 03 14:12:16 amnesia restart-tor[8351]: Tor seems to have stalled while bootstrapping. Restarting Tor.\n \
Oct 03 14:12:39 amnesia restart-tor[9354]: Tor seems to have stalled while bootstrapping. Restarting Tor.\n \
Oct 03 14:13:02 amnesia restart-tor[10384]: Tor seems to have stalled while bootstrapping. Restarting Tor.\n \
Oct 03 14:13:24 amnesia restart-tor[11416]: Tor seems to have stalled while bootstrapping. Restarting Tor.\n \
Oct 03 14:13:47 amnesia restart-tor[12443]: Tor seems to have stalled while bootstrapping. Restarting Tor.\n \
Oct 03 14:14:10 amnesia restart-tor[13470]: Tor seems to have stalled while bootstrapping. Restarting Tor.\n \
Oct 03 14:14:32 amnesia restart-tor[14502]: Tor seems to have stalled while bootstrapping. Restarting Tor.\n", ""]
From the journal:
  -- Logs begin at Mon 2016-10-03 14:08:40 UTC, end at Mon 2016-10-03 14:14:42 UTC. --
Oct 03 14:10:46 amnesia restart-tor[4265]: Tor seems to have stalled while bootstrapping. Restarting Tor.
Oct 03 14:11:08 amnesia restart-tor[5280]: Tor seems to have stalled while bootstrapping. Restarting Tor.
Oct 03 14:11:31 amnesia restart-tor[6276]: Tor seems to have stalled while bootstrapping. Restarting Tor.
Oct 03 14:11:54 amnesia restart-tor[7325]: Tor seems to have stalled while bootstrapping. Restarting Tor.
Oct 03 14:12:16 amnesia restart-tor[8351]: Tor seems to have stalled while bootstrapping. Restarting Tor.
Oct 03 14:12:39 amnesia restart-tor[9354]: Tor seems to have stalled while bootstrapping. Restarting Tor.
Oct 03 14:13:02 amnesia restart-tor[10384]: Tor seems to have stalled while bootstrapping. Restarting Tor.
Oct 03 14:13:24 amnesia restart-tor[11416]: Tor seems to have stalled while bootstrapping. Restarting Tor.
Oct 03 14:13:47 amnesia restart-tor[12443]: Tor seems to have stalled while bootstrapping. Restarting Tor.
Oct 03 14:14:10 amnesia restart-tor[13470]: Tor seems to have stalled while bootstrapping. Restarting Tor.
Oct 03 14:14:32 amnesia restart-tor[14502]: Tor seems to have stalled while bootstrapping. Restarting Tor.
    Step failed while creating checkpoint: Tor is ready

Otherwise, what did you have in mind with "tracking", specifically?

I don't know exactly, in any cases a way to collect which steps are used to trigger this (maybe one of the mission of the "Deal with XX 2016 false positive" task?), and probably a ticket to follow on on it?

#9 Updated by anonym over 3 years ago

  • Assignee deleted (anonym)
  • QA Check changed from Info Needed to Pass

bertagaz wrote:

anonym wrote:

bertagaz wrote:
Wow, if you can link me to the debug.log where this happened, I would be delighted!

[...]

Cool! So this happened in a similar instance (try_for timing out in the middle of a call to the remote shell).

Otherwise, what did you have in mind with "tracking", specifically?

I don't know exactly, in any cases a way to collect which steps are used to trigger this (maybe one of the mission of the "Deal with XX 2016 false positive" task?), and probably a ticket to follow on on it?

On second thought, I don't think tracking this will be very interesting. I'm very sure that all instances will be the same, i.e. cause by try_for timing out in the middle of a call to the remote shell. So I say, let's not bother. Please let me know if you think I am wrong! :)

#10 Updated by intrigeri over 3 years ago

  • Related to Bug #10502: The test suite sometimes cannot connect to the remote shell added

#11 Updated by intrigeri over 3 years ago

As reported on #10502#note-32, this seems to break some tests on the devel branch quite often. Is this tracked elsewhere, or shall we reopen this ticket?

#12 Updated by anonym over 3 years ago

intrigeri wrote:

As reported on #10502#note-32, this seems to break some tests on the devel branch quite often. Is this tracked elsewhere, or shall we reopen this ticket?

The reason is because those tests were run with an --old-iso that doesn't have the necessary changes to the remote shell -- if you look at the actual errors, you can see that all of them are results of this. I think someone has realized this, because devel's current test jobs has changed from using the Tails 2.6 ISO to a recent devel build with the required changes. Any way, I think all is good.

#13 Updated by intrigeri over 3 years ago

Any way, I think all is good.

Cool!

#14 Updated by bertagaz over 3 years ago

  • Status changed from 11 to Resolved

#15 Updated by intrigeri about 3 years ago

  • Related to Bug #12040: Test suite cannot sometimes connect to the remote shell: "Dropped out-of-order remote shell response: got id but expected id NNNN" added

Also available in: Atom PDF