Project

General

Profile

Feature #11888

Improve the remote shell's performance by switching to a virtio channel

Added by anonym about 3 years ago. Updated about 1 month ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Test suite
Target version:
-
Start date:
10/25/2016
Due date:
% Done:

10%

Feature Branch:
wip/11888-remote-shell-virtio-transport
Type of work:
Code
Blueprint:
Starter:
Affected tool:

Description

Currently the serial link-based transfer of data between host and guest is so slow (400 Kbaud) that I think we've consciously avoided adding some useful features that depend on fast file IO, e.g. dumping the output of tails-debugging-info as a scenario failure artifact. Also, the data transfer delay of the thousands of remote shell calls we do each full test suite run add up, so I expect that run time will decrease measurably.


Related issues

Related to Tails - Feature #11294: Replace remote shell with python-negotiator Rejected 03/31/2016
Blocked by Tails - Feature #11887: Make the remote shell's file operations robust Resolved 10/25/2016

Associated revisions

Revision 9fb70d9c (diff)
Added by anonym almost 3 years ago

Revert back to serial link for the remote shell transport.

We are experiencing lockups with the virtio channel that will need
further investigation.

Refs: #11887, #11888

Revision 6a519859 (diff)
Added by anonym about 1 month ago

Test suite: log whenever remote_shell_is_up?() returns false.

Making this clearer can help when debugging issues around the remote
shell, like for refs: #11888.

Revision d85813c4 (diff)
Added by anonym about 1 month ago

Test suite: retry restoring snapshots if the remote shell fails.

This is for debugging refs: #11888.

Revision 8405ba21 (diff)
Added by anonym about 1 month ago

Test suite: wait for the remote shell a bit longer after restoring a snapshot.

Refs: #11888

History

#1 Updated by intrigeri about 3 years ago

  • Related to Feature #11294: Replace remote shell with python-negotiator added

#2 Updated by intrigeri about 3 years ago

Cool! It would be nice to clarify the relationship of this ticket with #11294.

#3 Updated by intrigeri about 3 years ago

  • Related to Feature #11887: Make the remote shell's file operations robust added

#4 Updated by anonym about 3 years ago

I've seen the test suite get stuck like this:

    Checkpoint: I have started Tails from DVD and logged in with an administration password and the network is connected
      Given I have started Tails from DVD without network and logged in with an administration password
      And the network is plugged
calling as root: /usr/local/sbin/tor-has-bootstrapped
call returned: [1, "", ""]
calling as root: /usr/local/sbin/tor-has-bootstrapped
call returned: [1, "", ""]
calling as root: /usr/local/sbin/tor-has-bootstrapped
call returned: [1, "", ""]
calling as root: /usr/local/sbin/tor-has-bootstrapped
call returned: [1, "", ""]
calling as root: /usr/local/sbin/tor-has-bootstrapped
calling as root: journalctl SYSLOG_IDENTIFIER=restart-tor

Presumably both the sides ended up waiting for an answer => deadlock.

It happens in wait_until_tor_is_working():

def wait_until_tor_is_working
  try_for(270) { $vm.execute('/usr/local/sbin/tor-has-bootstrapped').success? }
rescue Timeout::Error => e
  c = $vm.execute("journalctl SYSLOG_IDENTIFIER=restart-tor")
[...]

So I can get how a try_for that times out could create the situation where the last tor-has-bootstrapped call is aborted while we listen for the response, but before it arrives, and then end up calling journalctl, where we should receive both responses, but discard the first one (which is logged). However, I only see five runs of the try_for block, so it's only been ~five seconds. So this is still a mystery to me.

Just a thought: this might be some sort of race that affects the slow non-virtio remote shell as well, it's just that it happens more often with this awesome speed.

#5 Updated by anonym about 3 years ago

And now I've seen these two deadlocks:

# just restored from a snapshot
calling as root: echo 'hello?'
call returned: [0, "hello?\n", ""]
calling as root: /sbin/ifconfig eth0 | grep -q 'inet addr'
call returned: [1, "", ""]
calling as root: date -s '@1477525308'
# deadlock
# first boot
calling as root: echo 'hello?'
call returned: [0, "hello?\n", ""]
calling as root: service tor status
call returned: [3, "● tor.service - Anonymizing overlay network for TCP (multi-instance-master)\n   Loaded: loaded (/lib/systemd/system/tor.service; disabled)\n   Active: inactive (dead)\n", ""]
opening file /etc/tor/torrc in 'append' mode
append complete
[log] CLICK on (642,449)
calling as root: loginctl
call returned: [0, "   SESSION        UID USER             SEAT            \n        c1        114 Debian-gdm       seat0           \n\n1 sessions listed.\n", ""]
calling as root: loginctl
call returned: [0, "   SESSION        UID USER             SEAT            \n        c1        114 Debian-gdm       seat0           \n         1       1000 amnesia          seat0           \n        c5          0 root                             \n\n3 sessions listed.\n", ""]
calling as amnesia: gsettings set org.gnome.desktop.session idle-delay 0
call returned: [0, "", ""]
calling as amnesia: gsettings set org.gnome.desktop.interface toolkit-accessibility true
# deadlock

Add to this that when I was developing this branch, I very frequently hit the issue that the remote shell got deadlocked within the first ten calls after it started for the first time, but thought I fixed it by making sure both sides flush the socket they use for the virtio channel. Any way, there seems to be an increased risk for these deadlocks "early", for some definition of "early".

I guess I need to add some sort deep of debug logging to understand what happens. The only problem is that normally we'd use the remote shell to obtain those logs from the guest. Perhaps we can revive the serial link just for that purpose. :)

#6 Updated by bertagaz about 3 years ago

anonym wrote:

Just a thought: this might be some sort of race that affects the slow non-virtio remote shell as well, it's just that it happens more often with this awesome speed.

It is. I thought I've seen while working on #10777 and mentioned it there, but it doesn't seem so. I'm pretty sure having noticed that anyway, and that it was one the the reason of the remote shell sometimes failed.

#7 Updated by bertagaz about 3 years ago

bertagaz wrote:

It is. I thought I've seen while working on #10777 and mentioned it there, but it doesn't seem so. I'm pretty sure having noticed that anyway, and that it was one the the reason of the remote shell sometimes failed.

Found it, that was in #11846#note-8

#8 Updated by anonym about 3 years ago

bertagaz wrote:

bertagaz wrote:

It is. I thought I've seen while working on #10777 and mentioned it there, but it doesn't seem so. I'm pretty sure having noticed that anyway, and that it was one the the reason of the remote shell sometimes failed.

Found it, that was in #11846#note-8

This is not the same issue, AFAICT. Dropping out-of-order responses is expected when VM.execute() is called inside a try_for() block (since we might abort after sending the request to the remote shell, but before we get the response). In fact, I'm shocked we haven't experienced more issues before, when we did not handle this. I guess it's because we mostly fail the run when a try_for() fails, so the remote shell is not used any more and we are safe -- if we did try to use it, we'd get the response from the aborted command, which could cause all sorts of trouble.

The deadlocks I'm talking about happen even if there's no try_for(), and in fact no exceptions at all are thrown. I suspect this is a low-level issue, perhaps related to IO buffering on the socket (which should be both disabled and @flush()@ed passed, both on client and server). And it's pretty rare; our test suite makes around 5000 calls to the remote shell each run, and if you look at Jenkins results for the branch, there's one per run, at random places. I have never seen more than one happen, and only a few runs where it doesn't happen at all.

This is a tricky one, and I'm currently trying to figure out how to debug it. The problem is that we can only easily debug one side (the client, in the test suite) since we cannot easily access the log of the remote shell server -- the easy way would be to use the remote shell itself! :)

#9 Updated by anonym almost 3 years ago

  • Target version changed from Tails_2.9.1 to Tails 2.10

Given the problems I'm seeing I think I'll split out the virtio socket blah into its own branch, so the other improvements can be included faster. Soon!

#10 Updated by anonym almost 3 years ago

  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 10
  • Feature Branch deleted (test/remote-shell-file-operations)

To continue work on this, revert 9fb70d9c6e90ff18e4e2be97c6360fd8212aa0dd (in the test/11887-remote-shell-file-operations branch, if it hasn't been merged yet).

#11 Updated by anonym almost 3 years ago

  • Related to deleted (Feature #11887: Make the remote shell's file operations robust)

#12 Updated by anonym almost 3 years ago

  • Blocked by Feature #11887: Make the remote shell's file operations robust added

#13 Updated by anonym almost 3 years ago

  • Target version deleted (Tails 2.10)

#14 Updated by anonym almost 3 years ago

  • Feature Branch set to feature/11888-remote-shell-virtio-transport

I've revived the code now (based on test/12059-dogtail-optimizations (which in turn is based on test/11887-remote-shell-file-operations)). Let's see how it behaves on Jenkins.

#15 Updated by anonym about 1 month ago

anonym wrote:

I've revived the code now (based on test/12059-dogtail-optimizations (which in turn is based on test/11887-remote-shell-file-operations)). Let's see how it behaves on Jenkins.

That was two years ago, and I don't know what the outcome was. :/ Any way, I'm curious if anything has changed since then so I have refreshed this branch yet again.

#16 Updated by anonym about 1 month ago

So after 5 runs of +force-all-tests and 3 runs of the "normal" branch, one of the +force-all-test runs had a failure like this:

05:00:49.055818907: Restoring snapshot 'with-network-logged-in'...
05:00:51.131680384: Remote shell: calling as root: echo 'hello?'
05:00:52.300526024: Remote shell: calling as root: echo 'hello?'
[...]
05:02:20.000511903: Remote shell: calling as root: echo 'hello?'
05:02:21.017870387: Remote shell: calling as root: echo 'hello?'
    Generated snapshot step failed with exception:
      Timeout::Error: Remote shell seems to be down

So, again, there is a problem "early" after restoring, although now it fails immediately after restoring, but 3 years ago it could happen a few calls later.

Any way, I added some debugging to help debug (and possibly workaround) this.

#17 Updated by anonym about 1 month ago

anonym wrote:

debugging to help debug

:D

#18 Updated by anonym about 1 month ago

I tried bumping the timeout used for waiting for the remote shell after restoring a snapshot from 3 to 30 seconds, but it didn't help.

What does help is to just retry restoring the snapshot again. It seems like an effective workaround as it has rescued all instances so far, but our serial based transport works fine without need for workarounds, it's just slow. It seems this branch could lower the total run time from around 5h40m to 5h30m (and more if there are tons of failures → saving the journal over a slow serial link) which does not seem worth it.

I guess I'll return in two years and see if things has improved. Unless we get a real need for this branch and want to invest time reporting this upstream and debugging this much further.

#19 Updated by intrigeri about 1 month ago

I guess I'll return in two years and see if things has improved. Unless we get a real need for this branch and want to invest time reporting this upstream and debugging this much further.

This plan works for me, until we see a real, immediate need for it.

#20 Updated by intrigeri about 1 month ago

Hi @anonym,

I guess I'll return in two years and see if things has improved.

Then, would you mind renaming the branches to wip/ to free some Jenkins cycles?

#21 Updated by anonym about 1 month ago

  • Feature Branch changed from feature/11888-remote-shell-virtio-transport to wip/11888-remote-shell-virtio-transport

intrigeri wrote:

Then, would you mind renaming the branches to wip/ to free some Jenkins cycles?

Didn't think about that! Thanks, done!

Also available in: Atom PDF