Project

General

Profile

Feature #11888

Improve the remote shell's performance by switching to a virtio channel

Added by anonym almost 3 years ago. Updated over 2 years ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Test suite
Target version:
-
Start date:
10/25/2016
Due date:
% Done:

10%

Feature Branch:
feature/11888-remote-shell-virtio-transport
Type of work:
Code
Blueprint:
Starter:
Affected tool:

Description

Currently the serial link-based transfer of data between host and guest is so slow (400 Kbaud) that I think we've consciously avoided adding some useful features that depend on fast file IO, e.g. dumping the output of tails-debugging-info as a scenario failure artifact. Also, the data transfer delay of the thousands of remote shell calls we do each full test suite run add up, so I expect that run time will decrease measurably.


Related issues

Related to Tails - Feature #11294: Replace remote shell with python-negotiator Rejected 03/31/2016
Blocked by Tails - Feature #11887: Make the remote shell's file operations robust Resolved 10/25/2016

Associated revisions

Revision 9fb70d9c (diff)
Added by anonym almost 3 years ago

Revert back to serial link for the remote shell transport.

We are experiencing lockups with the virtio channel that will need
further investigation.

Refs: #11887, #11888

History

#1 Updated by intrigeri almost 3 years ago

  • Related to Feature #11294: Replace remote shell with python-negotiator added

#2 Updated by intrigeri almost 3 years ago

Cool! It would be nice to clarify the relationship of this ticket with #11294.

#3 Updated by intrigeri almost 3 years ago

  • Related to Feature #11887: Make the remote shell's file operations robust added

#4 Updated by anonym almost 3 years ago

I've seen the test suite get stuck like this:

    Checkpoint: I have started Tails from DVD and logged in with an administration password and the network is connected
      Given I have started Tails from DVD without network and logged in with an administration password
      And the network is plugged
calling as root: /usr/local/sbin/tor-has-bootstrapped
call returned: [1, "", ""]
calling as root: /usr/local/sbin/tor-has-bootstrapped
call returned: [1, "", ""]
calling as root: /usr/local/sbin/tor-has-bootstrapped
call returned: [1, "", ""]
calling as root: /usr/local/sbin/tor-has-bootstrapped
call returned: [1, "", ""]
calling as root: /usr/local/sbin/tor-has-bootstrapped
calling as root: journalctl SYSLOG_IDENTIFIER=restart-tor

Presumably both the sides ended up waiting for an answer => deadlock.

It happens in wait_until_tor_is_working():

def wait_until_tor_is_working
  try_for(270) { $vm.execute('/usr/local/sbin/tor-has-bootstrapped').success? }
rescue Timeout::Error => e
  c = $vm.execute("journalctl SYSLOG_IDENTIFIER=restart-tor")
[...]

So I can get how a try_for that times out could create the situation where the last tor-has-bootstrapped call is aborted while we listen for the response, but before it arrives, and then end up calling journalctl, where we should receive both responses, but discard the first one (which is logged). However, I only see five runs of the try_for block, so it's only been ~five seconds. So this is still a mystery to me.

Just a thought: this might be some sort of race that affects the slow non-virtio remote shell as well, it's just that it happens more often with this awesome speed.

#5 Updated by anonym almost 3 years ago

And now I've seen these two deadlocks:

# just restored from a snapshot
calling as root: echo 'hello?'
call returned: [0, "hello?\n", ""]
calling as root: /sbin/ifconfig eth0 | grep -q 'inet addr'
call returned: [1, "", ""]
calling as root: date -s '@1477525308'
# deadlock
# first boot
calling as root: echo 'hello?'
call returned: [0, "hello?\n", ""]
calling as root: service tor status
call returned: [3, "● tor.service - Anonymizing overlay network for TCP (multi-instance-master)\n   Loaded: loaded (/lib/systemd/system/tor.service; disabled)\n   Active: inactive (dead)\n", ""]
opening file /etc/tor/torrc in 'append' mode
append complete
[log] CLICK on (642,449)
calling as root: loginctl
call returned: [0, "   SESSION        UID USER             SEAT            \n        c1        114 Debian-gdm       seat0           \n\n1 sessions listed.\n", ""]
calling as root: loginctl
call returned: [0, "   SESSION        UID USER             SEAT            \n        c1        114 Debian-gdm       seat0           \n         1       1000 amnesia          seat0           \n        c5          0 root                             \n\n3 sessions listed.\n", ""]
calling as amnesia: gsettings set org.gnome.desktop.session idle-delay 0
call returned: [0, "", ""]
calling as amnesia: gsettings set org.gnome.desktop.interface toolkit-accessibility true
# deadlock

Add to this that when I was developing this branch, I very frequently hit the issue that the remote shell got deadlocked within the first ten calls after it started for the first time, but thought I fixed it by making sure both sides flush the socket they use for the virtio channel. Any way, there seems to be an increased risk for these deadlocks "early", for some definition of "early".

I guess I need to add some sort deep of debug logging to understand what happens. The only problem is that normally we'd use the remote shell to obtain those logs from the guest. Perhaps we can revive the serial link just for that purpose. :)

#6 Updated by bertagaz almost 3 years ago

anonym wrote:

Just a thought: this might be some sort of race that affects the slow non-virtio remote shell as well, it's just that it happens more often with this awesome speed.

It is. I thought I've seen while working on #10777 and mentioned it there, but it doesn't seem so. I'm pretty sure having noticed that anyway, and that it was one the the reason of the remote shell sometimes failed.

#7 Updated by bertagaz almost 3 years ago

bertagaz wrote:

It is. I thought I've seen while working on #10777 and mentioned it there, but it doesn't seem so. I'm pretty sure having noticed that anyway, and that it was one the the reason of the remote shell sometimes failed.

Found it, that was in #11846#note-8

#8 Updated by anonym almost 3 years ago

bertagaz wrote:

bertagaz wrote:

It is. I thought I've seen while working on #10777 and mentioned it there, but it doesn't seem so. I'm pretty sure having noticed that anyway, and that it was one the the reason of the remote shell sometimes failed.

Found it, that was in #11846#note-8

This is not the same issue, AFAICT. Dropping out-of-order responses is expected when VM.execute() is called inside a try_for() block (since we might abort after sending the request to the remote shell, but before we get the response). In fact, I'm shocked we haven't experienced more issues before, when we did not handle this. I guess it's because we mostly fail the run when a try_for() fails, so the remote shell is not used any more and we are safe -- if we did try to use it, we'd get the response from the aborted command, which could cause all sorts of trouble.

The deadlocks I'm talking about happen even if there's no try_for(), and in fact no exceptions at all are thrown. I suspect this is a low-level issue, perhaps related to IO buffering on the socket (which should be both disabled and @flush()@ed passed, both on client and server). And it's pretty rare; our test suite makes around 5000 calls to the remote shell each run, and if you look at Jenkins results for the branch, there's one per run, at random places. I have never seen more than one happen, and only a few runs where it doesn't happen at all.

This is a tricky one, and I'm currently trying to figure out how to debug it. The problem is that we can only easily debug one side (the client, in the test suite) since we cannot easily access the log of the remote shell server -- the easy way would be to use the remote shell itself! :)

#9 Updated by anonym almost 3 years ago

  • Target version changed from Tails_2.9.1 to Tails 2.10

Given the problems I'm seeing I think I'll split out the virtio socket blah into its own branch, so the other improvements can be included faster. Soon!

#10 Updated by anonym almost 3 years ago

  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 10
  • Feature Branch deleted (test/remote-shell-file-operations)

To continue work on this, revert 9fb70d9c6e90ff18e4e2be97c6360fd8212aa0dd (in the test/11887-remote-shell-file-operations branch, if it hasn't been merged yet).

#11 Updated by anonym almost 3 years ago

  • Related to deleted (Feature #11887: Make the remote shell's file operations robust)

#12 Updated by anonym almost 3 years ago

  • Blocked by Feature #11887: Make the remote shell's file operations robust added

#13 Updated by anonym over 2 years ago

  • Target version deleted (Tails 2.10)

#14 Updated by anonym over 2 years ago

  • Feature Branch set to feature/11888-remote-shell-virtio-transport

I've revived the code now (based on test/12059-dogtail-optimizations (which in turn is based on test/11887-remote-shell-file-operations)). Let's see how it behaves on Jenkins.

Also available in: Atom PDF