Project

General

Profile

Feature #17387

Bug #16960: Make our CI feedback loop shorter

Consider disabling CPU vulnerabilities mitigation features in our CI builder/tester VMs

Added by intrigeri 3 months ago. Updated about 14 hours ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Continuous Integration
Target version:
Start date:
Due date:
% Done:

0%

Feature Branch:
Type of work:
Sysadmin
Blueprint:
Starter:
Affected tool:

Description

Recent kernels (including 4.9 LTS that we run on our CI builders/testers) offer a mitigations=off command line option, that disables all mitigation features for known CPU vulnerabilities. On my local Jenkins instance, I've measured that setting this option both on the l0 virtualization host and on the l1 builder/tester VMs gives me a 15% test suite performance boost; this cuts down the feedback loop by about ½ hour, which makes a significant difference to my developer experience. On lizard, I don't think we can reasonably disable these mitigations on the l0 virtualization host, so this proposal is only about the l1 builder/tester VMs. We won't know what kind of performance improvement we would get in this context without trying.

These l1 VMs are mostly used to run l2 VMs:

  • isobuilder:
    • The l2 VM is a Vagrant box, that's been built by the isobuilder itself; that l2 VM is trusted (otherwise we can't trust the Tails images we publish)
    • Given the kind of input data the l2 VM handles (mostly Debian packages) and how it handles it (mostly APT), it seems very unlikely that vulns such as Spectre and Meltdown can be exploited in there, towards escaping to l1 or gaining information about other processes running in the l1 VM.
    • The l1 VM runs essentially nothing else than this l2 VM, so a cross-process info leak would have no harmful consequences.
  • isotester:
    • Tails is running in the l2 VM.
    • The l2 VM mostly handles trusted input data (e.g. it loads our website in a web browser). Here as well, the way it handles untrusted input data (e.g. gpg --recv-key) does not leave much room for exploiting vulns such as Spectre or Meltdown.
    • The l1 VM runs essentially nothing else than this l2 VM, so a cross-process info leak would have no harmful consequences.

If we're concerned that mitigations=off increases the risk of l2 escape to l1, and thus increases the risk of lateral progression to more sensitive VMs hosted on the same l0 virt host, then we could sandbox the QEMU process that runs l2 in l1 more strictly, by enabling AppArmor for libvirtd in l1.
Given the above, I think this risk increase is too small to bother, but I could change my mind :)


Related issues

Related to Tails - Bug #17386: Consider disabling CPU vulnerabilities mitigation features in our Vagrant build box Resolved
Blocks Tails - Feature #13284: Core work: Sysadmin (Adapt our infrastructure) Confirmed 06/30/2017

History

#1 Updated by intrigeri 3 months ago

  • Parent task set to #16960

#2 Updated by intrigeri 3 months ago

  • Blocks Feature #13284: Core work: Sysadmin (Adapt our infrastructure) added

#3 Updated by intrigeri 3 months ago

Added to the agenda for upcoming sysadmin team meeting(s).

#4 Updated by intrigeri 2 months ago

  • Target version changed from Tails_4.3 to Tails_4.4

We had no time to discuss this during our last meeting, and the next one won't happen before 4.3.

#5 Updated by intrigeri about 1 month ago

  • Status changed from Confirmed to In Progress
  • Assignee changed from Sysadmins to intrigeri
  • Type of work changed from Discuss to Sysadmin

groente says benchmarking makes sense ⇒ if no significant perf benefits, we can skip the complex & painstaking security analysis.

I've set mitigations=off on all lizard isobuilders & isotesters. I'll come back to it in a few weeks to check if this improves performance measurably.

#6 Updated by intrigeri about 1 month ago

  • Related to Bug #17386: Consider disabling CPU vulnerabilities mitigation features in our Vagrant build box added

#7 Updated by intrigeri about 1 month ago

There's been a bunch of other recent changes that taint the first batch of build results. Here's the timeline:

  • Feb 20: set mitigations=off on all lizard isobuilders & isotesters
  • Feb 21: #17439 is merged
  • Feb 29: #17386 is merged

So I think I should now:

  1. analyze impact of mitigations=off on test suite performance (that data is not tainted by the above)
  2. wait another few weeks, to gather enough build perf data with mitigations=off
  3. remove mitigations=off
  4. wait a few weeks, to gather baseline build perf data without mitigations=off
  5. compare build perf results from steps 2 and 4

Hopefully no other change that significantly affects build performance will be merged in the meantime.

#8 Updated by intrigeri about 1 month ago

intrigeri wrote:

So I think I should now:

  1. analyze impact of mitigations=off on test suite performance (that data is not tainted by the above)

I'll use https://jenkins.tails.boum.org/job/test_Tails_ISO_stable/buildTimeTrend as my data source.
I'm only taking successful test suite runs into account, because failures skew the total run time.
Average test suite run time:

  • Feb 5-19 without mitigations=off, i.e. baseline: 307 minutes
  • Feb 21-29, with mitigations=off: 292 minutes

⇒ At the "one single test suite job run" zoom level, setting mitigations=off on our isotesters yields a 15 minutes (4.9%) test suite run time decrease, which is significant in itself.

And at the "our CI system" zoom level, when the system is heavily loaded and the queue constantly has job runs waiting, such improvements quickly accumulate with a snowball effect. For example:

  • The next job, that was waiting in the queue, will start 15 min earlier and last 15 min less, resulting in a 30 min CI feedback loop total decrease.
  • The 3rd job will start 30 min earlier, last 15 min less, resulting in a 45 min CI feedback loop total decrease.
  • If extra performance improvements are observed for image builds, these improvements will also add up to the test suite perf improvements, wrt. the total CI feedback loop duration, which is what matters.

I'll now wait for the build benchmark results.

#9 Updated by intrigeri about 1 month ago

  • Target version changed from Tails_4.4 to Tails_4.5

#10 Updated by intrigeri 20 days ago

intrigeri wrote:

  1. analyze impact of mitigations=off on test suite performance (that data is not tainted by the above)
  2. wait another few weeks, to gather enough build perf data with mitigations=off
  3. remove mitigations=off

Since the beginning of the month, on the stable branch, ignoring the minority of builds that had to rebuild the website instead of using the cached version: mean build time is 42.875 minutes.

I've now removed mitigations=off on isobuilder*.lizard and will come back to it in a few weeks to compare results.

#11 Updated by intrigeri about 14 hours ago

intrigeri wrote:

Since the beginning of the month, on the stable branch, ignoring the minority of builds that had to rebuild the website instead of using the cached version: mean build time is 42.875 minutes.

I've now removed mitigations=off on isobuilder*.lizard and will come back to it in a few weeks to compare results.

On the stable branch, since I removed mitigations=off, ignoring the minority of builds that had to rebuild the website instead of using the cached version: mean build time is 45.667 minutes.

So, to sum up, mitigations=off lowers the runtime of our jobs like this:

  • image builds: 2.8 minutes (6.1%)
  • image tests: 15 minutes (4.9%)
  • total for one build+test feedback loop: 17.8 minutes.

I've described on #17387#note-8 how these savings are cumulative in the situations that matter most, due to queue management and bottlenecks.

groente says benchmarking makes sense ⇒ if no significant perf benefits, we can skip the complex & painstaking security analysis.

So, it turns out that mitigations=off on {isobuilder,isotester}*.lizard does yield significant performance improvements.

I think we should now discuss the next steps, and how they fit in our various priorities.

Also available in: Atom PDF