Project

General

Profile

Bug #10999

Feature #11009: Improve ISO building and testing throughput and latency

Parallelize our ISO building workload on more builders

Added by intrigeri over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Infrastructure
Target version:
Start date:
01/26/2016
Due date:
% Done:

100%

Feature Branch:
Type of work:
Sysadmin
Starter:
Affected tool:

Description

The problem described in #8072 is back: quite often, ISO builds triggered by Jenkins are queuing up, and the latency between when a developer pushes to a branch, and when the resulting ISO is ready to be downloaded and automatically tested, is increasing. This situation can be explained by changes that make the build substantially slower: the move to Jessie, an added language to the website, and the Installation Assistant. We need to cope with it, somehow.

First of all, let's note that we initially had planned to give 4 vcpus to each isobuilder, while we currently give them 8 vcpus each. IIRC we did that because we had no better use of our vcpus back then. We currently have other good use of these vcpus.

In my book, these bonus 4 vcpus only should improve the part of the build that parallelizes well, i.e. the SquashFS compression, which takes around 11 minutes these days. So:

  1. on our current hardware, it would be wasteful to try to improve our ISO building latency by making each individual isobuilder faster; parallelizing this workload over more VMs should work much better;
  2. in theory, if we give only 4 vpcus to each isobuilder, and as a result mksquashfs is twice as slow, it would only make the build last about 12.5% longer, which feels acceptable if it allows us to double the number of ISO builders we run, and in turn to solve the congestion problem we have.

At first glance, I think we should run 3 or 4 ISO builders, with 4 vpcus each. Let's see how doable it is:

  • vcpus: as explained above, we can simply reclaim some of the bonus vcpus allocated a year ago to our current isobuilders; that is, if we're not ready to try overallocating vcpus (most of the time, all isobuilders are not used at the same time, so overallocation would make sense);
  • RAM: #11010 gave them enough RAM for 1 or 2 more builders
  • disk space: for 2 additional ISO builders, we need 2*10 GiB; we have some slack in our storage plan for this year, and we can still reclaim some space here and there, so we should be good on this side.

Related issues

Related to Tails - Feature #8072: Set up a second Jenkins slave to build ISO images Resolved 10/11/2014
Related to Tails - Feature #9264: Consider buying more server hardware to run our automated test suite Resolved 12/15/2015
Related to Tails - Feature #10996: Try running more isotester:s on lizard Resolved 01/25/2016
Blocked by Tails - Feature #11010: Give lizard v2 more RAM Resolved 01/27/2016

History

#1 Updated by intrigeri over 3 years ago

  • Related to Feature #8072: Set up a second Jenkins slave to build ISO images added

#2 Updated by intrigeri over 3 years ago

  • Related to Feature #9264: Consider buying more server hardware to run our automated test suite added

#3 Updated by intrigeri over 3 years ago

  • Related to Feature #10996: Try running more isotester:s on lizard added

#4 Updated by intrigeri over 3 years ago

  • Blueprint set to https://tails.boum.org/blueprint/hardware_for_automated_tests_take2/

The blueprint takes this problem into account.

#5 Updated by intrigeri over 3 years ago

  • Parent task set to #11009

#6 Updated by intrigeri over 3 years ago

Meta: I'm setting 2.2 as the target version as it'll be very easy once we have more RAM (#11010), but if this takes 1-2 more months to set up the additional ISO builders, no big deal.

#7 Updated by intrigeri over 3 years ago

#8 Updated by intrigeri over 3 years ago

  • Description updated (diff)

#9 Updated by intrigeri over 3 years ago

intrigeri wrote:

vcpus: as explained above, we can simply reclaim some of the bonus vcpus allocated a year ago to our current isobuilders;

Done, isobuilder{1,2} are now down to 4 vcpus.

#10 Updated by intrigeri over 3 years ago

  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 30

... and set up isobuilder{3,4}.

#11 Updated by intrigeri over 3 years ago

  • Target version changed from Tails_2.2 to Tails_2.3

https://jenkins.tails.boum.org/plugin/cluster-stats/ says the average wait time in queue is:

  • isobuilder2 2 hr 3 min
  • isobuilder1 1 hr 9 min

It's unclear if this is for the last 7 days only, or since we have been gathering stats with this plugin (1 month 5 days). I could get that more precisely from the CSV provided by this plugin, but whatever: I think I'll just come back to it in a month and see if adding isobuilders changes something measurable in terms of wait time.

Regarding build duration: that web page gives me very low average values for our isobuilders (24 and 28 minutes), because it takes into account the builds that fail very early, and also some other, non ISO build, jobs. So I don't think I can draw very useful conclusions out of these values. The raw CSV data the same plugin gives me doesn't tell me if the job run succeeded, so I can't use it to filter out failed and aborted builds. And it's not doable to find a minimal duration under which I assume a build failed, because apparently builds fail at any point in practice. So, for build duration stats, I'll instead use the data I can get in XML from the Jenkins Global Build Stats plugin => adapted https://git-tails.immerda.ch/puppet-tails/tree/files/jenkins/master/successful-ISO-builds to output average duration of successful ISO build runs, and then:

  • 2015-11: 38.3 minutes
  • 2015-12: 39.1 minutes
  • 2016-01: 44.8 minutes
  • 2016-02: 46.8 minutes

... and so I'll have something to compare with in a month or so. Of course, running more ISO builds & tests in parallel is likely to raise the average duration, the question is how much, and where is the good latency/throughput sweet spot for our workload. Note that we've taken some action to reduce a bit the website build time (that grew a lot recently), which will influence a bit our numbers, but rest assured that in the meantime we'll find other ways to increase build time.

Too bad we don't have a single source of raw data that gives us both the info we need for analyzing queue congestion, and the info we need to evaluate per-build performance, but whatever.

#12 Updated by intrigeri over 3 years ago

  • Target version changed from Tails_2.3 to Tails_2.4
  • % Done changed from 30 to 70

intrigeri wrote:

https://jenkins.tails.boum.org/plugin/cluster-stats/ says the average wait time in queue is:

  • isobuilder1 1 hr 9 min
  • isobuilder2 2 hr 3 min

It's unclear if this is for the last 7 days only, or since we have been gathering stats with this plugin (1 month 5 days). I could get that more precisely from the CSV provided by this plugin, but whatever: I think I'll just come back to it in a month and see if adding isobuilders changes something measurable in terms of wait time.

Success! I now see:

  • isobuilder1 51 min
  • isobuilder2 1 hr 33 min
  • isobuilder3 5 min 50 sec
  • isobuilder4 1 hr 30 min

Regarding build duration: that web page gives me very low average values for our isobuilders (24 and 28 minutes), because it takes into account the builds that fail very early, and also some other, non ISO build, jobs. So I don't think I can draw very useful conclusions out of these values. The raw CSV data the same plugin gives me doesn't tell me if the job run succeeded, so I can't use it to filter out failed and aborted builds. And it's not doable to find a minimal duration under which I assume a build failed, because apparently builds fail at any point in practice. So, for build duration stats, I'll instead use the data I can get in XML from the Jenkins Global Build Stats plugin => adapted https://git-tails.immerda.ch/puppet-tails/tree/files/jenkins/master/successful-ISO-builds to output average duration of successful ISO build runs, and then:

  • 2015-11: 38.3 minutes
  • 2015-12: 39.1 minutes
  • 2016-01: 44.8 minutes
  • 2016-02: 46.8 minutes

... and so I'll have something to compare with in a month or so. Of course, running more ISO builds & tests in parallel is likely to raise the average duration, the question is how much, and where is the good latency/throughput sweet spot for our workload.

Average build duration grew to 56 minutes in March (expected: caused by lowering the number of vpcus/builder, plus allowing Jenkins to run twice as many builds in parallel which loads the system more).

Note that we already built 10% more ISOs than in February, so we probably had a bit more room for congestion here.

I think that a 10 minutes hit on the build time is acceptable, given the improvements we got on the waiting time in queue. So I'm calling this tentatively done, but will come back to it in a month again to double-check.

#13 Updated by intrigeri over 3 years ago

  • Status changed from In Progress to Resolved
  • Assignee deleted (intrigeri)
  • % Done changed from 70 to 100

We're down to 53.5 minutes in April, which is probably explained by the fact we built a bit less ISO images. Calling it done.

Also available in: Atom PDF