Project

General

Profile

Feature #9430

Make our build system more robust vs. apt-get transient errors

Added by anonym about 4 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Continuous Integration
Target version:
Start date:
05/19/2015
Due date:
% Done:

100%

Feature Branch:
feature/9430-build-system-vs-apt-transient-errors
Type of work:
Research
Blueprint:
Starter:
Affected tool:

Description

For instance, for build failures like

W: Failed to fetch http://ftp.us.debian.org/debian/dists/experimental/main/binary-i386/Packages  Hash Sum mismatch

E: Some index files failed to download. They have been ignored, or old ones used instead.
Fetched 43.1 MB in 19s (2224 kB/s)
P: Begin unmounting filesystems...

we should teach jenkins to detect them, and trigger a rebuild instead of notifying the responsible party of this (very often) meaningless error.

It would probably be even better to make our build system retry a few times (+ change mirrors, if any?) on such errors before giving up for real.

Associated revisions

Revision 16c6f5c6 (diff)
Added by intrigeri about 4 years ago

APT: retry 3 downloads times before giving up.

Refs: #9430

History

#1 Updated by anonym about 4 years ago

  • Assignee set to intrigeri
  • Target version set to Tails_1.4.1

Please change the milestone as you see fit.

#2 Updated by intrigeri about 4 years ago

For now, just pasting what I wrote on -dev@:

I guess that's somehow possible with Jenkins only, but it most likely
requires twisting its semantics quite a bit. I'm happy to give
a closer look at it one of these days, in case there's a neat solution
to this problem => please give me a research ticket :)

However, long-term I think we'll have to use something like Zuul,
that's dedicated to orchestrating jobs and to mediating between our
Jenkins job needs, their result, and whatever action should be taken.
IIRC that's how the OpenStack project CI handles this kind
of problems.

On the short term, perhaps teaching APT or our build system to retry
such operations on failures would be a good enough, and muuuch
simpler, workaround.

#3 Updated by intrigeri about 4 years ago

  • Subject changed from Retry jenkins builds for transient errors to Have our build system more resistant to apt-get transient errors
  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 10

Looking into "teaching APT or our build system to retry such operations on failures" first.

First, in the APT configuration, for the Acquire group I see:

  • Retries: Number of retries to perform. If this is non-zero APT will retry failed files the given number of times.
  • ForceIPv4 might be useful: our networking config doesn't support IPv6, which might cause issues

One should also look into acng's configuration options.

#4 Updated by intrigeri about 4 years ago

  • Feature Branch set to feature/9430-build-system-vs-apt-transient-errors

#5 Updated by intrigeri about 4 years ago

  • Subject changed from Have our build system more resistant to apt-get transient errors to Make our build system more resistant to apt-get transient errors

#6 Updated by intrigeri about 4 years ago

  • % Done changed from 10 to 20

Merged that into experimental, we'll see if transient errors still happen on Jenkins for that branch.

#7 Updated by intrigeri about 4 years ago

Also note that once #5926 is done, we won't have problems with hitting differently sync'd mirrors and hashsum mismatches anymore.

#8 Updated by intrigeri about 4 years ago

  • Subject changed from Make our build system more resistant to apt-get transient errors to Make our build system more robust to apt-get transient errors

#9 Updated by intrigeri about 4 years ago

  • Subject changed from Make our build system more robust to apt-get transient errors to Make our build system more robust vs. apt-get transient errors

#10 Updated by intrigeri almost 4 years ago

  • Target version changed from Tails_1.4.1 to Tails_1.5

Let's give it a few more weeks to see if the changes merged into experimental make a difference at all.

#11 Updated by intrigeri almost 4 years ago

  • Target version changed from Tails_1.5 to Tails_1.6

I need to lighten my plate.

#12 Updated by intrigeri almost 4 years ago

Look for "network problems" in bin/reproducible_maintenance.sh in https://anonscm.debian.org/gitweb/?p=qa/jenkins.debian.net.git -- it reschedules builds that failed due to network problems.

#13 Updated by intrigeri almost 4 years ago

We could try configuring acng on apt.lizard to use httpredir.debian.org, which could help.

#14 Updated by intrigeri almost 4 years ago

intrigeri wrote:

We could try configuring acng on apt.lizard to use httpredir.debian.org, which could help.

Done, let's see how it goes.

#15 Updated by bertagaz almost 4 years ago

intrigeri wrote:

Look for "network problems" in bin/reproducible_maintenance.sh in https://anonscm.debian.org/gitweb/?p=qa/jenkins.debian.net.git -- it reschedules builds that failed due to network problems.

Hmmm well, that's a serious bit of scripts. They seem to create their own sqlite database to gather data and act on them. We may prefer others options.

#16 Updated by intrigeri almost 4 years ago

Hmmm well, that's a serious bit of scripts. They seem to create their own sqlite database to gather data and act on them. We may prefer others options.

The idea I meant to point to is: grep failed build logs for 'E: Failed to fetch.*(Connection failed|Size mismatch|Cannot initiate the connection to|Bad Gateway)', and restart those that match.

#17 Updated by intrigeri almost 4 years ago

intrigeri wrote:

intrigeri wrote:

We could try configuring acng on apt.lizard to use httpredir.debian.org, which could help.

Done, let's see how it goes.

It makes things worse, reverted, but kept the upgraded acng to see if that one is causing issues (and not httpredir).

#19 Updated by intrigeri almost 4 years ago

I should retry httpredir on apt-proxy.lizard: I'm told that some mirrors were broken precisely during the days when I've tested it, so it might be that the failures we've seen were _not_caused by httpredir.

#20 Updated by intrigeri almost 4 years ago

  • Target version changed from Tails_1.6 to Tails_1.7

#21 Updated by intrigeri over 3 years ago

#22 Updated by intrigeri over 3 years ago

  • Target version changed from Tails_1.7 to Tails_2.3

#5926 will magically solve 99% of this problem, so IMO I should not waste time trying to fix it differently here.

#23 Updated by intrigeri about 3 years ago

  • Target version changed from Tails_2.3 to Tails_2.4

#24 Updated by intrigeri about 3 years ago

  • Target version changed from Tails_2.4 to Tails_2.5

It's only during the next cycle that we can confirm that the freezable APT repo has indeed improved things in this respect (even if I really can't see how it could not be the case).

#25 Updated by intrigeri about 3 years ago

  • Status changed from In Progress to Fix committed
  • Assignee deleted (intrigeri)
  • Target version changed from Tails_2.5 to Tails_2.4
  • % Done changed from 20 to 100

I've thougt about it more, and I can't see how our freezable APT repo would not solve this.

#26 Updated by anonym about 3 years ago

We have enough of #5926 solved so that the blocker can be removed, and this ticket resolved.

#27 Updated by anonym about 3 years ago

#28 Updated by anonym about 3 years ago

  • Status changed from Fix committed to Resolved

Also available in: Atom PDF