Project

General

Profile

Bug #10601

Feature #5288: Run the test suite automatically on autobuilt ISOs

isotesterN:s are sometimes put offline and never back online

Added by intrigeri almost 4 years ago. Updated 22 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Continuous Integration
Target version:
Start date:
11/23/2015
Due date:
% Done:

10%

Feature Branch:
Type of work:
Sysadmin
Blueprint:
Starter:
Affected tool:

Description

I see that isotester2 and isotester4 are currently offline from Jenkins' PoV. They page (https://jenkins.tails.boum.org/computer/isotester2/) reads "Disconnected by anonymous : Taken_down_for_reboot". They both are running fine.

For isotester2, the last message I see in the logs is:

Nov 21, 2015 6:09:58 PM jenkins.slaves.JnlpSlaveAgentProtocol$Handler$1 onClosed
WARNING: NioChannelHub keys=6 gen=72376715: Computer.threadPoolForRemoting [#1] for + isotester2 terminated
java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@2bbbdba1[name=isotester2]

... which seems to be normal when we take a slave down (I've compared with another isotesterN that was successfully taken offline then back on). This means that it was taken offline 1 day and 14 hours ago, which happens to be its uptime, so it did reboot just fine.

My conclusion is that our way of putting a node back online while it's rebooting is not working reliably.

I'm curious how often this happens. I don't remember seeing this recently, but it might be that someone else has been dealing with it manually.


Related issues

Related to Tails - Feature #9486: Support running multiple instances of the test suite in parallel Resolved 06/25/2015
Related to Tails - Bug #11295: Test jobs sometimes get scheduled on a busy isotester while there are available ones In Progress 03/31/2016
Related to Tails - Bug #16959: Gather usability data about our current CI In Progress
Blocked by Tails - Bug #10068: Upgrade to Jenkins 2.x, using upstream packages Resolved 01/08/2018

History

#1 Updated by intrigeri almost 4 years ago

I've manually put both nodes back online and jobs were immediately scheduled on them. We'll see what happens next.

#2 Updated by intrigeri almost 4 years ago

  • Related to Feature #9486: Support running multiple instances of the test suite in parallel added

#3 Updated by intrigeri almost 4 years ago

  • Subject changed from isotesterN sometimes don't come back up online after rebooting to isotesterN:s sometimes disappear
  • Status changed from Confirmed to In Progress
  • Priority changed from Normal to High
  • % Done changed from 0 to 10

I've looked a bit closer at what happened when isotester2 got offline and never back.

https://jenkins.tails.boum.org/job/wrap_test_Tails_ISO_feature-jessie/64/ was scheduled on isotester2 on Nov 21, 2015 6:09:47 PM, triggerred https://jenkins.tails.boum.org/job/reboot_job/1412/, scheduled on Nov 21, 2015 6:09:49 PM and took 6.1 seconds. The back online HTTP request ended at 18:09:56. This is confirmed by the Jenkins log:

Nov 21, 2015 6:09:56 PM hudson.model.Run execute
INFO: reboot_job #1412 main build action completed: SUCCESS
Nov 21, 2015 6:09:56 PM hudson.model.Run execute
INFO: wrap_test_Tails_ISO_feature-jessie #64 main build action completed: SUCCESS

https://jenkins.tails.boum.org/job/wrap_test_Tails_ISO_experimental/56/ was scheduled on isotester2 on Nov 21, 2015 6:09:56 PM. It failed (https://jenkins.tails.boum.org/job/wrap_test_Tails_ISO_experimental/56/console) because isotester2 disappeared at 18:09:58, which is not surprising given we asked it to reboot at 6:09:49 PM.

So I think we have two problems, and the former causes the latter:

  1. our offline + reboot + online dance is racy (we knew that already in theory) enough to allow, in practice, another wrap_test_Tails_ISO_* job to be scheduled at the exact time when we it should not;
  2. when the 2nd wrapper job fails due to loss of connection, Jenkins marks the slave as offline and it never gets back up from Jenkins' PoV.

If we fix the first problem, then the 2nd one can't happen anymore and we're good. Let's do that.

The main problem is then: we put the slave back online at (shutdown request + 5 seconds), while we should do that after the jenkins-slave service has been stopped. We have already discussed this at length, and we have already agreed on a plan for if/when this happens: #9486#note-49. I'm thus going to raise the sleep time (as you can guess I still feel the right way to order this deterministically is a simple systemd unit file, but I still can live with the pragmatic approach for now).

You said such problems would be "handled with top-priority", so bumping priority.

#4 Updated by intrigeri almost 4 years ago

  • Assignee changed from bertagaz to intrigeri
  • % Done changed from 10 to 20

intrigeri wrote:

I'm thus going to raise the sleep time

Done, up to 30 seconds. While bertagaz is away I'll monitor this situation.

#5 Updated by intrigeri almost 4 years ago

  • Assignee changed from intrigeri to bertagaz

(This should ideally be tracked in a dedicated ticket, but it seems it might be strongly related to the changes I did to address this one, so I'm keeping the discussion here until we figure it out.)

So, I've not seen isotesters disappearing anymore, but now I've seen 2 different test jobs (https://jenkins.tails.boum.org/job/test_Tails_ISO_feature-8415-overlayfs/16/ and https://jenkins.tails.boum.org/job/test_Tails_ISO_bugfix-9713-electrum-2.5/7/) scheduled more or less at the same time on isotester1, and the latter is waiting in queue for isotester1... while isotester4 is idling.

Is this a known issue I wasn't aware of?

Or is it happening because I bumped the sleep to 30 seconds? If that's the case, then it looks like the reasoning behind preventing raciness with the infamous sleep isn't quite right.

Debugging info:

And logs:

Nov 26, 2015 9:48:59 AM hudson.model.Run execute
INFO: build_Tails_ISO_feature-8415-overlayfs #23 main build action completed: SUCCESS
Nov 26, 2015 9:59:57 AM hudson.model.Run execute
INFO: build_Tails_ISO_bugfix-9713-electrum-2.5 #10 main build action completed: SUCCESS
Nov 26, 2015 10:42:18 AM hudson.model.Run execute
INFO: test_Tails_ISO_feature-5926-freezable-apt-repository #19 main build action completed: FAILURE

Completion of the last test job above frees isotester1, and then:

Nov 26, 2015 10:42:52 AM hudson.model.Run execute
INFO: reboot_job #1495 main build action completed: SUCCESS
Nov 26, 2015 10:42:52 AM hudson.model.Run execute
INFO: wrap_test_Tails_ISO_feature-8415-overlayfs #16 main build action completed: SUCCESS
Nov 26, 2015 10:43:22 AM hudson.model.Run execute
INFO: reboot_job #1496 main build action completed: SUCCESS
Nov 26, 2015 10:43:22 AM hudson.model.Run execute
INFO: wrap_test_Tails_ISO_bugfix-9713-electrum-2.5 #7 main build action completed: SUCCESS

I've tried to understand what happens. At 10:42:52, we tell Jenkins master that isotester1 is back online. So far, so good. And then, why is
https://jenkins.tails.boum.org/job/wrap_test_Tails_ISO_bugfix-9713-electrum-2.5/7/console run on it, instead of https://jenkins.tails.boum.org/job/test_Tails_ISO_feature-8415-overlayfs/16/consoleFull (that should in theory be waiting in queue for isotester1 already)? I believe that's the main problem here, because as a result the overlayfs job waits for one more reboot, and more importantly the electrum one will be blocked for hours, while it could have run immediately on a different slave. https://tails.boum.org/blueprint/automated_builds_and_tests/jenkins/ says that priorities are set in a way that avoids such a situation, so it seems that this does not work as intended.

On https://wiki.jenkins-ci.org/display/JENKINS/Priority+Sorter+Plugin I read "Please note that lower number means higher priority", but that's about 3.4 while we're running 1.3, and on our Jenkins job configuration page I read "A job with higher priority is ran before jobs with lower priorities".

bertagaz, I'll let you assess how much of a problem this is, and if it's a serious one, I'll let you think about how to fix it. If help is needed to write systemd unit files to replace calls to sleep, let me know.

#6 Updated by bertagaz almost 4 years ago

I wonder how this could be related to the OOM we've seen lately that made some of our VM being shut down. Let see if we see this issue happening again.

#7 Updated by intrigeri almost 4 years ago

I wonder how this could be related to the OOM we've seen lately that made some of our VM being shut down.

I don't see how it can possibly relate to the problem at hand: we're talking of well-understood reboots here. I think it'll be clear to you if you re-read carefully the ticket description and my analysis.

Let see if we see this issue happening again.

Right.

Shall I file another ticket for the follow-up bug I reported and analyzed on #10601#note-5, or do you plan to answer it here?

#8 Updated by bertagaz almost 4 years ago

  • Target version changed from Tails_1.8 to Tails_2.0

Postponing

#9 Updated by intrigeri almost 4 years ago

We'll keep an eye on it and see if it happens again. Whenever it does, drop here any useful info that can help future debugging. It's not expected that we'll fully solve that for 2.0 (as actually the problem this ticket is about may be fixed already).

It would be good to see the other problem(s?) discovered on this ticket tracked appropriately elsewhere.

#10 Updated by intrigeri almost 4 years ago

  • Blocks Feature #5288: Run the test suite automatically on autobuilt ISOs added

#11 Updated by bertagaz over 3 years ago

  • Target version changed from Tails_2.0 to Tails_2.2

#12 Updated by intrigeri over 3 years ago

  • Related to deleted (Feature #9486: Support running multiple instances of the test suite in parallel)

#13 Updated by intrigeri over 3 years ago

  • Blocks deleted (Feature #5288: Run the test suite automatically on autobuilt ISOs)

#15 Updated by intrigeri over 3 years ago

  • Parent task set to #5288

#16 Updated by intrigeri over 3 years ago

  • Related to Feature #9486: Support running multiple instances of the test suite in parallel added

#17 Updated by bertagaz over 3 years ago

  • Target version changed from Tails_2.2 to Tails_2.3

#18 Updated by bertagaz over 3 years ago

  • Assignee changed from bertagaz to intrigeri
  • % Done changed from 20 to 60
  • QA Check set to Ready for QA

I did not look closely every day to see if the problem described in this ticket title still happens, but I don't remember having seen this lately. So I think that for this one we're good. I've created #11295 to track the other problem you found about test job getting their isotester stolen by another ones.

#19 Updated by intrigeri over 3 years ago

  • Status changed from In Progress to Resolved
  • Assignee deleted (intrigeri)
  • % Done changed from 60 to 100
  • QA Check changed from Ready for QA to Pass

bertagaz wrote:

I did not look closely every day to see if the problem described in this ticket title still happens, but I don't remember having seen this lately.

Neither have I! Closing, then.

#20 Updated by intrigeri over 1 year ago

  • Status changed from Resolved to In Progress
  • Assignee set to bertagaz
  • Target version deleted (Tails_2.3)
  • % Done changed from 100 to 10
  • QA Check deleted (Pass)

I see this happen regularly these days: it happens at least once every two week that I find such an isotester "taken down for reboot" for hours if not days. Every time I switch it back online and it's good to go again.

See my initial debugging (#10601#note-5) and the related #11295.

#21 Updated by intrigeri over 1 year ago

  • Priority changed from High to Normal

#22 Updated by intrigeri over 1 year ago

  • Related to Bug #11295: Test jobs sometimes get scheduled on a busy isotester while there are available ones added

#24 Updated by u 7 months ago

  • Blocked by Bug #10068: Upgrade to Jenkins 2.x, using upstream packages added

#25 Updated by u 7 months ago

If I understand correctly, then upgrading the Priority Sorter plugin will magically fix that. bertagaz will probably have to do this upgrade for #10068 → so "blocked by".

#26 Updated by u 7 months ago

  • Assignee changed from bertagaz to intrigeri
  • Target version set to Tails_3.16

Once bertagaz did the Jenkins update (#10068), intrigeri will check, ~ before releasing 3.16 (July-Aug 2019), if this issue was indeed magically corrected by the update.

#27 Updated by intrigeri 7 months ago

#28 Updated by intrigeri 7 months ago

FTR this problem happened twice in the last few days, that have been heavy on Jenkins (lots of activity on WIP branches in the last few days before the freeze for 3.13, probably explained in great part by the student syndrome).

#29 Updated by intrigeri 2 months ago

  • Target version changed from Tails_3.16 to Tails_3.17

u wrote:

Once bertagaz did the Jenkins update (#10068), intrigeri will check, ~ before releasing 3.16 (July-Aug 2019), if this issue was indeed magically corrected by the update.

bertagaz told me that he made progress on #10068 but it is not done yet so I'll postpone this by a month too.

#30 Updated by intrigeri 2 months ago

  • Subject changed from isotesterN:s sometimes disappear to isotesterN:s are sometimes put offline and never back online

#31 Updated by intrigeri about 2 months ago

  • Related to Bug #16959: Gather usability data about our current CI added

#32 Updated by intrigeri about 1 month ago

  • Target version changed from Tails_3.17 to Tails_4.0

#33 Updated by intrigeri 26 days ago

  • Status changed from In Progress to Needs Validation

#34 Updated by intrigeri 25 days ago

  • Type of work changed from Research to Sysadmin

#35 Updated by intrigeri 22 days ago

  • Status changed from Needs Validation to In Progress

Also available in: Atom PDF