Feature #5288: Run the test suite automatically on autobuilt ISOs
isotesterN:s are sometimes put offline and never back online
I see that isotester2 and isotester4 are currently offline from Jenkins' PoV. They page (https://jenkins.tails.boum.org/computer/isotester2/) reads "Disconnected by anonymous : Taken_down_for_reboot". They both are running fine.
For isotester2, the last message I see in the logs is:
Nov 21, 2015 6:09:58 PM jenkins.slaves.JnlpSlaveAgentProtocol$Handler$1 onClosed WARNING: NioChannelHub keys=6 gen=72376715: Computer.threadPoolForRemoting [#1] for + isotester2 terminated java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@2bbbdba1[name=isotester2]
... which seems to be normal when we take a slave down (I've compared with another isotesterN that was successfully taken offline then back on). This means that it was taken offline 1 day and 14 hours ago, which happens to be its uptime, so it did reboot just fine.
My conclusion is that our way of putting a node back online while it's rebooting is not working reliably.
I'm curious how often this happens. I don't remember seeing this recently, but it might be that someone else has been dealing with it manually.
#3 Updated by intrigeri about 4 years ago
- Subject changed from isotesterN sometimes don't come back up online after rebooting to isotesterN:s sometimes disappear
- Status changed from Confirmed to In Progress
- Priority changed from Normal to High
- % Done changed from 0 to 10
I've looked a bit closer at what happened when isotester2 got offline and never back.
https://jenkins.tails.boum.org/job/wrap_test_Tails_ISO_feature-jessie/64/ was scheduled on isotester2 on Nov 21, 2015 6:09:47 PM, triggerred https://jenkins.tails.boum.org/job/reboot_job/1412/, scheduled on Nov 21, 2015 6:09:49 PM and took 6.1 seconds. The back online HTTP request ended at 18:09:56. This is confirmed by the Jenkins log:
Nov 21, 2015 6:09:56 PM hudson.model.Run execute INFO: reboot_job #1412 main build action completed: SUCCESS Nov 21, 2015 6:09:56 PM hudson.model.Run execute INFO: wrap_test_Tails_ISO_feature-jessie #64 main build action completed: SUCCESS
https://jenkins.tails.boum.org/job/wrap_test_Tails_ISO_experimental/56/ was scheduled on isotester2 on Nov 21, 2015 6:09:56 PM. It failed (https://jenkins.tails.boum.org/job/wrap_test_Tails_ISO_experimental/56/console) because isotester2 disappeared at 18:09:58, which is not surprising given we asked it to reboot at 6:09:49 PM.
So I think we have two problems, and the former causes the latter:
- our offline + reboot + online dance is racy (we knew that already in theory) enough to allow, in practice, another
wrap_test_Tails_ISO_*job to be scheduled at the exact time when we it should not;
- when the 2nd wrapper job fails due to loss of connection, Jenkins marks the slave as offline and it never gets back up from Jenkins' PoV.
If we fix the first problem, then the 2nd one can't happen anymore and we're good. Let's do that.
The main problem is then: we put the slave back online at (shutdown request + 5 seconds), while we should do that after the
jenkins-slave service has been stopped. We have already discussed this at length, and we have already agreed on a plan for if/when this happens: #9486#note-49. I'm thus going to raise the sleep time (as you can guess I still feel the right way to order this deterministically is a simple systemd unit file, but I still can live with the pragmatic approach for now).
You said such problems would be "handled with top-priority", so bumping priority.
#5 Updated by intrigeri about 4 years ago
- Assignee changed from intrigeri to bertagaz
(This should ideally be tracked in a dedicated ticket, but it seems it might be strongly related to the changes I did to address this one, so I'm keeping the discussion here until we figure it out.)
So, I've not seen isotesters disappearing anymore, but now I've seen 2 different test jobs (https://jenkins.tails.boum.org/job/test_Tails_ISO_feature-8415-overlayfs/16/ and https://jenkins.tails.boum.org/job/test_Tails_ISO_bugfix-9713-electrum-2.5/7/) scheduled more or less at the same time on isotester1, and the latter is waiting in queue for isotester1... while isotester4 is idling.
Is this a known issue I wasn't aware of?
Or is it happening because I bumped the sleep to 30 seconds? If that's the case, then it looks like the reasoning behind preventing raciness with the infamous
sleep isn't quite right.
Nov 26, 2015 9:48:59 AM hudson.model.Run execute INFO: build_Tails_ISO_feature-8415-overlayfs #23 main build action completed: SUCCESS Nov 26, 2015 9:59:57 AM hudson.model.Run execute INFO: build_Tails_ISO_bugfix-9713-electrum-2.5 #10 main build action completed: SUCCESS Nov 26, 2015 10:42:18 AM hudson.model.Run execute INFO: test_Tails_ISO_feature-5926-freezable-apt-repository #19 main build action completed: FAILURE
Completion of the last test job above frees isotester1, and then:
Nov 26, 2015 10:42:52 AM hudson.model.Run execute INFO: reboot_job #1495 main build action completed: SUCCESS Nov 26, 2015 10:42:52 AM hudson.model.Run execute INFO: wrap_test_Tails_ISO_feature-8415-overlayfs #16 main build action completed: SUCCESS Nov 26, 2015 10:43:22 AM hudson.model.Run execute INFO: reboot_job #1496 main build action completed: SUCCESS Nov 26, 2015 10:43:22 AM hudson.model.Run execute INFO: wrap_test_Tails_ISO_bugfix-9713-electrum-2.5 #7 main build action completed: SUCCESS
I've tried to understand what happens. At 10:42:52, we tell Jenkins master that isotester1 is back online. So far, so good. And then, why is
https://jenkins.tails.boum.org/job/wrap_test_Tails_ISO_bugfix-9713-electrum-2.5/7/console run on it, instead of https://jenkins.tails.boum.org/job/test_Tails_ISO_feature-8415-overlayfs/16/consoleFull (that should in theory be waiting in queue for isotester1 already)? I believe that's the main problem here, because as a result the overlayfs job waits for one more reboot, and more importantly the electrum one will be blocked for hours, while it could have run immediately on a different slave. https://tails.boum.org/blueprint/automated_builds_and_tests/jenkins/ says that priorities are set in a way that avoids such a situation, so it seems that this does not work as intended.
On https://wiki.jenkins-ci.org/display/JENKINS/Priority+Sorter+Plugin I read "Please note that lower number means higher priority", but that's about 3.4 while we're running 1.3, and on our Jenkins job configuration page I read "A job with higher priority is ran before jobs with lower priorities".
bertagaz, I'll let you assess how much of a problem this is, and if it's a serious one, I'll let you think about how to fix it. If help is needed to write systemd unit files to replace calls to
sleep, let me know.
#7 Updated by intrigeri about 4 years ago
I wonder how this could be related to the OOM we've seen lately that made some of our VM being shut down.
I don't see how it can possibly relate to the problem at hand: we're talking of well-understood reboots here. I think it'll be clear to you if you re-read carefully the ticket description and my analysis.
Let see if we see this issue happening again.
Shall I file another ticket for the follow-up bug I reported and analyzed on #10601#note-5, or do you plan to answer it here?
#9 Updated by intrigeri about 4 years ago
We'll keep an eye on it and see if it happens again. Whenever it does, drop here any useful info that can help future debugging. It's not expected that we'll fully solve that for 2.0 (as actually the problem this ticket is about may be fixed already).
It would be good to see the other problem(s?) discovered on this ticket tracked appropriately elsewhere.
#18 Updated by bertagaz almost 4 years ago
- Assignee changed from bertagaz to intrigeri
- % Done changed from 20 to 60
- QA Check set to Ready for QA
I did not look closely every day to see if the problem described in this ticket title still happens, but I don't remember having seen this lately. So I think that for this one we're good. I've created #11295 to track the other problem you found about test job getting their isotester stolen by another ones.
#19 Updated by intrigeri almost 4 years ago
- Status changed from In Progress to Resolved
- Assignee deleted (
- % Done changed from 60 to 100
- QA Check changed from Ready for QA to Pass
I did not look closely every day to see if the problem described in this ticket title still happens, but I don't remember having seen this lately.
Neither have I! Closing, then.
#20 Updated by intrigeri almost 2 years ago
- Status changed from Resolved to In Progress
- Assignee set to bertagaz
- Target version deleted (
- % Done changed from 100 to 10
- QA Check deleted (
- Target version changed from Tails_3.16 to Tails_3.17
In passing, I've replaced the calls to
toggleOffline (context-dependent, so inherently fragile and harder to reason about) with the
disable_node setters, that are nowadays available via the Jenkins API. This should avoid putting an isotester offline while we wanted to bring it back online (I've no idea if that problem happened in practice, but I prefer to not have to reason about it).