Test jobs sometimes get scheduled on a busy isotester while there are available ones
While investigating #10601, we discovered that sometimes after a reboot_job completed, rather than starting the test job that triggered it for this isotester, Jenkins assigns this same isotester to another test job, resulting in the first test job waiting for hours for the other one to be over. See #10601#note-5 for details.
#1 Updated by intrigeri over 3 years ago
- Description updated (diff)
I suggest to first set up a very simple test case to confirm what's the deal with job priority, and whether our current configuration is based on a correct understanding of how the priority sorter plugin works (#10601#note-5 has more precise pointers about where this doubt of mine comes from).
Rationale: even if the bug isn't obvious in our current setup for some reason, I'd rather not keep config designed based on erroneous assumptions, since if it's the case it'll be confusing in the future next time I have to debug weird race conditions again.
#4 Updated by intrigeri about 3 years ago
- Subject changed from Test jobs sometimes get their isotester stolen by another one. to Test jobs sometimes get their isotester stolen by another one
I've just seen something similar happen again: https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_test-11588-usb-on-jenkins-10733/15/ is "(pending—Waiting for next available executor on isotester2) UPSTREAMJOB_BUILD_NUMBER=15" while https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_test-11588-usb-on-jenkins-10733/14/ is running on isotester2. Five other isotesters are available, so it's a shame that job 15 was scheduled on isotester2 as well and now has to wait for 3 hours before it's run.
Job 14 was run on Jul 31, 2016 9:18:49 PM by https://jenkins.tails.boum.org/job/wrap_test_Tails_ISO_test-11588-usb-on-jenkins-10733/14/, which also run
https://jenkins.tails.boum.org/job/reboot_job/8542/ with parameter
RESTART_NODE=isotester2. The wrap job had
Job 15 was run on Jul 31, 2016 9:19:19 PM by https://jenkins.tails.boum.org/job/wrap_test_Tails_ISO_test-11588-usb-on-jenkins-10733/15/, which also run https://jenkins.tails.boum.org/job/reboot_job/8543/ with parameter
RESTART_NODE=isotester2. The wrap job had
As said in the ticket description, I've already investigated such a problem 8 months ago (#10601#note-5), so the next debugging steps should be easy, if done before the corresponding system logs and Jenkins artifacts expire.
I believe this clearly answers the "We first need to see if this still happens or not" part of this ticket: something is wrong with our job priority setup.
#5 Updated by intrigeri about 3 years ago
- Subject changed from Test jobs sometimes get their isotester stolen by another one to Test jobs sometimes get scheduled on a busy isotester while there are available ones
Same thing as we speak, between https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_feature-from-intrigeri-for-2.6/7/ and job 9 on the same project: here again, 2 isotesters are free but job 9 is waiting for isotester1 to be available, while job 7 is running there.
(This problem generally arises when our CI is overloaded, which often happens in the last few days before a release. So once #10068 is done by end of June, either I should notice the problem before the 3.15 or 3.16 release. If I don't, I'll be happy to call this fixed.)
#32 Updated by intrigeri about 2 months ago
- Target version changed from Tails_3.16 to Tails_3.17
#35 Updated by intrigeri about 11 hours ago
- Status changed from Confirmed to Needs Validation
I've adapted our config to be compatible with the latest version of the Priority Sorter plugin, and made the configured priorities correctly implement the documented design: https://git.tails.boum.org/jenkins-jobs/commit/?id=d5586f9642fa4cd63f49016cbe090fa863149db8 might help. I'll keep an eye on this problem and will report back if I see it again.
Note, however, that I don't think that fixing the priorities is sufficient to fully fix the race condition: between "the time when
reboot_job puts the node back online and
test_Tails_ISO_", and "the time when
test_Tails_ISO_ actually starts, on my local Jenkins I see that 9 seconds have passed. So there's still a chance that another job, that was already in queue, and that is allowed to run on that node, starts there in the meantime. Such a job can be, for example, another
wrap_test_*. At this point I can think of no easy way to fix this within the current design of our pipeline. Given the uncertainty around the future of our Jenkins setup, I won't spend much more time on it.
I suspect that our time would be better invested in making the test suite jobs clean up properly after themselves when stuff crashes, like we did for the build jobs when we switched them to Vagrant. This way, we could drop the whole "reboot before running the test suite" dance. Most likely, the outcome of this effort would still be valuable even if we move to GitLab CI or whatnot, which it why I think it could be a more worthwhile investment than fixing this part our Jenkins setup 4 years after the problem was spotted. Also, making the test suite jobs clean up after themselves reliably is needed if we want to have nodes that can run both builds & tests, which would provide great performance improvements to our feedback loop.