Project

General

Profile

Bug #11295

Test jobs sometimes get scheduled on a busy isotester while there are available ones

Added by bertagaz over 3 years ago. Updated about 11 hours ago.

Status:
Needs Validation
Priority:
Normal
Assignee:
Category:
Continuous Integration
Target version:
Start date:
03/31/2016
Due date:
% Done:

0%

Feature Branch:
Type of work:
Research
Blueprint:
Starter:
No
Affected tool:

Description

While investigating #10601, we discovered that sometimes after a reboot_job completed, rather than starting the test job that triggered it for this isotester, Jenkins assigns this same isotester to another test job, resulting in the first test job waiting for hours for the other one to be over. See #10601#note-5 for details.


Related issues

Related to Tails - Bug #10215: Suboptimal advance booking of Jenkins slaves for testing ISOs Resolved 09/17/2015
Related to Tails - Bug #10601: isotesterN:s are sometimes put offline and never back online Needs Validation 11/23/2015
Related to Tails - Bug #16959: Gather usability data about our current CI In Progress
Blocked by Tails - Bug #10068: Upgrade to Jenkins 2.x, using upstream packages In Progress 01/08/2018

History

#1 Updated by intrigeri over 3 years ago

  • Description updated (diff)

I suggest to first set up a very simple test case to confirm what's the deal with job priority, and whether our current configuration is based on a correct understanding of how the priority sorter plugin works (#10601#note-5 has more precise pointers about where this doubt of mine comes from).

Rationale: even if the bug isn't obvious in our current setup for some reason, I'd rather not keep config designed based on erroneous assumptions, since if it's the case it'll be confusing in the future next time I have to debug weird race conditions again.

#2 Updated by bertagaz over 3 years ago

  • Target version changed from Tails_2.4 to Tails_2.5

#3 Updated by bertagaz over 3 years ago

  • Target version changed from Tails_2.5 to Tails_2.6

Probably won't have time to work on it before that.

#4 Updated by intrigeri about 3 years ago

  • Subject changed from Test jobs sometimes get their isotester stolen by another one. to Test jobs sometimes get their isotester stolen by another one

I've just seen something similar happen again: https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_test-11588-usb-on-jenkins-10733/15/ is "(pending—Waiting for next available executor on isotester2) UPSTREAMJOB_BUILD_NUMBER=15" while https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_test-11588-usb-on-jenkins-10733/14/ is running on isotester2. Five other isotesters are available, so it's a shame that job 15 was scheduled on isotester2 as well and now has to wait for 3 hours before it's run.

Job 14 was run on Jul 31, 2016 9:18:49 PM by https://jenkins.tails.boum.org/job/wrap_test_Tails_ISO_test-11588-usb-on-jenkins-10733/14/, which also run
https://jenkins.tails.boum.org/job/reboot_job/8542/ with parameter RESTART_NODE=isotester2. The wrap job had NODE_NAME=isotester2.

Job 15 was run on Jul 31, 2016 9:19:19 PM by https://jenkins.tails.boum.org/job/wrap_test_Tails_ISO_test-11588-usb-on-jenkins-10733/15/, which also run https://jenkins.tails.boum.org/job/reboot_job/8543/ with parameter RESTART_NODE=isotester2. The wrap job had NODE_NAME=isotester2.

As said in the ticket description, I've already investigated such a problem 8 months ago (#10601#note-5), so the next debugging steps should be easy, if done before the corresponding system logs and Jenkins artifacts expire.

I believe this clearly answers the "We first need to see if this still happens or not" part of this ticket: something is wrong with our job priority setup.

#5 Updated by intrigeri about 3 years ago

  • Subject changed from Test jobs sometimes get their isotester stolen by another one to Test jobs sometimes get scheduled on a busy isotester while there are available ones

Same thing as we speak, between https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_feature-from-intrigeri-for-2.6/7/ and job 9 on the same project: here again, 2 isotesters are free but job 9 is waiting for isotester1 to be available, while job 7 is running there.

#6 Updated by intrigeri about 3 years ago

  • Related to Bug #10215: Suboptimal advance booking of Jenkins slaves for testing ISOs added

#7 Updated by anonym about 3 years ago

  • Target version changed from Tails_2.6 to Tails_2.7

#8 Updated by bertagaz almost 3 years ago

  • Target version changed from Tails_2.7 to Tails_2.9.1

#9 Updated by anonym almost 3 years ago

  • Target version changed from Tails_2.9.1 to Tails 2.10

#10 Updated by intrigeri almost 3 years ago

  • Target version changed from Tails 2.10 to Tails_2.11

#11 Updated by bertagaz over 2 years ago

  • Target version changed from Tails_2.11 to Tails_2.12

#12 Updated by bertagaz over 2 years ago

  • Target version changed from Tails_2.12 to Tails_3.0

#13 Updated by bertagaz over 2 years ago

  • Target version changed from Tails_3.0 to Tails_3.1

#14 Updated by bertagaz over 2 years ago

  • Target version changed from Tails_3.1 to Tails_3.2

#15 Updated by bertagaz about 2 years ago

  • Target version changed from Tails_3.2 to Tails_3.3

#16 Updated by bertagaz almost 2 years ago

  • Target version changed from Tails_3.3 to Tails_3.5

Realistically reschedule for 3.4.

#17 Updated by bertagaz almost 2 years ago

  • Target version changed from Tails_3.5 to Tails_3.6

#18 Updated by bertagaz over 1 year ago

  • Target version changed from Tails_3.6 to Tails_3.7

#19 Updated by intrigeri over 1 year ago

  • Description updated (diff)

#20 Updated by intrigeri over 1 year ago

  • Related to Bug #10601: isotesterN:s are sometimes put offline and never back online added

#22 Updated by bertagaz over 1 year ago

  • Target version changed from Tails_3.7 to Tails_3.8

#23 Updated by intrigeri about 1 year ago

  • Target version changed from Tails_3.8 to Tails_3.9

#24 Updated by intrigeri about 1 year ago

  • Target version changed from Tails_3.9 to Tails_3.10.1

#25 Updated by intrigeri 11 months ago

  • Target version changed from Tails_3.10.1 to Tails_3.11

#26 Updated by CyrilBrulebois 9 months ago

  • Target version changed from Tails_3.11 to Tails_3.12

#27 Updated by anonym 8 months ago

  • Target version changed from Tails_3.12 to Tails_3.13

#28 Updated by u 6 months ago

  • Blocked by Bug #10068: Upgrade to Jenkins 2.x, using upstream packages added

#29 Updated by u 6 months ago

If I understand correctly, then upgrading the Priority Sorter plugin will magically fix that. bertagaz will probably have to do this upgrade for #10068 → so "blocked by".

#30 Updated by u 6 months ago

  • Assignee changed from bertagaz to intrigeri
  • Target version changed from Tails_3.13 to Tails_3.16

Once bertagaz did the Jenkins update (#10068), intrigeri will check, ~ before releasing 3.16 (July-Aug 2019), if this issue was indeed magically corrected by the update.

#31 Updated by intrigeri 6 months ago

(This problem generally arises when our CI is overloaded, which often happens in the last few days before a release. So once #10068 is done by end of June, either I should notice the problem before the 3.15 or 3.16 release. If I don't, I'll be happy to call this fixed.)

#32 Updated by intrigeri about 2 months ago

  • Target version changed from Tails_3.16 to Tails_3.17

u wrote:

Once bertagaz did the Jenkins update (#10068), intrigeri will check, ~ before releasing 3.16 (July-Aug 2019), if this issue was indeed magically corrected by the update.

bertagaz told me that he made progress on #10068 but it is not done yet so I'll postpone this by a month too.

#33 Updated by intrigeri 21 days ago

  • Related to Bug #16959: Gather usability data about our current CI added

#34 Updated by intrigeri 10 days ago

  • Target version changed from Tails_3.17 to Tails_4.0

#35 Updated by intrigeri about 11 hours ago

  • Status changed from Confirmed to Needs Validation

I've adapted our config to be compatible with the latest version of the Priority Sorter plugin, and made the configured priorities correctly implement the documented design: https://git.tails.boum.org/jenkins-jobs/commit/?id=d5586f9642fa4cd63f49016cbe090fa863149db8 might help. I'll keep an eye on this problem and will report back if I see it again.

Note, however, that I don't think that fixing the priorities is sufficient to fully fix the race condition: between "the time when reboot_job puts the node back online and wrap_test_* triggers test_Tails_ISO_", and "the time when test_Tails_ISO_ actually starts, on my local Jenkins I see that 9 seconds have passed. So there's still a chance that another job, that was already in queue, and that is allowed to run on that node, starts there in the meantime. Such a job can be, for example, another wrap_test_*. At this point I can think of no easy way to fix this within the current design of our pipeline. Given the uncertainty around the future of our Jenkins setup, I won't spend much more time on it.

I suspect that our time would be better invested in making the test suite jobs clean up properly after themselves when stuff crashes, like we did for the build jobs when we switched them to Vagrant. This way, we could drop the whole "reboot before running the test suite" dance. Most likely, the outcome of this effort would still be valuable even if we move to GitLab CI or whatnot, which it why I think it could be a more worthwhile investment than fixing this part our Jenkins setup 4 years after the problem was spotted. Also, making the test suite jobs clean up after themselves reliably is needed if we want to have nodes that can run both builds & tests, which would provide great performance improvements to our feedback loop.

Also available in: Atom PDF