Project

General

Profile

Bug #11295

Test jobs sometimes get scheduled on a busy isotester while there are available ones

Added by bertagaz over 3 years ago. Updated 5 days ago.

Status:
Needs Validation
Priority:
Normal
Assignee:
Category:
Continuous Integration
Target version:
Start date:
03/31/2016
Due date:
% Done:

0%

Feature Branch:
jenkins-jobs:SponsorS-leftovers, puppet-tails:SponsorS-leftovers
Type of work:
Sysadmin
Blueprint:
Starter:
No
Affected tool:

Description

While investigating #10601, we discovered that sometimes after a reboot_job completed, rather than starting the test job that triggered it for this isotester, Jenkins assigns this same isotester to another test job, resulting in the first test job waiting for hours for the other one to be over. See #10601#note-5 for details.


Related issues

Related to Tails - Bug #10215: Suboptimal advance booking of Jenkins slaves for testing ISOs Resolved 09/17/2015
Related to Tails - Bug #10601: isotesterN:s are sometimes put offline and never back online Needs Validation 11/23/2015
Related to Tails - Bug #16959: Gather usability data about our current CI In Progress
Related to Tails - Feature #9486: Support running multiple instances of the test suite in parallel Resolved 06/25/2015
Related to Tails - Bug #17216: Make the test suite clean up after itself even in most failure modes Confirmed
Blocked by Tails - Bug #10068: Upgrade to Jenkins 2.x, using upstream packages Resolved 01/08/2018

History

#1 Updated by intrigeri over 3 years ago

  • Description updated (diff)

I suggest to first set up a very simple test case to confirm what's the deal with job priority, and whether our current configuration is based on a correct understanding of how the priority sorter plugin works (#10601#note-5 has more precise pointers about where this doubt of mine comes from).

Rationale: even if the bug isn't obvious in our current setup for some reason, I'd rather not keep config designed based on erroneous assumptions, since if it's the case it'll be confusing in the future next time I have to debug weird race conditions again.

#2 Updated by bertagaz over 3 years ago

  • Target version changed from Tails_2.4 to Tails_2.5

#3 Updated by bertagaz over 3 years ago

  • Target version changed from Tails_2.5 to Tails_2.6

Probably won't have time to work on it before that.

#4 Updated by intrigeri over 3 years ago

  • Subject changed from Test jobs sometimes get their isotester stolen by another one. to Test jobs sometimes get their isotester stolen by another one

I've just seen something similar happen again: https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_test-11588-usb-on-jenkins-10733/15/ is "(pending—Waiting for next available executor on isotester2) UPSTREAMJOB_BUILD_NUMBER=15" while https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_test-11588-usb-on-jenkins-10733/14/ is running on isotester2. Five other isotesters are available, so it's a shame that job 15 was scheduled on isotester2 as well and now has to wait for 3 hours before it's run.

Job 14 was run on Jul 31, 2016 9:18:49 PM by https://jenkins.tails.boum.org/job/wrap_test_Tails_ISO_test-11588-usb-on-jenkins-10733/14/, which also run
https://jenkins.tails.boum.org/job/reboot_job/8542/ with parameter RESTART_NODE=isotester2. The wrap job had NODE_NAME=isotester2.

Job 15 was run on Jul 31, 2016 9:19:19 PM by https://jenkins.tails.boum.org/job/wrap_test_Tails_ISO_test-11588-usb-on-jenkins-10733/15/, which also run https://jenkins.tails.boum.org/job/reboot_job/8543/ with parameter RESTART_NODE=isotester2. The wrap job had NODE_NAME=isotester2.

As said in the ticket description, I've already investigated such a problem 8 months ago (#10601#note-5), so the next debugging steps should be easy, if done before the corresponding system logs and Jenkins artifacts expire.

I believe this clearly answers the "We first need to see if this still happens or not" part of this ticket: something is wrong with our job priority setup.

#5 Updated by intrigeri over 3 years ago

  • Subject changed from Test jobs sometimes get their isotester stolen by another one to Test jobs sometimes get scheduled on a busy isotester while there are available ones

Same thing as we speak, between https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_feature-from-intrigeri-for-2.6/7/ and job 9 on the same project: here again, 2 isotesters are free but job 9 is waiting for isotester1 to be available, while job 7 is running there.

#6 Updated by intrigeri over 3 years ago

  • Related to Bug #10215: Suboptimal advance booking of Jenkins slaves for testing ISOs added

#7 Updated by anonym about 3 years ago

  • Target version changed from Tails_2.6 to Tails_2.7

#8 Updated by bertagaz about 3 years ago

  • Target version changed from Tails_2.7 to Tails_2.9.1

#9 Updated by anonym almost 3 years ago

  • Target version changed from Tails_2.9.1 to Tails 2.10

#10 Updated by intrigeri almost 3 years ago

  • Target version changed from Tails 2.10 to Tails_2.11

#11 Updated by bertagaz over 2 years ago

  • Target version changed from Tails_2.11 to Tails_2.12

#12 Updated by bertagaz over 2 years ago

  • Target version changed from Tails_2.12 to Tails_3.0

#13 Updated by bertagaz over 2 years ago

  • Target version changed from Tails_3.0 to Tails_3.1

#14 Updated by bertagaz over 2 years ago

  • Target version changed from Tails_3.1 to Tails_3.2

#15 Updated by bertagaz over 2 years ago

  • Target version changed from Tails_3.2 to Tails_3.3

#16 Updated by bertagaz about 2 years ago

  • Target version changed from Tails_3.3 to Tails_3.5

Realistically reschedule for 3.4.

#17 Updated by bertagaz almost 2 years ago

  • Target version changed from Tails_3.5 to Tails_3.6

#18 Updated by bertagaz over 1 year ago

  • Target version changed from Tails_3.6 to Tails_3.7

#19 Updated by intrigeri over 1 year ago

  • Description updated (diff)

#20 Updated by intrigeri over 1 year ago

  • Related to Bug #10601: isotesterN:s are sometimes put offline and never back online added

#22 Updated by bertagaz over 1 year ago

  • Target version changed from Tails_3.7 to Tails_3.8

#23 Updated by intrigeri over 1 year ago

  • Target version changed from Tails_3.8 to Tails_3.9

#24 Updated by intrigeri about 1 year ago

  • Target version changed from Tails_3.9 to Tails_3.10.1

#25 Updated by intrigeri about 1 year ago

  • Target version changed from Tails_3.10.1 to Tails_3.11

#26 Updated by CyrilBrulebois 11 months ago

  • Target version changed from Tails_3.11 to Tails_3.12

#27 Updated by anonym 10 months ago

  • Target version changed from Tails_3.12 to Tails_3.13

#28 Updated by u 8 months ago

  • Blocked by Bug #10068: Upgrade to Jenkins 2.x, using upstream packages added

#29 Updated by u 8 months ago

If I understand correctly, then upgrading the Priority Sorter plugin will magically fix that. bertagaz will probably have to do this upgrade for #10068 → so "blocked by".

#30 Updated by u 8 months ago

  • Assignee changed from bertagaz to intrigeri
  • Target version changed from Tails_3.13 to Tails_3.16

Once bertagaz did the Jenkins update (#10068), intrigeri will check, ~ before releasing 3.16 (July-Aug 2019), if this issue was indeed magically corrected by the update.

#31 Updated by intrigeri 8 months ago

(This problem generally arises when our CI is overloaded, which often happens in the last few days before a release. So once #10068 is done by end of June, either I should notice the problem before the 3.15 or 3.16 release. If I don't, I'll be happy to call this fixed.)

#32 Updated by intrigeri 3 months ago

  • Target version changed from Tails_3.16 to Tails_3.17

u wrote:

Once bertagaz did the Jenkins update (#10068), intrigeri will check, ~ before releasing 3.16 (July-Aug 2019), if this issue was indeed magically corrected by the update.

bertagaz told me that he made progress on #10068 but it is not done yet so I'll postpone this by a month too.

#33 Updated by intrigeri 2 months ago

  • Related to Bug #16959: Gather usability data about our current CI added

#34 Updated by intrigeri 2 months ago

  • Target version changed from Tails_3.17 to Tails_4.0

#35 Updated by intrigeri about 2 months ago

  • Status changed from Confirmed to Needs Validation

I've adapted our config to be compatible with the latest version of the Priority Sorter plugin, and made the configured priorities correctly implement the documented design: https://git.tails.boum.org/jenkins-jobs/commit/?id=d5586f9642fa4cd63f49016cbe090fa863149db8 might help. I'll keep an eye on this problem and will report back if I see it again.

Note, however, that I don't think that fixing the priorities is sufficient to fully fix the race condition: between "the time when reboot_job puts the node back online and wrap_test_* triggers test_Tails_ISO_", and "the time when test_Tails_ISO_ actually starts, on my local Jenkins I see that 9 seconds have passed. So there's still a chance that another job, that was already in queue, and that is allowed to run on that node, starts there in the meantime. Such a job can be, for example, another wrap_test_*. At this point I can think of no easy way to fix this within the current design of our pipeline. Given the uncertainty around the future of our Jenkins setup, I won't spend much more time on it.

I suspect that our time would be better invested in making the test suite jobs clean up properly after themselves when stuff crashes, like we did for the build jobs when we switched them to Vagrant. This way, we could drop the whole "reboot before running the test suite" dance. Most likely, the outcome of this effort would still be valuable even if we move to GitLab CI or whatnot, which it why I think it could be a more worthwhile investment than fixing this part our Jenkins setup 4 years after the problem was spotted. Also, making the test suite jobs clean up after themselves reliably is needed if we want to have nodes that can run both builds & tests, which would provide great performance improvements to our feedback loop.

#36 Updated by intrigeri about 2 months ago

  • Type of work changed from Research to Sysadmin

#37 Updated by intrigeri about 2 months ago

  • Related to Feature #9486: Support running multiple instances of the test suite in parallel added

#38 Updated by intrigeri about 2 months ago

  • Status changed from Needs Validation to In Progress

Note, however, that I don't think that fixing the priorities is sufficient to fully fix the race condition: between "the time when reboot_job puts the node back online and wrap_test_* triggers test_Tails_ISO_", and "the time when test_Tails_ISO_ actually starts, on my local Jenkins I see that 9 seconds have passed. So there's still a chance that another job, that was already in queue, and that is allowed to run on that node, starts there in the meantime. Such a job can be, for example, another wrap_test_*. At this point I can think of no easy way to fix this within the current design of our pipeline. Given the uncertainty around the future of our Jenkins setup, I won't spend much more time on it.

I've thought about this some more and tested some other options. I'm now convinced that job priorities alone (even fixed as done a few days ago) can't solve this problem. At the very least, we would need wrap_test_Tails_ISO_* to wait before it frees the current node, to ensure the test_Tails_ISO_ it triggers has actually reached the queue.

Thing is, I understood that last part while testing a totally different approach (based on unfinished work started by myself, and later improved by bertagaz, in feature/reboot-after_post_build_trigger)… which I think is pretty robust, easier to maintain and to fix if needed:

  • It includes lots of comments that explain why every workaround is needed and why it works.
  • Every time there's a call to sleep, if it turns out that the sleep duration does not work reliably, we can simply increase it, with well understood consequences (that boil down to: every test suite run takes a little more time; given such a run takes hours, adding a few dozen seconds here and there is not a problem).

I don't think my time would be well spent transplanting all this work onto the current design (I don't even know if that's possible) so what I'll do is:

  1. Keep stress-testing this new design on my local Jenkins.
  2. Identify whatever corner cases I've not thought of yet. Handle them.
  3. After a while, propose it for merging. I will ask zen to review it: he now understands quite a bit about our CI setup.

This should solve #11295 and #10601 for good (modulo possible sleep duration bumps later).

Nevertheless, that's still not really good on the long term, because we're still blocking some CI performance improvements and potential future options:

I suspect that our time would be better invested in making the test suite jobs clean up properly after themselves when stuff crashes, like we did for the build jobs when we switched them to Vagrant. This way, we could drop the whole "reboot before running the test suite" dance. Most likely, the outcome of this effort would still be valuable even if we move to GitLab CI or whatnot, which it why I think it could be a more worthwhile investment than fixing this part our Jenkins setup 4 years after the problem was spotted. Also, making the test suite jobs clean up after themselves reliably is needed if we want to have nodes that can run both builds & tests, which would provide great performance improvements to our feedback loop.

I think I have a good lead for this. If I work on this further or if this ticket gets closed, I should move this to a dedicated ticket, but for now let's keep tracking our various options here.

The idea is this:

1. Whenever the test suite completes (success, failure, abort for unexpected reasons), that is whenever the wrap_test_suite process exits, ensure that all its child processes are killed (even if they were started in the background). This matters for things like Xvfb, unclutter, tor processes started by Chutney, avconv, local services run by cucumber, etc.

I've got a simple PoC for this that uses systemd-run, which creates a new CGroup (currently all test suite processes are in the /system.slice/jenkins-slave.service CGroup, which we can't kill because the Jenkins slave process is also part of it).

Applying this recipe to wrap_test_suite looks like this:

systemd-run --user \
--collect \
--service-type=oneshot \
--unit=tails-wrap-test-suite.service \
--setenv=USE_LAST_RELEASE_AS_OLD_ISO=yes \
/usr/local/bin/wrap_test_suite \
https://iso-history.tails.boum.org \
/tmp/TailsToaster

We'll need some trickery to send more environment variables over to this transient service. For example, those defined in tmp/tails-build-env.list. This seems totally feasible.

A problem here is process ownership: wrap_test_suite runs all its children process as root, with sudo, so I doubt that the systemd --user instance this would be run under is allowed to kill them. I suspect this can easily be solved by running wrap_test_suite itself as root instead of using sudo in this script. And then we can remove --user from the systemd-run command line.

Developers could use a similar technique when running the test suite locally, which should fix the problems with zombie processes that we have there.

2. Clean up before at the beginning of any new job started on a Jenkins node:

  • Forcibly kill the TailsToaster libvirt guest if it's still running and for good measure, restart libvirtd.
  • Empty /tmp/TailsToaster. Currently we do some of this in post_test_cleanup but it would be more reliable to do it before the test suite starts: there are probably cases when a test suite run aborts in such a broken way that post_test_cleanup is not run.
  • Ensure tails-wrap-test-suite.service is stopped and all the processes that are part of it are killed.

We can do this as another Jenkins builder step.

My current understanding is that this should free all temporary resources that a given test suite run creates outside of the scope the transient unit created by systemd-run manages. But there might be more such resources that I forgot. I'm confident they can be freed in the same way.

To start with we would do this only when we start a test suite run. But eventually, we should do this when we start any job, so we can use all our nodes for any kind of job.

Note, in passing, that this strategy is already the one we've picked for build jobs, when we started using Vagrant and libvirt there: we have pre-build cleanup steps there: remove_libvirt_volumes and cleanup_build_jobs_leftovers. It works pretty well.

3. Don't reboot Jenkins nodes between test suite jobs.

At this point we should not need to reboot nodes anymore between test suite runs.

In terms of strategy, at first glance all this can be done incrementally:

  1. We can start with step (1) alone, which would already solve some problems for developers who run the test suite locally outside of Jenkins: we all struggle with lingering zombie processes.
  2. Then we can do step (2), which should be a no-op as long as we keep rebooting nodes between test suite runs.
  3. Once we're confident (1) and (2) work well, we can do step (3).

I can (stress-)test all this on my local Jenkins before we consider deploying to production.

#39 Updated by intrigeri 27 days ago

intrigeri wrote:

I don't think my time would be well spent transplanting all this work onto the current design (I don't even know if that's possible) so what I'll do is:

  1. Keep stress-testing this new design on my local Jenkins.
  2. Identify whatever corner cases I've not thought of yet. Handle them.
  3. After a while, propose it for merging. I will ask zen to review it: he now understands quite a bit about our CI setup.

So far so good locally for steps 1 & 2.

Corner cases that I've seen being correctly handled:

  • A test suite job is deleted (by our job generation logic) while it's running → the node gets rebooted as expected.
  • A build is aborted before starting to execute builders, that is, in practice, while cloning the Git repo → the node gets rebooted as expected.

After 4.0 is out, whenever I find some time for it, I'll prepare branches, submit this for review, and move my systemd-run idea to a dedicated ticket.

#40 Updated by intrigeri 27 days ago

  • Target version changed from Tails_4.0 to Tails_4.1

#41 Updated by intrigeri 5 days ago

intrigeri wrote:

  1. After a while, propose it for merging. I will ask zen to review it: he now understands quite a bit about our CI setup.

I've asked zen if he wants to spend 2h with me to review this.

#42 Updated by intrigeri 5 days ago

  • Feature Branch set to jenkins-jobs:SponsorS-leftovers

#43 Updated by intrigeri 5 days ago

  • Feature Branch changed from jenkins-jobs:SponsorS-leftovers to jenkins-jobs:SponsorS-leftovers, puppet-tails:SponsorS-leftovers

#44 Updated by intrigeri 5 days ago

  • Related to Bug #17216: Make the test suite clean up after itself even in most failure modes added

#45 Updated by intrigeri 5 days ago

intrigeri wrote:

After 4.0 is out, whenever I find some time for it, I'll […] move my systemd-run idea to a dedicated ticket.

Done: #17216

#46 Updated by intrigeri 5 days ago

  • Status changed from In Progress to Needs Validation

Also available in: Atom PDF