Slow networking for release management
(Opening as suggested by intrigeri.)
I'm regularly getting very slow transfers when performing release management duties, e.g. sync-ing stuff with git-annex right now:
kibi@armor:~/work/clients/tails/release/isos.git$ git annex get tails-amd64-4.* get tails-amd64-4.0~beta1/tails-amd64-4.0~beta1.apt-sources (from origin...) SHA256E-s455--8759b3d834ba480061a814178fcaba653183e063ec554d2792da7cab31244d1d 455 100% 444.34kB/s 0:00:00 (xfr#1, to-chk=0/1) (checksum...) ok get tails-amd64-4.0~beta1/tails-amd64-4.0~beta1.build-manifest (from origin...) SHA256E-s115862--f1ad4992232b8f026e612affecfa5d44c28061736d3fd82d35fc34029b3c743e 115,862 100% 367.36kB/s 0:00:00 (xfr#1, to-chk=0/1) (checksum...) ok get tails-amd64-4.0~beta1/tails-amd64-4.0~beta1.buildlog (from origin...) SHA256E-s1167089--21d7a15c8065b52e49662273c6a7a0d734ea4bf7c69f623266774767ecc767ec 1,167,089 100% 2.32MB/s 0:00:00 (xfr#1, to-chk=0/1) (checksum...) ok get tails-amd64-4.0~beta1/tails-amd64-4.0~beta1.img (from origin...) SHA256E-s1136656384--384db4d74da56c31a4e50bf093526c962d0eb3dee19de3e127fd5acccc063f9b.img 1,136,656,384 100% 6.09MB/s 0:02:58 (xfr#1, to-chk=0/1) (checksum...) ok get tails-amd64-4.0~beta1/tails-amd64-4.0~beta1.img.sig (from origin...) SHA256E-s833--b8d1f3e4843d9586b811c16272df34759a6fab8c89c26839e0c0373f091ab343.img.sig 833 100% 813.48kB/s 0:00:00 (xfr#1, to-chk=0/1) (checksum...) ok get tails-amd64-4.0~beta1/tails-amd64-4.0~beta1.iso (from origin...) SHA256E-s1126864896--c4c25d9689d8c5927f8ce1569454503fc92494ce53af236532ddb0d6fb34cff3.iso 1,126,864,896 100% 3.93MB/s 0:04:33 (xfr#1, to-chk=0/1) (checksum...) ok get tails-amd64-4.0~beta1/tails-amd64-4.0~beta1.iso.sig (from origin...) SHA256E-s833--924ed01855ed92471526c9d68db40d692d6b3dfe970554333f52f3297d6f4f1a.iso.sig 833 100% 813.48kB/s 0:00:00 (xfr#1, to-chk=0/1) (checksum...) ok get tails-amd64-4.0~beta1/tails-amd64-4.0~beta1.packages (from origin...) SHA256E-s46699--7d7ce059b09604f08676561e2a62c6142ff3226878257b695f20269f4eab6e8d 46,699 100% 44.54MB/s 0:00:00 (xfr#1, to-chk=0/1) (checksum...) ok get tails-amd64-4.0~beta2/tails-amd64-4.0~beta2.build-manifest (from origin...) SHA256E-s115583--95117b4429829ce7e0f29fa887789e1c29c37d6ff9414a2a8e898d88099f426e 115,583 100% 714.39kB/s 0:00:00 (xfr#1, to-chk=0/1) (checksum...) ok get tails-amd64-4.0~beta2/tails-amd64-4.0~beta2.buildlog (from origin...) SHA256E-s1165670--352f0e02919afccacaf763a9831497678a9175211f423aae0fa63465018f79bd 1,165,670 100% 3.32MB/s 0:00:00 (xfr#1, to-chk=0/1) (checksum...) ok get tails-amd64-4.0~beta2/tails-amd64-4.0~beta2.img (from origin...) SHA256E-s1136656384--920e48fb7b8ab07573f6ad334749dd965c453794b6d33766d545c943c21296ad.img 1,136,656,384 100% 4.05MB/s 0:04:27 (xfr#1, to-chk=0/1) (checksum...) ok get tails-amd64-4.0~beta2/tails-amd64-4.0~beta2.img.sig (from origin...) SHA256E-s228--212807a88aca27d186eaebe66e7045b88888ab7f8e806e77ca780a8a8646697e.img.sig 228 100% 222.66kB/s 0:00:00 (xfr#1, to-chk=0/1)
Meanwhile, I'm easily reaching 20+ MB/s when downloading some big files from other servers, so that's definitely not a bandwidth issue on my side.
FWIW this is with a git-annex that leverages a direct connection to git.puppet.tails.boum.org defined as an SSH alias to lizard, so there's no tor involved.
At other times, I can get up to 15 MB/s from the iso history HTTPS access, or over SSH. But it usually only happens when I'm double-checking the numbers after mentioning this issue to my fellow RMs, while I'm getting the slowness moments when such transfers are on the critical path to a release.
Please let me know what you need from me to help you help me. Thanks!
#1 Updated by CyrilBrulebois 3 months ago
And now that pushing to ISO history (that particular repository) is also on the critical path since one needs to upload the built images there so as to be be able to build IUKs on Jenkins. Currently seeing this:
121,602,048 10% 954.03kB/s 0:18:10
and I've got ~ 2.4 GB to upload.
#2 Updated by CyrilBrulebois 3 months ago
To be a tad more complete:
copy tails-amd64-4.2.1/tails-amd64-4.2.1.img (checking origin...) (to origin...) SHA256E-s1161822208--19f20ad2dc3d28c695162479e5b1d527381baadddf125f3c66ece34c17c154d1.1.img 1,161,822,208 100% 1.87MB/s 0:09:52 (xfr#1, to-chk=0/1) ok copy tails-amd64-4.2.1/tails-amd64-4.2.1.iso (checking origin...) (to origin...) SHA256E-s1151539200--4fcc4f2d0877f4ac7fdd5867467842eeed49094e37473e347a84798978064533.1.iso 1,151,539,200 100% 1.71MB/s 0:10:40 (xfr#1, to-chk=0/1) ok
while I could upload those elsewhere ~ 10 times faster.
(18:45:54) intrigeri: taggart: fwiw, we're seeing <100 Mbps transfer rates for lizard (did not check if it's "always" or "only when we need it to be fast", (18:46:09) intrigeri: taggart: which is not consistent with the fact it's supposed to be plugged on a gigabit switch now (19:45:12) taggart: intrigeri: can you run ethtool and confirm it's got a gig link (19:45:27) taggart: intrigeri: it's plugged straight into a gig port on our router (19:46:11) taggart: intrigeri: do you have a bandwidth graph somewhere? also we might need to do some traceroutes to see how it's routing (19:46:34) intrigeri: taggart: it says Speed: 1000Mb/s (19:46:43) intrigeri: taggart: yeah, we are on your munin (19:47:13) intrigeri: I'll report more details later, busy now, I just wanted to check if there was a known issue, sorry I made you context switch! (19:54:18) taggart: intrigeri: found it https://munin.riseup.net/riseup.net/wren.riseup.net/if_eth7.html (20:15:49) intrigeri: taggart: ok, so it does go slightly above 100Mb/s, but I do remember that it went closer to gigabit after if was plugged in. […] (21:22:48) taggart: intrigeri: this is what I use for speed testing https://github.com/richb-hanover/OpenWrtScripts (21:22:59) taggart: the betterspeedtest.sh script (21:23:20) taggart: I think it needs netperf or flent installed (but the error will be clear)
@intrigeri, did you run any test that shows that the problem is not on our side?
I ran the speedtest available in Debian some times in a row, and consistently gave around the following values:
$ speedtest --secure --simple Ping: 26.323 ms Download: 716.46 Mbit/s Upload: 240.98 Mbit/s
I'll look for a way to run the proposed custom script so our numbers are more consistent with the provider expects. Other suggestions for measuring bandwidth are also welcome.
intrigeri, did you run any test that shows that the problem is not on our side?
I did not.
I ran the speedtest available in Debian some times in a row, and consistently gave around the following values:> $ speedtest --secure --simple > Ping: 26.323 ms > Download: 716.46 Mbit/s > Upload: 240.98 Mbit/s >
Hmmm. 240.98 Mbit/s is better than what kibi has seen, but still very far from download rates, let alone from maxing out a gigabit link. This suggests there's indeed a problem somewhere.
#9 Updated by intrigeri about 2 months ago
FWIW, I have a hunch that the problem may not be caused by networking problems, but rather by
puppet-git.lizard being resource-constrained and slow. This hunch comes from the fact the problem seems to happen mostly (only?) with
git annex operations, and not with operations that connect to another VM.
To confirm this, next time this sort of trouble happens, the RM could:
- Share the exact timestamp so we can look into our Munin graphs and see what was going on around that time. Or, even better, if a sysadmin is around when the problem occurs, they would be able to check current resources usage directly on
puppet-git.lizard, which would give us finer-grained data.
- Try downloading a large file over HTTPS from lizard at the same time as the slow upload is ongoing, and see if that one is slow too.
… and sysadmins could check:
rsync.lizarduploading tons of data to mirrors? This could explain why there's less bandwidth available for other needs.
puppet-git.lizardbottlenecked by CPU, I/O, or anything else?
- Global lizard bandwidth usage
I could try to be around next Monday, at the time anonym will go through potentially affected steps of the release process, in order to check things live.
FWIW, yesterday between 3:00 and 5:00 PDT (presumably while mirrors were sync'ing from
rsync.lizard), lizard pushed up to 800 Mbps of traffic, and was above 400 Mbps half of the time:
So it seems to me that network bandwidth alone does not explain the problem.
This is consistent with my hunch that the problem is specific to
#11 Updated by CyrilBrulebois 7 days ago
I don't see anything matching
wiki/src/blueprint/GitLab.mdwn (which I was kind of expecting), so the upcoming switch to GitLab at Immerda will probably not change anything here?
I'll try and remember to download a big file from there next time I'm noticing a slow push, to double check the bandwidth aspect (which I didn't remember because of let's say suboptimal working conditions).
I don't see anything matching
wiki/src/blueprint/GitLab.mdwn(which I was kind of expecting), so the upcoming switch to GitLab at Immerda will probably not change anything here?
I won't change anything directly: migrating git-annex repos to GitLab is out of scope (besides, AFAIK GitLab supports git-lfs, but not git-annex).
It may improve things indirectly, if the problem is merely "
puppet-git.lizard is overloaded", by migrating a little bit of the load away from that machine.
I'll try and remember to download a big file from there next time I'm noticing a slow push, to double check the bandwidth aspect
If one of our sysadmins happens to be around at the time, it would be nice if you asked them to take a look at what seems to be the limiting factor for the server-side git-annex processes.