Project

General

Profile

Bug #17414

Slow networking for release management

Added by CyrilBrulebois 3 months ago. Updated 7 days ago.

Status:
Confirmed
Priority:
Normal
Assignee:
Category:
Infrastructure
Target version:
-
Start date:
Due date:
% Done:

0%

Feature Branch:
Type of work:
Sysadmin
Blueprint:
Starter:
Affected tool:

Description

(Opening as suggested by intrigeri.)

I'm regularly getting very slow transfers when performing release management duties, e.g. sync-ing stuff with git-annex right now:

kibi@armor:~/work/clients/tails/release/isos.git$ git annex get tails-amd64-4.*
get tails-amd64-4.0~beta1/tails-amd64-4.0~beta1.apt-sources (from origin...) 
SHA256E-s455--8759b3d834ba480061a814178fcaba653183e063ec554d2792da7cab31244d1d
            455 100%  444.34kB/s    0:00:00 (xfr#1, to-chk=0/1)
(checksum...) ok
get tails-amd64-4.0~beta1/tails-amd64-4.0~beta1.build-manifest (from origin...) 
SHA256E-s115862--f1ad4992232b8f026e612affecfa5d44c28061736d3fd82d35fc34029b3c743e
        115,862 100%  367.36kB/s    0:00:00 (xfr#1, to-chk=0/1)
(checksum...) ok
get tails-amd64-4.0~beta1/tails-amd64-4.0~beta1.buildlog (from origin...) 
SHA256E-s1167089--21d7a15c8065b52e49662273c6a7a0d734ea4bf7c69f623266774767ecc767ec
      1,167,089 100%    2.32MB/s    0:00:00 (xfr#1, to-chk=0/1)
(checksum...) ok
get tails-amd64-4.0~beta1/tails-amd64-4.0~beta1.img (from origin...) 
SHA256E-s1136656384--384db4d74da56c31a4e50bf093526c962d0eb3dee19de3e127fd5acccc063f9b.img
  1,136,656,384 100%    6.09MB/s    0:02:58 (xfr#1, to-chk=0/1)
(checksum...) ok
get tails-amd64-4.0~beta1/tails-amd64-4.0~beta1.img.sig (from origin...) 
SHA256E-s833--b8d1f3e4843d9586b811c16272df34759a6fab8c89c26839e0c0373f091ab343.img.sig
            833 100%  813.48kB/s    0:00:00 (xfr#1, to-chk=0/1)
(checksum...) ok
get tails-amd64-4.0~beta1/tails-amd64-4.0~beta1.iso (from origin...) 
SHA256E-s1126864896--c4c25d9689d8c5927f8ce1569454503fc92494ce53af236532ddb0d6fb34cff3.iso
  1,126,864,896 100%    3.93MB/s    0:04:33 (xfr#1, to-chk=0/1)
(checksum...) ok
get tails-amd64-4.0~beta1/tails-amd64-4.0~beta1.iso.sig (from origin...) 
SHA256E-s833--924ed01855ed92471526c9d68db40d692d6b3dfe970554333f52f3297d6f4f1a.iso.sig
            833 100%  813.48kB/s    0:00:00 (xfr#1, to-chk=0/1)
(checksum...) ok
get tails-amd64-4.0~beta1/tails-amd64-4.0~beta1.packages (from origin...) 
SHA256E-s46699--7d7ce059b09604f08676561e2a62c6142ff3226878257b695f20269f4eab6e8d
         46,699 100%   44.54MB/s    0:00:00 (xfr#1, to-chk=0/1)
(checksum...) ok
get tails-amd64-4.0~beta2/tails-amd64-4.0~beta2.build-manifest (from origin...) 
SHA256E-s115583--95117b4429829ce7e0f29fa887789e1c29c37d6ff9414a2a8e898d88099f426e
        115,583 100%  714.39kB/s    0:00:00 (xfr#1, to-chk=0/1)
(checksum...) ok
get tails-amd64-4.0~beta2/tails-amd64-4.0~beta2.buildlog (from origin...) 
SHA256E-s1165670--352f0e02919afccacaf763a9831497678a9175211f423aae0fa63465018f79bd
      1,165,670 100%    3.32MB/s    0:00:00 (xfr#1, to-chk=0/1)
(checksum...) ok
get tails-amd64-4.0~beta2/tails-amd64-4.0~beta2.img (from origin...) 
SHA256E-s1136656384--920e48fb7b8ab07573f6ad334749dd965c453794b6d33766d545c943c21296ad.img
  1,136,656,384 100%    4.05MB/s    0:04:27 (xfr#1, to-chk=0/1)
(checksum...) ok
get tails-amd64-4.0~beta2/tails-amd64-4.0~beta2.img.sig (from origin...) 
SHA256E-s228--212807a88aca27d186eaebe66e7045b88888ab7f8e806e77ca780a8a8646697e.img.sig
            228 100%  222.66kB/s    0:00:00 (xfr#1, to-chk=0/1)

Meanwhile, I'm easily reaching 20+ MB/s when downloading some big files from other servers, so that's definitely not a bandwidth issue on my side.

FWIW this is with a git-annex that leverages a direct connection to git.puppet.tails.boum.org defined as an SSH alias to lizard, so there's no tor involved.

At other times, I can get up to 15 MB/s from the iso history HTTPS access, or over SSH. But it usually only happens when I'm double-checking the numbers after mentioning this issue to my fellow RMs, while I'm getting the slowness moments when such transfers are on the critical path to a release.

Please let me know what you need from me to help you help me. Thanks!


Related issues

Related to Tails - Bug #17361: Streamline our release process Confirmed

History

#1 Updated by CyrilBrulebois 3 months ago

And now that pushing to ISO history (that particular repository) is also on the critical path since one needs to upload the built images there so as to be be able to build IUKs on Jenkins. Currently seeing this:

    121,602,048  10%  954.03kB/s    0:18:10  

and I've got ~ 2.4 GB to upload.

#2 Updated by CyrilBrulebois 3 months ago

To be a tad more complete:

copy tails-amd64-4.2.1/tails-amd64-4.2.1.img (checking origin...) (to origin...) 
SHA256E-s1161822208--19f20ad2dc3d28c695162479e5b1d527381baadddf125f3c66ece34c17c154d1.1.img
  1,161,822,208 100%    1.87MB/s    0:09:52 (xfr#1, to-chk=0/1)
ok
copy tails-amd64-4.2.1/tails-amd64-4.2.1.iso (checking origin...) (to origin...) 
SHA256E-s1151539200--4fcc4f2d0877f4ac7fdd5867467842eeed49094e37473e347a84798978064533.1.iso
  1,151,539,200 100%    1.71MB/s    0:10:40 (xfr#1, to-chk=0/1)
ok

while I could upload those elsewhere ~ 10 times faster.

#3 Updated by intrigeri 3 months ago

(18:45:54) intrigeri: taggart: fwiw, we're seeing <100 Mbps transfer rates for lizard (did not check if it's "always" or "only when we need it to be fast",
(18:46:09) intrigeri: taggart: which is not consistent with the fact it's supposed to be plugged on a gigabit switch now
(19:45:12) taggart: intrigeri: can you run ethtool and confirm it's got a gig link
(19:45:27) taggart: intrigeri: it's plugged straight into a gig port on our router
(19:46:11) taggart: intrigeri: do you have a bandwidth graph somewhere? also we might need to do some traceroutes to see how it's routing
(19:46:34) intrigeri: taggart: it says Speed: 1000Mb/s
(19:46:43) intrigeri: taggart: yeah, we are on your munin
(19:47:13) intrigeri: I'll report more details later, busy now, I just wanted to check if there was a known issue, sorry I made you context switch!
(19:54:18) taggart: intrigeri: found it https://munin.riseup.net/riseup.net/wren.riseup.net/if_eth7.html
(20:15:49) intrigeri: taggart: ok, so it does go slightly above 100Mb/s, but I do remember that it went closer to gigabit after if was plugged in.
[…]
(21:22:48) taggart: intrigeri: this is what I use for speed testing https://github.com/richb-hanover/OpenWrtScripts
(21:22:59) taggart: the betterspeedtest.sh script
(21:23:20) taggart: I think it needs netperf or flent installed (but the error will be clear)

#4 Updated by intrigeri 3 months ago

  • Related to Bug #17361: Streamline our release process added

#5 Updated by intrigeri 3 months ago

  • Category set to Infrastructure
  • Status changed from New to Confirmed

#6 Updated by zen 3 months ago

@intrigeri, did you run any test that shows that the problem is not on our side?

I ran the speedtest available in Debian some times in a row, and consistently gave around the following values:

$ speedtest --secure --simple
Ping: 26.323 ms
Download: 716.46 Mbit/s
Upload: 240.98 Mbit/s

I'll look for a way to run the proposed custom script so our numbers are more consistent with the provider expects. Other suggestions for measuring bandwidth are also welcome.

#7 Updated by zen 3 months ago

  • Assignee changed from Sysadmins to zen

#8 Updated by intrigeri 2 months ago

intrigeri, did you run any test that shows that the problem is not on our side?

I did not.

I ran the speedtest available in Debian some times in a row, and consistently gave around the following values:

> $ speedtest --secure --simple
> Ping: 26.323 ms
> Download: 716.46 Mbit/s
> Upload: 240.98 Mbit/s
> 

Hmmm. 240.98 Mbit/s is better than what kibi has seen, but still very far from download rates, let alone from maxing out a gigabit link. This suggests there's indeed a problem somewhere.

#9 Updated by intrigeri about 2 months ago

FWIW, I have a hunch that the problem may not be caused by networking problems, but rather by puppet-git.lizard being resource-constrained and slow. This hunch comes from the fact the problem seems to happen mostly (only?) with git annex operations, and not with operations that connect to another VM.

To confirm this, next time this sort of trouble happens, the RM could:

  • Share the exact timestamp so we can look into our Munin graphs and see what was going on around that time. Or, even better, if a sysadmin is around when the problem occurs, they would be able to check current resources usage directly on puppet-git.lizard, which would give us finer-grained data.
  • Try downloading a large file over HTTPS from lizard at the same time as the slow upload is ongoing, and see if that one is slow too.

… and sysadmins could check:

  • Was rsync.lizard uploading tons of data to mirrors? This could explain why there's less bandwidth available for other needs.
  • Was puppet-git.lizard bottlenecked by CPU, I/O, or anything else?
  • Global lizard bandwidth usage

I could try to be around next Monday, at the time anonym will go through potentially affected steps of the release process, in order to check things live.

#10 Updated by intrigeri 8 days ago

FWIW, yesterday between 3:00 and 5:00 PDT (presumably while mirrors were sync'ing from rsync.lizard), lizard pushed up to 800 Mbps of traffic, and was above 400 Mbps half of the time:

https://munin.riseup.net/static/dynazoom.html?plugin_name=tails.boum.org%2Flizard.tails.boum.org%2Fif_eth1&start_iso8601=2020-03-27T09%3A34%3A30-0000&stop_iso8601=2020-03-27T12%3A23%3A15-0000&start_epoch=1585301670&stop_epoch=1585311795&lower_limit=&upper_limit=&size_x=800&size_y=400&cgiurl_graph=%2Fmunin-cgi%2Fmunin-cgi-graph

So it seems to me that network bandwidth alone does not explain the problem.
This is consistent with my hunch that the problem is specific to git-annex and puppet-git.lizard.

#11 Updated by CyrilBrulebois 7 days ago

I don't see anything matching /annex/i in wiki/src/blueprint/GitLab.mdwn (which I was kind of expecting), so the upcoming switch to GitLab at Immerda will probably not change anything here?

I'll try and remember to download a big file from there next time I'm noticing a slow push, to double check the bandwidth aspect (which I didn't remember because of let's say suboptimal working conditions).

#12 Updated by intrigeri 7 days ago

I don't see anything matching /annex/i in wiki/src/blueprint/GitLab.mdwn (which I was kind of expecting), so the upcoming switch to GitLab at Immerda will probably not change anything here?

I won't change anything directly: migrating git-annex repos to GitLab is out of scope (besides, AFAIK GitLab supports git-lfs, but not git-annex).

It may improve things indirectly, if the problem is merely "puppet-git.lizard is overloaded", by migrating a little bit of the load away from that machine.

I'll try and remember to download a big file from there next time I'm noticing a slow push, to double check the bandwidth aspect

Great!

If one of our sysadmins happens to be around at the time, it would be nice if you asked them to take a look at what seems to be the limiting factor for the server-side git-annex processes.

Also available in: Atom PDF