Project

General

Profile

Bug #16805

Time between Greeter and desktop multiplied by 2

Added by sajolida 3 months ago. Updated 26 days ago.

Status:
Resolved
Priority:
Elevated
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

100%

Feature Branch:
bugfix/16805-slow-login+force-all-tests
Type of work:
Code
Blueprint:
Starter:
Affected tool:

Description

Here are some measurements to compare the boot time betwee 3.14 and bc022ef71e and an X201:

  • Between syslinux and Tails Greeter: 94 s (3.14) vs 72 s (bc022ef71e): awesome!
  • Between Tails Greeter and the desktop: 16 s (3.14) vs 26 s: how come it's almost the double?

Looking for low-hanging fruits to improve this, without revamping our whole MAC spoofing design & implementation, is FT work.


Related issues

Related to Tails - Bug #9012: Network is sometimes not unblocked post-Greeter in Jessie Resolved 03/04/2015
Blocks Tails - Feature #16209: Core work: Foundations Team Confirmed

Associated revisions

Revision 1a9e1a84 (diff)
Added by intrigeri about 1 month ago

tails-unblock-network: only sleep until all-net-blacklist.conf is gone (refs: #16805)

Sleeping 5 seconds unconditionally harms UX.

The assumption here is that:

- #9012 was caused by an aufs bug that somehow affects how udev (and the
kernel?) monitor /etc/modprobe.d/, and make them need time until they notice
that all-net-blacklist.conf was deleted.
- The same bug would also affect the "-e" test done by the shell this script
runs under. That is, it would affect essentially any process that accesses
/etc/modprobe.d/.
- So for example, this bug can't be "the inode number of /etc/modprobe.d
changed between the time udev started monitoring it, and the time we trigger
a replay of the kernel 'add' events". According to the aufs documentation,
inode numbers can change when using the noxino mount option, which we do,
and actually that's been one of my primary suspects when investigating
#9012.

Revision eb55281c (diff)
Added by intrigeri about 1 month ago

tails-unblock-network: have udev reload the databases it uses (refs: #16805)

Revision 1527d3c0 (diff)
Added by intrigeri about 1 month ago

tails-unblock-network: only sleep until all-net-blacklist.conf is gone (refs: #16805)

Sleeping 5 seconds unconditionally harms UX.

The assumption here is that:

- #9012 was caused by an aufs bug that somehow affects how udev (and the
kernel?) monitor /etc/modprobe.d/, and make them need time until they notice
that all-net-blacklist.conf was deleted.
- The same bug would also affect the "-e" test done by the shell this script
runs under. That is, it would affect essentially any process that accesses
/etc/modprobe.d/.
- So for example, this bug can't be "the inode number of /etc/modprobe.d
changed between the time udev started monitoring it, and the time we trigger
a replay of the kernel 'add' events". According to the aufs documentation,
inode numbers can change when using the noxino mount option, which we do,
and actually that's been one of my primary suspects when investigating
#9012.

Revision 2e230972 (diff)
Added by intrigeri about 1 month ago

tails-unblock-network: have udev reload the databases it uses (refs: #16805)

Revision 9f57027e
Added by intrigeri 27 days ago

Merge branch 'bugfix/16805-slow-login+force-all-tests' into devel (Closes: #16805)

History

#1 Updated by intrigeri 3 months ago

  • Status changed from New to Rejected
  • Between Tails Greeter and the desktop: 16 s (3.14) vs 26 s: how come it's almost the double?

In 3.x I've been bold and we've started running the "unblock network adapters + spoof MAC address" logic in an asynchronous manner, in parallel with the rest of the login process, while in 2.x this was blocking the login process. But this bold move causes trouble on Buster (#16620) so I've reverted it in feature/buster (commit 9d51c6771a3f2d0af387b405d8e6987ee25f270f in greeter.git).

Also, I'm quite convinced that doing this stuff asynchronously is the root cause for a serious security issue we have in 3.x (#16560).

So I'm afraid we'll have to bite the bullet on this one. Thankfully, despite this Tails 4.x still starts faster than 3.x overall.

If you think we should do something about it, let me know.

#2 Updated by sajolida 3 months ago

Too bad but it makes sense :)

#3 Updated by sajolida about 1 month ago

I'll dump more ideas here as I'm afraid that this issue will generate a lot of noise for 4.0. It's also a shame to have such a significant loss in performance (though I understand that we gain in reliability).

I'm not sure to understand what you mean by "in an asynchronous manner, in parallel with the rest of the login process" but could we do the MAC spoofing dance proactively before starting the session, ie. in the background while the user is interacting with Tails Greeter?

And then revert to the original MAC if the user decides not to disable MAC spoofing in Tails Greeter?

This would also make it easier to solve Problem K from https://tails.boum.org/blueprint/network_connection/ by prompting the user about the failure from within Tails Greeter. I know that we discarded solving Problem K because it is pretty rare, but if we get to know about the failure before starting the session, it might become easy to fix.

#4 Updated by intrigeri about 1 month ago

  • Status changed from Rejected to Confirmed

Hi!

I'm afraid that this issue will generate a lot of noise for 4.0.

Thank you for this input. I'm therefore reopening this ticket to give it more visibility (I don't expect anyone but you and I to notice discussion on a rejected ticket). I'm not treating this as a release blocker at this point, but it would be nice if we at least managed to look for a cheap mitigation (I would start by profiling the login process with systemd-analyze plot and look for low-hanging fruits).

It's also a shame to have such a significant loss in performance

I'd like to remind the reader that according to sajolida's report, the total time between syslinux and the desktop, between 3.14 and Buster, went down by 11%.
This being said, I understand that having another long waiting step in the boot process has problematic UX consequences, even if the total waiting time goes down.

could we do the MAC spoofing dance proactively before starting the session, ie. in the background while the user is interacting with Tails Greeter?
And then revert to the original MAC if the user decides not to disable MAC spoofing in Tails Greeter?

tl;dr: yes, if we change our design goals a bit. The amount of work (design goals discussion, UX, coding) to make this happen seems non-trivial. Not terribly huge to the point that there would be no chance this gets done in time for 4.0, but it would definitely divert quite a lot of our resources away from other problems. It's not clear to me whether it's worth it right now.

First, one question: were there specific reasons why you suggested to do this in the Greeter, as opposed to earlier?

There are three aspects I think we ought to consider here: design goals, implementation, and UX. Any option we choose for any of these 3 aspects impacts the 2 other ones.

Wrt. design goals, we've discussed this option and rejected it in the past, because this might cause a spoofed MAC address to be leaked before the user can decide to disable MAC address spoofing in order to avoid raising alarms on the network (AvoidIdMacSpoof). It's unclear whether this kind of leak is more than a theoretical possibility: we have no idea what the proprietary firmware of network interfaces do, and we did not audit every Linux network interface driver to ensure it does not do that. One could argue that:

  • Such firmware can also very well use their permanent (non-spoofed) MAC address for whatever network action they want, even after reporting to the OS that MAC spoofing was successful, and then using the (often flawed) "if we can't protect against X, let's not bother protecting against Y" logic, it would follow that we can stop bothering about drivers/firmware leaking MAC address before we explicitly trigger network operations ourselves by starting NetworkManager.
  • Even in our current implementation, there is a short window of time during which the network interface is enabled but we have not spoofed its MAC address yet. AFAICT that's unavoidable. So presumably, network interface drivers/firmware could leak the permanent (non-spoofed) MAC address while the user wants Tails to spoof it. Rebuttal: that window of time is immensely shorter than the one we would open up if we spoofed the MAC address before the user is given a choice.

At this point, I have no strong opinion about whether it would be reasonable to change our design goals here. I'd like to discuss it with folks like segfault and anonym if we decide we would like to invest more into this idea.

Wrt. implementation: yes, I think that's feasible. Having to deal with the revert aspect would make this (already complex) system even more complex, harder to reason about, and harder to debug, but it may be OK. I expect it'll take quite some time to implement this and get it right due to this complexity, adapting existing automated tests, and probably writing new ones.

Finally, wrt. UX, I don't know how much of the expected benefits we would get in practice. It depends:

  • If we initiate MAC spoofing in the Greeter, we probably save a few seconds per non-standard option the user chooses there, and then:
    • I believe this can make a real difference for users who manually change several of them, e.g. a non-English language, unlocking persistence, and setting an administration password. But most of this gain vanishes once we have added support to make most settings persistent.
    • It does not make much of a difference for users who login with the default settings or very few non-default settings.
  • If we initiate MAC spoofing earlier in the boot process, we probably save 10 seconds or so (on X201-class hardware) for everybody except those who will disable MAC spoofing. But we increase the risk of leaking the spoofed MAC address before the user had a chance to disable MAC spoofing.

This would also make it easier to solve Problem K from https://tails.boum.org/blueprint/network_connection/ by prompting the user about the failure from within Tails Greeter.

This only works if we block on MAC spoofing before we allow the user to log in. So if we start MAC spoofing in the Greeter, in practice it moves the waiting time from the login process to the Greeter, forcing most users to wait in the Greeter before they can log in, which is probably even worse UX than the current Tails/Buster state of things. I don't think problem K is worth making things worse for everyone else. So I would advocate against blocking on this in the Greeter. But that could be another argument in favour of spoofing the MAC address earlier in the boot process, instead of initiating this process at Greeter time.

#5 Updated by intrigeri about 1 month ago

#6 Updated by intrigeri about 1 month ago

  • Description updated (diff)

#7 Updated by sajolida about 1 month ago

I'd like to remind the reader that according to sajolida's report, the total time between syslinux and the desktop, between 3.14 and Buster, went down by 11%.

s/performance/perceived performance/ then.

As Michael said on tails-testers and as I felt the first time I started
4.0~beta1, "it looks broken".

Michael described this UX regression very well on tails-testers@:

« I think the problem with the long time between Greeter and Desktop
could confuse especially Windows users. We are spoiled from Windows
because there is always some bar or symbol which is moving. If its only
a few seconds then it is no problem but nearly 25 seconds is pretty
much. Especially because "it looks broken". Its not just a black screen
or any information window. On my screen i can see a few colours and the
pointer but nothing more. It looks like the user broke something. Thats
one of these small UX things which you maybe only really recognize if
you mostly use Windows and especially new user could feel "lonely"
during these moments. »

First, one question: were there specific reasons why you suggested to do this in the Greeter, as opposed to earlier?

Not really. I thought that, since the user will have to perform manual
steps in Tails Greeter, we might as well use this waiting time to
perform some computing in the background.

I'd like to discuss it with folks like segfault and anonym if we decide we would like to invest more into this idea.

Seeing the big can of worms that is being opened, forget about the MAC
spoofing idea and look for other ways of solving this UX regression.

Would it be simpler to add some visual feedback while the GNOME session
is starting or to provide some feedback in Tails Greeter after clicking
"Start Tails" and while the MAC spoofing dance is happening?

#8 Updated by sajolida about 1 month ago

I wanted to test how long it took to open GNOME with MAC spoofing disabled in Tails Greeter. It took 26 seconds as well but I think it's because MAC spoofing disabling is broken → #16988.

#9 Updated by intrigeri about 1 month ago

Seeing the big can of worms that is being opened, forget about the MAC spoofing idea and look for other ways of solving this UX regression.

OK.

Would it be simpler to add some visual feedback while the GNOME session is starting or to provide some feedback in Tails Greeter after clicking "Start Tails" and while the MAC spoofing dance is happening?

We never managed to have such feedback reliably (as in: is reliably displayed and does not break other stuff) in the past so I'm not very hopeful.

#10 Updated by intrigeri about 1 month ago

I wanted to test how long it took to open GNOME with MAC spoofing disabled in Tails Greeter. It took 26 seconds as well but I think it's because MAC spoofing disabling is broken → #16988.

FWIW I doubt that disabling MAC spoofing will change much there: my hunch is that what takes time is udev waiting for devices to settle, and starting NetworkManager, more than the actual MAC spoofing operation. I can of course be wrong on this one.

#11 Updated by sajolida about 1 month ago

Could we wait for udev and start NetworkManager in another screen
of Tails Greeter, before handing over to GNOME and having no possibility
of reliable feedback?

What other operations could we do at the very end of Tails Greeter,
while we're still control of the display?

#12 Updated by intrigeri about 1 month ago

Hi!

Meta:

I'm sorry if my replies made you feel I did not care much or that there would be no progress here unless you push it and come up with all the ideas.

I think I understand why you care deeply about this topic. Perceived performance matters and we've put lots of energy into making Additional Software not block the login process, so it's rather sad to make it longer again for everybody (even those who don't use persistence).

As you know, guided by our "make it easier to switch between Tails and another OS" goal, I've put lots of efforts upstream on my volunteer time into making the boot faster, and I was really happy I had succeeded for 4.0… until we reach the Greeter and click "Start Tails", and then right, my efforts are almost canceled (based on measurements) or negated (based on user perception) due to this bug. So I also have something at stake here.

This ticket, however, is not my top priority for 4.0 at the moment: I'm working on #12092 so we have reliable test suite results and don't have to bump the memory requirements (which matters for UX too: I assume you'll agree that "I can't use Tails anymore because my computer is old" feels just as important as "it feels like Tails takes N more seconds to start" — even though it actually starts faster).

So I could not find time, so far, to work on it, apart of discussing the various suggestions you have made in the last week. It was fine and rather exciting, to be honest, I'm not complaining at all. I'm very aware that very often, your fresh perspective on things (i.e. without looking at the code) yields out-of-the-box ideas that someone who's merely looking at the code would never think about. It's great. But at some point I'd like to take time to eventually look at the problem from a different perspective, that is looking at the logs and the code to understand what's going on, and whether we can make the whole thing faster: not all problems have their best solution in clever UX design, sometimes we can also fix bugs in our code :)

Could we wait for udev and start NetworkManager in another screen of Tails Greeter, before handing over to GNOME and having no possibility of reliable feedback?

Sure, worst case we could do that.

What other operations could we do at the very end of Tails Greeter, while we're still control of the display?

I understand you're asking this because if we insert another screen in Tails Greeter, in order to provide feedback, while we're blocked anyway, we could as well do more work. This makes total sense to me. At first glance, I can think of nothing at all, unfortunately: at that point, we've done everything we could do before starting the desktop environment, which is our preferred place (currently, de facto) to do stuff that might require user input or notifications, e.g. everything we have in config/chroot_local-includes/usr/lib/systemd/user/. I'm happy to think more about it, and about any other feedback/design-based mitigations, once I (or someone else) had time to try and find a solution to this problem.

Note to myself:

  • In my test VM, tails-unblock-network blocks the login process for 7 seconds, among which 5 are used by… the sleep 5 call in tails-unblock-network. Uh oh!
  • These 5 seconds explain half of the additional 10 seconds between Tails Greeter and the desktop that sajolida reported. If we can get rid of this, that might be good enough.
  • Our best (but still not exactly great) explanation about this call to sleep, that was added due to #9012, is that an aufs bug makes udev not notice that the blacklist file has just been removed. If that's indeed the case, another solution would be to wait, with repeated but much shorter calls to sleep, until we can confirm that the blacklist is gone, before we trigger loading of network device drivers.
  • If #9012 was indeed caused by a bug in aufs, then the reason why we have to sleep there will vanish once we switch to overlayfs, which is due for March 2020.

#13 Updated by intrigeri about 1 month ago

As Michael said on tails-testers and as I felt the first time I started 4.0~beta1, "it looks broken".
Michael described this UX regression very well on tails-testers@:

Thanks for pointing me to it: I had completely missed this because this text was never posted to the mailing list in non-quoted form. I've skipped the quoted text, assuming I had read it already. But apparently it was quoted from some other discussion I was not part of (like replying to you personally and not on the mailing list, I guess :)

#14 Updated by intrigeri about 1 month ago

  • Related to Bug #9012: Network is sometimes not unblocked post-Greeter in Jessie added

#15 Updated by intrigeri about 1 month ago

  • Status changed from Confirmed to In Progress

#16 Updated by intrigeri about 1 month ago

  • Feature Branch set to bugfix/16805-slow-login+force-all-tests

#17 Updated by intrigeri about 1 month ago

  • Assignee set to intrigeri

I've measured how long it takes between pressing Enter in the Greeter and the Desktop icons being displayed (i.e. the Desktop looks fully loaded) on 3 machines × 3 images:

Machine Elitebook 840G1 X200 VM
3.15 14 23 10
4.0~beta1 19 32 14
topic branch 14 20 9

Notes:

  • The 2 laptops were started from a fast USB stick (I can write at a sustained 25MB/s rate on it).
  • The VM has 4 vCPUs and 4GB of RAM and its storage is backed by a fast NVMe drive.
  • Depending on the hardware, how much "it looks broken" varies: I've seen either a rather intriguing black screen, or merely the Greeter blue background and top bar remaining displayed longer.

So, unsurprisingly, dropping the aforementioned sleep 5 brings login time back down to its 3.x level. I'm a bit surprised that it even makes the login process a bit faster than 3.15, on 2 of the 3 test machines, but I guess nobody will complain. I guess that's thanks to the migration from desktop icons displayed by Nautilus, to displayed by GNOME Shell itself.

I've started an image built from the topic branch 5 times on each of the machines listed below, and #9012 never happened. But given how rare and hard to reproduce #9012 was, this does not mean much, really. If we want to give this solution a try, I don't see any better way than shipping it in 4.0~beta2, mention something about it in the call for testing, and hope that at least some folks who would be affected try it out and report the bug (if any).

Next steps:

  • Wait for Jenkins test results.
  • Decide whether we want to risk getting #9012 back. I think it's worth it. Information & arguments that may help decide:
    • On the topic branch I've added some mitigations that might help avoid #9012. I'm not too hopeful though: we've never really understood #9012.
    • When #9012 happens, there's one way to fix it, but it requires an admin password and using a Terminal: sudo /usr/local/lib/tails-unblock-network. We could document that.
    • #9012 happened so rarely to anonym and I, and could never be reproduced consistently, that IMO it's unlikely that a user/laptop is affected consistently; that is, if it happens, chances are that restarting Tails fixes it and you won't see the problem again for days/weeks/months. I suspect that restarting Tails is exactly what most Windows users would do anyway, rather than reading the doc ;) Note that it could still be particularly problematic for the first boot of a first-time user, who could conclude that Tails does not support their hardware instead of retrying, but I hope nobody gets so unlucky.

#18 Updated by intrigeri about 1 month ago

  • Priority changed from Normal to Elevated

#19 Updated by intrigeri 29 days ago

intrigeri wrote:

Next steps:

  • Wait for Jenkins test results.

Looks good. I'll also test on all bare metal hardware I have here.

  • Decide whether we want to risk getting #9012 back.

I'll let the reviewer decide once I'm done with my tests.

#20 Updated by intrigeri 29 days ago

  • Status changed from In Progress to Needs Validation
  • Assignee deleted (intrigeri)

intrigeri wrote:

I'll also test on all bare metal hardware I have here.

Ah, I had forgotten that I had done this already ("I've started an image built from the topic branch 5 times on each of the machines listed below, and #9012 never happened").

Dear reviewer, please read #16805#note-17 and following to get enough context to make a decision here. Thanks in advance!

#21 Updated by segfault 28 days ago

  • Assignee set to segfault

#22 Updated by segfault 28 days ago

  • Assignee changed from segfault to intrigeri

LGTM. I think the UX improvement makes it worth a try, we can still revert it if we get any reports of #9012.

I pushed two small fixup commits to the branch, so assigning to intrigeri to review and merge.

#23 Updated by intrigeri 27 days ago

  • Status changed from Needs Validation to Resolved
  • % Done changed from 0 to 100

#24 Updated by sajolida 26 days ago

I'm sorry if my replies made you feel I did not care much or that there would be no progress here unless you push it and come up with all the ideas.

I'm seeing now that you solve this problem, yay! You rock!

Reacting to your meta-reflection:

I didn't feel like that and I hope that you didn't feel pressured.

My updates on this ticket always came with either:

  • New information, for example #note-7, when I realized that the problem
    was not a performance problem but a feedback problem, partly thanks to
    Michael's report.
  • Ideas (#note-3) or clarification when I felt that my idea was not well
    understood (#note-11).

I felt a bit bad when you replied in great details in #note-4. I won't take it bad if you discard my out-of-the-box ideas quicker, don't provide a complete technical justification, or take more time to react :)

I also really trust you to prioritize your work in the context all the rest. My job is to help you understand the impact but I don't know the cost so I can't do the prioritization myself (though I'm happy to help if needed).

Depending on the hardware, how much "it looks broken" varies: I've seen either a rather intriguing black screen, or merely the Greeter blue background and top bar remaining displayed longer.

I consistently get the blue background and the top bar and that's what "looks broken" to me. In VirtualBox I get the black background and it doesn't look so bad.

#25 Updated by intrigeri 26 days ago

Reacting to your meta-reflection:

Thanks, much appreciated ♥ :)

Also available in: Atom PDF