Project

General

Profile

Bug #12092

The Greeter keeps eating lots of memory after logging in

Added by intrigeri over 2 years ago. Updated 22 days ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
Hardware support
Target version:
Start date:
12/27/2016
Due date:
% Done:

0%

Feature Branch:
bugfix/12092-kill-gdm-session-after-login+force-all-tests
Type of work:
Code
Blueprint:
Starter:
Affected tool:
Greeter

Description

The test suite started failing on feature/stretch in Jenkins (#12088) once we've merged the new Greeter in there. I've noticed that a full GNOME session is kept running as the Debian-gdm user once one has logged in. This seems like a big waste of memory. Can we somehow free that RAM, to keep supporting older hardware?

As part of this, revert 895387ece8cec86831cd617ee08dbdfc49cdc144.


Related issues

Related to Tails - Bug #12088: Test suite fails for feature/stretch on Jenkins due to lack of space in /tmp/TailsToaster Resolved 12/26/2016
Related to Tails - Bug #16305: GDM's GNOME Shell floods the Journal with XFIXES/cursor issues on Buster Resolved 01/06/2019
Related to Tails - Bug #17011: Disable GDM debug logs In Progress
Duplicated by Tails - Bug #12999: Stopping session-c1.scope after starting tails Duplicate 06/23/2017
Blocks Tails - Bug #16281: Update the test suite for Buster Resolved 01/05/2019
Blocks Tails - Feature #16209: Core work: Foundations Team Confirmed

Associated revisions

Revision 110a60d0 (diff)
Added by intrigeri about 1 month ago

Terminate the GDM session after the user logs in, to free memory (refs: #12092)

Revision 89cc641f (diff)
Added by intrigeri about 1 month ago

Terminate the GDM session after the user logs in, to free memory (refs: #12092)

We kill it as part of desktop.target, i.e. during the "Applications" phase of
the initialization of the GNOME session. We cannot do this earlier reliably:

- basic.target is started by "systemd --user" for almost every command run as
the amnesia user and may thus be triggered too early, at a time when we still
need GDM.
- If we do this as part of basic.target, it sometimes happens before amnesia's
X.Org has started, and sometimes after that, which causes racy behaviour,
weird bugs, and amnesia's $DISPLAY can be either :0 or :1, which breaks our
code that relies on that value to be always the same.

We're in no rush to kill GDM super early anyway.

I've heard rumors that we can drop this hack when we switch to Wayland (#12213).
We'll see :)

Revision 2e6edc54 (diff)
Added by intrigeri about 1 month ago

Don't start new instances of gdm-x-session when we've killed it on purpose (refs: #12092)

Revision 107efb89 (diff)
Added by intrigeri about 1 month ago

Ensure GDM does not take control of the display back when we kill its processes (refs: #12092)

Also, revert "Don't start new instances of gdm-x-session when we've killed it on
purpose" aka. commit 2e6edc540a58eddd23f495f50a526137e0c78b18, which is not
needed anymore: we kill the gdm.service in a way that should guarantee our
gdm-x-session wrapper is not started again once we've killed it along with the
rest of the GDM session.

Revision 962c5fac (diff)
Added by intrigeri 27 days ago

Terminate GDM's GNOME session after the amnesia user logs in, in order to free memory (refs: #12092)

I've heard rumors that we can drop this hack when we switch to Wayland (#12213).
We'll see :)

We kill it as part of desktop.target, i.e. during the "Applications" phase of
the initialization of the GNOME session. We cannot do this earlier reliably:

- basic.target is started by "systemd --user" for almost every command run as
the amnesia user and may thus be triggered too early, at a time when we still
need GDM's processes.
- If we do this as part of basic.target, it sometimes happens before amnesia's
X.Org has started, and sometimes after that, which causes racy behaviour,
weird bugs, and amnesia's $DISPLAY can be either :0 or :1, which breaks our
code that relies on that value to be always the same.

We're in no rush to kill GDM's GNOME session super early anyway.

Note that we keep GDM running while we kill its GNOME session,
otherwise, the amnesia user can't unlock the screen:

Failed to open reauthentication channel: Gio:DBusError:
GDBus.Error:org.freedesktop.DBus.Error.ServiceUnknown: The name
org.gnome.DisplayManager was not provided by any .service files

Revision 0cda3a16 (diff)
Added by intrigeri 27 days ago

Terminate GDM's GNOME session after the amnesia user logs in, in order to free memory (refs: #12092)

I've heard rumors that we can drop this hack when we switch to Wayland (#12213).
We'll see :)

We kill it as part of desktop.target, i.e. during the "Applications" phase of
the initialization of the GNOME session. We cannot do this earlier reliably:

- basic.target is started by "systemd --user" for almost every command run as
the amnesia user and may thus be triggered too early, at a time when we still
need GDM's processes.
- If we do this as part of basic.target, it sometimes happens before amnesia's
X.Org has started, and sometimes after that, which causes racy behaviour,
weird bugs, and amnesia's $DISPLAY can be either :0 or :1, which breaks our
code that relies on that value to be always the same.

We're in no rush to kill GDM's GNOME session super early anyway.

Note that we keep GDM running while we kill its GNOME session,
otherwise, the amnesia user can't unlock the screen:

Failed to open reauthentication channel: Gio:DBusError:
GDBus.Error:org.freedesktop.DBus.Error.ServiceUnknown: The name
org.gnome.DisplayManager was not provided by any .service files

Revision 10f82bb2 (diff)
Added by intrigeri 27 days ago

Terminate GDM's GNOME session after the amnesia user logs in, in order to free memory (refs: #12092)

I've heard rumors that we can drop this hack when we switch to Wayland (#12213).
We'll see :)

We kill it as part of desktop.target, i.e. during the "Applications" phase of
the initialization of the GNOME session. We cannot do this earlier reliably:

- basic.target is started by "systemd --user" for almost every command run as
the amnesia user and may thus be triggered too early, at a time when we still
need GDM's processes.
- If we do this as part of basic.target, it sometimes happens before amnesia's
X.Org has started, and sometimes after that, which causes racy behaviour,
weird bugs, and amnesia's $DISPLAY can be either :0 or :1, which breaks our
code that relies on that value to be always the same.

We're in no rush to kill GDM's GNOME session super early anyway.

Note that we keep GDM running while we kill its GNOME session,
otherwise, the amnesia user can't unlock the screen:

Failed to open reauthentication channel: Gio:DBusError:
GDBus.Error:org.freedesktop.DBus.Error.ServiceUnknown: The name
org.gnome.DisplayManager was not provided by any .service files

Also, we ensure gdm-session-worker does not start new sessions once the amnesia
user has logged in, which should hopefully prevent GDM from activating
such a session while we want the amnesia's user session to remain active.

Revision b8776c87 (diff)
Added by intrigeri 24 days ago

Terminate GDM's GNOME session after the amnesia user logs in, in order to free memory (refs: #12092)

I've heard rumors that we can drop this hack when we switch to Wayland (#12213).
We'll see :)

We kill it as part of desktop.target, i.e. during the "Applications" phase of
the initialization of the GNOME session. We cannot do this earlier reliably:

- basic.target is started by "systemd --user" for almost every command run as
the amnesia user and may thus be triggered too early, at a time when we still
need GDM's processes.
- If we do this as part of basic.target, it sometimes happens before amnesia's
X.Org has started, and sometimes after that, which causes racy behaviour,
weird bugs, and amnesia's $DISPLAY can be either :0 or :1, which breaks our
code that relies on that value to be always the same.

We're in no rush to kill GDM's GNOME session super early anyway.

Note that we keep GDM running while we kill its GNOME session,
otherwise, the amnesia user can't unlock the screen:

Failed to open reauthentication channel: Gio:DBusError:
GDBus.Error:org.freedesktop.DBus.Error.ServiceUnknown: The name
org.gnome.DisplayManager was not provided by any .service files

Also, we ensure gdm-session-worker does not start new sessions once the amnesia
user has logged in, which should hopefully prevent GDM from activating
such a session while we want the amnesia's user session to remain active.

Revision 9e6df451 (diff)
Added by intrigeri 22 days ago

Terminate GDM's GNOME session after the amnesia user logs in, in order to free memory (refs: #12092)

I've heard rumors that we can drop this hack when we switch to Wayland (#12213).
We'll see :)

We kill it as part of desktop.target, i.e. during the "Applications" phase of
the initialization of the GNOME session. We cannot do this earlier reliably:

- basic.target is started by "systemd --user" for almost every command run as
the amnesia user and may thus be triggered too early, at a time when we still
need GDM's processes.
- If we do this as part of basic.target, it sometimes happens before amnesia's
X.Org has started, and sometimes after that, which causes racy behaviour,
weird bugs, and amnesia's $DISPLAY can be either :0 or :1, which breaks our
code that relies on that value to be always the same.

We're in no rush to kill GDM's GNOME session super early anyway.

Note that we keep GDM running while we kill its GNOME session,
otherwise, the amnesia user can't unlock the screen:

Failed to open reauthentication channel: Gio:DBusError:
GDBus.Error:org.freedesktop.DBus.Error.ServiceUnknown: The name
org.gnome.DisplayManager was not provided by any .service files

Also, we ensure gdm-session-worker does not start new sessions once the amnesia
user has logged in, which should hopefully prevent GDM from activating
such a session while we want the amnesia's user session to remain active.

Revision 0358ae20 (diff)
Added by intrigeri 22 days ago

Enable GDM debug logs (refs: #12092).

In case the branch for #12092 introduces regressions, we'll need debug logs to
understand when/why GDM switches VTs or $DISPLAY, and when it activates its own
logind session.

This will be reverted in refs: #17011.

Revision edc71c6f (diff)
Added by intrigeri 22 days ago

Improve comments.

I.e. integrate as code comments all the relevant explanations
that were in commit messages, in the previous non-squashed branch
for refs: #12092.

Revision 85b14c91
Added by intrigeri 22 days ago

Merge branch 'bugfix/12092-kill-gdm-session-after-login-squashed' into devel (Fix-committed: #12092)

Revision 9fab9914 (diff)
Added by segfault 4 days ago

Revert "Enable GDM debug logs (refs: #12092)." (refs: #17011)

This reverts commit 0358ae208acb150766e8110c225025c4bb3ed5f1.

History

#1 Updated by intrigeri over 2 years ago

  • Related to Bug #12088: Test suite fails for feature/stretch on Jenkins due to lack of space in /tmp/TailsToaster added

#2 Updated by alant over 2 years ago

On a stretch system using X session for both the greeter and the user session, I can reproduce this behaviour: after logging in, I have:

        ├─gdm3─┬─gdm-session-wor─┬─gdm-x-session(Debian-gdm)─┬─Xorg───{InputThread}
        │      │                 │                           ├─gnome-session-b─┬─gnome-settings-─┬─{dconf worker}
        │      │                 │                           │                 │                 ├─{gdbus}
        │      │                 │                           │                 │                 ├─{gmain}
        │      │                 │                           │                 │                 └─{pool}
        │      │                 │                           │                 ├─gnome-shell─┬─{JS GC Helper}
        │      │                 │                           │                 │             ├─{JS Sour~ Thread}
        │      │                 │                           │                 │             ├─{dconf worker}
        │      │                 │                           │                 │             ├─{gdbus}
        │      │                 │                           │                 │             ├─{gmain}
        │      │                 │                           │                 │             └─{threaded-ml}
        │      │                 │                           │                 ├─xbrlapi
        │      │                 │                           │                 ├─{dconf worker}
        │      │                 │                           │                 ├─{gdbus}
        │      │                 │                           │                 └─{gmain}
        │      │                 │                           ├─{gdbus}
        │      │                 │                           └─{gmain}
        │      │                 ├─{gdbus}
        │      │                 └─{gmain}
        │      ├─gdm-session-wor─┬─gdm-x-session(user)─┬─Xorg───{InputThread}
        │      │                 │                     ├─gnome-session-b─┬─evolution-alarm─┬─{dconf worker}
        │      │                 │                     │                 │                 ├─3*[{evolution-alarm}]
        │      │                 │                     │                 │                 ├─{gdbus}
        │      │                 │                     │                 │                 └─{gmain}

However, with a sid system using wayland sessions for both the greeter and the user, there are no Debian-gdm processes anymore after login.

We can try to find a way to (have these) processes killed, but I'm not sure it could be easy...

#3 Updated by intrigeri over 2 years ago

We can try to find a way to (have these) processes killed, but I'm not sure it could be easy...

I'm open to an ugly hack to keep hardware requirements as low as possible, on the grounds that we can drop the hack once we use Wayland.

#4 Updated by intrigeri over 2 years ago

  • Priority changed from Normal to Elevated

#5 Updated by alant over 2 years ago

Let's first evaluate the impact on memory usage, then decide what to do.

#6 Updated by alant over 2 years ago

  • Assignee changed from alant to intrigeri

I won't handle that before May 15.

Let's first evaluate the impact on memory usage, then decide what to do.

Please assign it again to me if you decide it's an issue and that deadline is 3.0 final.

#7 Updated by intrigeri over 2 years ago

  • Assignee changed from intrigeri to alant

Please assign it again to me if you decide it's an issue and that deadline is 3.0 final.

Yes, please try to get this done (possibly with an ugly hack we can drop once we go the Wayland way) by the end of May.

#8 Updated by intrigeri over 2 years ago

  • Priority changed from Elevated to Normal

Please focus on #12364 first, but still, it would be nice to have this one fixed.

#9 Updated by intrigeri over 2 years ago

  • Target version deleted (Tails_3.0)
  • Parent task changed from #8230 to #11643

#10 Updated by intrigeri over 2 years ago

  • Duplicated by Bug #12999: Stopping session-c1.scope after starting tails added

#11 Updated by intrigeri over 2 years ago

#12999 has a potential solution to this problem.

#12 Updated by intrigeri over 1 year ago

https://siliconislandblog.wordpress.com/2018/05/16/gnome-performance-hackfest/ suggests it's expected behaviour (but patches are WIP) unless one uses auto-login.

#13 Updated by intrigeri 9 months ago

  • Related to Bug #16305: GDM's GNOME Shell floods the Journal with XFIXES/cursor issues on Buster added

#14 Updated by intrigeri 9 months ago

(Coming here from #16305, since fixing this old issue would probably also fix that new one.)

On Buster, loginctl terminate-session c1 (run from a root terminal in the "amnesia" session) successfully kills the Debian-gdm session and its processes without breaking anything obvious at first glance. We could try to run that with sudo as a systemd --user/ service. Not super nice but we can drop it once we've switched to Wayland (in Tails 5.0, according to our current plans).

#15 Updated by sajolida 8 months ago

  • Parent task deleted (#11643)

#16 Updated by anonym about 1 month ago

  • Priority changed from Normal to Elevated
  • Target version set to Tails_4.0

This ticket is now a release blocker for Tails 4.0.

Rationale: in our test suite we have seen that 2 GB of memory (which is our advertised minimum) is dangerously close to low-memory situations (that essentially freezes the system) under very normal usage (e.g. install a package with a-s-p and then opening Tor Browser). More details:

#17 Updated by anonym about 1 month ago

  • Blocks Bug #16281: Update the test suite for Buster added

#18 Updated by anonym about 1 month ago

#19 Updated by anonym about 1 month ago

  • Assignee deleted (alant)

I don't want to force a blocker on you, @alant, so I'll put this on the FT plate for now. Let us know if you want to work on this!

#20 Updated by intrigeri about 1 month ago

  • Description updated (diff)

#21 Updated by intrigeri about 1 month ago

  • Priority changed from Elevated to High
  • Type of work changed from Research to Code

This will fix a bunch of totally broken test cases: #16281#note-35.

#22 Updated by intrigeri about 1 month ago

  • Blocks Bug #16004: Make our automated test suite take into account USB image updates wrt. supported installation & update methods added

#23 Updated by intrigeri about 1 month ago

  • Status changed from Confirmed to In Progress
  • Assignee set to intrigeri
  • Feature Branch set to bugfix/12092-kill-gdm-session-after-login+force-all-tests

Giving it a very basic try.

#24 Updated by intrigeri about 1 month ago

This breaks ASP starting the Persistence wizard. I see the firewall blocks connections from the amnesia user to TCP port 6001 (x11-1). It looks like that's because the ASP code sets DISPLAY to :1. Setting it to :0 seems to fix the problem (only tested manually so far) but I wonder if there was a good reason for :1 and if changing it will break other stuff; we'll see!

#25 Updated by intrigeri about 1 month ago

intrigeri wrote:

This breaks ASP starting the Persistence wizard. I see the firewall blocks connections from the amnesia user to TCP port 6001 (x11-1). It looks like that's because the ASP code sets DISPLAY to :1. Setting it to :0 seems to fix the problem

Scratch this, it depends on when exactly we kill the Greeter's session: if we kill it before amnesia's X.Org is started (which we should not do), :0 is free and amnesia's X.Org takes it. 05aadd63b0d4d3d19b7a49e3029ff0c8ad5af6c0 should fix the root cause that caused :0 to be used, which broke some ASP stuff.

#26 Updated by intrigeri about 1 month ago

Current code runs in a very robust manner on my local Jenkins: in 5 full test suite runs, only failures are #11592 and some failed GnuPG key fetches. But on our shared Jenkins I see lots of failures. Analyzing the Journal seems to tell us that the whole thing is racing against GDM, that starts a new gdm-x-session immediately after we've killed the first one, which prevents the amnesia's session from loading (or being displayed, or something, and in any case we won't get the memory savings we're looking for when that happens).

One way to fix this would be to have tails-kill-gdm-session.service disable gdm.service's OnFailure=tails-gdm-failed-to-start.service, then kill the gdm process. But I'm not sure how the system will behave then: it might be that we're sent back to plymouth or something, while we'd like to remain on the VT where amnesia's X.Org is running.

Or, we could tweak our config/chroot_local-includes/usr/lib/gdm3/gdm-x-session.tails wrapper: if some flag file created by tails-kill-gdm-session.service exists, do nothing, keep running, so that GDM believes it has successfully started its X session. That's arguably a bit ugly but if it works, it might be the easiest way to get what we want reliably.

#27 Updated by intrigeri about 1 month ago

intrigeri wrote:

Or, we could tweak our config/chroot_local-includes/usr/lib/gdm3/gdm-x-session.tails wrapper: if some flag file created by tails-kill-gdm-session.service exists, do nothing, keep running, so that GDM believes it has successfully started its X session. That's arguably a bit ugly but if it works, it might be the easiest way to get what we want reliably.

Instead I made our gdm-x-session wrapper exit 0 when it's called after we've killed the GDM session. At first glance, this seems to work:

/usr/lib/gdm3/gdm-x-session[2920]: (II) Server terminated successfully (0). Closing log file.
systemd[1]: session-c1.scope: Succeeded.
systemd[1]: Stopped Session c1 of user Debian-gdm.
systemd-logind[2085]: Removed session c1.

Let's see what these to Jenkins think.

#28 Updated by intrigeri about 1 month ago

It seems to work a bit better on Jenkins (one run worked pretty fine, which did not happen before), but I still see some failures. After we click Login in the Greeter, ~15s later the screen turns black, at the time when amnesia's X.Org is started; nothing fishy in the logs that claim GNOME Shell started just fine, except that 10s after amnesia's X.Org starts, as soon as we've successfully killed GDM's session, a new one briefly starts (without starting X.Org though, thanks to 2e6edc540a58eddd23f495f50a526137e0c78b18):

Aug 22 20:54:10 amnesia systemd[1]: session-c1.scope: Succeeded.
Aug 22 20:54:10 amnesia systemd[1]: Stopped Session c1 of user Debian-gdm.
Aug 22 20:54:10 amnesia systemd-logind[2466]: Removed session c1.
Aug 22 20:54:10 amnesia gdm-launch-environment][7270]: pam_unix(gdm-launch-environment:session): session opened for user Debian-gdm by (uid=0)
Aug 22 20:54:10 amnesia systemd-logind[2466]: New session c10 of user Debian-gdm.
Aug 22 20:54:10 amnesia systemd[1]: Started Session c10 of user Debian-gdm.
Aug 22 20:54:10 amnesia gdm-launch-environment][7270]: pam_unix(gdm-launch-environment:session): session closed for user Debian-gdm
Aug 22 20:54:10 amnesia gdm3[3345]: Child process -7297 was already dead.
Aug 22 20:54:10 amnesia systemd[1]: session-c10.scope: Succeeded.
Aug 22 20:54:10 amnesia systemd-logind[2466]: Session c10 logged out. Waiting for processes to exit.
Aug 22 20:54:10 amnesia systemd-logind[2466]: Removed session c10.

I suspect that GDM switches VTs when doing so, which would explain why the screen remains black. I think the only robust solution, to take control on the screen away from GDM in a non-racy way, would be to:

  • Set KillSignal=SIGKILL for gdm.service. It may be that KillMode=control-group is needed too.
  • Have tails-kill-gdm-session.service run systemctl stop gdm before (or instead of) loginctl terminate-session c1.

This way, GDM should be left no chance to react to its first gdm-x-session aborting. Then 2e6edc540a58eddd23f495f50a526137e0c78b18 should not be needed anymore as we have a guarantee that our gdm-x-session wrapper is not started again once we've killed it along with the rest of the GDM session.

#29 Updated by intrigeri about 1 month ago

After we click Login in the Greeter, ~15s later the screen turns black, at the time when amnesia's X.Org is started; nothing fishy in the logs that claim GNOME Shell started just fine

I still see this occasionally at 107efb89771ad96cf74b025b9bad1e1f55c825e2 :/
But I have another lead for a possible fix :)

#30 Updated by intrigeri 28 days ago

OK, this finally seems robust enough on Jenkins. I'll test this on bare metal before I unleash this monstrosity to some poor soul who'll review this. The good news is that presumably, all this can go away once we're on Wayland.

#31 Updated by intrigeri 28 days ago

@segfault, if/once you're done with the other reviews, you might want to do a first code review here, in the hope that this allows us to merge this earlier, therefore unblocking various other things. Otherwise, no big deal, of course :)

#32 Updated by intrigeri 27 days ago

I'll test this on bare metal […]

I've successfully started Tails and logged in 3 times for each combination among:

  • machine: ThinkPad X200 / Elitebook 840G1
  • USB stick: very fast / rather slow
  • administration password: set / unset

Additionally:

  • I tested the same thing in a VM started from a virtual USB drive: 3 times with an admin password + 3 times without any.
  • The full test suite has been run on the current state of this branch at least 5 times on each of the 2 Jenkins I'm using.

In none of these dozens of tests could I reproduce any of the bugs that were very common on Jenkins in earlier iterations of this work. So either there's still a race condition bug somewhere but it's very hard to trigger, or the current state of this branch has effectively got rid of all such issues. I'm rather confident it's the latter but it's of course hard to prove the absence of something.

Except that at (what looked like) the last minute, while doing these tests, I've just realized that this branch breaks unlocking the screen: gnome-shell: Failed to open reauthentication channel: Gio:DBusError: GDBus.Error:org.freedesktop.DBus.Error.ServiceUnknown: The name org.gnome.DisplayManager was not provided by any .service files. Ahem. So I'm back to the drawing board ⇒ will try to find some way to keep GDM running just enough to support the lock/unlock use case, but without eating too much memory.

FTR the goal here is to:

  • fix our automated test suite that currently often fails due to lack of "disk" space (memory snapshots get smaller)
  • avoid giving more RAM to our test VM, and in turn: avoid bumping the documented RAM requirements in 4.0
  • lower the effective RAM requirements compared to 3.x; we're not going to lower the documented ones, but still :)

#33 Updated by intrigeri 27 days ago

@segfault, if/once you're done with the other reviews, you might want to do a first code review here, in the hope that this allows us to merge this earlier, therefore unblocking various other things. Otherwise, no big deal, of course :)

I hope you did not waste time on the implementation that was on the branch back when I wrote that. Hold on! That implementation had a serious flaw (on top of being overly complex) and I'm now testing a much simpler one.

#34 Updated by segfault 26 days ago

I hope you did not waste time on the implementation that was on the branch back when I wrote that. Hold on! That implementation had a serious flaw (on top of being overly complex) and I'm now testing a much simpler one.

No worries, I didn't look at it yet.

#35 Updated by intrigeri 24 days ago

  • Blocks deleted (Bug #16004: Make our automated test suite take into account USB image updates wrt. supported installation & update methods)

#36 Updated by segfault 24 days ago

I read the comments and think I somewhat understand the problem and I took a look at the code and understand what you're trying to do. But I feel like this needs a more extensive review which I don't have the time for right now :/

#37 Updated by intrigeri 24 days ago

I read the comments and think I somewhat understand the problem and I took a look at the code and understand what you're trying to do. But I feel like this needs a more extensive review which I don't have the time for right now :/

This sounds fair. Besides, one of my 5 test suite runs on Jenkins (against the state of the branch that you looked at) identified one remaining problem with that version of the implementation, so I had to change it (again × N) quite a bit ⇒ I'll have to run a bunch of tests again and will have results late today, earliest. Even if these tests indicate that all robustness problems are gone, I don't know if there's a chance I find someone to review the branch by the end of the week, so I'm not confident anymore that this will get fixed in time for 4.0~beta2. That's OKish, even though I would really have liked to ship this sort of risky changes in a beta, before including it in 4.0 final. Worst case we'll have to release 4.0~rc1 early October, which could be a good idea anyway.

#38 Updated by intrigeri 23 days ago

  • Subject changed from The new Greeter keeps eating lots of memory after logging in to The Greeter keeps eating lots of memory after logging in

#39 Updated by intrigeri 23 days ago

Oooookay, it looks like this branch is finally robust enough for my taste:

  • For each previous iteration of this branch, I've run the full test suite 5 times on my local Jenkins + 5 times on our shared Jenkins, and the latter always exposed a race condition during at least one of these run; the former sometimes did too.
  • At e540df45b13286b98597a7efb649fcd36d835ff3, I did the same 2×5 runs, and for the first time there's been no regression vs. devel.

I'll now test this on various bare metal machines, with fast and slow USB drives, and also from DVD, in the "hope" that any remaining race condition is triggered.

#40 Updated by intrigeri 22 days ago

  • Status changed from In Progress to Needs Validation

I'll now test this on various bare metal machines, with fast and slow USB drives, and also from DVD, in the "hope" that any remaining race condition is triggered.

Done: 18 boots in total, in various combinations of slow/fast USB stick / DVD, default Greeter settings / persistence with Additional Software / French + disabled network; I did all these tests on a ThinkPad X200 and a HP Elitebook 840G1. Login was successful in all cases.

Also, in the last few days, thanks to GDM debug logging + reading lots of its source code, I've reached a point when I think I know reasonably well what I'm doing here. This contrasts with my initially semi-random fiddling, even though some finer points are still not 100% clear to me.

Finally, AFAIK we should be able to drop all these kludges in 5.0, as they should not be needed on Wayland anymore: in a sid GNOME VM, Debian-gdm's GNOME+Wayland session vanishes once I log into a GNOME+Wayland session. Knowing that we're going to carry this pile of hacks only during the 4.x era makes me more comfortable proposing this branch.

So I'm hereby sending this for review. Note that the branch should not be merged as-is! See below for more about this.

FTR the goal here is to:

  • Help fix our automated test suite that currently often fails due to lack of "disk" space (this branch should make memory snapshots smaller)
  • Avoid giving more RAM to our test VM, and in turn: avoid bumping the documented RAM requirements in 4.0
  • Lower the effective RAM requirements compared to 3.x; we're not going to lower the documented ones, but still :)

I've left a bunch of "squash!" commits for now:

  • I think their commit message may help the reviewer understand what I'm doing, how, and why.
  • My plan is that once we reach an agreement, I'll squash all this, move lots of commit messages text to comments in the code, verify that the diff between the reviewed+ack'ed branch and the rewritten one consists purely of updating comments, and finally merge.

I'm not sure if I should leave GDM debug logging enabled. Ideally, I'd like to:

  • Ship this in 4.0~betaN, or more likely in ~rc1, with debug logging enabled ⇒ if there's any problem, I'll have the info I need in WhisperBack reports.
  • Disable debug logging in 4.0 final. File a ticket about it so we don't forget.

#41 Updated by intrigeri 22 days ago

  • Assignee deleted (intrigeri)

#42 Updated by segfault 22 days ago

  • Assignee set to intrigeri

I did another review and I agree that it would be good to have this in the beta release. So I propose you clean up the branch and merge into devel, if it's not too late for 4.0~beta2.

Please also review the commit I pushed on top, which adds some quotes to parameter expansions and command substitutions.

#43 Updated by intrigeri 22 days ago

  • Related to Bug #17011: Disable GDM debug logs added

#44 Updated by intrigeri 22 days ago

  • Status changed from Needs Validation to In Progress

#45 Updated by intrigeri 22 days ago

  • My plan is that once we reach an agreement, I'll squash all this, move lots of commit messages text to comments in the code, verify that the diff between the reviewed+ack'ed branch and the rewritten one consists purely of updating comments, and finally merge.

I've merged devel into the bugfix/12092-kill-gdm-session-after-login+force-all-tests branch → 5931565a7f2c945beb4bb17e3e5ae463b0eb9830.
I've done what I'm quoting in the bugfix/12092-kill-gdm-session-after-login-squashed branch → edc71c6ff8a7577fc75a1d447fd954b7187ebd37.
The only diff is edc71c6ff8a7577fc75a1d447fd954b7187ebd37, which only changes comments.
So I'm going to merge!

#46 Updated by intrigeri 22 days ago

  • Status changed from In Progress to Fix committed
  • % Done changed from 0 to 100

#47 Updated by intrigeri 22 days ago

  • Status changed from Fix committed to Resolved
  • Assignee deleted (intrigeri)
  • % Done changed from 100 to 0

Also available in: Atom PDF