Project

General

Profile

Bug #12589

Feature #12160: Upgrade all systems to Stretch

Enabling LUKS-backed PVs on lizard takes ages in the initramfs

Added by intrigeri over 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Infrastructure
Target version:
Start date:
05/24/2017
Due date:
% Done:

100%

Feature Branch:
Type of work:
Sysadmin
Blueprint:
Starter:
Affected tool:

Description

Notices this today after upgrading to Stretch: it took several minutes after each PV unlocking. Each time, 4 pvscan processes were running for a while.

Random ideas:

  • Maybe that's because the initramfs is looking for the nested PVs (PV-on-LV) we have? They're painful anyway, maybe we should get rid of them.
  • Maybe there's a timeout somewhere that we should lower.

Related issues

Blocks Tails - Feature #13284: Core work: Sysadmin (Adapt our infrastructure) Confirmed 06/30/2017

History

#1 Updated by intrigeri about 2 years ago

  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 10

intrigeri wrote:

  • Maybe that's because the initramfs is looking for the nested PVs (PV-on-LV) we have? They're painful anyway, maybe we should get rid of them.

Indeed, the 4 pvscan instances are looking for LVs whose VG is built on top of a PV that's stored as a LV. These commands are run by lib/udev/rules.d/69-lvm-metad.rules. I've verified that the affected block devices are hard-coded nowhere in our initramfs.

We should do one of:

  • get rid of this overly complex setup; we don't have enough spare space to migrate the bitcoin-data one so we would have to download the entire blockchain again, but that's no big deal;
  • teach our initramfs not to wait for these LVs to appear: see filter and global_filter in lvm.conf, and perhaps use_lvmetad = 0.

The first option has more chances to work out-of-the-box, and it's easy to predict the (limited) amount of time it'll take; while the 2nd option may take a few retries (i.e. reboots), entails the risk of a non-booting machine, and I'm not even sure it'll work in the end. Both options will cause some limited downtime. So I'll go for the first option.

#2 Updated by intrigeri about 2 years ago

tl;dr: VG-on-PV-on-partition-inside-LV is OK; VG-on-PV-on-LV is not.

Migration procedure, for each of the four affected VG-on-PV-on-LV (as said above, bitcoin-data will require a different procedure though):

  1. in the VM that uses the VG:
    1. comment out every line that relies on this VG in fstab
    2. update the initramfs
    3. power off the VM
  2. on the host system:
    1. create a new LV with the same size as the problematic one
    2. dd the filestystem hosted by the old LV-on-VG-on-PV-on-LV to the new LV
    3. deactivate the VG-on-PV-on-LV
    4. deactivate the old PV-on-LV
    5. delete the old LV
    6. give the old LV the name the old one had
    7. ensure the new LV is backed by an appropriate PV, pvmove if needed
  3. update the VM accordingly:
    1. start the VM
    2. enter the VM
    3. update fstab to point to the new backing storage location
    4. update the initramfs
  4. reboot the VM and check that everything is up

And finally, reboot lizard to confirm the problem is gone.

Note: all our VMs have another VG-over-LV, that hosts their root filesystem. I think these didn't cause problems because these VGs are backed by a PV that's a partition inside the LV lizard can see, so the LV lizard can see is not a PV and our initramfs doesn't care about it.

#3 Updated by intrigeri about 2 years ago

Reminder: bertagaz, you might hit this problem next time you reboot lizard. Last time I had to kill the faulty (see above) pvscan processes by hand a few times.

#4 Updated by intrigeri about 2 years ago

  • Blocks Feature #13233: Core work 2017Q3: Sysadmin (Maintain our already existing services) added

#5 Updated by intrigeri about 2 years ago

bertagaz, did you notice this problem during the reboot you did 9 days ago? If yes, how did you solve/workaround it?

#6 Updated by bertagaz about 2 years ago

intrigeri wrote:

bertagaz, did you notice this problem during the reboot you did 9 days ago? If yes, how did you solve/workaround it?

Yes, both time I rebooted lizard I had the problem written in the description of the ticket, and had to apply the same fix (killing pvscan processes).

#7 Updated by intrigeri about 2 years ago

Another option could be to use filter in /etc/lvm/lvm.conf to make the host ignore guest LVM VGs, e.g. https://lists.debian.org/debian-devel/2017/07/msg00221.html.

#8 Updated by intrigeri almost 2 years ago

  • Blocks deleted (Feature #13233: Core work 2017Q3: Sysadmin (Maintain our already existing services))

#9 Updated by intrigeri almost 2 years ago

  • Blocks Feature #13284: Core work: Sysadmin (Adapt our infrastructure) added

#10 Updated by intrigeri almost 2 years ago

  • % Done changed from 10 to 20

Hopefully fixed with commit 718e98cbcdfb9ae3a3fcaf75ac2af3abbde1e0c7 in our manifests repo. Refreshed initramfs, let's see how it goes on next reboot.

#11 Updated by intrigeri almost 2 years ago

Sadly, that's not enough to fix the problem. I think that's because lvmetad is not running in the initramfs so we might need to use filter instead of global_filter. Tried this (commit dac6c4c).

#12 Updated by intrigeri almost 2 years ago

Still not enough => tried harder (commit c4f2fc7).

#13 Updated by intrigeri almost 2 years ago

If that next try doesn't work either, I'll document the workaround one must apply at boot time and will give up: we can apply this workaround for many, many reboots, across many years, before it costs more than the more involved fix I've mentioned earlier on this thread.

So: groente & bertagaz, if you handle the next reboot, please pay attention and report back here wrt. whether the problem is solved or not.

#14 Updated by intrigeri almost 2 years ago

  • Blocks deleted (Feature #13284: Core work: Sysadmin (Adapt our infrastructure))

#15 Updated by intrigeri almost 2 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 20 to 100
  • Parent task deleted (#12160)

Documented the workaround with instructions to report back here => calling this done.

#16 Updated by intrigeri almost 2 years ago

  • Parent task set to #12160

#17 Updated by intrigeri almost 2 years ago

  • Subject changed from Enabling LUKS-backed PVs takes ages in the initramfs to Enabling LUKS-backed PVs on lizard takes ages in the initramfs

#18 Updated by intrigeri almost 2 years ago

  • Blocks Feature #13284: Core work: Sysadmin (Adapt our infrastructure) added

#19 Updated by groente almost 2 years ago

alas, the problem is not solved :(

Also available in: Atom PDF