Project

General

Profile

Bug #16389

Some USB sticks become unbootable in legacy BIOS mode after first boot

Added by intrigeri 9 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
Installation
Target version:
Start date:
01/25/2019
Due date:
% Done:

100%

Estimated time:
16.00 h
Feature Branch:
bugfix/16389-recompute-chs
Type of work:
Code
Blueprint:
Starter:
Affected tool:

Description

As per "[Tails-testers] tails 3.12rc1 becomes unbootable on bios after first use" and "[Tails-testers] Testing the 3.12 image". OPs are and . The latter "fixed" the problem by opening the drive in gdisk and rebuilding the protective MBR; they also shared an image of the MBR + GPT header of the broken stick + the diff between that one and the fixed one.

Reports of this issue:

... where recomputing CHS fixed the problem:
  1. https://lists.autistici.org/message/20190123.145208.8cf06fce.en.html
... where recomputing CHS did not fix the problem:
  1. https://lists.autistici.org/message/20190302.084144.948cedd2.en.html (they wrote about their attempt to fix this in an email to u and me (segfault):

I've tried the work-around : https://tails.boum.org/news/version_3.12/index.en.html#index2h1 ?
root # sgdisk --recompute-chs /dev/bilibop
It won't work neither. (by stand-alone boot on 3.12.1)
The symptom remains the same.

... where we don't know whether recomputing CHS would have fixed the problem:
  1. https://lists.autistici.org/message/20190121.044948.0f6116d3.en.html

Recent reports of other boot issues which do not seem to be the same issue:

  1. https://lists.autistici.org/message/20190328.021008.03aa2bc3.en.html
  2. https://lists.autistici.org/message/20190401.074103.bc76dd9e.en.html
  3. https://lists.autistici.org/message/20190419.203140.cd5d1b8a.en.html
  4. https://lists.autistici.org/message/20190522.061704.60c15ee0.en.html
  5. https://lists.autistici.org/message/20190523.135640.7f7eed6c.en.html
Unclear:
  1. https://lists.autistici.org/message/20190327.004600.b5ca1460.en.html

gpt.diff View (150 Bytes) intrigeri, 01/25/2019 08:30 AM

gpt.img (17 KB) intrigeri, 01/25/2019 08:30 AM


Related issues

Related to Tails - Feature #16397: Write release notes for 3.12 Resolved 01/29/2019
Related to Tails - Bug #15987: Check the system partition on every boot and grow it if needed Rejected 09/28/2018
Related to Tails - Bug #16731: Partitioning on boot sometimes aborts because of partprobe failing Resolved

Associated revisions

Revision 6d15ad06 (diff)
Added by intrigeri 9 months ago

Recompute CHS values for the hybrid MBR after first-boot repartitioning (refs: #16389)

Some legacy BIOS systems won't boot otherwise.

Revision 80af70ce (diff)
Added by intrigeri 8 months ago

Recompute CHS values for the hybrid MBR after first-boot repartitioning (refs: #16389)

Some legacy BIOS systems won't boot otherwise.

Revision b9c09447 (diff)
Added by anonym 5 months ago

Only probe for partitions on the boot device.

Without arguments partprobe will scan all devices, and if it
encounters a device it doesn't support (e.g. fake raid-0 arrays) it
will return non-zero, thus aborting Tails' partitioning script,
resulting in an unbootable install.

Will-fix: #16389

Revision 5dfc6624 (diff)
Added by anonym 5 months ago

Only probe for partitions on the boot device.

Without arguments partprobe will scan all devices, and if it
encounters a device it doesn't support (e.g. fake raid-0 arrays) it
will return non-zero, thus aborting Tails' partitioning script,
resulting in an unbootable install.

Will-fix: #16389

Revision fef06567 (diff)
Added by segfault 4 months ago

Combine sgdisk commands in partitioning script (refs: #16389)

Revision ce2a6d5f
Added by intrigeri 3 months ago

Merge remote-tracking branch 'origin/bugfix/16389-recompute-chs' into stable

Fix-committed: #16389

History

#1 Updated by intrigeri 9 months ago

Attaching the aforementioned MBR+GPT header image + diff.

#2 Updated by intrigeri 9 months ago

Dear segfault,

Context: we need a fix ready, tested, reviewed and merged by the end of the week. That's going to be intense, especially if none of us manage to reproduce the problem (I guess our MBR+GPT headers are affected just as well as the OP's, but for some reason they boot anyway on all the hardware we've tested). I'm confident the OPs will be happy to test a tentative fix, but that'll likely take some time, so the earlier we have a tentative fix, the better.

I can make time for this over the week-end, be it to work on a fix alone, sprinting towards a fix with you, or reviewing. I'd like to know by tomorrow morning to what extent I need to change my week-end plans though. Thanks in advance!

#3 Updated by intrigeri 9 months ago

  • Assignee changed from segfault to intrigeri

I'll start working on it now. If you can help or take over, cool: then let's talk on XMPP :)

#4 Updated by intrigeri 9 months ago

The bytes affected by the diff sent by the OP are all about the partition entry for the system partition in the hybrid MBR. It's not 100% clear to me whether it's about the partition type or the CHS of the last sector: the diff and the gpt.img don't match, I suspect some byte ordering / endianness mismatch.

If it's about the CHS of the last sector, then the size of the system partition matters, and thus the size of the USB stick matters. So I'll be testing with a 32GB USB stick, that has more chances than my other (8GB) ones to display the problem.

I've compared the 1st sector of the same USB stick a) installed with Tails Installer; b) installed with the USB image in GNOME Disks and then booted once; the MBR partition entry for the 2 resulting system partitions is identical. So I would assume that a given legacy BIOS system should equally succeed, or fail, to boot both. In this sense, the problem this ticket is about might not be a regression.

#5 Updated by intrigeri 9 months ago

  • Status changed from Confirmed to In Progress

#6 Updated by intrigeri 9 months ago

  • % Done changed from 0 to 10
  • Feature Branch set to bugfix/16389-recompute-chs

The code added in this branch will apply, on first boot, part of the changes present in the provided gpt.diff, but not all of them. Hopefully that will be sufficient to fix boot on their system.

I'm starting to hope that the fix I'm preparing will not only fix this problem, but more generally fix a bunch of "Tails installed with Tails Installer does not start in legacy BIOS mode on $computer XYZ" (we have quite a few of these known issues documented), which could actually be instances of this exact problem. And my main fear is that this fix breaks stuff for systems that previously booted fine…

#7 Updated by intrigeri 9 months ago

Asked the 2 OPs to test the fix and on top of that, sent a broader call for testing to check whether this fix breaks Tails startup on other computers.

#8 Updated by intrigeri 9 months ago

  • % Done changed from 10 to 20

(Partial) test suite passed on our shared Jenkins and on my local one. USB sticks installed from a USB image built from this branch, onto a 32GB USB stick, boot fine twice on the two spare laptops I have around.

I'll wait until some point tomorrow for reports from my call for testing and then I'll decide if this is worth the risk: if I have a confirmation that this does fix the problem for the OPs, I might be tempted to say it's worth it, even if I get little confirmations that this does not break anything (on top of my own testing). We'll see.

#9 Updated by anonym 9 months ago

I have tested an image built from this branch using four different USB sticks on for four laptops (all fairly different from each other except they all use Intel CPUs), and saw no regressions.

#10 Updated by intrigeri 9 months ago

  • Assignee changed from intrigeri to anonym
  • QA Check set to Ready for QA

The 2 OPs did not reply yet, so I can't say for sure that this fixes the problem they reported; I bet it does. Very few people tested the new image but they all confirmed it does not break stuff for them. So I'm really unsure what's best: we can merge this, and risk breaking systems that booted just fine during the USB image call for testing and/or on 3.12~rc1; or we can postpone, and risk discovering that the problem reported by these 2 people actually affects way more users (whose usual installation method is not supported/documented anymore).

I'm leaning towards merging, so here is a PR. But I would understand if you decide to take the other risk over this one.

#11 Updated by segfault 9 months ago

Hi,

I'm unsure about merging this. IIUC, the man page of sgdisk says that --recompute-chs sets a CHS value that violates GPT specification. This sounds like it could break booting on some systems. Could we maybe do another call for testing and then merge this in 3.13? Only two of many testers reported this problem, so I'm quite confident not too many users will be affected.

@intrigeri: I wrote you an email explaining why I went AWOL. Sorry again and thanks so much for stepping in!

#12 Updated by intrigeri 9 months ago

I'm unsure about merging this. IIUC, the man page of sgdisk says that --recompute-chs sets a CHS value that violates GPT specification. This sounds like it could break booting on some systems. Could we maybe do another call for testing and then merge this in 3.13? Only two of many testers reported this problem, so I'm quite confident not too many users will be affected.

Wrt. "Only two of many testers reported this problem", it could because most people who answered our USB image call for testing started it only once, which would not expose the bug. And then only few people installed 3.12~rc1 from scratch to use it in production (and thus more than once): I expect most people who are ready to use a RC in production upgraded an existing stick, which again would not expose the bug. But yeah, I hear your concerns.

My best bet/guess at this point, given #16389#note-4, is that without this branch merged, the users who will face a regression are those who satisfy these two conditions:

  • Their hardware is affected by the problem. Risk: probably low, because I currently think that a Tails installed by Tails Installer would have the exact same problem, so if it was this widespread, I hope we would have noticed earlier.
  • They were previously using an installation method that is not affected, i.e. users who've been creating an "intermediate" Tails (from macOS, Windows, all Linux except Debian) and run it forever, without following the Installation Assistant last steps that instruct to create a "final" Tails with Tails Installer. I don't know how widespread such practice is.

So yeah, it feels relatively safe to skip merging this.

Either way, we should be ready to deal with the fallout if things go wrong, that is if we don't merge:

  • Explain to help desk & technical writers what are the symptoms they should notice, and what they should tell affected users, e.g.:
    • How to manually fix their broken 3.12 USB stick.
    • How to try the proposed branch and check if it fixes the bug for them.
  • If the problem affects a significant number of users:
    • Update the call for testing I've sent and give it a higher profile (blog post, Twitter?).
    • Be ready to release an emergency Tails 3.12.1 a couple weeks max after 3.12 (i.e. ASAP after collecting data to gain confidence the candidate fix does not break more than it fixes).

And if we do merge, the list of work items is similar but not quite the same.

I'm not able to lead this work (I'll be AFK for a week after 3.12). It should probably be a USB image team effort.

#13 Updated by intrigeri 9 months ago

  • Assignee changed from anonym to segfault
  • QA Check changed from Ready for QA to Dev Needed

Please provide the bits our tech writers (for release notes) and help desk will need. anonym suggests the easiest way for affected users is to reinstall, boot once, set an admin password, and then run the magic command (assuming that works from inside a running Tails, this needs to be tested).

#14 Updated by segfault 9 months ago

intrigeri wrote:

Please provide the bits our tech writers (for release notes) and help desk will need. anonym suggests the easiest way for affected users is to reinstall, boot once, set an admin password, and then run the magic command (assuming that works from inside a running Tails, this needs to be tested).

OK, I will try to do that tonight or tomorrow

#15 Updated by u 9 months ago

Hi segfault: as the release is supposed to be today, it would be super cool if you could get to that as early as possible :)

#16 Updated by u 9 months ago

Made release writers and help desk aware of the issue by email.

#17 Updated by u 9 months ago

#18 Updated by u 9 months ago

#19 Updated by u 9 months ago

#20 Updated by u 9 months ago

anonym suggests the easiest way for affected users is to reinstall, boot once, set an admin password, and then run the magic command (assuming that works from inside a running Tails, this needs to be tested).

magic command is `sudo sgdisk --recompute-chs /dev/sda`

#21 Updated by intrigeri 9 months ago

Better use sudo sgdisk --recompute-chs /dev/bilibop, that avoids the need to find out the name of the boot device: /dev/bilibop should always be a symlink to it.

#22 Updated by anonym 9 months ago

  • Target version changed from Tails_3.12 to Tails_3.13

#23 Updated by intrigeri 8 months ago

One of the OPs (Phredo) reported that sgdisk --recompute-chs /dev/bilibop fixed the problem for them.

#24 Updated by mercedes508 8 months ago

For the record, since 3.12, we didn't receive report about this bug.

#25 Updated by u 8 months ago

@segfault I think that fixing this bug once and for all is possible part of the "post-release bugfixing" budget. That we did not receive bug reports since 3.12 might be due to the fact that lots of users seem to struggle with our new setup, as reported by emmapeel/frontdesk on tails-dev recently. As there is already a branch, and users report that it works for them, we might want to merge this branch. What do you think?

#26 Updated by intrigeri 8 months ago

As there is already a branch, and users report that it works for them, we might want to merge this branch.

Devil's advocate speaking: we need to balance the benefits of fixing this bug vs. the risk of regressions; wrt. the latter, we might not have enough reports that this branch won't break stuff for users for whom our current code works just fine. I would suggest pinging the relevant call for testing thread and maybe even using Twitter to get enough test results.

#27 Updated by u 8 months ago

  • Parent task changed from #15992 to #15292

#28 Updated by u 8 months ago

  • Estimated time set to 16.00 h

#29 Updated by u 8 months ago

  • Related to Bug #15987: Check the system partition on every boot and grow it if needed added

#30 Updated by u 8 months ago

intrigeri wrote:

As there is already a branch, and users report that it works for them, we might want to merge this branch.

Devil's advocate speaking: we need to balance the benefits of fixing this bug vs. the risk of regressions; wrt. the latter, we might not have enough reports that this branch won't break stuff for users for whom our current code works just fine. I would suggest pinging the relevant call for testing thread and maybe even using Twitter to get enough test results.

Ack.

#31 Updated by u 8 months ago

#32 Updated by intrigeri 8 months ago

I posted this on Twitter: https://twitter.com/Tails_live/status/1103295258576781312

Nice :)

I wonder if there's a misunderstanding though: you're gathering data about the part I find the least concerning (whether the workaround fixes the bug for affected users), not about the aspect where I mentioned earlier here that we lack info (suggesting Twitter to gather it). Anyway, any additional info is good to have :)

#33 Updated by u 8 months ago

intrigeri wrote:

I posted this on Twitter: https://twitter.com/Tails_live/status/1103295258576781312

Nice :)

I wonder if there's a misunderstanding though: you're gathering data about the part I find the least concerning (whether the workaround fixes the bug for affected users), not about the aspect where I mentioned earlier here that we lack info (suggesting Twitter to gather it). Anyway, any additional info is good to have :)

Ah then I might have misunderstood the issue. Please reformulate it: apparently I was unable to guess what you are talking about.

#34 Updated by intrigeri 8 months ago

Let's try again. We need to balance:

  • benefits of merging the proposed branch to fix this bug: it seems to me that (almost?) nobody affected by the bug complained that the workaround does not work for them (and you've asked for more confirmations on Twitter). Given the branch basically applies that workaround automatically, I'm quite confident it will fix the bug as well. If we want to increase our confidence even further, then we need to ask affected folks to try a nightly image built from the proposed branch.
  • the risk of regressions i.e. if we merge the proposed branch, will this break boot for users who are not affected by the bug this ticket is about? AFAIK we have extremely little data here, which is why we did not merge the proposed branch yet. The relevant call for testing might need updating but it is still current wrt. what info we lack.

#35 Updated by u 8 months ago

intrigeri wrote:

Let's try again. We need to balance:

  • benefits of merging the proposed branch to fix this bug: it seems to me that (almost?) nobody affected by the bug complained that the workaround does not work for them (and you've asked for more confirmations on Twitter). Given the branch basically applies that workaround automatically, I'm quite confident it will fix the bug as well. If we want to increase our confidence even further, then we need to ask affected folks to try a nightly image built from the proposed branch.
  • the risk of regressions i.e. if we merge the proposed branch, will this break boot for users who are not affected by the bug this ticket is about? AFAIK we have extremely little data here, which is why we did not merge the proposed branch yet. The relevant call for testing might need updating but it is still current wrt. what info we lack.

Great.

Next time please add the URL to whatever you're talking about. We have issued many different calls for testing on the list since 2 months afaik and I've been spending quite some time trying to find out what you were talking about - I don't have the entire problem space and emails sent out by everyone else in mind all the time.

#36 Updated by u 8 months ago

I updated the email text, sent it again to the tester list, because it got possibly lost there since 5 weeks and retweeted a new call for testing this branch.

#37 Updated by intrigeri 7 months ago

Next time please add the URL to whatever you're talking about.

Right, point gladly taken.

Post-mortem: I've sent only one call for testing about this topic, that I've mentioned on this ticket (#16389#note-7), but never shared the URL, so any further reference to it was indeed ambiguous, and it took me some minutes to find it myself when you asked me to clarify. Had I provided the URL back in January, it would have been cheap for me to reference it again 2 days ago, and then hopefully it would have cleared the misunderstanding about "we might not have enough reports that this branch won't break stuff for users for whom our current code works just fine".

I've been spending quite some time trying to find out what you were talking about - I don't have the entire problem space and emails sent out by everyone else in mind all the time.

Absolutely.

FTR, taking a step back, #16389#note-26 was primarily meant for segfault. You had just pinged him, presumably with your team manager hat on, clarified that this was part of his post-release bugfixing work, and asked his input. So:

  • I expected segfault to do the exact same (painful) work I had to do to find the link to the aforementioned call for testing. I did not do this work myself because I've already done much more than I had committed to on this ticket (dealing with the problem at the last minute before the 3.12 release, while segfault was unexpectedly AFK). This is obviously not very nice of me and I'm not exactly proud of it. I see this as a direct consequence of the (poor) way we've been dealing with assignees & budget for sponsor deliverables. Thankfully the Leipzig Plan will fix that and make it easier to help each other :)
  • I hadn't you in mind here, in terms of target audience, as the person who'll do the next steps of the work. I had you in mind as my team manager who had just pinged my team-mate, suggesting a course of action based on a reasoning that, I thought, was missing a critical piece of information. I did not expect you to have all the info in mind, nor to guess what I meant, nor to do the research to find that call for testing, nor to act on my comment yourself. Had I expected you would take over the next steps of this work, I hope I would have either a) phrased things differently, taking into account the specifics of your situation on this ticket (although I realize I might very well have failed to do that); or b) realized that it would have been easier to do the work myself than to gather all the info you needed to do it.

This being said, as you know I'm not a fan of a super-rigid tasks / individual assignee structure, and I'm super happy we're working on this together and you've switched hats! In the future, to make the magic hat swap trick smoother, I would find it helpful if you let me know when you're switching hats from "team manager pinging workers" to "I'll tweet something in 48h and I need all the info now to do it right", so I understand better who my target audience is and I can try to adjust my communication style accordingly :)

And finally, I know that you've been handling yourself quite some of the user feedback on tails-testers. Thanks! This should perhaps have hinted me about the ongoing magic hat swap trick.

#38 Updated by intrigeri 7 months ago

I updated the email text, sent it again to the tester list, because it got possibly lost there since 5 weeks and retweeted a new call for testing this branch.

Great! For future reference, the tweet is https://twitter.com/Tails_live/status/1103326873789046786 and the updated call for testing is https://lists.autistici.org/message/20190306.160900.47a129bc.en.html (been burnt once, won't do it again, at least not right away).

segfault, will you be in a position to process the user feedback from this new call for testing? (Note that some folks have started replying on Twitter, gah; presumably because the list archives don't link back to the place where one shall give feedback.) If you can't, then I offer to do it; I just need to know I'm needed here.

#39 Updated by segfault 7 months ago

segfault, will you be in a position to process the user feedback from this new call for testing?

Yes, at least every few days. But I don't see any feedback yet, except for one user who posted a screenshot on twitter. The email thread doesn't have any replies yet.

#40 Updated by u 7 months ago

We received one report (in private) that the branch fixes the problem.
As a follow up on http://lists.autistici.org/message/20190302.084144.948cedd2.en.html
where another problem is reported, i.e. when booting with two Tails keys, both seem to become unbootable. Not sure what to make of it.


When I boot only 3.11, it works fine.

I've tried the work-around : https://tails.boum.org/news/version_3.12/index.en.html#index2h1 ?
root # sgdisk --recompute-chs /dev/bilibop
It won't work neither. (by stand-alone boot on 3.12.1)
The symptom remains the same.

I've tried the experimental image : tails-amd64-bugfix_16389-recompute-chs-3.13-20190308T1708Z-af998d063a+stable@6bc35d7ef9.iso
It woks on the second boot, on stand-alone boot.

#41 Updated by u 7 months ago

And more report from the same person:


reporting additional symptom concerning the different coming-out between ".img" and ".iso" 

In the meantime, I've tried the experimental image with ".img" extension to create a 3.12.1 bootable USB-key.
-> tails-amd64-bugfix_16389-recompute-chs-3.13-20190308T1708Z-af998d063a+stable@6bc35d7ef9.img

It won't boot for the second boot.

It might be interesting finding why with ".iso" version works, and ".img" version won't.
It seems to have some relation with, because since 3.12.x, you publish the stable bootable image with ".img" for the stable version.

#42 Updated by u 7 months ago

We still have no data to evaluate this fix.
I would like to close this ticket in order to have closure on the USB image project.
So today I've asked help desk by email to report every user that faces the issue and to point them to https://lists.autistici.org/message/20190306.160900.47a129bc.en.html after the release of 3.13.
I suggest to close this ticket if there are no more reports or actionable feedback when we release 3.14.

#43 Updated by u 7 months ago

  • Parent task deleted (#15292)

#44 Updated by u 7 months ago

  • Target version changed from Tails_3.13 to Tails_3.14

#45 Updated by u 7 months ago

  • Parent task set to #15292

15292

#47 Updated by mercedes508 7 months ago

22099509.4091554104463649.JavaMail.defaultUser

#49 Updated by segfault 7 months ago

u wrote:

More user reports:
- http://lists.autistici.org/message/20190327.004600.b5ca1460.en.html

Not sure that's the second boot issue, because they say that "some first boots failed as well". Anyway, let's see if we get more information from this user (they didn't reply yet to your request to test our fix).

- http://lists.autistici.org/message/20190326.173127.23351d6b.en.html

We already clarified this via email, but for the record: This does not seem to be the second boot issue but a hardware specific issue.

#50 Updated by segfault 7 months ago

mercedes508 wrote:

22099509.4091554104463649.JavaMail.defaultUser

I guess you wanted to post a link to this new report about this issue: http://lists.autistici.org/message/20190401.074103.bc76dd9e.en.html

#51 Updated by u 7 months ago

  • Parent task deleted (#15292)

unparenting.

#52 Updated by segfault 7 months ago

segfault wrote:

mercedes508 wrote:

22099509.4091554104463649.JavaMail.defaultUser

I guess you wanted to post a link to this new report about this issue: http://lists.autistici.org/message/20190401.074103.bc76dd9e.en.html

This also doesn't seem to be the second boot issue, see:
https://lists.autistici.org/message/20190402.193400.1e6b73f7.en.html

#53 Updated by segfault 7 months ago

  • Description updated (diff)

It's getting hard to keep track of the different reports on this ticket, so I added them to the description.

#54 Updated by goupille 6 months ago

Bug report: 55355b65c069131f6802e26ed17a2d51

#55 Updated by segfault 6 months ago

goupille wrote:

Bug report: 55355b65c069131f6802e26ed17a2d51

@goupille: I can't find this bug report, can you forward it to me please?

#56 Updated by segfault 6 months ago

  • Description updated (diff)

In https://lists.autistici.org/message/20190419.203140.cd5d1b8a.en.html a user reports that running sgdisk --recompute-chs /dev/bilibop did fix the issue for them.

#57 Updated by u 6 months ago

@segfault @intrigeri: I believe we need to talk about what to do about this issue. There seem to be very few bug reports about it. And still not enough information. I previously suggested to close this ticket if there are no more reports or actionable feedback when we release 3.14. I would like to have your advice on this proposal.

#58 Updated by intrigeri 6 months ago

Hi!

segfault intrigeri: I believe we need to talk about what to do about this issue. There seem to be very few bug reports about it. And still not enough information. I previously suggested to close this ticket if there are no more reports or actionable feedback when we release 3.14. I would like to have your advice on this proposal.

Thanks for raising this again.

At this point, there's no doubt that this bug exists in the wild and its risk × impact is pretty high:

  • risk of occurrence: we got enough reports to know that this affects quite a few users; not tons of them, but still; keep in mind that failures to boot are harder to report as one cannot use WhisperBack
  • impact:
    • This makes Tails unusable for some people, except if they learn about the workaround and apply it (I bet many of them don't).
    • The workaround requires running code as root in a terminal, which is outside of the comfort zone of our target user base.
    • That's a regression: Tails used to work for some people and does not anymore.

So I'm not comfortable with the idea of leaving this bug unfixed, especially given we have a proposed solution that works, merely because nobody bothered testing whether that solution would break stuff for other people.

In the context of #16389#note-34:

  • wrt. the benefits of merging: it seems clear that the proposed branch does fix the problem for those who are affected by this bug
  • wrt. the risk of regressions: I see no new feedback wrt. whether the proposed branch breaks stuff for anyone who is not affected by this bug

So unfortunately, we're basically in the same situation as 3 months ago when balancing the risks/benefits ratio of the proposed solution.

I propose we start by gathering as much of the missing info (wrt. risk of regressions) as we can ourselves:

  1. Boot twice on a nightly build of the proposed fix, on as many computers as we can; I did this on the computers I have handy already
  2. Nag the Tails community so they do the same (XMPP, -dev@, maybe even -summit@); I can do that.

If we get enough info and no regression ⇒ merge. If regressions, give up and ensure the workaround gets documented in a better place than in each release notes.

If we don't get enough info from the Tails community ⇒ merge for one release and see what happens. If the fix brings regressions, revert, give up, and ensure the workaround gets documented in a better place than in each release notes. Else, close as resolved.

#59 Updated by intrigeri 6 months ago

segfault wrote:

I'm unsure about merging this. IIUC, the man page of sgdisk says that --recompute-chs sets a CHS value that violates GPT specification.

On sid it reads:

       -C, --recompute-chs
              Recompute  CHS  values  in protective or hybrid MBR. This option can
              sometimes help if a disk utility, OS, or BIOS doesn't like  the  CHS
              values  used  by  the partitions in the protective or hybrid MBR. In
              particular, the GPT specification requires a CHS value  of  0xFFFFFF
              for  over-8GiB  partitions, but this value is technically illegal by
              the usual standards. Some BIOSes hang if they encounter this  value.
              This  option  will recompute a more normal CHS value -- 0xFEFFFF for
              over-8GiB partitions, enabling these BIOSes to boot.

So yes, --recompute-chs sets a CHS value that violates GPT specification in order to fix legacy BIOS boot with some BIOSes. But according to that same manpage, the value we currently set is "technically illegal by the usual standards". So it sounds like we have to choose between violating the GPT specification or violating the "usual standards" (whatever that means).

In doubt, I'm leaning slightly towards violating a specification that I've seen no computer strictly follow, instead of violating the "usual standards" that we know some real-world BIOSes do strictly follow.

Note that UEFI boot does not use CHS, so the only possible regression is for legacy BIOS boot, with a BIOS that would zealously interpret the GPT specification and reject the 0XFEFFFF value, and instead require 0xFFFFFF, which some other BIOSes reject.

#60 Updated by segfault 5 months ago

  • Description updated (diff)

intrigeri wrote:

segfault intrigeri: I believe we need to talk about what to do about this issue. There seem to be very few bug reports about it. And still not enough information. I previously suggested to close this ticket if there are no more reports or actionable feedback when we release 3.14. I would like to have your advice on this proposal.

Thanks for raising this again.

At this point, there's no doubt that this bug exists in the wild and its risk × impact is pretty high:

  • risk of occurrence: we got enough reports to know that this affects quite a few users; not tons of them, but still

Agreed.

keep in mind that failures to boot are harder to report as one cannot use WhisperBack

Yes, they are harder to report, but 1. we asked for those reports in the release notes of the last 5 releases, and 2. not being able to boot (when the user was able to boot Tails before) has the biggest impact on usability, and I think that the worse the impact on usability, the better the chances that the user wants to report the bug. And they are still able to report it by reinstalling Tails or using it on another machine (which they will have to do anyway, except if they just stop using Tails at all). So I actually expect that we have a higher probability that an affected users reports this bug than we have for most other bugs.

  • impact:
    • This makes Tails unusable for some people, except if they learn about the workaround and apply it (I bet many of them don't).
    • The workaround requires running code as root in a terminal, which is outside of the comfort zone of our target user base.
    • That's a regression: Tails used to work for some people and does not anymore.

So I'm not comfortable with the idea of leaving this bug unfixed, especially given we have a proposed solution that works, merely because nobody bothered testing whether that solution would break stuff for other people.

Agreed.

In the context of #16389#note-34:

  • wrt. the benefits of merging: it seems clear that the proposed branch does fix the problem for those who are affected by this bug

I dont think that's clear yet. From the four people who reported this problem, only three tried the fix and it only worked for two. I updated the description to include this information.

  • wrt. the risk of regressions: I see no new feedback wrt. whether the proposed branch breaks stuff for anyone who is not affected by this bug

So unfortunately, we're basically in the same situation as 3 months ago when balancing the risks/benefits ratio of the proposed solution.

I propose we start by gathering as much of the missing info (wrt. risk of regressions) as we can ourselves:

  1. Boot twice on a nightly build of the proposed fix, on as many computers as we can; I did this on the computers I have handy already

I think it should be enough to boot once, with a Tails device that was already booted once on any machine, right?

  1. Nag the Tails community so they do the same (XMPP, -dev@, maybe even -summit@); I can do that.

If we get enough info and no regression ⇒ merge. If regressions, give up and ensure the workaround gets documented in a better place than in each release notes.

If we don't get enough info from the Tails community ⇒ merge for one release and see what happens. If the fix brings regressions, revert, give up, and ensure the workaround gets documented in a better place than in each release notes. Else, close as resolved.

I'm OK with that procedure.

#61 Updated by segfault 5 months ago

  • Description updated (diff)

segfault wrote:

intrigeri wrote:

  • wrt. the benefits of merging: it seems clear that the proposed branch does fix the problem for those who are affected by this bug

I dont think that's clear yet. From the four people who reported this problem, only three tried the fix and it only worked for two. I updated the description to include this information.

I just noticed that I didn't update the description after the last email from , who first reported that they could fix their boot issue by recalculating the CHS, but in their second email they write that they actually installed this Tails device via the ISO method, so they were not affected by this issue - or the issue does not only affect the USB image, in which case it would not be a regression. That only leaves a single user affected by this issue for whom recalculating the CHS fixed it.

#62 Updated by u 5 months ago

intrigeri wrote:

I propose we start by gathering as much of the missing info (wrt. risk of regressions) as we can ourselves:

  1. Boot twice on a nightly build of the proposed fix, on as many computers as we can; I did this on the computers I have handy already

Would one of you be able (if not the case yet) to merge the current stable branch into this build?

Then: please post a link of the corresponding nightly build here, so that everyone knows which image to to test.

  1. Nag the Tails community so they do the same (XMPP, -dev@, maybe even -summit@); I can do that.

I'd be happy to ask on Twitter too, once I know that the build and branch are actually up-to-date.

If we get enough info and no regression ⇒ merge. If regressions, give up and ensure the workaround gets documented in a better place than in each release notes.
If we don't get enough info from the Tails community ⇒ merge for one release and see what happens. If the fix brings regressions, revert, give up, and ensure the workaround gets documented in a better place than in each release notes. Else, close as resolved.

Ack.

#63 Updated by segfault 5 months ago

u wrote:

Would one of you be able (if not the case yet) to merge the current stable branch into this build?

Done.

Then: please post a link of the corresponding nightly build here, so that everyone knows which image to to test.

This should be the link: https://nightly.tails.boum.org/build_Tails_ISO_bugfix-16389-recompute-chs/builds/lastSuccessfulBuild/archive/latest.img

But we have to wait for the current build to finish to have the latest commits I just merged from stable.

#64 Updated by intrigeri 5 months ago

  1. Boot twice on a nightly build of the proposed fix, on as many computers as we can; I did this on the computers I have handy already

I think it should be enough to boot once, with a Tails device that was already booted once on any machine, right?

Yes.

#65 Updated by intrigeri 5 months ago

That only leaves a single user affected by this issue for whom recalculating the CHS fixed it.

Note that users are even less likely to report about the fact the workaround is successful, than about the fact they experience the problem in the first place.

#66 Updated by intrigeri 5 months ago

This should be the link: https://nightly.tails.boum.org/build_Tails_ISO_bugfix-16389-recompute-chs/builds/lastSuccessfulBuild/archive/latest.img

FWIW, please don't send users to latest.img as it makes is harder for them to tell us which exact image they've tested. Experience has shown that (somewhat unfortunately), we gather more useful feedback if we ask people to test the .img file found in https://nightly.tails.boum.org/build_Tails_ISO_bugfix-16389-recompute-chs/builds/lastSuccessfulBuild/archive/build-artifacts/, since its filename encodes the relevant info :)

#67 Updated by u 5 months ago

  • Blocks Bug #16724: Write a notice about USB image and ISC to monthly report added

#68 Updated by u 5 months ago

#69 Updated by u 5 months ago

I tested the img on a x230 and a simple Toshiba USB stick of 8GB: booted fine several times, the partition was resized correctly on first boot.

#70 Updated by u 5 months ago

successfully installed to a 16GB Kingston Data Traveler flash drive using dd under Arch Linux, booted and re-booted on all three of the systems available to me. None of these systems previously experienced the second boot failure:

HP Pavillion dm4 laptop
Lenovo Thinkpad T410 laptop
Mintbox Mini AMD APC desktop L

#71 Updated by anonym 5 months ago

  • % Done changed from 20 to 30
  • Feature Branch changed from bugfix/16389-recompute-chs to bugfix/16389-explicit-partprobe

A very helpful user on the tails XMPP chat room reported that for them the problem is that partprobe (in config/chroot_local-includes/usr/share/initramfs-tools/scripts/init-premount/partitioning) returns non-zero, so the script bails thanks to set -e. His system has a fakeraid raid-0 array, and if partprobe scans it it exits with "Error: Can't have a partition outside the disk!". The fix is to call partprobe "${PARENT_DEVICE}" instead.

I might be wrong, but I would expect such raid setups to be pretty rare so I'm doubtful this is the main cause for this ticket. Luckily we can easily verify if this is the problem by instructing users to run just partprobe; echo $? and check the return value (and any errors printed): if non-zero then the fix above is probably the solution; if zero and they still cannot boot, then there is some other problem unrelated to partprobe.

#72 Updated by segfault 5 months ago

  • Feature Branch changed from bugfix/16389-explicit-partprobe to bugfix/16389-recompute-chs

Restoring old feature branch, see #16731#note-5.

#73 Updated by intrigeri 5 months ago

  • Related to Bug #16731: Partitioning on boot sometimes aborts because of partprobe failing added

#74 Updated by segfault 5 months ago

  • Description updated (diff)

#75 Updated by CyrilBrulebois 5 months ago

  • Target version changed from Tails_3.14 to Tails_3.15

#77 Updated by u 5 months ago

u wrote:

Sent call for testing to tails-summit, tails-dev, tails-testers and Twitter. https://lists.autistici.org/message/20190516.122045.95d80d05.en.html // https://twitter.com/Tails_live/status/1128999794305904641

There was only one person who tested this and reported that it worked for them (alienpup).

#78 Updated by segfault 4 months ago

We got another test report from muri, he had no issues booting the recompute CHS image on two different machines (HP Elitebook 2560p and Lenovo Thinkpad X230).

I also tested the image on two machines without issues.

#79 Updated by u 4 months ago

also gagz reported that it worked for them. and intrigeri.

#80 Updated by u 4 months ago

  • Blocks deleted (Bug #16724: Write a notice about USB image and ISC to monthly report)

#81 Updated by u 4 months ago

@segfault @intrigeri I would like to finalize this asap. Regarding our tests, it seems like this will not break functionality for people for whom it currently works. i.e. we could probably merge this and release the next ISO/IMG using bugfix/16389-recompute-chs. Do we agree here? If yes, could you give me an ETA until which this would be done?
Thanks!

#82 Updated by intrigeri 4 months ago

we could probably merge this and release the next ISO/IMG using bugfix/16389-recompute-chs. Do we agree here?

I agree.

#83 Updated by segfault 4 months ago

intrigeri wrote:

we could probably merge this and release the next ISO/IMG using bugfix/16389-recompute-chs. Do we agree here?

I agree.

Fine with me. But looking at the code again, I think that we could merge the two consecutive sgdisk commands into one:

# Recompute CHS values for the hybrid MBR (see #16389) and set the 
# following attributes on the system partition (we have to set them 
# after running fatresize, because fatresize resets them):
#   0: system partition
#   2: legacy BIOS bootable
#   60: read-only
#   62: hidden
#   63: do not automount
sgdisk \
    --recompute-chs \
    --attributes=1:set:0  \
    --attributes=1:set:2  \
    --attributes=1:set:60 \
    --attributes=1:set:62 \
    --attributes=1:set:63 \
    "${PARENT_DEVICE}" 

This would save >1 second of boot time during the first boot (takes 1 second in my VM with the image lying on an SSD, probably takes multiple seconds when booted from USB).

@intrigeri, if you agree with that, I can push a commit and merge into stable.

#84 Updated by intrigeri 4 months ago

But looking at the code again, I think that we could merge the two consecutive sgdisk commands into one:

OK!

Now, sgdisk(8) reads "sgdisk interprets options in the order in which they're entered, so effects can vary depending on order" so to increase the chances this merged command has the same effect as the previous 2 commands, I suggest we pass --recompute-chs last. In any case, the comment will need merging too. Feel free to merge this into stable once 3.14.2 is out and your updated branch has passed our test suite.

In passing, --recompute-chs would fit more logically in the sgdisk command that actually resizes partitions and stuff, but we've struggled so much to get a little testing results here that I'd rather not invalidate them by merging something that's too different from what has been tested (even if we can't think of a reason why this would matter does not imply it does not). Merging these 2 commands as you're suggesting is fine by me but anything more involved would make me worry.

#85 Updated by sajolida 4 months ago

Here is something weird that happened to me while I was working on #16837 and has similar symptoms.

I was testing different versions of the splash screen by replacing syslinux/splash.png in the system partition of a Tails USB stick from Debian and testing the resulting USB stick on second laptop.

The USB stick was freshly installed and got mounted on Debian as /media/$USER/TAILS (in all caps).

I started on the USB stick, it was its 1st boot and forced a shutdown right after the boot menu (as I was only interested in see the boot splash screen). Then I plugged it again in Debian to try a different splash screen. This time it got mounted on Debian as /media/$USER/Tails (only 1 cap) and the file system was empty. I couldn't boot a second time on this same USB stick as the partition was empty.

I'm not sure it's relevant here but there might be something in the process of the 1st boot that leads to broken USB stick if interrupted at a bad moment.

#86 Updated by segfault 4 months ago

intrigeri wrote:

But looking at the code again, I think that we could merge the two consecutive sgdisk commands into one:

OK!

Now, sgdisk(8) reads "sgdisk interprets options in the order in which they're entered, so effects can vary depending on order" so to increase the chances this merged command has the same effect as the previous 2 commands, I suggest we pass --recompute-chs last.

OK.

In any case, the comment will need merging too.

I don't understand, the comment in my post above already mentions both recompute-chs and the attributes.

Feel free to merge this into stable once 3.14.2 is out and your updated branch has passed our test suite.

Ack.

In passing, --recompute-chs would fit more logically in the sgdisk command that actually resizes partitions and stuff, but we've struggled so much to get a little testing results here that I'd rather not invalidate them by merging something that's too different from what has been tested (even if we can't think of a reason why this would matter does not imply it does not). Merging these 2 commands as you're suggesting is fine by me but anything more involved would make me worry.

Same, that's why I proposed to add it to the second sgdisk command.

#87 Updated by segfault 4 months ago

I pushed the commit to the feature branch, please merge if the tests pass.

#88 Updated by intrigeri 4 months ago

In any case, the comment will need merging too.

I don't understand, the comment in my post above already mentions both recompute-chs and the attributes.

I missed that, sorry!

#89 Updated by intrigeri 3 months ago

  • Status changed from In Progress to Needs Validation
  • Assignee changed from segfault to intrigeri

segfault wrote:

I pushed the commit to the feature branch, please merge if the tests pass.

Oops, I missed that. Adjusting metadata accordingly.

#90 Updated by intrigeri 3 months ago

sajolida wrote:

there might be something in the process of the 1st boot that leads to broken USB stick if interrupted at a bad moment.

Indeed the partitioning process is not a single atomic operation. Some steps, e.g. resizing the FAT filesystem, are likely to yield a broken filesystem or non-bootable USB stick if interrupted. I'm afraid this limitation is here to stay so the best we could do is mitigations (e.g. via UI).

#91 Updated by intrigeri 3 months ago

  • Status changed from Needs Validation to Fix committed
  • % Done changed from 30 to 100

#92 Updated by intrigeri 3 months ago

  • Assignee deleted (intrigeri)

#93 Updated by CyrilBrulebois 3 months ago

  • Status changed from Fix committed to Resolved

Also available in: Atom PDF