Project

General

Profile

Bug #16389

Some USB sticks become unbootable in legacy BIOS mode after first boot

Added by intrigeri 3 months ago. Updated 1 day ago.

Status:
In Progress
Priority:
High
Assignee:
Category:
Installation
Target version:
Start date:
01/25/2019
Due date:
% Done:

20%

Estimated time:
16.00 h
QA Check:
Dev Needed
Feature Branch:
bugfix/16389-recompute-chs
Type of work:
Code
Blueprint:
Starter:
Affected tool:

Description

As per "[Tails-testers] tails 3.12rc1 becomes unbootable on bios after first use" and "[Tails-testers] Testing the 3.12 image". OPs are and . The latter "fixed" the problem by opening the drive in gdisk and rebuilding the protective MBR; they also shared an image of the MBR + GPT header of the broken stick + the diff between that one and the fixed one.

Reports of this issue:
  1. https://lists.autistici.org/message/20190121.044948.0f6116d3.en.html
  2. https://lists.autistici.org/message/20190123.145208.8cf06fce.en.html
  3. https://lists.autistici.org/message/20190302.084144.948cedd2.en.html
  4. https://lists.autistici.org/message/20190419.203140.cd5d1b8a.en.html

Recent reports of other boot issues which do not seem to be the same issue:

  1. https://lists.autistici.org/message/20190328.021008.03aa2bc3.en.html
  2. https://lists.autistici.org/message/20190401.074103.bc76dd9e.en.html
Unclear:
  1. https://lists.autistici.org/message/20190327.004600.b5ca1460.en.html

gpt.diff View (150 Bytes) intrigeri, 01/25/2019 08:30 AM

gpt.img (17 KB) intrigeri, 01/25/2019 08:30 AM


Related issues

Related to Tails - Feature #16397: Write release notes for 3.12 Resolved 01/29/2019
Related to Tails - Bug #15987: Check the system partition on every boot and grow it if needed Confirmed 09/28/2018

Associated revisions

Revision 6d15ad06 (diff)
Added by intrigeri 3 months ago

Recompute CHS values for the hybrid MBR after first-boot repartitioning (refs: #16389)

Some legacy BIOS systems won't boot otherwise.

Revision 80af70ce (diff)
Added by intrigeri 2 months ago

Recompute CHS values for the hybrid MBR after first-boot repartitioning (refs: #16389)

Some legacy BIOS systems won't boot otherwise.

History

#1 Updated by intrigeri 3 months ago

Attaching the aforementioned MBR+GPT header image + diff.

#2 Updated by intrigeri 3 months ago

Dear segfault,

Context: we need a fix ready, tested, reviewed and merged by the end of the week. That's going to be intense, especially if none of us manage to reproduce the problem (I guess our MBR+GPT headers are affected just as well as the OP's, but for some reason they boot anyway on all the hardware we've tested). I'm confident the OPs will be happy to test a tentative fix, but that'll likely take some time, so the earlier we have a tentative fix, the better.

I can make time for this over the week-end, be it to work on a fix alone, sprinting towards a fix with you, or reviewing. I'd like to know by tomorrow morning to what extent I need to change my week-end plans though. Thanks in advance!

#3 Updated by intrigeri 3 months ago

  • Assignee changed from segfault to intrigeri

I'll start working on it now. If you can help or take over, cool: then let's talk on XMPP :)

#4 Updated by intrigeri 3 months ago

The bytes affected by the diff sent by the OP are all about the partition entry for the system partition in the hybrid MBR. It's not 100% clear to me whether it's about the partition type or the CHS of the last sector: the diff and the gpt.img don't match, I suspect some byte ordering / endianness mismatch.

If it's about the CHS of the last sector, then the size of the system partition matters, and thus the size of the USB stick matters. So I'll be testing with a 32GB USB stick, that has more chances than my other (8GB) ones to display the problem.

I've compared the 1st sector of the same USB stick a) installed with Tails Installer; b) installed with the USB image in GNOME Disks and then booted once; the MBR partition entry for the 2 resulting system partitions is identical. So I would assume that a given legacy BIOS system should equally succeed, or fail, to boot both. In this sense, the problem this ticket is about might not be a regression.

#5 Updated by intrigeri 3 months ago

  • Status changed from Confirmed to In Progress

#6 Updated by intrigeri 3 months ago

  • % Done changed from 0 to 10
  • Feature Branch set to bugfix/16389-recompute-chs

The code added in this branch will apply, on first boot, part of the changes present in the provided gpt.diff, but not all of them. Hopefully that will be sufficient to fix boot on their system.

I'm starting to hope that the fix I'm preparing will not only fix this problem, but more generally fix a bunch of "Tails installed with Tails Installer does not start in legacy BIOS mode on $computer XYZ" (we have quite a few of these known issues documented), which could actually be instances of this exact problem. And my main fear is that this fix breaks stuff for systems that previously booted fine…

#7 Updated by intrigeri 3 months ago

Asked the 2 OPs to test the fix and on top of that, sent a broader call for testing to check whether this fix breaks Tails startup on other computers.

#8 Updated by intrigeri 3 months ago

  • % Done changed from 10 to 20

(Partial) test suite passed on our shared Jenkins and on my local one. USB sticks installed from a USB image built from this branch, onto a 32GB USB stick, boot fine twice on the two spare laptops I have around.

I'll wait until some point tomorrow for reports from my call for testing and then I'll decide if this is worth the risk: if I have a confirmation that this does fix the problem for the OPs, I might be tempted to say it's worth it, even if I get little confirmations that this does not break anything (on top of my own testing). We'll see.

#9 Updated by anonym 3 months ago

I have tested an image built from this branch using four different USB sticks on for four laptops (all fairly different from each other except they all use Intel CPUs), and saw no regressions.

#10 Updated by intrigeri 3 months ago

  • Assignee changed from intrigeri to anonym
  • QA Check set to Ready for QA

The 2 OPs did not reply yet, so I can't say for sure that this fixes the problem they reported; I bet it does. Very few people tested the new image but they all confirmed it does not break stuff for them. So I'm really unsure what's best: we can merge this, and risk breaking systems that booted just fine during the USB image call for testing and/or on 3.12~rc1; or we can postpone, and risk discovering that the problem reported by these 2 people actually affects way more users (whose usual installation method is not supported/documented anymore).

I'm leaning towards merging, so here is a PR. But I would understand if you decide to take the other risk over this one.

#11 Updated by segfault 3 months ago

Hi,

I'm unsure about merging this. IIUC, the man page of sgdisk says that --recompute-chs sets a CHS value that violates GPT specification. This sounds like it could break booting on some systems. Could we maybe do another call for testing and then merge this in 3.13? Only two of many testers reported this problem, so I'm quite confident not too many users will be affected.

@intrigeri: I wrote you an email explaining why I went AWOL. Sorry again and thanks so much for stepping in!

#12 Updated by intrigeri 3 months ago

I'm unsure about merging this. IIUC, the man page of sgdisk says that --recompute-chs sets a CHS value that violates GPT specification. This sounds like it could break booting on some systems. Could we maybe do another call for testing and then merge this in 3.13? Only two of many testers reported this problem, so I'm quite confident not too many users will be affected.

Wrt. "Only two of many testers reported this problem", it could because most people who answered our USB image call for testing started it only once, which would not expose the bug. And then only few people installed 3.12~rc1 from scratch to use it in production (and thus more than once): I expect most people who are ready to use a RC in production upgraded an existing stick, which again would not expose the bug. But yeah, I hear your concerns.

My best bet/guess at this point, given #16389#note-4, is that without this branch merged, the users who will face a regression are those who satisfy these two conditions:

  • Their hardware is affected by the problem. Risk: probably low, because I currently think that a Tails installed by Tails Installer would have the exact same problem, so if it was this widespread, I hope we would have noticed earlier.
  • They were previously using an installation method that is not affected, i.e. users who've been creating an "intermediate" Tails (from macOS, Windows, all Linux except Debian) and run it forever, without following the Installation Assistant last steps that instruct to create a "final" Tails with Tails Installer. I don't know how widespread such practice is.

So yeah, it feels relatively safe to skip merging this.

Either way, we should be ready to deal with the fallout if things go wrong, that is if we don't merge:

  • Explain to help desk & technical writers what are the symptoms they should notice, and what they should tell affected users, e.g.:
    • How to manually fix their broken 3.12 USB stick.
    • How to try the proposed branch and check if it fixes the bug for them.
  • If the problem affects a significant number of users:
    • Update the call for testing I've sent and give it a higher profile (blog post, Twitter?).
    • Be ready to release an emergency Tails 3.12.1 a couple weeks max after 3.12 (i.e. ASAP after collecting data to gain confidence the candidate fix does not break more than it fixes).

And if we do merge, the list of work items is similar but not quite the same.

I'm not able to lead this work (I'll be AFK for a week after 3.12). It should probably be a USB image team effort.

#13 Updated by intrigeri 3 months ago

  • Assignee changed from anonym to segfault
  • QA Check changed from Ready for QA to Dev Needed

Please provide the bits our tech writers (for release notes) and help desk will need. anonym suggests the easiest way for affected users is to reinstall, boot once, set an admin password, and then run the magic command (assuming that works from inside a running Tails, this needs to be tested).

#14 Updated by segfault 3 months ago

intrigeri wrote:

Please provide the bits our tech writers (for release notes) and help desk will need. anonym suggests the easiest way for affected users is to reinstall, boot once, set an admin password, and then run the magic command (assuming that works from inside a running Tails, this needs to be tested).

OK, I will try to do that tonight or tomorrow

#15 Updated by u 3 months ago

Hi segfault: as the release is supposed to be today, it would be super cool if you could get to that as early as possible :)

#16 Updated by u 3 months ago

Made release writers and help desk aware of the issue by email.

#17 Updated by u 3 months ago

#18 Updated by u 3 months ago

#19 Updated by u 3 months ago

#20 Updated by u 3 months ago

anonym suggests the easiest way for affected users is to reinstall, boot once, set an admin password, and then run the magic command (assuming that works from inside a running Tails, this needs to be tested).

magic command is `sudo sgdisk --recompute-chs /dev/sda`

#21 Updated by intrigeri 3 months ago

Better use sudo sgdisk --recompute-chs /dev/bilibop, that avoids the need to find out the name of the boot device: /dev/bilibop should always be a symlink to it.

#22 Updated by anonym 3 months ago

  • Target version changed from Tails_3.12 to Tails_3.13

#23 Updated by intrigeri 3 months ago

One of the OPs (Phredo) reported that sgdisk --recompute-chs /dev/bilibop fixed the problem for them.

#24 Updated by mercedes508 2 months ago

For the record, since 3.12, we didn't receive report about this bug.

#25 Updated by u about 2 months ago

@segfault I think that fixing this bug once and for all is possible part of the "post-release bugfixing" budget. That we did not receive bug reports since 3.12 might be due to the fact that lots of users seem to struggle with our new setup, as reported by emmapeel/frontdesk on tails-dev recently. As there is already a branch, and users report that it works for them, we might want to merge this branch. What do you think?

#26 Updated by intrigeri about 2 months ago

As there is already a branch, and users report that it works for them, we might want to merge this branch.

Devil's advocate speaking: we need to balance the benefits of fixing this bug vs. the risk of regressions; wrt. the latter, we might not have enough reports that this branch won't break stuff for users for whom our current code works just fine. I would suggest pinging the relevant call for testing thread and maybe even using Twitter to get enough test results.

#27 Updated by u about 2 months ago

  • Parent task changed from #15992 to #15292

#28 Updated by u about 2 months ago

  • Estimated time set to 16.00 h

#29 Updated by u about 2 months ago

  • Related to Bug #15987: Check the system partition on every boot and grow it if needed added

#30 Updated by u about 2 months ago

intrigeri wrote:

As there is already a branch, and users report that it works for them, we might want to merge this branch.

Devil's advocate speaking: we need to balance the benefits of fixing this bug vs. the risk of regressions; wrt. the latter, we might not have enough reports that this branch won't break stuff for users for whom our current code works just fine. I would suggest pinging the relevant call for testing thread and maybe even using Twitter to get enough test results.

Ack.

#32 Updated by intrigeri about 2 months ago

I posted this on Twitter: https://twitter.com/Tails_live/status/1103295258576781312

Nice :)

I wonder if there's a misunderstanding though: you're gathering data about the part I find the least concerning (whether the workaround fixes the bug for affected users), not about the aspect where I mentioned earlier here that we lack info (suggesting Twitter to gather it). Anyway, any additional info is good to have :)

#33 Updated by u about 2 months ago

intrigeri wrote:

I posted this on Twitter: https://twitter.com/Tails_live/status/1103295258576781312

Nice :)

I wonder if there's a misunderstanding though: you're gathering data about the part I find the least concerning (whether the workaround fixes the bug for affected users), not about the aspect where I mentioned earlier here that we lack info (suggesting Twitter to gather it). Anyway, any additional info is good to have :)

Ah then I might have misunderstood the issue. Please reformulate it: apparently I was unable to guess what you are talking about.

#34 Updated by intrigeri about 2 months ago

Let's try again. We need to balance:

  • benefits of merging the proposed branch to fix this bug: it seems to me that (almost?) nobody affected by the bug complained that the workaround does not work for them (and you've asked for more confirmations on Twitter). Given the branch basically applies that workaround automatically, I'm quite confident it will fix the bug as well. If we want to increase our confidence even further, then we need to ask affected folks to try a nightly image built from the proposed branch.
  • the risk of regressions i.e. if we merge the proposed branch, will this break boot for users who are not affected by the bug this ticket is about? AFAIK we have extremely little data here, which is why we did not merge the proposed branch yet. The relevant call for testing might need updating but it is still current wrt. what info we lack.

#35 Updated by u about 2 months ago

intrigeri wrote:

Let's try again. We need to balance:

  • benefits of merging the proposed branch to fix this bug: it seems to me that (almost?) nobody affected by the bug complained that the workaround does not work for them (and you've asked for more confirmations on Twitter). Given the branch basically applies that workaround automatically, I'm quite confident it will fix the bug as well. If we want to increase our confidence even further, then we need to ask affected folks to try a nightly image built from the proposed branch.
  • the risk of regressions i.e. if we merge the proposed branch, will this break boot for users who are not affected by the bug this ticket is about? AFAIK we have extremely little data here, which is why we did not merge the proposed branch yet. The relevant call for testing might need updating but it is still current wrt. what info we lack.

Great.

Next time please add the URL to whatever you're talking about. We have issued many different calls for testing on the list since 2 months afaik and I've been spending quite some time trying to find out what you were talking about - I don't have the entire problem space and emails sent out by everyone else in mind all the time.

#36 Updated by u about 2 months ago

I updated the email text, sent it again to the tester list, because it got possibly lost there since 5 weeks and retweeted a new call for testing this branch.

#37 Updated by intrigeri about 2 months ago

Next time please add the URL to whatever you're talking about.

Right, point gladly taken.

Post-mortem: I've sent only one call for testing about this topic, that I've mentioned on this ticket (#16389#note-7), but never shared the URL, so any further reference to it was indeed ambiguous, and it took me some minutes to find it myself when you asked me to clarify. Had I provided the URL back in January, it would have been cheap for me to reference it again 2 days ago, and then hopefully it would have cleared the misunderstanding about "we might not have enough reports that this branch won't break stuff for users for whom our current code works just fine".

I've been spending quite some time trying to find out what you were talking about - I don't have the entire problem space and emails sent out by everyone else in mind all the time.

Absolutely.

FTR, taking a step back, #16389#note-26 was primarily meant for segfault. You had just pinged him, presumably with your team manager hat on, clarified that this was part of his post-release bugfixing work, and asked his input. So:

  • I expected segfault to do the exact same (painful) work I had to do to find the link to the aforementioned call for testing. I did not do this work myself because I've already done much more than I had committed to on this ticket (dealing with the problem at the last minute before the 3.12 release, while segfault was unexpectedly AFK). This is obviously not very nice of me and I'm not exactly proud of it. I see this as a direct consequence of the (poor) way we've been dealing with assignees & budget for sponsor deliverables. Thankfully the Leipzig Plan will fix that and make it easier to help each other :)
  • I hadn't you in mind here, in terms of target audience, as the person who'll do the next steps of the work. I had you in mind as my team manager who had just pinged my team-mate, suggesting a course of action based on a reasoning that, I thought, was missing a critical piece of information. I did not expect you to have all the info in mind, nor to guess what I meant, nor to do the research to find that call for testing, nor to act on my comment yourself. Had I expected you would take over the next steps of this work, I hope I would have either a) phrased things differently, taking into account the specifics of your situation on this ticket (although I realize I might very well have failed to do that); or b) realized that it would have been easier to do the work myself than to gather all the info you needed to do it.

This being said, as you know I'm not a fan of a super-rigid tasks / individual assignee structure, and I'm super happy we're working on this together and you've switched hats! In the future, to make the magic hat swap trick smoother, I would find it helpful if you let me know when you're switching hats from "team manager pinging workers" to "I'll tweet something in 48h and I need all the info now to do it right", so I understand better who my target audience is and I can try to adjust my communication style accordingly :)

And finally, I know that you've been handling yourself quite some of the user feedback on tails-testers. Thanks! This should perhaps have hinted me about the ongoing magic hat swap trick.

#38 Updated by intrigeri about 2 months ago

I updated the email text, sent it again to the tester list, because it got possibly lost there since 5 weeks and retweeted a new call for testing this branch.

Great! For future reference, the tweet is https://twitter.com/Tails_live/status/1103326873789046786 and the updated call for testing is https://lists.autistici.org/message/20190306.160900.47a129bc.en.html (been burnt once, won't do it again, at least not right away).

segfault, will you be in a position to process the user feedback from this new call for testing? (Note that some folks have started replying on Twitter, gah; presumably because the list archives don't link back to the place where one shall give feedback.) If you can't, then I offer to do it; I just need to know I'm needed here.

#39 Updated by segfault about 1 month ago

segfault, will you be in a position to process the user feedback from this new call for testing?

Yes, at least every few days. But I don't see any feedback yet, except for one user who posted a screenshot on twitter. The email thread doesn't have any replies yet.

#40 Updated by u about 1 month ago

We received one report (in private) that the branch fixes the problem.
As a follow up on http://lists.autistici.org/message/20190302.084144.948cedd2.en.html
where another problem is reported, i.e. when booting with two Tails keys, both seem to become unbootable. Not sure what to make of it.


When I boot only 3.11, it works fine.

I've tried the work-around : https://tails.boum.org/news/version_3.12/index.en.html#index2h1 ?
root # sgdisk --recompute-chs /dev/bilibop
It won't work neither. (by stand-alone boot on 3.12.1)
The symptom remains the same.

I've tried the experimental image : tails-amd64-bugfix_16389-recompute-chs-3.13-20190308T1708Z-af998d063a+stable@6bc35d7ef9.iso
It woks on the second boot, on stand-alone boot.

#41 Updated by u about 1 month ago

And more report from the same person:


reporting additional symptom concerning the different coming-out between ".img" and ".iso" 

In the meantime, I've tried the experimental image with ".img" extension to create a 3.12.1 bootable USB-key.
-> tails-amd64-bugfix_16389-recompute-chs-3.13-20190308T1708Z-af998d063a+stable@6bc35d7ef9.img

It won't boot for the second boot.

It might be interesting finding why with ".iso" version works, and ".img" version won't.
It seems to have some relation with, because since 3.12.x, you publish the stable bootable image with ".img" for the stable version.

#42 Updated by u about 1 month ago

We still have no data to evaluate this fix.
I would like to close this ticket in order to have closure on the USB image project.
So today I've asked help desk by email to report every user that faces the issue and to point them to https://lists.autistici.org/message/20190306.160900.47a129bc.en.html after the release of 3.13.
I suggest to close this ticket if there are no more reports or actionable feedback when we release 3.14.

#43 Updated by u about 1 month ago

  • Parent task deleted (#15292)

#44 Updated by u about 1 month ago

  • Target version changed from Tails_3.13 to Tails_3.14

#45 Updated by u about 1 month ago

  • Parent task set to #15292

15292

#47 Updated by mercedes508 20 days ago

22099509.4091554104463649.JavaMail.defaultUser

#49 Updated by segfault 19 days ago

u wrote:

More user reports:
- http://lists.autistici.org/message/20190327.004600.b5ca1460.en.html

Not sure that's the second boot issue, because they say that "some first boots failed as well". Anyway, let's see if we get more information from this user (they didn't reply yet to your request to test our fix).

- http://lists.autistici.org/message/20190326.173127.23351d6b.en.html

We already clarified this via email, but for the record: This does not seem to be the second boot issue but a hardware specific issue.

#50 Updated by segfault 19 days ago

mercedes508 wrote:

22099509.4091554104463649.JavaMail.defaultUser

I guess you wanted to post a link to this new report about this issue: http://lists.autistici.org/message/20190401.074103.bc76dd9e.en.html

#51 Updated by u 18 days ago

  • Parent task deleted (#15292)

unparenting.

#52 Updated by segfault 18 days ago

segfault wrote:

mercedes508 wrote:

22099509.4091554104463649.JavaMail.defaultUser

I guess you wanted to post a link to this new report about this issue: http://lists.autistici.org/message/20190401.074103.bc76dd9e.en.html

This also doesn't seem to be the second boot issue, see:
https://lists.autistici.org/message/20190402.193400.1e6b73f7.en.html

#53 Updated by segfault 18 days ago

  • Description updated (diff)

It's getting hard to keep track of the different reports on this ticket, so I added them to the description.

#54 Updated by goupille 5 days ago

Bug report: 55355b65c069131f6802e26ed17a2d51

#55 Updated by segfault 1 day ago

goupille wrote:

Bug report: 55355b65c069131f6802e26ed17a2d51

@goupille: I can't find this bug report, can you forward it to me please?

#56 Updated by segfault 1 day ago

  • Description updated (diff)

In https://lists.autistici.org/message/20190419.203140.cd5d1b8a.en.html a user reports that running sgdisk --recompute-chs /dev/bilibop did fix the issue for them.

Also available in: Atom PDF