Project

General

Profile

Bug #17359

lint_po: don't use pre-existing rply cache file

Added by CyrilBrulebois 4 months ago. Updated 13 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

100%

Feature Branch:
bugfix/17359-rply-cache
Type of work:
Code
Blueprint:
Starter:
Affected tool:

Description

Spotted by the test suite for the 4.1.1 release, this failing check:

14:32:44 cucumber features/po.feature:7 # Scenario: check all PO files

(Unless I took a wrong turn) I didn't follow any part of the release process that updated PO files except for the release announcement (which happens after the image build anyway, in 6139fe3cb9de1ba3b465472194d7167c9a60e8ad) so I'm not sure I'm responsible for this; I also don't think build check should prevent us from releasing 4.1.1, that's why I'm opening this placeholder ticket, until a developer or a translator figures out where the problem lies.

The trace is overly long so I won't be pasting it here, but here's an example for a random file. This seems to happen on various pages, various languages, nothing absolutely obvious (to me, at this hour, at least):

      ERROR: wiki/src/home.fr.po:

          i18nspector exited with 1 - stderr:

          Traceback (most recent call last):
            File "/usr/bin/i18nspector", line 85, in <module>
              cli.main()
            File "/usr/share/i18nspector/lib/cli.py", line 225, in main
              check_all(files, options=options)
            File "/usr/share/i18nspector/lib/cli.py", line 137, in check_all
              check_file(path, options=options)
            File "/usr/share/i18nspector/lib/cli.py", line 120, in check_file
              return check_regular_file(path, options=options)
            File "/usr/share/i18nspector/lib/cli.py", line 71, in check_regular_file
              checker_instance.check()
            File "/usr/share/i18nspector/lib/check/__init__.py", line 190, in check
              self.check_plurals(ctx)
            File "/usr/share/i18nspector/lib/check/__init__.py", line 407, in check_plurals
              (n, expr, ljunk, rjunk) = gettext.parse_plural_forms(plural_forms, strict=False)
            File "/usr/share/i18nspector/lib/gettext.py", line 104, in parse_plural_forms
              expr = parse_plural_expression(match.group(2))
            File "/usr/share/i18nspector/lib/gettext.py", line 89, in parse_plural_expression
              parser = intexpr.Parser()
            File "/usr/share/i18nspector/lib/intexpr.py", line 148, in __init__
              self._parser = create_parser(self._lexer)
            File "/usr/share/i18nspector/lib/intexpr.py", line 142, in create_parser
              return pg.build()
            File "/usr/lib/python3/dist-packages/rply/parsergenerator.py", line 189, in build
              data = json.load(f)
            File "/usr/lib/python3.5/json/__init__.py", line 268, in load
              parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
            File "/usr/lib/python3.5/json/__init__.py", line 319, in loads
              return _default_decoder.decode(s)
            File "/usr/lib/python3.5/json/decoder.py", line 342, in decode
              raise JSONDecodeError("Extra data", s, end)
          json.decoder.JSONDecodeError: Extra data: line 1 column 3387 (char 3386)

0001-attempt1.patch View (2.16 KB) CyrilBrulebois, 03/18/2020 05:28 PM

0001-attempt-2.patch View (1.78 KB) CyrilBrulebois, 03/18/2020 05:50 PM


Related issues

Blocks Tails - Feature #16209: Core work: Foundations Team Confirmed

Associated revisions

Revision 0e5dbe7b (diff)
Added by Sandro Knauß 4 months ago

update jenkins-tools to ignore pre-existing rply cache file (refs: #17359)

Revision eaf618ae (diff)
Added by Sandro Knauß 4 months ago

update jenkins-tools to ignore pre-existing rply cache file (refs: #17359)

Revision 7b225237
Added by intrigeri 4 months ago

Merge branch 'bugfix/17359-rply-cache' into stable (Closes: #17359)

Revision 5b09c855
Added by intrigeri 17 days ago

Merge remote-tracking branch 'origin/bugfix/17359-rply-cache' into stable (Closes: #17359)

History

#1 Updated by intrigeri 4 months ago

Interesting!

I can't reproduce this locally on my sid system (tried on current master and stable branches).
I've not seen it happen on Jenkins (Stretch) either:

(both failed recently due to a buggy translation in wiki/src/about/contact.pt.po, that I just fixed)

So I'm wondering if there's anything special on your system, that can possibly make i18nspector or the Python JSON library behave differently.

Did you see this happen only in features/po.feature or can you also reproduce this by running ./submodules/jenkins-tools/slaves/check_po (with the jenkins-tools submodule at the commit tracked by current master or stable) directly?

What does i18nspector --version say?

Random guess: maybe the current locale affects how the underlying code reads PO files. Might be worth retrying with LC_ALL=C or en_US.UTF-8?

Other than that, I must admit I'm quite clueless :/

#2 Updated by CyrilBrulebois 4 months ago

Going back to the 4.1.1 tag on my system:

kibi@armor:~/work/clients/tails/release/release-checkout$ ./submodules/jenkins-tools/slaves/check_po
ERROR: wiki/src/about/contact.pt.po:
    i18nspector is not happy:
        E: inconsistent-trailing-newlines msgid '[[!toc levels=2]]\n'
checked files are not clean.

with:

kibi@armor:~/work/clients/tails/release/release-checkout$ dpkg -l i18nspector
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                   Version          Architecture     Description
+++-======================-================-================-==================================================
ii  i18nspector            0.25.2-1         all              checking tool for gettext POT, PO and MO files

kibi@armor:~/work/clients/tails/release/release-checkout$ i18nspector --version
i18nspector 0.25.2
+ Python 3.5.3
+ polib 1.0.8
+ rply 0.7.4

I'm quite surprised to only see that during a test suite run, which I've done locally without checking the results in Jenkins because I wasn't trusted its results anyway… that's why I've been re-running all non-fragile tests locally before taking any actions.

My system is configured with LANG=en_GB.UTF-8 (globally, so the account has that too), and I've been running the test suite this way:

sudo TMPDIR=~/TT ./run_test_suite --view --capture --iso ../isos/tails-amd64-4.1.1/tails-amd64-4.1.1.iso -- features/po.feature

I'm quite reassured that the PO files are not as broken as the local test suite run led me to fear in the beginning (even if I couldn't see why that would be the case).

I've rerun the test suite from the same 4.1.1 tag, and it fails in the same way. So yes it's reproducible at least here, and I'll investigate a little bit more.

#3 Updated by CyrilBrulebois 4 months ago

And strace saved the day yet another time:

kibi@armor:~/work/clients/tails/release/release-checkout$ sudo ls -l /root/.cache/rply/i18nspector-intexpr-1-931a22c3200c0cb35da6baca76b3973f7517595a.json
-rw-r--r-- 1 root root 3393 Sep  4 19:42 /root/.cache/rply/i18nspector-intexpr-1-931a22c3200c0cb35da6baca76b3973f7517595a.json

so for some reasons, my system had a cache file dated from the previous release, with broken contents, that is interfering with proper operation?

How to fix? Clean-up the whole cache (might be harsh/interfere with other RPLY users)? See if the cache can be located elsewhere?

Quick looks suggests i18nspector sets its own cache_id (as opposed to letting RPLY set one up randomly if unset), and that RPLY uses this:

        cache_dir = AppDirs("rply").user_cache_dir
        cache_file = os.path.join(
            cache_dir,
            "%s-%s-%s.json" % (
                self.cache_id, self.VERSION, self.compute_grammar_hash(g)
            )
        )

Looking at the appdirs Python module, it seemed that adding something like env={'XDG_CACHE_HOME': '/tmp/shouldberandom'} to the subprocess.run() call might help, and indeed, I'm only seeing the inconsistent-trailing-newlines error in wiki/src/about/contact.pt.po now.

I'm therefore tempted to suggest creating a temporary directory in slaves/lint_po (to which check_po is a symlink), using it for each i18nspector run, and removing it afterwards. This would avoid having to deal with whatever was possibly cached/left over in the calling user's cache directory.

#4 Updated by intrigeri 4 months ago

> kibi@armor:~/work/clients/tails/release/release-checkout$ sudo ls -l /root/.cache/rply/i18nspector-intexpr-1-931a22c3200c0cb35da6baca76b3973f7517595a.json
> -rw-r--r-- 1 root root 3393 Sep  4 19:42 /root/.cache/rply/i18nspector-intexpr-1-931a22c3200c0cb35da6baca76b3973f7517595a.json
> 

so for some reasons, my system had a cache file dated from the previous release, with broken contents, that is interfering with proper operation?

FTR, I have that file too. Mine is 3386 bytes large and its mtime is 2015-11-20. I think it's not a coincidence that the json.decoder.JSONDecodeError you saw triggers in column 3387 (char 3386).

I'm therefore tempted to suggest creating a temporary directory in slaves/lint_po (to which check_po is a symlink), using it for each i18nspector run, and removing it afterwards. This would avoid having to deal with whatever was cached/left over in the calling user's cache directory.

Sounds good to me.

#5 Updated by intrigeri 4 months ago

  • Subject changed from ./submodules/jenkins-tools/slaves/check_po failures in test suite for 4.1.1 to lint_po: don't use pre-existing rply cache file
  • Status changed from New to Confirmed
  • Type of work changed from Research to Code

hefee, fixing this in @lint_po should be rather straightforward:

  • before running i18nspector: create a temporary directory
  • while running i18nspector: point the XDG_CACHE_HOME environment variable to that temporary directory
  • clean up the temporary directory before exiting

I think you would be more efficient than me to do this. Interested?

(No big hurry as AFAICT, we've seen this problem only once in many years.)

#6 Updated by intrigeri 4 months ago

#7 Updated by hefee 4 months ago

  • Assignee set to hefee

#8 Updated by Anonymous 4 months ago

  • Status changed from Confirmed to In Progress

#9 Updated by hefee 4 months ago

  • Status changed from In Progress to Needs Validation
  • Assignee deleted (hefee)
  • Feature Branch set to bugfix/17359-rply-cache

#10 Updated by intrigeri 4 months ago

  • Assignee set to intrigeri

#11 Updated by intrigeri 4 months ago

  • Target version set to Tails_4.2

#12 Updated by Anonymous 4 months ago

  • Status changed from Needs Validation to In Progress

#13 Updated by intrigeri 4 months ago

  • Status changed from In Progress to Needs Validation

#14 Updated by intrigeri 4 months ago

  • Status changed from Needs Validation to In Progress
  • Assignee changed from intrigeri to hefee

Hi @hefee!

So, this mostly works for me: I've reproduced the behavior kibi saw with a corrupted rply cache file and this branch works around this problem as intended.

But it introduces a race condition that leads to a new failure mode: I've seen one instance of i18nspector fail with FileExistsError when rply/parsergenerator.py creates $XDG_CACHE_HOME/rply. I can't reproduce this consistently but I'm not surprised that this race condition exists: we're spawning multiple instances of i18nspector in parallel and each of them will try to create the exact same directory, and then the exact same file in that directory, which is bound to occasionally fail.

I'm not sure what's the best way to fix this. A dirty workaround would be to first run one i18nspector, wait for its completion, then run all the others in parallel. It's a bit ugly but I could live with it. I'm confident you'll have a better idea though :)

#15 Updated by hefee 4 months ago

  • Status changed from In Progress to Needs Validation
  • Assignee changed from hefee to intrigeri

Fixed the race condition and do some more cleanup.

I still have uploaded to git.tails.boum.org:jenkins-tools, as this was the only source I found I can write to:

$ git push gitolite@git-tails.immerda.ch:jenkins-tools
gitolite@git-tails.immerda.ch: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

nor

$ git push gitolite@d53ykjpeekuikgoq.onion:jenkins-tools.git
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
        LANGUAGE = (unset),
        LC_ALL = (unset),
        LANG = "de_DE.UTF-8" 
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
FATAL: W any jenkins-tools hefee DENIED by fallthru
(or you mis-spelled the reponame)
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

#16 Updated by intrigeri 4 months ago

  • Status changed from Needs Validation to Resolved
  • % Done changed from 0 to 100

#17 Updated by intrigeri 4 months ago

Hi @hefee,

Fixed the race condition and do some more cleanup.

LGTM, thanks!

I still have uploaded to git.tails.boum.org:jenkins-tools, as this was the only source I found I can write to:

Yeah, I think that's actually the only repo we have for jenkins-tools.

#18 Updated by CyrilBrulebois 3 months ago

  • Status changed from Resolved to Confirmed
  • Target version changed from Tails_4.2 to Tails_4.3

Reopening, targetting 4.3.

The test suite failed again on Jenkins, while testing 4.2 → https://jenkins.tails.boum.org/view/RM/job/test_Tails_ISO_stable/2147/consoleText

#19 Updated by intrigeri 3 months ago

  • Assignee changed from intrigeri to hefee
  • Feature Branch deleted (bugfix/17359-rply-cache)

This happened again on Jenkins during the 4.2.2 release process. The error message is json.decoder.JSONDecodeError: Extra data: line 1 column 3391 (char 3390). This does not match the size of /root/.cache/rply/*.json on the isotester where the test suite was running, which seems to confirm that it's ignored and our custom XDG_CACHE_HOME is honored.

Interestingly, there's no trace of this error in the set of logs we still have for the check_PO_master Jenkins job.

In total, I've found 5 other test_Tails_* job runs that had this problem since we deployed the XDG_CACHE_HOME workaround; the exact error varies:

  • json.decoder.JSONDecodeError: Extra data: line 1 column 3384 (char 3383)
  • json.decoder.JSONDecodeError: Extra data: line 1 column 3383 (char 3382)
  • json.decoder.JSONDecodeError: Extra data: line 1 column 3387 (char 3386)

hefee, could you please take a look?

#20 Updated by intrigeri 2 months ago

hefee, IIRC you suggested a change that would help gather debugging info, but I can't find it anymore. Any hints?

#21 Updated by anonym about 2 months ago

  • Target version changed from Tails_4.3 to Tails_4.4

#22 Updated by CyrilBrulebois 24 days ago

  • Target version changed from Tails_4.4 to Tails_4.5

#23 Updated by CyrilBrulebois 17 days ago

  • Assignee changed from hefee to CyrilBrulebois

I'm “stealing” this as I'd like to get that resolved in time for the next release: we're hitting that from time to time in Jenkins and it sometimes hinders the release process.

Early findings:

  • We do have an XDG_CACHE_HOME set up with a different path each time check_po runs, which means we don't reuse possibly old data from one check_po run to another.
  • The cache file is exactly the same, independently of the file being processed at a given time.
  • The file is sometimes empty when it gets loaded, leading to an obvious failure:
    json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
  • Since the file is not written atomatically, I suspect this explains why we sometimes run into issues that aren't happening at the very beginning of the cache file:
    json.decoder.JSONDecodeError: Extra data: line 1 column 3382 (char 3381)

See this excerpt from /usr/lib/python3/dist-packages/rply/parsergenerator.py's build function (empathic comments are mine):

        cache_dir = AppDirs("rply").user_cache_dir
        cache_file = os.path.join(
            cache_dir,
            "%s-%s-%s.json" % (
                self.cache_id, self.VERSION, self.compute_grammar_hash(g)
            )
        )

        table = None
        if os.path.exists(cache_file):                     # <=== can exist and be empty
            with open(cache_file) as f:
                data = json.load(f)
            if self.data_is_valid(g, data):
                table = LRTable.from_cache(g, data)
        if table is None:
            table = LRTable.from_grammar(g)
            if not os.path.exists(cache_dir):
                os.makedirs(cache_dir, mode=0o0700)

            with open(cache_file, "w") as f:               # <=== not atomic
                json.dump(self.serialize_table(table), f)

#24 Updated by CyrilBrulebois 17 days ago

OK, this is a highly entertaining bug.

Different PO files trigger different code paths in i18nspector. This can be checked using strace and/or by adding some instrumenting in rply/parsergenerator.py:

$ rm -rf ~/.cache/rply && strace -v -o /tmp/fr -s 400 -- i18nspector --language fr wiki/src/getting_started.fr.po && find ~/.cache/rply
I: wiki/src/getting_started.fr.po: boilerplate-in-initial-comments 'Copyright (C) YEAR Free Software Foundation, Inc.'
I: wiki/src/getting_started.fr.po: boilerplate-in-initial-comments 'This file is distributed under the same license as the PACKAGE package.'
I: wiki/src/getting_started.fr.po: boilerplate-in-initial-comments 'FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.'
P: wiki/src/getting_started.fr.po: no-version-in-project-id-version Tails
W: wiki/src/getting_started.fr.po: no-report-msgid-bugs-to-header-field
/usr/bin/find: ‘/home/kibi/.cache/rply’: No such file or directory

$ rm -rf ~/.cache/rply && strace -v -o /tmp/pt -s 400 -- i18nspector --language fr wiki/src/home.pt.po && find ~/.cache/rply
I: wiki/src/home.pt.po: boilerplate-in-initial-comments 'Copyright (C) YEAR Free Software Foundation, Inc.'
I: wiki/src/home.pt.po: boilerplate-in-initial-comments 'This file is distributed under the same license as the PACKAGE package.'
I: wiki/src/home.pt.po: boilerplate-in-initial-comments 'FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.'
W: wiki/src/home.pt.po: language-disparity fr (command-line) != pt (Language header field)
building a parser
considering cache_file: /home/kibi/.cache/rply/i18nspector-intexpr-1-931a22c3200c0cb35da6baca76b3973f7517595a.json
W: wiki/src/home.pt.po: boilerplate-in-project-id-version 'PACKAGE VERSION'
W: wiki/src/home.pt.po: boilerplate-in-language-team 'LANGUAGE <LL@li.org>'
/home/kibi/.cache/rply
/home/kibi/.cache/rply/i18nspector-intexpr-1-931a22c3200c0cb35da6baca76b3973f7517595a.json

So the attempt at splitting the pool into a single “head” and a “long tail” (see below) was a nice idea, but that didn't ensure the “head” did warm the rply cache: it depends whether the first considered file needs that particular parser (and the relevant cache). On my system, the first file seems to always be the aforementioned (and aptly-named) wiki/src/getting_started.fr.po, which doesn't need that cache.

I think that pretty much explains why there's at best an empty rply directory after this file has been processed, and why errors vary between the “existing-but-empty JSON file” and the “being-written JSON file” error cases: the first elements of the “long tail” are racing.

What I call the “head” vs. “long tail” split is (empathic comments are mine):

            chain = [(_check_po_file(next(iterator)),),                # <=== 1 element, supposedly warming the cache
                     pool.imap_unordered(_check_po_file,iterator, 10)] # <=== all others

Instead of hardcoding a “known good” file as the first element of the list, I'm tempted to proceed as follows: consider elements one at a time until a file appears in the rply cache. At which time, consider the “slightly longer head” finished, and go for the “slightly less long tail” approach: multiprocess all others through pool.imap_unordered().

While processing elements in turn, if a hard limit of say 50 files is reached, break so that people don't waste too much time because everything is being processed sequentially: make it obvious that something is going wrong and that one needs to look at this test case (again).

I'll draft something up shortly.

#25 Updated by CyrilBrulebois 17 days ago

Patch attached, without a proper commit message, because I think my last comment on the ticket says it all… plus I'm not convinced this is what we should implement. That relies on an implementation detail that might end up being too fragile. :/

#26 Updated by CyrilBrulebois 17 days ago

#27 Updated by CyrilBrulebois 17 days ago

I suppose it might be better to just give each i18nspector child its own directory to play with. That would mean no chances to step on someone else's toes, and that would make the head/tail distinction moot.

Kind of “brute-force” but likely simpler code anyway.

As a first draft, I'm using the current PID to have distinct directories. We seem to have between 4k and 5k PO files meaning we might not have a chance to get full circle. For added safety, it might be best to remove the cache after each child has run, to make sure we don't leave something that could be the next “same PID” process unhappy, if that happens in the same check_po run.

Without the removal, I'm seeing a < 3 minutes runtime on an 8-core machine; given the cache is ~3k, that's a few MiB wasted for each run (removed because a tree below a TemporaryDirectory. I'm not sure I'm going to find a more convincing approach.

#28 Updated by CyrilBrulebois 17 days ago

  • Status changed from Confirmed to Needs Validation
  • Assignee changed from CyrilBrulebois to intrigeri
  • Feature Branch set to bugfix/17359-rply-cache

bugfix/17359-rply-cache branched from stable in tails.git, bumping the jenkins-tools submodule by 4 commits (same branch in the other git repository: bugfix/17359-rply-cache).

Hopefully they should be self-documenting, but I'm happy to take questions/remarks.

Assigning to @intrigeri who expressed interest.

[ FWIW, I've capped this at 4 hours, which seems a little high for “just a cache bug”, while still being lower than the >6 effective hours. But I hope we're no longer going to have to deal with that particular topic, wasting RM's time during release cycles with random Jenkins failures, but also at release time; “famous last words”, I know… ]

#29 Updated by intrigeri 17 days ago

Wow, amazing detective work!

Code LGTM. Performance on my system is in the same ballpark area as previously.

I will merge this once I've seen it pass the corresponding test on Jenkins (I'm not worried it won't pass, but better safe than sorry :)

In passing, I believe d7ca5acd76c18d2183efaded23174b8750f02843 was perhaps racy: once the with tempfile.TemporaryDirectory context manager exits, tmpdir is removed (asynchronously?), but we used it in poFile.i18nspector(my_env) outside of the context manager. Anyway, it does not matter as this problem disappeared in your follow-up commits :)

[ FWIW, I've capped this at 4 hours, which seems a little high for “just a cache bug”, while still being lower than the >6 effective hours. But I hope we're no longer going to have to deal with that particular topic, wasting RM's time during release cycles with random Jenkins failures, but also at release time; “famous last words”, I know… ]

I'd rather see you report the total effective hours: fixing such "holes in the roof" is IMO worth it.

#30 Updated by CyrilBrulebois 17 days ago

intrigeri wrote:

Wow, amazing detective work!

Code LGTM. Performance on my system is in the same ballpark area as previously.

I will merge this once I've seen it pass the corresponding test on Jenkins (I'm not worried it won't pass, but better safe than sorry :)

Thanks, and yes, good call. :)

In passing, I believe d7ca5acd76c18d2183efaded23174b8750f02843 was perhaps racy: once the with tempfile.TemporaryDirectory context manager exits, tmpdir is removed (asynchronously?), but we used it in poFile.i18nspector(my_env) outside of the context manager. Anyway, it does not matter as this problem disappeared in your follow-up commits :)

Possibly, yes. But tests I ran led me to end up with lots of temporary directories (several thousands), so I've committed what I tested… before cleaning everything up. :)

[ FWIW, I've capped this at 4 hours, which seems a little high for “just a cache bug”, while still being lower than the >6 effective hours. But I hope we're no longer going to have to deal with that particular topic, wasting RM's time during release cycles with random Jenkins failures, but also at release time; “famous last words”, I know… ]

I'd rather see you report the total effective hours: fixing such "holes in the roof" is IMO worth it.

Fine with me, ta.

#31 Updated by intrigeri 17 days ago

  • Status changed from Needs Validation to Resolved

#32 Updated by CyrilBrulebois 17 days ago

Oh, I understand what I was seeing… that's embarrassing.

Indeed, with the context manager, the temporary directory was created, its named stored, then removed. But its name was then re-used by the rply stack to (re)create the same directory again right after its removal. And since that wasn't handled by Python as a temporary directory, it would be kept around until the removal of the parent directory (which itself is a temporary directory for the whole runtime)…

So much for trying not to trigger reindentation of long blocks right before cleaning up… Woops!

#33 Updated by CyrilBrulebois 13 days ago

  • Target version changed from Tails_4.5 to Tails_4.4.1

Also available in: Atom PDF