Project

General

Profile

Feature #6891

Monitor external broken links on our website

Added by intrigeri over 5 years ago. Updated 9 months ago.

Status:
In Progress
Priority:
Normal
Assignee:
-
Category:
Infrastructure
Target version:
-
Start date:
03/10/2014
Due date:
% Done:

0%

Feature Branch:
Type of work:
Website
Blueprint:
Starter:
Yes
Affected tool:

Description

It would be great if someone prepared whatever is needed (scripts, cronjob line, email output) to monitor outgoing broken links on the Tails website (e.g. links on our website pointing to a third-party resource that does not exist anymore), and send useful reports of it regularly to some email address.

Ideally, it would be good to cache old results in order to report new(ly) broken links, and links that were broken last time already, separately (a bit like apticron is doing).

Once the basics are ready, we will want to turn the whole thing into a Puppet module, and deploy it on our infrastructure, but as a first step, preparing things without Puppet, as long as there is some setup documentation, would be enough.

It's important to avoid the Not Invented Here syndrome, as we don't want to maintain a big new chunk of software forever. Most likely, existing tools can be reused extensively. It might even be that Puppet modules to do the whole thing can be found.

:sajolida:

Associated revisions

Revision 1c8deda2
Added by intrigeri 9 months ago

Merge remote-tracking branch 'origin/web/6891-broken-links' (refs: #6891)

History

#1 Updated by geb over 5 years ago

Hi,

Awstats does it by default. It use log file parsing and build HTML reports that include a section with 404, with URLS #Access and Referers.

Example:
http://noc.actux.eu.org/awstats/actux.eu.org/awstats.actux.eu.org.html
http://noc.actux.eu.org/awstats/actux.eu.org/awstats.actux.eu.org.errors404.html

Best,

#2 Updated by intrigeri over 5 years ago

geb wrote:

Awstats does it by default.

This is for incoming broken links (e.g. to pages of ours that don't exist anymore). What we need is to detect outgoing broken links (e.g. links on our website pointing to a third-party resource that does not exist anymore).

#3 Updated by intrigeri over 5 years ago

  • Description updated (diff)

Clarified description to avoid more similar confusion in the future.

#4 Updated by spriver over 4 years ago

intrigeri wrote:

geb wrote:

Awstats does it by default.

This is for incoming broken links (e.g. to pages of ours that don't exist anymore). What we need is to detect outgoing broken links (e.g. links on our website pointing to a third-party resource that does not exist anymore).

I tried around a bit with the --spider mode in wget (with the recursive option activated), it detects errors like 404 or 301. Maybe this would be a start. A simple bash script would be able to do the job.
Shall I go on to create such a script or should we use another approach?

#5 Updated by intrigeri over 4 years ago

I tried around a bit with the --spider mode in wget (with the recursive option
activated), it detects errors like 404 or 301. Maybe this would be a start. A simple
bash script would be able to do the job.
Shall I go on to create such a script or should we use another approach?

Thanks for working on this!

However, this does not address caching old results by itself. Aren't there more specialized tools (than wget) that satisfy this requirements without any need for us to write and maintain custom code?

#6 Updated by spriver over 4 years ago

intrigeri wrote:

I tried around a bit with the --spider mode in wget (with the recursive option
activated), it detects errors like 404 or 301. Maybe this would be a start. A simple
bash script would be able to do the job.
Shall I go on to create such a script or should we use another approach?

Thanks for working on this!

However, this does not address caching old results by itself. Aren't there more specialized tools (than wget) that satisfy this requirements without any need for us to write and maintain custom code?

How about: http://wummel.github.io/linkchecker/ ?
I am trying out some configuration options. What exactly do you mean by address caching? Can you explain it?

#7 Updated by sajolida over 4 years ago

  • Assignee set to spriver

I searched very quickly for similar tools in the Debian archive and
found three:

  • htcheck - Utility for checking web site for dead/external links
  • linkchecker - check websites and HTML documents for broken links
  • webcheck - website link and structure checker

Could you maybe have a look to at least those three and compare how
would they do to solve our particular problem?

I'm assigning this ticket to you since you started working on it. Feel
free to deassign it if you give up on it at some point. That's no problem!

#8 Updated by spriver over 4 years ago

  • Assignee deleted (spriver)

I will check them all out. (currently checking intensively out linkchecker) What type of errors do we want to gather? Just "404 not found"? Or also ones like "moved permanently"?

#9 Updated by spriver over 4 years ago

  • Assignee set to spriver

#10 Updated by intrigeri over 4 years ago

What exactly do you mean by address caching? Can you explain it?

See the ticket description, where I have explained it already :)

#11 Updated by sajolida over 4 years ago

I would be interested in both cases, because "moved permanently" pages
are a bit more likely of turning up being "not found" at some point.

#12 Updated by intrigeri over 4 years ago

Any news?

#13 Updated by spriver over 4 years ago

intrigeri wrote:

Any news?

I'm still on it, testing out all the tools. Caching is not really common...

#14 Updated by spriver over 4 years ago

Hi,
I tested out htcheck, linkchecker and webcheck. No one of them is providing caching. How extensive should the caching be? Maybe diffing of the result files would be sufficient?

#15 Updated by BitingBird over 4 years ago

Well, if no one does the caching, maybe we should give up on that and verify manually if the pages are really gone.

#16 Updated by intrigeri over 4 years ago

Maybe diffing of the result files would be sufficient?

Yes, possibly. Let's try it that way and we'll see :)

#17 Updated by spriver over 4 years ago

Yes, possibly. Let's try it that way and we'll see :)

I will have a look now on the best output methods of the tools and how to prevent duplicated links (in my testing there were sometimes some)

#18 Updated by elouann over 3 years ago

spriver, may I ask you if you did some progress on this?

sajolida wrote:

I searched very quickly for similar tools in the Debian archive and
found three:

  • htcheck - Utility for checking web site for dead/external links
  • linkchecker - check websites and HTML documents for broken links
  • webcheck - website link and structure checker

webcheck was not updated since 2010: http://arthurdejong.org/webcheck/

#19 Updated by spriver over 3 years ago

elouann wrote:

spriver, may I ask you if you did some progress on this?

I can work on this again, but feel free to assign this ticket to you if you want!

#20 Updated by BitingBird about 3 years ago

It seems that LinkChecker was broken in Debian and is now repaired (http://anarc.at/blog/2016-05-19-free-software-activities-may-2016/). Maybe worth a second look ?

#21 Updated by sajolida over 2 years ago

  • Subject changed from Monitor broken links on our website to Monitor external broken links on our website

#22 Updated by u over 1 year ago

linkchecker is actively maintained in Debian indeed.

#23 Updated by u 11 months ago

I actually find it horrible to have to use an entire package to do that kind of thing :( So I looked and I found this: https://www.createdbypete.com/articles/simple-way-to-find-broken-links-with-wget/ It's basically crawling a page with wget, can also find broken image links and logs all the output. That's the downside: the output then needs to be processed by looking for 404s and 500s → but I guess this can't be too hard to turn this into a script.

#24 Updated by intrigeri 11 months ago

I actually find it horrible to have to use an entire package to do that kind of thing

I'm curious why.

FWIW I'm more concerned about the NIH syndrome that leads to writing the good old "simple script" that remains simple for about 1 hour and then becomes an abomination once it's made suitable for the real world, than about using software that's already written specifically for this purpose.

https://www.createdbypete.com/articles/simple-way-to-find-broken-links-with-wget/ It's basically crawling a page with wget, can also find broken image links and logs all the output.

I'm not sure that this checks external links.

That's the downside: the output then needs to be processed by looking for 404s and 500s → but I guess this can't be too hard to turn this into a script.

Parsing non-machine-readable output makes red lights blink in my brain.

Anyway, this being said, </control freak mode>: I'll be happy with any solution chosen by whoever decides to implement this as long as they're ready to maintain it :)

#25 Updated by u 11 months ago

Ack! Maybe I was bitten by NIH :)

#26 Updated by sajolida 10 months ago

  • Assignee changed from spriver to sajolida
  • Priority changed from Low to Normal
  • Target version set to Tails_3.10.1
  • Type of work changed from Sysadmin to Website

I'm taking this one over after many years of inactivity.

It seems to work to run linkchecker https://tails.boum.org/. I started that on a server and will check the output tomorrow.

#27 Updated by sajolida 10 months ago

  • Blocks Feature #15411: Core work 2018Q2 → 2018Q3: Technical writing added

#28 Updated by sajolida 10 months ago

Better version:

linkchecker --file-output=csv/tails.csv --no-warnings --check-extern --no-follow-url="https://tails.boum.org/blueprint/.*" https://tails.boum.org/

#29 Updated by sajolida 10 months ago

  • Blocks deleted (Feature #15411: Core work 2018Q2 → 2018Q3: Technical writing)

#30 Updated by sajolida 10 months ago

  • Blocks Feature #15941: Core work 2018Q4 → 2019Q2: Technical writing added

#31 Updated by sajolida 10 months ago

  • Blocks deleted (Feature #15941: Core work 2018Q4 → 2019Q2: Technical writing)

#32 Updated by sajolida 10 months ago

  • Feature Branch set to web/6891-broken-links

I started fixing a bunch of broken links on web/6891-broken-links. We have a lot and most of them are not affecting our documentation but /news, /contribute, and /blueprint. So I won't do that on our Technical Writing budget (that would be too much work) nor will I do it alone.

#33 Updated by intrigeri 9 months ago

sajolida wrote:

I started fixing a bunch of broken links on web/6891-broken-links.

Do you want someone to review & merge this branch?

#34 Updated by sajolida 9 months ago

  • Assignee deleted (sajolida)
  • QA Check set to Ready for QA

Indeed. I started working on this some weeks ago because I had more time but it's the case anymore so it would be good to have this reviewed already.

#35 Updated by intrigeri 9 months ago

  • Status changed from Confirmed to In Progress
  • Assignee set to intrigeri

#36 Updated by intrigeri 9 months ago

  • Assignee deleted (intrigeri)
  • Target version deleted (Tails_3.10.1)
  • QA Check deleted (Ready for QA)
  • Feature Branch deleted (web/6891-broken-links)

sajolida wrote:

I started fixing a bunch of broken links on web/6891-broken-links.

Looks good, merging!

We have a lot and most of them are not affecting our documentation but /news, /contribute, and /blueprint.

I think we should teach whatever broken link tool we use to ignore /blueprint and older blog posts:

  • broken links in older /news entries don't matter much because it's pretty hard to find a link pointing to them and they're of mostly historical interest anyway; but broken links in recent blog posts should be reported (probably except links to nightly.t.b.o and dl.a.b.o).
  • blueprints are work tools for contributors and don't affect the vast majority of the public of our website

=> and then we can focus first on broken links in sections that matter more :)

So I won't do that on our Technical Writing budget (that would be too much work) nor will I do it alone.

Makes sense.

#37 Updated by sajolida 9 months ago

  • Description updated (diff)

Also available in: Atom PDF