Checking for link rot

My collection of photo galleries from events I like has been growing for years, with close to 10,000 links. But the Internet is ever-changing, and some of those links from years ago may no longer be valid. I’ve designed a system to check for this “link rot” and save users of this collection from the frustrating of clicking dead links.

Goals

  • Test each link weekly
    • not working one week, maybe it’s a hiccup. Probation
    • not working for two weeks. It’s dead. Hide it.
  • Don’t spam image hosts with unnecessary requests
  • Minimize the amount of manual work I have to do

Implementation

The gallery aggregator is written in PHP, with a MySQL database. I plan to run an automated job each day that will check everything in the DB over the course of each week. So each day, I check every seventh gallery, and move the offset based on the day of the week.

I added two columns to the DB.

  • last checked: timestamp of the last time I tried to access this URL
  • last accessed: timestamp of the last time this URL returned HTTP 200 OK

I skip any gallery with a “last checked” value within the past day. Then I try to fopen() each gallery. I don’t need to download these webpages (which contain many large images) I just want to know if they exist. If the URL returns HTTP 200 OK, I add that gallery ID to a list of “fresh” galleries.  When I’ve tried to visit all the URLs, I go through the list of “fresh” galleries and set their “last accessed” time to now.  The “last accessed” time for URLs that return errors will stay the same, and time will march away from it. The ‘last checked’ time is always updated to now regardless of the HTTP response code.

Even dividing the task up across the week leaves more than 1000 URLs to check per day. I don’t want my PHP job to time out, and I don’t want to trigger DOS protections on the image hosts I’m pinging. So I paginate the job with MySQL’s LIMIT keyword. Each time I process a batch of URLs, I update the “last checked” field, which means those URLs won’t be selected in the next batch. Eventually there won’t be any URLs left to check, so running the script too many times does nothing instead of performing unnecessary DB updates or HTTP requests. I know to stop when I don’t have any more input to process.

The webpage that displays galleries to the public now takes the “last accessed” time into account. If that time is more than 2 weeks in the past, the gallery is hidden. If “last accessed” is between 1 and 2 weeks in the past, an icon and tooltip appears next to the link to that gallery:

  • 🚧 This gallery might be offline.

Galleries that are “fresher” than 1 week are displayed normally.

Results

Since I give each gallery a few chances, this first week of automatic testing won’t remove any galleries from the list.  You may see some 🚧 icons appear over the next week, and in the new year, dead links will start to disappear. It’s a long-term project, but I intend to maintain this list for years, so I don’t mind waiting.