{"id":557,"date":"2019-12-21T20:01:47","date_gmt":"2019-12-21T20:01:47","guid":{"rendered":"http:\/\/cliffnordman.com\/blog\/?p=557"},"modified":"2019-12-21T20:01:47","modified_gmt":"2019-12-21T20:01:47","slug":"checking-for-link-rot","status":"publish","type":"post","link":"https:\/\/cliffnordman.com\/blog\/2019\/12\/21\/checking-for-link-rot\/","title":{"rendered":"Checking for link rot"},"content":{"rendered":"<p>My collection of <a href=\"https:\/\/cliffnordman.com\/galleries\/\">photo galleries<\/a> from events I like has been growing for years, with close to 10,000 links. But the Internet is ever-changing, and some of those links from years ago may no longer be valid. I&#8217;ve designed a system to check for this &#8220;link rot&#8221; and save users of this collection from the frustrating of clicking dead links.<\/p>\n<h3>Goals<\/h3>\n<ul>\n<li>Test each link weekly\n<ul>\n<li>not working one week, maybe it&#8217;s a hiccup. Probation<\/li>\n<li>not working for two weeks. It&#8217;s dead. Hide it.<\/li>\n<\/ul>\n<\/li>\n<li>Don&#8217;t spam image hosts with unnecessary requests<\/li>\n<li>Minimize the amount of manual work I have to do<\/li>\n<\/ul>\n<h3>Implementation<\/h3>\n<p>The gallery aggregator is written in PHP, with a MySQL database. I plan to run an automated job each day that will check everything in the DB over the course of each week. So each day, I check every seventh gallery, and move the offset based on the day of the week.<\/p>\n<p>I added two columns to the DB.<\/p>\n<ul>\n<li>last checked: timestamp of the last time I tried to access this URL<\/li>\n<li>last accessed: timestamp of the last time this URL returned HTTP 200 OK<\/li>\n<\/ul>\n<p>I skip any gallery with a &#8220;last checked&#8221; value within the past day. Then I try to <strong>fopen()<\/strong> each gallery. I don&#8217;t need to download these webpages (which contain many large images) I just want to know if they exist. If the URL returns <strong>HTTP 200 OK<\/strong>, I add that gallery ID to a list of &#8220;fresh&#8221; galleries.\u00a0 When I&#8217;ve tried to visit all the URLs, I go through the list of &#8220;fresh&#8221; galleries and set their &#8220;last accessed&#8221; time to now.\u00a0 The &#8220;last accessed&#8221; time for URLs that return errors will stay the same, and time will march away from it. The &#8216;last checked&#8217; time is always updated to now regardless of the HTTP response code.<\/p>\n<p>Even dividing the task up across the week leaves more than 1000 URLs to check per day. I don&#8217;t want my PHP job to time out, and I don&#8217;t want to trigger DOS protections on the image hosts I&#8217;m pinging. So I paginate the job with MySQL&#8217;s <strong>LIMIT<\/strong> keyword. Each time I process a batch of URLs, I update the &#8220;last checked&#8221; field, which means those URLs won&#8217;t be selected in the next batch. Eventually there won&#8217;t be any URLs left to check, so running the script too many times does nothing instead of performing unnecessary DB updates or HTTP requests. I know to stop when I don&#8217;t have any more input to process.<\/p>\n<p>The webpage that displays galleries to the public now takes the &#8220;last accessed&#8221; time into account. If that time is more than 2 weeks in the past, the gallery is hidden. If &#8220;last accessed&#8221; is between 1 and 2 weeks in the past, an icon and tooltip appears next to the link to that gallery:<\/p>\n<ul>\n<li>\ud83d\udea7 This gallery might be offline.<\/li>\n<\/ul>\n<p>Galleries that are &#8220;fresher&#8221; than 1 week are displayed normally.<\/p>\n<h3>Results<\/h3>\n<p>Since I give each gallery a few chances, this first week of automatic testing won&#8217;t remove any galleries from the list.\u00a0 You may see some \ud83d\udea7 icons appear over the next week, and in the new year, dead links will start to disappear. It&#8217;s a long-term project, but I intend to maintain this list for years, so I don&#8217;t mind waiting.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>My collection of photo galleries from events I like has been growing for years, with close to 10,000 links. But the Internet is ever-changing, and some of those links from years ago may no longer be valid. I&#8217;ve designed a system to check for this &#8220;link rot&#8221; and save users of this collection from the &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/cliffnordman.com\/blog\/2019\/12\/21\/checking-for-link-rot\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Checking for link rot&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8],"tags":[],"class_list":["post-557","post","type-post","status-publish","format-standard","hentry","category-programming"],"_links":{"self":[{"href":"https:\/\/cliffnordman.com\/blog\/wp-json\/wp\/v2\/posts\/557"}],"collection":[{"href":"https:\/\/cliffnordman.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cliffnordman.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cliffnordman.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/cliffnordman.com\/blog\/wp-json\/wp\/v2\/comments?post=557"}],"version-history":[{"count":1,"href":"https:\/\/cliffnordman.com\/blog\/wp-json\/wp\/v2\/posts\/557\/revisions"}],"predecessor-version":[{"id":558,"href":"https:\/\/cliffnordman.com\/blog\/wp-json\/wp\/v2\/posts\/557\/revisions\/558"}],"wp:attachment":[{"href":"https:\/\/cliffnordman.com\/blog\/wp-json\/wp\/v2\/media?parent=557"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cliffnordman.com\/blog\/wp-json\/wp\/v2\/categories?post=557"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cliffnordman.com\/blog\/wp-json\/wp\/v2\/tags?post=557"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}