wget and robots.txt

A web application I was hired to fix was experiencing a number of unhandled exceptions. After adding error logging to capture the exceptions for review, I noticed that one of the exceptions was triggered every time a web crawler hit the site. I added a robots.txt file to the site that blocked all user agents to reduce the number of exceptions. This would only block well-behaved web indexing crawlers, but at least it would de-clutter the error logs.

The side effect of blocking all user agents was that it also blocked access by wget. The application I was working on needed to support client systems that retrieved software updates via wget. By default, wget honors the robots.txt files. Adding a robots.txt file that blocked all user agents caused the automated software updates to fail. It’s good that wget respects robots.txt file settings because it can easily be scripted to slurp down an entire site and severely impact site performance in the process.

There are several ways to permit wget and still block other user agents. First, if you are in control of the client, you can use a switch in wget to turn off the check for robots.txt. (See the wget docs for more info.). You can also modify your robots.txt file to permit the wget user agent or block all the other agents you don’t want explicitly.