Google Reader ignores robots.txt, so are feed readers ‘bots’?
Wednesday, April 25th, 2007Today, a Feed Digest user reported that his digests using IceRocket were no longer working. I looked into it, and it seems IceRocket had banned our proxy. I rigged up an alternative proxy and it worked for about 50 requests, and then that was banned too. Clearly the ban was automated, and probably reflects a new rule / policy from IceRocket.
I took a look at their robots.txt to see what the deal was, and it turns out they block ALL useragents from their /search directory, which means most of their RSS feeds can’t be used by, er.. anything. A feed reader is an automated client, much like Feed Digest is, so we’re not technically allowed to retrieve their feeds except manually with our browsers
Of course, this all depends on the definition of a ‘bot’.. more on that later.
I decided to put Google Reader to the test to see if they respect robots.txt rules, and.. no! I could subscribe successfully to an IceRocket feed ( http://www.icerocket.com/search?tab=blog&q=robots&rss=1 ) from Google Reader, despite IceRocket’s robots.txt file denying it. So, at least Feed Digest isn’t alone in mostly ignoring robots.txt policy (although barely any feeds are usually covered by them since otherwise they’d be made useless) and Google Reader doesn’t follow the rules either. Difference is, Google’s a big guy and doesn’t get banned and Feed Digest is small and does. Perhaps we’ll work it out with IceRocket in a nice fashion, but the point remains and this could easily be an issue with 101 other feed providers out there in the future.
However, the remaining point is.. is a feed reader a ‘bot’? Finding a definitive answer to this isn’t easy. The original “robot exclusion” standard says:
WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.
In theory this means almost no feed reader is actually a “robot”, although it appears Feed Digest is being treated as such, although this definition of “robot” seems riddled with potential loopholes.
What’s the actual policy here? Are proxies, feed readers, feed “crawlers” (but not recursive ones) and so forth “robots”, “spiders”, or not? Furthermore, would an application that trawled through linked OPML files be a “robot” because it recursively retrieves OPML files? It’s a toughie, but I’m thinking there needs to be some policy set on this by the higher-ups

