Archive for the 'Programming' Category

Google Reader ignores robots.txt, so are feed readers ‘bots’?

Wednesday, April 25th, 2007

Today, a Feed Digest user reported that his digests using IceRocket were no longer working. I looked into it, and it seems IceRocket had banned our proxy. I rigged up an alternative proxy and it worked for about 50 requests, and then that was banned too. Clearly the ban was automated, and probably reflects a new rule / policy from IceRocket.

I took a look at their robots.txt to see what the deal was, and it turns out they block ALL useragents from their /search directory, which means most of their RSS feeds can’t be used by, er.. anything. A feed reader is an automated client, much like Feed Digest is, so we’re not technically allowed to retrieve their feeds except manually with our browsers ;-) Of course, this all depends on the definition of a ‘bot’.. more on that later.

I decided to put Google Reader to the test to see if they respect robots.txt rules, and.. no! I could subscribe successfully to an IceRocket feed ( http://www.icerocket.com/search?tab=blog&q=robots&rss=1 ) from Google Reader, despite IceRocket’s robots.txt file denying it. So, at least Feed Digest isn’t alone in mostly ignoring robots.txt policy (although barely any feeds are usually covered by them since otherwise they’d be made useless) and Google Reader doesn’t follow the rules either. Difference is, Google’s a big guy and doesn’t get banned and Feed Digest is small and does. Perhaps we’ll work it out with IceRocket in a nice fashion, but the point remains and this could easily be an issue with 101 other feed providers out there in the future.

However, the remaining point is.. is a feed reader a ‘bot’? Finding a definitive answer to this isn’t easy. The original “robot exclusion” standard says:

WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.

In theory this means almost no feed reader is actually a “robot”, although it appears Feed Digest is being treated as such, although this definition of “robot” seems riddled with potential loopholes.

What’s the actual policy here? Are proxies, feed readers, feed “crawlers” (but not recursive ones) and so forth “robots”, “spiders”, or not? Furthermore, would an application that trawled through linked OPML files be a “robot” because it recursively retrieves OPML files? It’s a toughie, but I’m thinking there needs to be some policy set on this by the higher-ups :)

Installing Ruby 1.8.6 from Source on Ubuntu Feisty Fawn

Tuesday, April 24th, 2007

Installing Ruby from source is my preferred method, although in Ubuntu Feisty you can supposedly install it with apt-get install ruby now. Here’s the essential packages needed to get a source build working right though and the process I just went through:

sudo apt-get install build-essential
sudo apt-get install libreadline-dev
sudo apt-get install libz-dev (this is necessary for RubyGems to install - amongst other things)
wget ftp://ftp.ruby-lang.org/pub/ruby/ruby-1.8.6.tar.gz
tar xzvf ruby-1.8.6.tar.gz
cd ruby-1.8.6
./configure
make
make install

And to install RubyGems..

wget http://rubyforge.org/frs/download.php/17190/rubygems-0.9.2.tgz
tar xzvf rubygems-0.9.2.tgz
cd rubygems-0.9.2
ruby setup.rb

Various Ideas

Monday, April 2nd, 2007

I don’t have time to do all / any of these, but a few ideas that have been floating around in my head lately:

mod_mongrel: An Apache module to make configuration of Mongrel / Rails apps easier. It starts up the instances, manages the cluster, chooses available ports, does the proxying automatically, etc. Deploying Mongrel/Rails apps at the moment is too “sysadminny”.

RubyScript: A Firefox plugin to allow Ruby to be used in a Javascript-esque fashion in Web pages. Would be good for off-line / intranet / specialist use. This kinda exists in Microsoft world.

Single text box journalling, notes, etc: A single page with a single text box. You type, it stores. If you type with a ? at the start, it then automatically searches for items matching your query and shows them to you. The ultimate simple note taking system. Just a single text box. Like Twitter, but not public. I’ve 90% developed this already but haven’t been bothered to finish it yet.

Web RAD tool with open source runtime environment: Think along the lines of Coghead, but with an open source runtime environment that anyone can use to run their apps. Imagine Delphi or Visual Basic, but simplified, and browser-based with an open source runtime.

There’s probably more, but these are the ones that I keep thinking about for five minutes each day.. so I figured I should note them down.

Beginning Ruby is released!

Monday, March 26th, 2007

Cover-1

My book, Beginning Ruby, was published today. Learn more about it (and how to get a copy!) in this short article I’ve written.

List of public DNS servers you can use

Friday, March 2nd, 2007

Your ISP’s DNS servers acting crappy and not resolving for you? Or do you just want to check out if your DNS changes are propagating properly? These DNS resolvers should be usable from almost anywhere..

67.138.54.100 = provided by ScrubIt
207.225.209.66 = provided by ScrubIt
208.67.222.222 = provided by OpenDNS
208.67.220.220 = provided by OpenDNS
4.2.2.1 = vnsc-pri.sys.gtei.net
4.2.2.2 = vnsc-bak.sys.gtei.net
65.74.140.3 = noc.arpa.org
206.111.255.123 = nic.arpa.org
216.185.111.10 = ns1.servermatrix.com
69.56.222.10 = ns2.servermatrix.com
67.19.0.10 = ns3.servermatrix.com
67.19.1.10 = ns4.servermatrix.com
70.84.160.11 = ns5.servermatrix.com
4.2.2.3
4.2.2.4
4.2.2.5
4.2.2.6

Thanks goes to Zeeshan Muhammad and Max Powers for several of these.

Blogger outputting bad Atom feeds with invalid MIME types

Monday, February 26th, 2007

This is annoying me enough that I have to post. Mostly so I can rank for the terms related to this problem, because I’ve tried searching for references to it and no-one else seems to have noticed the problem! At Feed Digest, however, it’s impossible to avoid as customers are complaining their feeds aren’t being processed properly.. but the reason is that Blogger.com has fscked up a lot of its customers feeds.

The problem seems to be that they’re throwing random crap into the “type” attribute, which is meant to be used for MIME types.. like so:

<link rel=’related’ type=’How to setup a 301 Redirect’ href=’http://www.dailyblogtips.com/how-to-setup-a-301-redirect/’></link>

“How to setup a 301 Redirect” is not a valid MIME type, so it’s not a valid Atom feed.

Another problem is that they’re not encoding apostrophes in many places, so the code is becoming totally invalid in the eyes of XML parsers. Check this out:

<link rel=’related’ type=’53 CSS-Techniques You Couldn’t Live Without | Smashing Magazine’ href=’http://www.smashingmagazine.com/2007/01/19/53-css-techniques-you-couldnt-live-without/’></link>

Blogger have decided to use single quotes for text encapsulation, which would be okay if they didn’t also allow apostrophes in the attribute data unescaped! The apostrophe on “Couldn’t” totally freaks out XML parsers.

(Update.. they’re also mixing single and double quotes..

<link rel=’alternate” type=”text/html” href=”http://www.cocc-blogs.com/2007/01/tutorial-on-installing-gaim.html”></link>

Check out the rel attribute.)

(Update 2.. I have word from Google that they’re looking into the problem. Result!)

Code Snippets sold (so don’t use @bigbold.com anymore)

Saturday, February 10th, 2007

I briefly added it as an update to my latest post, but in case you didn’t see it.. I sold Code Snippets (and the entire bigbold.com domain). I sold it to the amazing DZone, who are best known for Javalobby and DZone.com (Digg for developers, as I call it). DZone owner, Rick Ross, wrote a little bit about the acquisition.

I’m really pleased DZone has it because I know they’ll be great caretakers and developers for the site. They have an absolutely massive developer community around their various Web properties and could blow Snippets up to an entirely new level that little old me wouldn’t be able to reach alone. I also have the option of working alongside DZone wherever I can to help them with the site, ideas, and so forth, so even though I’ve given away my baby, I still have the option to ‘go visit’ if I want.

If this were a really slow news week, I guess you could now say a British Web 2.0 property has been sold, but I couldn’t be so pretentious. Still, for something cobbled up mostly over 2 days back in 2005 (though with significant work much later on to make it look nice), I am pleased with the outcome. All I need to do is sell another nine such sites, and I could afford a house! Of course, I’d rather build up and sell FeedDigest in a year or two instead, and that’s the next thing on the books.

Note: Most people who read this blog don’t e-mail me on @bigbold.com addresses anyway.. but if you do, please don’t anymore! Use my name @ petercooper.co.uk instead.

Microsoft attempts to patent feed processing technology

Saturday, December 23rd, 2006

I can’t believe it. Dave Winer reports that Microsoft are rather specifically attempting to patent a system that acts and sounds rather like what Feed Digest does. All of these excerpts from the patent application are almost word for word descriptions of significant aspects of what Feed Digest does or how it operates. It also covers significant aspects of applications such as FeedBurner.

The ability of a central system to receive feeds and allow others to retrieve data related to those feeds:

[…] the platform can acquire and organize web content, and make such content available for consumption by many different types of applications. These applications may or may not necessarily understand the particular syndication format. Thus, in the implementation example, applications that do not understand the RSS format can nonetheless, through the platform, acquire and consume content, such as enclosures, acquired by the platform through an RSS feed […]

There are cases, however, when an application that uses the platform does not wish to be subscribed to a particular feed. Rather, the application just wants to use the functionality of the platform to access data from a feed. In this case, in this particular embodiment, subscriptions object 202 supports a method that allows a feed to be downloaded without subscribing to the feed. In this particular example, the application calls the method and provides it with a URL associated with the feed. The platform then utilizes the URL to fetch the data of interest to the application. In this manner, the application can acquire data associated with a feed in an adhoc fashion without ever having to subscribe to the feed.

The ability to tailor data within the system for each feed:

On the other hand, there is data that is treated as read/write data, such as the name of a particular feed. That is, the user may wish to personalize a particular feed for their particular user interface. In this case, the object model has properties that are read/write. For example, a user may wish to change the name of a feed from “New York Times” to “NYT”. In this situation, the name property may be readable and writable.

Centralized synchronization:

In the illustrated and described embodiment, feed synchronization engine 108 (FIG. 1) is responsible for downloading RSS feeds from a source. A source can comprise any suitable source for a feed, such as a web site, a feed publishing site and the like. In at least one embodiment, any suitable valid URL or resource identifier can comprise the source of a feed. The synchronization engine receives feeds and processes the various feed formats, takes care of scheduling, handles content and enclosure downloads, as well as organizes archiving activities.

Feed normalization:

In the illustrated and described embodiment, feeds are capable of being received in a number of different feed formats. By way of example and not limitation, these feed formats can include RSS 1.0, 1.1, 0.9.times., 2.0, Atom 0.3, and so on. The synchronization engine, via the feed format module, receives these feeds in the various formats, parses the format and transforms the format into a normalized format referred to as the common format.

To Amar S. Ghandi, Edward J. Praitis, Jane T. Kim, Sean O. Lyndersay, Walter V. von Kock, William Gould, Bruce A. Morgan, and Cindy Kwan.. did you really collectively invent all of this stuff? Shame on those backing this pathetic attempt to trample over technology that has, so far, not necessitated the use of software patents.

Retro Programming Books

Wednesday, December 20th, 2006

Basicgames

I’m not really a collector, but if there’s one thing I could be accused of collecting then it’s cool programming books. I don’t have many retro ones though, so it was with much joy I found this site full of cool old programming books (mostly for the Atari though). Check out Basic Computer Games and the Bowling game. Puts the Wii to shame it does.

How eBay Works: An Amazing PDF

Friday, December 8th, 2006

Ebaypdf

I’ve found this amazing PDF presentation that shows how eBay’s architecture fits together and gives some mindblowing traffic numbers (1 billion pageviews per day!). There’s lots of juicy information about how their database system works too. They perform no client-side transactions, intriguingly.