.

« Time for Code Freeze | Main | Writing Documentation »

Saturday, July 29, 2006

Deep Linking from RSS

One of the more unique and perhaps controversial features of FeedJournal is that it can filter out the meat of an article published on the web.

How does it accomplish this? FeedJournal has four ways of retrieving the actual content for the next issue.

Actual Content
In the trivial case, a site (like this blog for example) decides to include the full article text within its RSS feed. FeedJournal simply published the content; no surprises here. By the way, this is how all standard RSS aggregators work. The problem is when a site decides to only publish summaries or teasers of the full article text. FeedJournal needs to deal with this because it is an offline RSS reader, users cannot click on their printed newspaper to read the full article.

Linked Content
The <link> tag inside the RSS feed specifies the URL for the full article. In case the RSS only includes summaries of the full articles, FeedJournal retrieves the text from this URL.

Rewritten Link
In most cases, just following this link is not a good solution. The web page typically includes lots of irrelevant content, like a navigation menu, a blogroll, or other articles. FeedJournal lets the user write a regular expression for each feed, automatically rewriting the article’s URL to the URL of the printer-friendly version. As an example the URL to a full article in International Herald Tribune is http://www.iht.com/articles/2006/07/28/news/mideast.php while the link to the printer-friendly version is http://www.iht.com/bin/print_ipub.php?file=/articles/2006/07/28/news/mideast.php By inserting bin/print_ipub.php?file=/ in the middle of the URL we will reach the printer-friendly article. This article is much more suitable for publishing in FeedJournal, because it more or less only contains the meat of the article.

Filtered Content
“More or less”, I said in the last sentence. There are usually some unwanted elements left in the printer-friendly version, like a header and a footer. These can be filtered out by letting FeedJournal begin the article after a specified substring in the HTML document source. Likewise, another substring can be selected as ending the relevant content.

By applying these functions it is possible to scoop, or extract, the meat of almost any web published article. Of course it is only necessary to do this once for every feed. To my knowledge, FeedJournal is the only aggregator who has the functionality described in the last three sections.

Is this legal, you ask? Wouldn’t a site owner require each user to actually visit the web site to read the content and click on all those fancy ads sprinkled all over? Well, my stance is that if the content is freely available on the web, I am free to do whatever I want with it for my own purposes. Keep in mind that we are not actually republishing the site’s content, we are only filtering it for our own use. Essentially, I think of this as a pop-up or ad blocker running in your browser.

What is interesting to note is that some web sites have tried to include in their copyright notice a paragraph limiting the usage of their content. Digg.com, for example, initially had a clause in the their copyright effectively prohibiting RSS aggregators from using their RSS feeds! Today, it is removed.

As long as FeedJournal is used for personal use, and the issues are not sold or made available publicly, I do not see any legal problems with the deep linking.


© Jonas Martinsson 1995-2006