James Caws

Generating RSS feeds from static HTML web pages

Jul 9th 2008
One Comment
respond
trackback


Sadly, even some mainstream, well used websites still do not offer updates via RSS. More fool them I say, they are potentially losing people who would otherwise be repeat visitors. There are ways though that you can automatically and easily generate your own feed(s) based on one or more of their pages.

About one year ago I came across a job listing page that I wanted to automatically monitor for new additions and alterations without actually having to view the web page. I could have appended it to my list of Firefox homepage bookmarks, but I already have enough and I have found after time it is easy to become complacent and give up even checking. The quickest and most rudimentary solution I could think of at the time was to write a quick and dirty unix script that retrieved the content of the page I was interested in monitoring and stored the total byte size in a file. Subsequent checks (the script was scheduled to run periodically in cron) compared the new page size with the previously obtained size and if there was a change an email would be sent with a link to said page. It worked so well that I adapted it to monitor a number of other pages I simply couldn’t be bothered to manually check, including various DMOZ listings.

The above method works OK, but there are plenty of reasons why you wouldn’t want to use it, including the fact that some pages change dynamically and on a very regular basis, plus in some cases it can be hard to spot where the changes are.

Yesterday I came across yet another page I wanted to monitor. I simply couldn’t bring myself to append it to the unix script and as I have almost fallen in love with the convenience of RSS subscriptions, I figured that at least one person out there must have produced a website that allows you to automatically generate RSS feeds from a URL.

I was after a simple solution and nothing too complicated either – simply monitoring a page for new links shouldn’t require a degree in Internet Technology. I got searching and discovered three websites offering the service, also sometimes referred to as ‘feed scraping’, though I believe this is incorrect terminology given no feeds are being scraped, more ‘html scraping’ or ‘web page scraping’ if you ask me.

In no particular order and with my brief opinion on each based on the few minutes trial I was willing to give each, here are the sites I discovered.

  • FeedYes – With FeedYes you cannot save feeds without being registered and logged in. I provided a sample URL (which is possible pre-registration) and they generated a preview of what links would be included in a feed based on it. Continuing on, you can specify what links are static and would therefore be present after every scrape, these are then ignored in future so only newsworthy links are included.
  • Feed43 – No registration required and it has the potential to be a very accurate RSS feed generator, because after retrieving the static page you define extraction rules. Some learning may be involved to set up these pattern matches.
  • feedmaker (from yoktu.com) – The most straight forward service I found and probably a good choice for someone who doesn’t have a geeky web background. A Google like interface where you enter the URL for the target page and after submitting it you are given a RSS link. It doesn’t provide any kind of link exclusion, so the initial feed results will include common static links, but after that they should no longer be displayed. I gave this service a go with a couple of URLs and sadly it didn’t work for all whereas most if not all URLs worked with the others.

This is not an extensive list and I am sure there are other sites offering a similar service, perhaps even better than all of the above. But as I said, I was after a quick solution which included being able to find it ‘quickly’.


This post is tagged: , , , ,

One Response

  1. [...] under a year ago I wrote about three websites that allowed users to generate RSS feeds from static web pages. The most promising provider looked like Feed43, however I never really gave it a thorough test. [...]

Leave a Reply

Latest Photos