Screen Scraping Your Way Into RSS


Dennis Pallett

Introduction

RSS is one the hottest technologies at the moment, and even big web publishers such as the New York Times are getting into RSS as well. However, there are still a lot of websites that do not have RSS feeds.

If you still want to be able to check those websites in your favourite aggregator, you need to create your own RSS feed for those websites. This can be done automatically with PHP, using a method called screen scrapping. Screen scrapping is usually frowned upon, as its mostly used to steal content from other websites.

I personally believe that in this case, to automatically generate a RSS feed, screen scrapping is not a bad thing. Now, on to the code!

Getting the content

For this article, well use PHPit as an example, despite the fact that PHPit already has RSS feeds http://www.phpit.net/syndication/.

Well want to generate a RSS feed from the content listed on the frontpage http://www.phpit.net. The first step in screen scraping is getting the complete page. In PHP this can be done very easily, by using implodefile"", "[the url here]"; IF your web host allows it. If you cant use file youll have to use a different method of getting the page, e.g. using the CURL library http://www.php.net/curl.

Now that we have the content available, we can parse it for the content using some regular expressions. The key to screen scraping is looking for patterns that match the content, e.g. are all the content items wrapped in <div>s or something else If you can successfully discover a pattern, then you can use preg_match_all to get all the content items.

For PHPit, the pattern that match the content is <div class="contentitem">[Content Here]<div>. You can verify this yourself by going to the main page of PHPit, and viewing the source.

Now that we have a match we can get all the content items. The next step is to retrieve the individual information, i.e. url, title, author, text. This can be done by using some more regular expression and str_replace on the each content items.

By now we have the following code;

<php

// Get page
$url = "http://www.phpit.net/";
$data = implode"", file$url; 

// Get content items
preg_match_all "/<div class="contentitem">[^`]*</div>/", $data, $matches;

Like I said, the next step is to retrieve the individual information, but first lets make a beginning on our feed, by setting the appropriate header text/xml and printing the channel information, etc.

 
// Begin feed
header "Content-Type: text/xml; charset=ISO-8859-1";
echo "<xml version="1.0" encoding="ISO-8859-1" >
";
>
<rss version="2.0"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:content="http://purl.org/rss/1.0/modules/content/"
  xmlns:admin="http://webns.net/mvcb/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
	<channel>
		<title>PHPit Latest Content</title>
		<description>The latest content from PHPit http://www.phpit.net, screen scraped!</description>
		<link>http://www.phpit.net</link>
		<language>en-us</language>


<

Now its time to loop through the items, and print their RSS XML. We first loop through each item, and get all the information we get, by using more regular expressions and preg_match. After that the RSS for the item is printed.

<php
// Loop through each content item
foreach $matches[0] as $match {
	// First, get title
	preg_match "/">[^`]*</a></h3>/", $match, $temp;
	$title = $temp[1];
	$title = strip_tags$title;
	$title = trim$title;

	// Second, get url
	preg_match "/<a href="[^`]*">/", $match, $temp;
	$url = $temp[1];
	$url = trim$url;

	// Third, get text
	preg_match "/<p>[^`]*<span class="byline">/", $match, $temp;
	$text = $temp[1];
	$text = trim$text;

	// Fourth, and finally, get author
	preg_match "/<span class="byline">By [^`]*</span>/", $match, $temp;
	$author = $temp[1];
	$author = trim$author;

	// Echo RSS XML
	echo "<item>
";
		echo "			<title>" . strip_tags$title . "</title>
";
		echo "			<link>http://www.phpit.net" . strip_tags$url . "</link>
";
		echo "			<description>" . strip_tags$text . "</description>
";
		echo "			<content:encoded><![CDATA[ 
";
		echo $text . "
";
		echo " ]]></content:encoded>
";
		echo "			<dc:creator>" . strip_tags$author . "</dc:creator>
";
	echo "		</item>
";
}
>

And finally, the RSS file is closed off.

</channel>
</rss>

Thats all. If you put all the code together, like in the demo script, then youll have a perfect RSS feed.

Conclusion

In this tutorial I have shown you how to create a RSS feed from a website that does not have a RSS feed themselves yet. Though the regular expression is different for each website, the principle is exactly the same.

One thing I should mention is that you shouldnt immediately screen scrape a websites content. E-mail them first about a RSS feed. Who knows, they might set one up themselves, and that would be even better.

Download sample script at http://www.phpit.net/viewsource.phpurl=/demo/screenscrape%20rss/example.php





About The Author

Dennis Pallett is a young tech writer, with much experience in ASP, PHP and other web technologies. He enjoys writing, and has written several articles and tutorials. To find more of his work, look at his websites at http://www.phpit.net, http://www.aspit.net and http://www.ezfaqs.com



To provide some examples of web design and development I give you here:

10 latest blog posts by Web Developer Jim Westergren

I’m an SEO and I have been working a lot with WordPress, here I give you all my tips for you to rank very well in Google with your blog. UPDATE: Check this blog post for a better guide. Quick Facts There are 55 million blogs out there, if you don’t stand out you will have no chance. The [...]

Update, March 9th I have now changed it again and put some color into it. What do you guys think? Sunday today and I was away from work with clients so I decided to work with my blog today from home. I made a new design for this site. Check out the navigation links at the top left [...]

This article is written for my friend “honey” (site). I have been bidding against honey on web site auctions for almost 2 years now. I have won maybe 60 auctions and I have now over 100 web sites. Honey owns over 300 … So here comes my checklist that I want to show honey as I [...]

Have you also heard of those horror stories of Google banning Adsense accounts for the smallest mistakes? You have read the Terms and Conditions and you know the basics but what do you do when you show your friend your site on your computer and the first thing he does is to click the Adsense ad [...]

This article is written more for myself so I remember how I do it the next time but probably a few people will also benefit from this for different uses and purposes. Today I updated all the PR values for the directories listed on my directory list. I had to update each listing in the MySQL [...]

Official site of a children’s hospital in Japan Hey, your “logo” is not blinking! MSY Technology Pty. Ltd. Are you sure product X is HOT? Personal site of Franz Magnus Incredible that you got several awards for that site. Angren.net, electronic shop Can’t you squeeze in something more on the home page? Official site of Northbridge Police Department Still being updated in 2006. Perhaps [...]

The last days I have been fighting in the war against the latest spam bot soldiers like a maniac. I own and manage over 70 web sites. This includes different forums, directories, blogs, topsites, article submission sites and you name it. Very recently there is a new wave of spam. The default captcha for vBulletin is now [...]

This is a WordPress plugin that will give you more links and higher rankings in the search engines. Most bloggers knows the importance of getting links in order to get high rankings in search engines. But did you know that the best links are those that are natural recommendations? Additionally I experienced better rankings across all [...]

To improve the navigation of your users as well as search engine traffic and ranking to your WordPress blog I suggest making a good site map of your posts - a map of your site. A kind of user friendly archive of your posts. This is not “Google sitemap”! The benefits: The user can quickly find a [...]

How I rank on different keywords and links to the different SE queries.

home | site map

Articles



Ken Follet | Car Insurance | Savings Accounts | Loans | Secured Loans