How to Write a Web Crawler in C#

A few months ago I drastically changed how the urls on my site were built. I moved to using the ASP.NET 2.0 virtual path provider to make more friendly urls. See the discussions in April 2007 if you’re interested. There were several posts that month about it. One problem with a change like this is that it can wreak havoc on your urls, especially your relative ones. Using the url rewriting features built into ASP.NET 2.0 I redirected all the old urls to the new ones, but that didn’t fix the relative urls in the blog posts, because there were now more subdirectories that needed to be navigated. I have finally gotten around to building something to check to make sure all my urls are good: a web crawler.

Just in case you don’t know what a web crawler is, a web crawler is a program that someone uses to view a page, extract all the links and various pieces of data for the page, which then hits all the links referenced on that page, getting all the data for those, and so on. This is how search engines, for example, get all their data. They write crawlers.

And that is exactly what I needed; something to crawl my site to make sure all my links were good. So I decided to write one, and I’m sharing it with you here. You can download it at the end. Between here and there is a discussion of some of the more interesting bits of features and code in the crawler.

Disclaimer

First, I’m not sharing this because I think it is the best crawler ever. My quality bar for this one was "will it meet the needs for which I developed it?". The answer to that is "yes". It may not meet yours. If not, change it yourself, use the code as a starting point for your own, or run away cursing my insufficient code, ruing the day that I was brought into this cold, hard world. Second, I have only tested this on a few of my own personal sites. It seems to work fine on all of them. If it doesn’t work completely on yours, see the first point. Third, this was not optimized for speed. If you want to crawl the entire web with this thing, you’ll probably find that it is not fast enough. Sorry, but see the first point. Fourth, I did not build in robots.txt support into the crawler...because I was just wanting this for myself. If you’re going to use this on other people’s sites, please do that. It is the nice thing to do. Don’t be evil.

Overview

Here are some notes on the basics of the crawler.

  1. It is a console app - It doesn’t need a rich interface, so I figured a console application would do. The output is done as an html file and the input (what site to view) is done through the app.config. Making a windows app out of this seemed like overkill.
  2. The crawler is designed to only crawl the site it originally targets. It would be easy to change that if you want to crawl more than just a single site, but that is the goal of this little application.
  3. Originally the crawler was just written to find bad links. Just for fun I also had it collect information on page and viewstate sizes. It will also list all non-html files and external urls, just in case you care to see them.
  4. The results are shown in a rather minimalistic html report. This report is automatically opened in Internet Explorer when the crawl is finished.

Getting the Text from an Html Page

The first crucial piece of building a crawler is the mechanism for going out and fetching the html off of the web (or your local machine, if you have the site running locally.). Like so much else, .NET has classes for doing this very thing built into the framework.


private static string GetWebText(string url)
{
    HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
    request.UserAgent = "A .NET Web Crawler";
    WebResponse response = request.GetResponse();
    Stream stream = response.GetResponseStream();
    StreamReader reader = new StreamReader(stream);
    string htmlText = reader.ReadToEnd();
    return htmlText;
}

The HttpWebRequest class can be used to request any page from the internet. The response (retrieved through a call to GetResponse()) holds the data you want. Get the response stream, throw it in a StreamReader, and read the text to get your html.

You can download the sample code here.

You might also find arachnode.net interesting, which is aweb crawler written in C#.