One thing that has always fascinated me is how search engines work; more so how they collect information about all the webpages they search through. To that end I created my own search engine Web Bot that you can use and modify.
To try it out download the code from:
Parses Robots.txt and Sitemaps to correctly determine what to crawl through
Uses multithreaded searching (via the ThreadPool) and Async Web Requests for lower CPU load.
How it works:
When a new Top Level Domain is encountered eg. http://www.Microsoft.com it is checked for Robots.txt file, if one exists it is parsed along with any Sitemap Xml files referenced within. Any pages referenced are added to the Web Crawler Task queue for that particular domain.
To stop chocking of particular domains, much like the Web Crawler example in the Windows Mobile 6 SDK, this one orders tasks in a round robin style between each domain, thus all domains have a chance to add tasks to their queues to be processed.
Although processing is queued in the ThreadPool, tasks as mentioned are pre-queued in their own domains allowing the current state to be serialised in Xml for future continuation of work.
Currently very little is extracted from a webpage, currently only links via regular expressions for further processing. This allows the user free reign on how the search engine should catalogue pages.