RSS feed icon BTB Shadow Man BiteTheBullet.co.uk logo
Average Rating: 
Whole StarWhole StarWhole StarWhole StarWhole Star
Total number of ratings: 1
Leave your own rating

Lucene.net Web Spider

Updated Version 2.0 - 11 March 2007
Compiled against .Net 2.0 and Visual Studio 2005. Improved filtering and indexing of HTML files.

What is it?

I’ve been looking at Lucene.Net recently which is a great search API for .net. The only problem is that it is an API that works on created Lucene indexes already. So I’ve created this project to index a website and create a Lucene index that I can then use.

This project is based on the c# spider create by Jeff Heaton and then modified by Dan Bartels. What I’ve done is add some new functions and improve the existing functionality into something that I can use in my project.

Lucene Spider GUI Screenshot


Features

  • Create a new index based on a website – Spider a complete website and create a Lucene index for it.
  • Updates an existing index for a website – Deletes web pages no longer in existence, update pages that have changed since the last crawl.
  • Remove duplicate pages – Any web pages with the same content will only be indexed once, so if you are indexing a CMS website such as DotNetNuke where multiple urls can lead to the same content you won’t end up with duplicate content indexed.



In Use
I’ve created two projects in Visual Studio 2005, one is the spider/indexer the other project is a simple GUI that you can use to test drive the project. You could implement your own GUI/driver program that you can then use in your own project.

The indexer is multi threaded, so you can define the number of threads to use. I found 3-4 threads seem to work the best.

I’ve also written a DotNetNuke specific version that will not index the standard controls more than once. By standard controls I mean the login, terms & conditions and the privacy policy.

As with all my projects I’ve included all the source code and included lots comments to help you out should you want to expand on it.


Areas for Improvement
At the moment the indexer will only index HTML files however as you’ll see in the code I’ve added the basic code you need to implement other content filtering classes so it should be fairly easy to parse and index pdf and Word doc files.

If you are indexing a DotNetNuke site you'll see we add the same content multiple times since the same content can often be accessed by different urls. We then remove the duplicate documents from the index at the end, something which is fairly quick but it would be better to stop the indexing of these duplicates in the first place so some type of regex filter to allow the spider to be smarter about the urls that DotNetNuke uses and how it can identify unique content.

Downloads
Lucene Spider Version 2.0 - Source & Documentation

Privacy PolicyTerms and ConditionsCopyright © 2005 - 2008 BiteTheBullet.co.uk