Tech Time Pass: Installing Nutch

This article is courtesy due to Abhilash LL. Thanks Abhilash. :)

Setting up Nutch and performing a simple crawl

Get your hands on the setups of tomcat, nutch and ant ( preferably the release versions )

   Nutch    Ant    Tomcat

Extract all 3 tarballs, i.e. apache tomcat, ant and nutch in /home/<user>/ . Go into the nutch folder ( <NUTCH_HOME> )

Goto <NUTCH_HOME>/conf and rename existing nutch-site.xml as nutch-site.xml.orignal. Now copy nutch-default.xml and paste it as nutch-site.xml. Now both the xml files are the same. We shall be working on nutch-site.xml

In nutch-site.xml, Search for http.agent.name and put <crawler Name> in <value></value> tag within .

eg : <value>'MyCrawler'</value>

Then search for searcher.dir and put /home/<user>/Nutch/crawl in <value> tag. Save and close the file.

eg: <value> /home/username/Nutch_foldername/crawl </value> p.s. folder named crawl should not exist.

Create folder named urls in <NUTCH_HOME> folder.

Create a file as seedurls and put the list of urls to be crawled. If multiple seeds are to be provided, give each url on a single line

eg:http://www.iiitb.ac.in

Since its preferable to restrict ourselves to a particular domain, in <NUTCH_HOME>/conf modify crawl-urlfilter.txt by changing the value for "# accept hosts in MY.DOMAIN.NAME". Replace MY.DOMAIN.NAME by the domain you want to crawl in the regular expression.

eg: iiitb.ac.in

We need to set JAVA_HOME and NUTCH_JAVA_HOME. In the console type ( Assuming java is installed )

   echo $JAVA_HOME

if empty, you need to set it. In the console, type in

   java -version

Get to know your version Now we need to find the path and add it to the above mentioned variables. In the console,

   which java

suppose it returns /usr/lib/j2sdk1.5-sun. In the console,

   export JAVA_HOME=/usr/lib/j2sdk1.5-sun

   export NUTCH_JAVA_HOME=/usr/lib/j2sdk1.5-sun

suppose your echo $JAVA_HOME is non empty , you need to set the same values to NUTCH_JAVA_HOME as well.

You can verify by echoing both the variables

Now we are ready to crawl

cd into <NUTCH_HOME>. In the console type in bin/nutch. You can see a list of commands nutch can take. We use crawl here.

To start a crawl, make sure your internet connection is active and run this command.

   bin/nutch crawl urls/seedurls -dir /home/<user>/Nutch/crawl -depth 3 -topN 20

p.s : your pwd should be <NUTCH_HOME>

Wait for a while to let nutch complete the crawl

To verify the crawled content

   bin/nutch org.apache.nutch.searcher.NutchBean iiitb

You can see the hits for iiitb in your crawl To search the crawled information through a browser we need to build nutch using ant

Change build.xml in <NUTCH_HOME> and comment whole <touch> tag.

We can find the <touch> tags at line 61, as we get an error at the same line when we try to build w/o doing these changes

Now build nutch by

   /home/<user>/apache-ant-1.7.0/bin/ant

The above command builds nutch using build.xml in <NUTCH_HOME>. We give the path to the ant command in extracted ant directory. ps: pwd is <NUTCH_HOME>

Now we build the war file to be put into tomcat

   /home/<user>/apache-ant-1.7.0/bin/ant war

ps: pwd is <NUTCH_HOME>

Copy the war file in <NUTCH_HOME>/build ( nutch-0.9.war ) into webapps folder of tomcat

Restart tomcat by ./shutdown.sh and then ./startup.sh in <TOMCAT_HOME>/bin folder

open a browser and point url to "http://localhost:8080/nutch-0.9"

This brings up the search page of Nutch. You can now go ahead and search for information in the crawled data

Have fun!

Tech Time Pass

Tuesday, April 28, 2009

Installing Nutch

No comments:

Post a Comment

Blog Archive

Myself...