Tuesday, April 28, 2009

Installing Nutch

This article is courtesy due to Abhilash LL. Thanks Abhilash. :)
Setting up Nutch and performing a simple crawl
  • Get your hands on the setups of tomcat, nutch and ant ( preferably the release versions )
   Nutch    Ant    Tomcat 
  • Extract all 3 tarballs, i.e. apache tomcat, ant and nutch in /home/<user>/ . Go into the nutch folder ( <NUTCH_HOME> )
  • Goto <NUTCH_HOME>/conf and rename existing nutch-site.xml as nutch-site.xml.orignal. Now copy nutch-default.xml and paste it as nutch-site.xml. Now both the xml files are the same. We shall be working on nutch-site.xml
  • In nutch-site.xml, Search for http.agent.name and put <crawler Name> in <value></value> tag within .

eg : <value>'MyCrawler'</value>

  • Then search for searcher.dir and put /home/<user>/Nutch/crawl in <value> tag. Save and close the file.

eg: <value> /home/username/Nutch_foldername/crawl </value> p.s. folder named crawl should not exist.

  • Create folder named urls in <NUTCH_HOME> folder.
  • Create a file as seedurls and put the list of urls to be crawled. If multiple seeds are to be provided, give each url on a single line

eg:http://www.iiitb.ac.in

  • Since its preferable to restrict ourselves to a particular domain, in <NUTCH_HOME>/conf modify crawl-urlfilter.txt by changing the value for "# accept hosts in MY.DOMAIN.NAME". Replace MY.DOMAIN.NAME by the domain you want to crawl in the regular expression.

eg: iiitb.ac.in

  • We need to set JAVA_HOME and NUTCH_JAVA_HOME. In the console type ( Assuming java is installed )
   echo $JAVA_HOME 

if empty, you need to set it. In the console, type in

   java -version 

Get to know your version Now we need to find the path and add it to the above mentioned variables. In the console,

   which java  

suppose it returns /usr/lib/j2sdk1.5-sun. In the console,

   export JAVA_HOME=/usr/lib/j2sdk1.5-sun
   export NUTCH_JAVA_HOME=/usr/lib/j2sdk1.5-sun 

suppose your echo $JAVA_HOME is non empty , you need to set the same values to NUTCH_JAVA_HOME as well.

You can verify by echoing both the variables

Now we are ready to crawl

  • cd into <NUTCH_HOME>. In the console type in bin/nutch. You can see a list of commands nutch can take. We use crawl here.
  • To start a crawl, make sure your internet connection is active and run this command.
   bin/nutch crawl urls/seedurls -dir /home/<user>/Nutch/crawl -depth 3 -topN 20 

p.s : your pwd should be <NUTCH_HOME>

Wait for a while to let nutch complete the crawl

  • To verify the crawled content
   bin/nutch org.apache.nutch.searcher.NutchBean iiitb 

You can see the hits for iiitb in your crawl To search the crawled information through a browser we need to build nutch using ant

  • Change build.xml in <NUTCH_HOME> and comment whole <touch> tag.

We can find the <touch> tags at line 61, as we get an error at the same line when we try to build w/o doing these changes

  • Now build nutch by
   /home/<user>/apache-ant-1.7.0/bin/ant 

The above command builds nutch using build.xml in <NUTCH_HOME>. We give the path to the ant command in extracted ant directory. ps: pwd is <NUTCH_HOME>

  • Now we build the war file to be put into tomcat
   /home/<user>/apache-ant-1.7.0/bin/ant war 

ps: pwd is <NUTCH_HOME>

  • Copy the war file in <NUTCH_HOME>/build ( nutch-0.9.war ) into webapps folder of tomcat
  • Restart tomcat by ./shutdown.sh and then ./startup.sh in <TOMCAT_HOME>/bin folder

This brings up the search page of Nutch. You can now go ahead and search for information in the crawled data

Have fun!

No comments:

Post a Comment