Tuesday, April 28, 2009

Installing Nutch

This article is courtesy due to Abhilash LL. Thanks Abhilash. :)
Setting up Nutch and performing a simple crawl
  • Get your hands on the setups of tomcat, nutch and ant ( preferably the release versions )
   Nutch    Ant    Tomcat 
  • Extract all 3 tarballs, i.e. apache tomcat, ant and nutch in /home/<user>/ . Go into the nutch folder ( <NUTCH_HOME> )
  • Goto <NUTCH_HOME>/conf and rename existing nutch-site.xml as nutch-site.xml.orignal. Now copy nutch-default.xml and paste it as nutch-site.xml. Now both the xml files are the same. We shall be working on nutch-site.xml
  • In nutch-site.xml, Search for http.agent.name and put <crawler Name> in <value></value> tag within .

eg : <value>'MyCrawler'</value>

  • Then search for searcher.dir and put /home/<user>/Nutch/crawl in <value> tag. Save and close the file.

eg: <value> /home/username/Nutch_foldername/crawl </value> p.s. folder named crawl should not exist.

  • Create folder named urls in <NUTCH_HOME> folder.
  • Create a file as seedurls and put the list of urls to be crawled. If multiple seeds are to be provided, give each url on a single line

eg:http://www.iiitb.ac.in

  • Since its preferable to restrict ourselves to a particular domain, in <NUTCH_HOME>/conf modify crawl-urlfilter.txt by changing the value for "# accept hosts in MY.DOMAIN.NAME". Replace MY.DOMAIN.NAME by the domain you want to crawl in the regular expression.

eg: iiitb.ac.in

  • We need to set JAVA_HOME and NUTCH_JAVA_HOME. In the console type ( Assuming java is installed )
   echo $JAVA_HOME 

if empty, you need to set it. In the console, type in

   java -version 

Get to know your version Now we need to find the path and add it to the above mentioned variables. In the console,

   which java  

suppose it returns /usr/lib/j2sdk1.5-sun. In the console,

   export JAVA_HOME=/usr/lib/j2sdk1.5-sun
   export NUTCH_JAVA_HOME=/usr/lib/j2sdk1.5-sun 

suppose your echo $JAVA_HOME is non empty , you need to set the same values to NUTCH_JAVA_HOME as well.

You can verify by echoing both the variables

Now we are ready to crawl

  • cd into <NUTCH_HOME>. In the console type in bin/nutch. You can see a list of commands nutch can take. We use crawl here.
  • To start a crawl, make sure your internet connection is active and run this command.
   bin/nutch crawl urls/seedurls -dir /home/<user>/Nutch/crawl -depth 3 -topN 20 

p.s : your pwd should be <NUTCH_HOME>

Wait for a while to let nutch complete the crawl

  • To verify the crawled content
   bin/nutch org.apache.nutch.searcher.NutchBean iiitb 

You can see the hits for iiitb in your crawl To search the crawled information through a browser we need to build nutch using ant

  • Change build.xml in <NUTCH_HOME> and comment whole <touch> tag.

We can find the <touch> tags at line 61, as we get an error at the same line when we try to build w/o doing these changes

  • Now build nutch by
   /home/<user>/apache-ant-1.7.0/bin/ant 

The above command builds nutch using build.xml in <NUTCH_HOME>. We give the path to the ant command in extracted ant directory. ps: pwd is <NUTCH_HOME>

  • Now we build the war file to be put into tomcat
   /home/<user>/apache-ant-1.7.0/bin/ant war 

ps: pwd is <NUTCH_HOME>

  • Copy the war file in <NUTCH_HOME>/build ( nutch-0.9.war ) into webapps folder of tomcat
  • Restart tomcat by ./shutdown.sh and then ./startup.sh in <TOMCAT_HOME>/bin folder

This brings up the search page of Nutch. You can now go ahead and search for information in the crawled data

Have fun!

Extending Nutch

Introduction

Nutch, apart from giving search results for the keywords given, provides some more features about every page like, cached, explain, anchors and more. Explain page gives some more technical details about the page, which are more intended for developers' community. Nutch, as part of its functionality, provides some basic details about the page on explain page. The main concern of this topic will be, how can we change this page to display customized information about the page as per our need.

There are several plug-ins provided by Nutch to accomplish different functionalities. The plug-in, we are interested in, to customize explain page is "index-more".


Steps

Following is the sequence of steps to be followed to make desired changes. Pre-requisite for these steps is, Nutch crawler should be fully functional on the system. If not please refer to Installing Nutch page by Abhilash. Once Nutch is verified running successfully, we are ready to get going for the changes.

  • Open file, /src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
  • Add a method with following signature.
   private Document addMyFields(Document doc, ParseData data, String url, Inlinks inlinks) 
  • Locate the method called "filter". This is the method responsible to add any additional fields on the explain page. Following are the arguments of filter method, which will be used to accomplish our task.
    • Document doc: This is the document object which is displayed, i.e. the explain page. We will be using this object to add fields on the explain page.
    • Parse parse: This is the object containing data obtained after parsing any web page. We will be using this object to get values of the fields which we want to add on the explain page.
    • Inlinks inlinks: This is the object of Inlinks class. This object should be having information about the incoming links to the web page. (P.S.: Incoming links could not be found out when tried to get actually.)
    • Text url: URL of the web page.
  • Add a call to this method in filter method as follows.
   addMyFields(doc, parse.getData(), url_s, inlinks); 
  • Inside addMyFields method, use following method call to add fields on the explain page.

doc.add(new Field(, , ));

    • Here, <title of field> is the string to be displayed on explain page as the field title.
    • <value of field> is the string to be displayed on explain page, as the value of the field.
    • <to be displayed> flag sets whether the field to be added is to be displayed on explain page or not.
    • <to be tokenized> flag sets whether the field to be added is to be tokenized or not.
    • Please refer to important methods, listed below to get <field title> and<field value> of different fields.
  • Add one doc.add() method call to add each field on explain page.
  • Now, open nutch-site.xml file from <nutch_home>/conf.
  • Search for a property name "plugin.include".
  • In the value tag of "plugin.include" property, add "|index-more". Value of this tag tells nutch which all plugins are to be enabled.
  • Now, execute following command with current working directory as > to build the changed made.
   ../apache-ant-1.7.0/bin/ant 
  • After the command has terminated successfully, replace "plugin" folder in <nutch_home> directory with "<nutch_home>/build/plugin" folder.
  • Finally, we are set to crawl the web again with changes in the crawler. So, crawl the web again, create a war file, place it in webapps folder of tomcat server and restart it.
  • Open the nutch home page on localhost and fire a query for a keyword.
  • Click on "explain" link of any resulting link on the result page. And... all the fields we added are displayed over here...

Important Methods

Parse

  1. ParseData getData() - Returns an object of ParseData class, containing parsed data of a web page.

ParseData

  1. Outlink[] getOutlinks() - Returns an array of Outlink object, each containg details about the outgoing link from the web page.
  2. Matadata getContentMeta() - Returns a Metadata object having the original Metadata retrieved from content.
  3. Matadata getParseMeta() - Returns a Metadata object having other content properties.

Inlinks

  1. Iterator iterator() - Returns an Iterator object containing objects of Inlink class.

Inlink

  1. String getFromUrl() - Returns the URL of the incoming link.
  2. String getAnchor() - Returns the anchor text on the web page for the incoming link.

Outlink

  1. String getToUrl() - Returns the URL of the outgoing link.
  2. String getAnchor() - Returns the anchor text on the web page for the outgoing link.

Metadata

  1. String[] names() - Returns the array of Strings of all the names of the fields in the Metadata object.
  2. boolean isMultiValued(String name) - Returns true if the field "name" is multivalued.
  3. String get(String name) - Get the value associated to a metadata name.
  4. String[] getValues(String name) - Get the values associated to a metadata name.
  5. int size() - Returns the number of metadata names in this metadata.


Some things worth knowing...

  • Whenever you use ../<apache-ant_home>/bin/ant with/without any option from <nutch_home>, the command always affects files in<nutch_home>/build folder. e.g. After making changes in index-more plugin, when ant is executed, the latest jar file with changes is created in plugin folder inside build folder. So, for the changes to come into effect, that folder has to be moved to <nutch_home> folder.
  • After making change in any plugin, we need to enable the plugin by adding name of that plugin in the value tag of "plugin.include" property in nutch-site.xml file. How we can know what is the exact name of the plugin? For every plugin, there is a plugin.xml file in<nutch_home>/src/plugin/<plugin> folder. id of the plugin tag in the plugin.xml file gives you the exact name which should be added in nutch-site.xml file.
  • To get the javadoc of modified or existing src, you can use javadoc option of ant command. This is to be run from <nutch_home> directory. Also, the new javadoc will be generated in <nutch_home>/build/doc only.
   ../apache-ant-1.7.0/bin/ant javadoc

DirectX 11

DirectX 11 is the latest inclusion by Microsoft into DirectX family. Though in pre-release mode, SDK is available for developers to start building applications on it. The most exiting feature in Dx11 is the Compute Shader.