Tuesday, April 28, 2009

Extending Nutch

Introduction

Nutch, apart from giving search results for the keywords given, provides some more features about every page like, cached, explain, anchors and more. Explain page gives some more technical details about the page, which are more intended for developers' community. Nutch, as part of its functionality, provides some basic details about the page on explain page. The main concern of this topic will be, how can we change this page to display customized information about the page as per our need.

There are several plug-ins provided by Nutch to accomplish different functionalities. The plug-in, we are interested in, to customize explain page is "index-more".


Steps

Following is the sequence of steps to be followed to make desired changes. Pre-requisite for these steps is, Nutch crawler should be fully functional on the system. If not please refer to Installing Nutch page by Abhilash. Once Nutch is verified running successfully, we are ready to get going for the changes.

  • Open file, /src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
  • Add a method with following signature.
   private Document addMyFields(Document doc, ParseData data, String url, Inlinks inlinks) 
  • Locate the method called "filter". This is the method responsible to add any additional fields on the explain page. Following are the arguments of filter method, which will be used to accomplish our task.
    • Document doc: This is the document object which is displayed, i.e. the explain page. We will be using this object to add fields on the explain page.
    • Parse parse: This is the object containing data obtained after parsing any web page. We will be using this object to get values of the fields which we want to add on the explain page.
    • Inlinks inlinks: This is the object of Inlinks class. This object should be having information about the incoming links to the web page. (P.S.: Incoming links could not be found out when tried to get actually.)
    • Text url: URL of the web page.
  • Add a call to this method in filter method as follows.
   addMyFields(doc, parse.getData(), url_s, inlinks); 
  • Inside addMyFields method, use following method call to add fields on the explain page.

doc.add(new Field(, , ));

    • Here, <title of field> is the string to be displayed on explain page as the field title.
    • <value of field> is the string to be displayed on explain page, as the value of the field.
    • <to be displayed> flag sets whether the field to be added is to be displayed on explain page or not.
    • <to be tokenized> flag sets whether the field to be added is to be tokenized or not.
    • Please refer to important methods, listed below to get <field title> and<field value> of different fields.
  • Add one doc.add() method call to add each field on explain page.
  • Now, open nutch-site.xml file from <nutch_home>/conf.
  • Search for a property name "plugin.include".
  • In the value tag of "plugin.include" property, add "|index-more". Value of this tag tells nutch which all plugins are to be enabled.
  • Now, execute following command with current working directory as > to build the changed made.
   ../apache-ant-1.7.0/bin/ant 
  • After the command has terminated successfully, replace "plugin" folder in <nutch_home> directory with "<nutch_home>/build/plugin" folder.
  • Finally, we are set to crawl the web again with changes in the crawler. So, crawl the web again, create a war file, place it in webapps folder of tomcat server and restart it.
  • Open the nutch home page on localhost and fire a query for a keyword.
  • Click on "explain" link of any resulting link on the result page. And... all the fields we added are displayed over here...

Important Methods

Parse

  1. ParseData getData() - Returns an object of ParseData class, containing parsed data of a web page.

ParseData

  1. Outlink[] getOutlinks() - Returns an array of Outlink object, each containg details about the outgoing link from the web page.
  2. Matadata getContentMeta() - Returns a Metadata object having the original Metadata retrieved from content.
  3. Matadata getParseMeta() - Returns a Metadata object having other content properties.

Inlinks

  1. Iterator iterator() - Returns an Iterator object containing objects of Inlink class.

Inlink

  1. String getFromUrl() - Returns the URL of the incoming link.
  2. String getAnchor() - Returns the anchor text on the web page for the incoming link.

Outlink

  1. String getToUrl() - Returns the URL of the outgoing link.
  2. String getAnchor() - Returns the anchor text on the web page for the outgoing link.

Metadata

  1. String[] names() - Returns the array of Strings of all the names of the fields in the Metadata object.
  2. boolean isMultiValued(String name) - Returns true if the field "name" is multivalued.
  3. String get(String name) - Get the value associated to a metadata name.
  4. String[] getValues(String name) - Get the values associated to a metadata name.
  5. int size() - Returns the number of metadata names in this metadata.


Some things worth knowing...

  • Whenever you use ../<apache-ant_home>/bin/ant with/without any option from <nutch_home>, the command always affects files in<nutch_home>/build folder. e.g. After making changes in index-more plugin, when ant is executed, the latest jar file with changes is created in plugin folder inside build folder. So, for the changes to come into effect, that folder has to be moved to <nutch_home> folder.
  • After making change in any plugin, we need to enable the plugin by adding name of that plugin in the value tag of "plugin.include" property in nutch-site.xml file. How we can know what is the exact name of the plugin? For every plugin, there is a plugin.xml file in<nutch_home>/src/plugin/<plugin> folder. id of the plugin tag in the plugin.xml file gives you the exact name which should be added in nutch-site.xml file.
  • To get the javadoc of modified or existing src, you can use javadoc option of ant command. This is to be run from <nutch_home> directory. Also, the new javadoc will be generated in <nutch_home>/build/doc only.
   ../apache-ant-1.7.0/bin/ant javadoc

No comments:

Post a Comment