- Get your hands on the setups of tomcat, nutch and ant ( preferably the release versions )
Nutch Ant Tomcat
- Extract all 3 tarballs, i.e. apache tomcat, ant and nutch in /home/<user>/ . Go into the nutch folder ( <NUTCH_HOME> )
- Goto <NUTCH_HOME>/conf and rename existing nutch-site.xml as nutch-site.xml.orignal. Now copy nutch-default.xml and paste it as nutch-site.xml. Now both the xml files are the same. We shall be working on nutch-site.xml
- In nutch-site.xml, Search for http.agent.name and put <crawler Name> in <value></value> tag within .
eg : <value>'MyCrawler'</value>
- Then search for searcher.dir and put /home/<user>/Nutch/crawl in <value> tag. Save and close the file.
eg: <value> /home/username/Nutch_foldername/crawl </value> p.s. folder named crawl should not exist.
- Create folder named urls in <NUTCH_HOME> folder.
- Create a file as seedurls and put the list of urls to be crawled. If multiple seeds are to be provided, give each url on a single line
- Since its preferable to restrict ourselves to a particular domain, in <NUTCH_HOME>/conf modify crawl-urlfilter.txt by changing the value for "# accept hosts in MY.DOMAIN.NAME". Replace MY.DOMAIN.NAME by the domain you want to crawl in the regular expression.
eg: iiitb.ac.in
- We need to set JAVA_HOME and NUTCH_JAVA_HOME. In the console type ( Assuming java is installed )
echo $JAVA_HOME
if empty, you need to set it. In the console, type in
java -version
Get to know your version Now we need to find the path and add it to the above mentioned variables. In the console,
which java
suppose it returns /usr/lib/j2sdk1.5-sun. In the console,
export JAVA_HOME=/usr/lib/j2sdk1.5-sun
export NUTCH_JAVA_HOME=/usr/lib/j2sdk1.5-sun
suppose your echo $JAVA_HOME is non empty , you need to set the same values to NUTCH_JAVA_HOME as well.
You can verify by echoing both the variables
Now we are ready to crawl
- cd into <NUTCH_HOME>. In the console type in bin/nutch. You can see a list of commands nutch can take. We use crawl here.
- To start a crawl, make sure your internet connection is active and run this command.
bin/nutch crawl urls/seedurls -dir /home/<user>/Nutch/crawl -depth 3 -topN 20
p.s : your pwd should be <NUTCH_HOME>
Wait for a while to let nutch complete the crawl
- To verify the crawled content
bin/nutch org.apache.nutch.searcher.NutchBean iiitb
You can see the hits for iiitb in your crawl To search the crawled information through a browser we need to build nutch using ant
- Change build.xml in <NUTCH_HOME> and comment whole <touch> tag.
We can find the <touch> tags at line 61, as we get an error at the same line when we try to build w/o doing these changes
- Now build nutch by
/home/<user>/apache-ant-1.7.0/bin/ant
The above command builds nutch using build.xml in <NUTCH_HOME>. We give the path to the ant command in extracted ant directory. ps: pwd is <NUTCH_HOME>
- Now we build the war file to be put into tomcat
/home/<user>/apache-ant-1.7.0/bin/ant war
ps: pwd is <NUTCH_HOME>
- Copy the war file in <NUTCH_HOME>/build ( nutch-0.9.war ) into webapps folder of tomcat
- Restart tomcat by ./shutdown.sh and then ./startup.sh in <TOMCAT_HOME>/bin folder
- open a browser and point url to "http://localhost:8080/nutch-0.9"
This brings up the search page of Nutch. You can now go ahead and search for information in the crawled data
Have fun!
No comments:
Post a Comment