Can we force a blog to be popular??? Here is one attempt...
Popular Blog
Nutch Ant Tomcat
eg : <value>'MyCrawler'</value>
eg: <value> /home/username/Nutch_foldername/crawl </value> p.s. folder named crawl should not exist.
eg: iiitb.ac.in
echo $JAVA_HOME
if empty, you need to set it. In the console, type in
java -version
Get to know your version Now we need to find the path and add it to the above mentioned variables. In the console,
which java
suppose it returns /usr/lib/j2sdk1.5-sun. In the console,
export JAVA_HOME=/usr/lib/j2sdk1.5-sun
export NUTCH_JAVA_HOME=/usr/lib/j2sdk1.5-sun
suppose your echo $JAVA_HOME is non empty , you need to set the same values to NUTCH_JAVA_HOME as well.
You can verify by echoing both the variables
Now we are ready to crawl
bin/nutch crawl urls/seedurls -dir /home/<user>/Nutch/crawl -depth 3 -topN 20
p.s : your pwd should be <NUTCH_HOME>
Wait for a while to let nutch complete the crawl
bin/nutch org.apache.nutch.searcher.NutchBean iiitb
You can see the hits for iiitb in your crawl To search the crawled information through a browser we need to build nutch using ant
We can find the <touch> tags at line 61, as we get an error at the same line when we try to build w/o doing these changes
/home/<user>/apache-ant-1.7.0/bin/ant
The above command builds nutch using build.xml in <NUTCH_HOME>. We give the path to the ant command in extracted ant directory. ps: pwd is <NUTCH_HOME>
/home/<user>/apache-ant-1.7.0/bin/ant war
ps: pwd is <NUTCH_HOME>
This brings up the search page of Nutch. You can now go ahead and search for information in the crawled data
Have fun!
Nutch, apart from giving search results for the keywords given, provides some more features about every page like, cached, explain, anchors and more. Explain page gives some more technical details about the page, which are more intended for developers' community. Nutch, as part of its functionality, provides some basic details about the page on explain page. The main concern of this topic will be, how can we change this page to display customized information about the page as per our need.
There are several plug-ins provided by Nutch to accomplish different functionalities. The plug-in, we are interested in, to customize explain page is "index-more".
Following is the sequence of steps to be followed to make desired changes. Pre-requisite for these steps is, Nutch crawler should be fully functional on the system. If not please refer to Installing Nutch page by Abhilash. Once Nutch is verified running successfully, we are ready to get going for the changes.
private Document addMyFields(Document doc, ParseData data, String url, Inlinks inlinks)
addMyFields(doc, parse.getData(), url_s, inlinks);
doc.add(new Field(
../apache-ant-1.7.0/bin/ant
Important Methods
Parse
ParseData
Inlinks
Inlink
Outlink
Metadata
Some things worth knowing...
../apache-ant-1.7.0/bin/ant javadoc