Author Login
Post Reply
Hi,
I'm a nutch-newbie and am developing a search-based website.
How can I use Nutch to search for parameterized URLs?
e.g. I want to search on an item called "xyz". The information on this item
is available on http://www.somesite.com/somepage.jsp?id=someId
where someId is the databaseId (generated by the host application) for item
"xyz".
I know that item "xyz" shows up with the above URL when I search using
Google but it doesn't appear when I search for it using the sample web
application provided with nutch.
*Configuration:*
I have configured the crawl-urlfilter.txt to :
# accept hosts in MY.DOMAIN.NAME
*+^http://([a-z0-9]*\.)*somesite.com/*
My *urls* folder contains a text file containing :
*http://www.somesite.com*<http://www.somesite.com>
and I executed the command: *bin/nutch crawl urls -dir crawldir -depth 3*
How can I get: http://www.somesite.com/somepage.jsp?id=someId when I search
for "xyz" the same way it shows up during a Google search?
Your help would be much appreciated,
Rohit