Author Login
Post Reply
I'm looking at setting up a Lucene index front-ended by Solr returning
JSON requests to a Python/Django app running a search UI. I have
about 10,000 urls I need to crawl, and that number is supposed to rise
up to about 200,000 over the next year. In crawling these urls, I
will need to go 5 levels deep and stay within the domain. I need will
need to keep the index fresh, secure, and fast. So, I need to be able
to scale the system up probably to 10's of thousands of searches a
minute.
I've ordered Lucene in Action, I'm frantically bookmarking all the
wiki and faq pages I can find on the subject.
Here are some options I've come up with. Can anyone comment?
1) Use Nutch to build an index through Solr - requires some low level config
2) Build a simple crawler in Python and post xml packets to Solr to
build the index - simple, but may be too simple
3) Use wget to get all the pages, and then use ?? to index the pages
locally (probably a python script.) - a hack
I'm not sure I like any of these ideas. But, I'm leaning to 2) as it
seems easy. Can always get this project going quick and agile-like,
and the refactor into using Nutch down the road. That's assuming
there isn't something about nutch that you think I'll need
immediately.
Thoughts?