Java Mailing List Archive

http://www.java2.5341.com/

Home » java-user.lucene »

constructing a mini-index with just the number of hits for a term

Sven

2008-11-13

Replies: Find Java Web Hosting

Author LoginPost Reply
First - I apologize for the double post on my earlier email. The first
time I sent it I received an error message from support@(protected)
saying that I should instead send email to java-user@(protected)
thought it did not go through.

My question is this - is there a way to use the Lucene/Solr
infrastructure to create a mini-index that simply contains a lookup
table of terms and the number of times they have appeared?

I do not need to record which documents have them nor do I need to know
where in the documents they appear. There could be (and probably will
be) more than 2^32 terms, however. I'm not sure if that makes a
difference to the Lucene backend, but thought it might be relevant.

This question coincides with my earlier question about counting the
times a given term is associated with another term. I figure that this
would be more easily accomplished by making the mini-index described
above alongside the normal index when a document is indexed. For
example, when scanning:

Bravely bold Sir Robin, brought forth from Camelot. He was not afraid
to die! Oh, brave Sir Robin!

In addition to the normal indexing function of Lucene, I would like to
write something on the backend to also index:

bravely|bold
bravely|sir
bravely|robin
bravely|brought
bravely|forth
bold|sir
bold|robin
bold|brought
bold|forth
bold|camelot ("from" being a stop word)
...and so on

I only need to keep a running total of each "bravely|bold" term,
however, since the number of terms will be quite large and keeping track
of the document/termpositions would translate to a lot of wasted HD space.

If such a thing is not already in place, could someone let me know if
there are some tutorials, documentation, or presentations that describe
the inner workings of Lucene and the theories/implementation at work for
the actual file formats, structures, data manipulations, etc? (The
javadocs don't go into this kind of detail.) I'm sure I can sift
through the code and eventually make sense of it, but if there is
documentation out there, I'd prefer to peruse that first. My thought
being that I can simply generate my own kind of hash for each combined
term and write it out to a custom file structure similar to Lucene - but
the specifics of how to (optimally) do so are not plain to me yet.

Thanks again!
-Sven


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@(protected)
For additional commands, e-mail: java-user-help@(protected)

©2008 java2.5341.com - Jax Systems, LLC, U.S.A.