Mailing List
Home
Forum Home
Maven - Project building tool
Axis - Java SOAP implementation
Cocoon - MVC web framework based on XML/XSL
Lucene - Full-featured text search engine APIs
Log4J - A log library
Fop - Create PDF, PCL, PS, SVG, XML driven by XSL formatting objects.
POI - Java Excel, Word and other Microsoft Office files manipulating library
Oracle database error code ...
Subjects
log4j warning: No appenders could be found
java security AccessControlException: access denied (java io FilePermission clie
java lang InstantiationException: org apache tools ant Main
Apache Axis Tutorial
Struts <logic iterate >
log4j properties How to parse outpu to multiple files
configuring log4j with BEA Weblogic 8 1
How to use XSL FOP Java together
JSP precompile
Servlet File Download dialog problem (IE6,Adobe 6 0)
Proposal: Adding jar manifest classpath in jar and war plugins
Unsupported major minor version 48 0 problem while running the an
   telope task
java security AccessControlException: access denied (java io FilePermission
axis wsdl2java Ant Task usage
net sf hibernate MappingException: Error reading resource: test/User hbm xml
Building EAR ANT Script for websphere 5 0
CREATING WAR Files
jsp data into Excel
Classpath problem
Jboss 3 2 3+ vs Tomcat Axis Question
RE: How to include jars and add them into the MANIFEST MF/Class Path
attribute
Printing problem
InstantiationException
Couldn 't find trusted certificate
Please : How can one install ant 1 6 0 under Eclipse 2 1 ?
Excel: Too many different cell formats
Running junit tests fails
XDoclet, Struts and Maven: Where to start? SOLUTION
1 3 final: now giving me java io FileNotFoundException (Too many
open files)
AXIS: tomcat timeout ?
 
Search:  
Power your search with and, or, +, -, or "some phrase" operators.
Clustering lucene 's results

Clustering lucene 's results

2004-10-07       - By Dawid Weiss

 Back
Reply:     1     2     3     4     5     6     7     8     9     10     >>  


Hi William,

Ok, here is some demo code I've put together that shows how you can
achieve clustering of Lucene's results. I hope this will get you started
on your projects. If you have questions, please don't hesitate to ask --
cross posts to carrot2-developers would be a good idea too.

The code (plus the binaries so that you don't have to check out all of
Carrot2 ;) are at:
http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip

Take a look at Demo.java -- it is the main link between Lucene and
Carrot. Play with the parameters, I used 100 as the number of search
results to be clustered. Adjust it to your needs.

    int start = 0;
    int requiredHits = 100;

I hope the code will be self-explanatory.

Good luck,
Dawid

From the readme file:

An example of using Carrot2 components to clustering search
results from Lucene.
===========================================================


Prerequisities
-- ---- ------

You must have an index created with Lucene and containing
documents with the following fields: url, title, summary.

The Lucene demo works with exactly these fields -- I just indexed
all of Lucene's source code and documentation using the following line:

mkdir index
java -Djava.ext.dirs=build org.apache.lucene.demo.IndexHTML Source code of org.apache.lucene.demo.IndexHTML -create
-index index .

The index is now in 'index' folder.

Remember that the quality of snippets and titles heavily influences the
output of the clustering; in fact, the above example index of Lucene's
API is
not too good because most queries will return nonsensical cluster labels
(see below).

Building Carrot2-Lucene demo
-- ---- ---- ---- ---- -----

Basically you should have all of Carrot2 source code checked out and
issue the building command:

ant -Dcopy.dependencies=true

All of the required libraries and Carrot2 components will end up
in 'tmp/dist/deps-carrot2-lucene-example-jar' folder.

You can also spare yourself some time and download precompiled binaries
I've put at:

http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip

Now, once you have the compiled binaries, issue the following command
(all on one line of course):

java -Djava.ext.dirs=tmp\dist;tmp\dist\deps-carrot2-lucene-example-jar \
  com.dawidweiss.carrot.lucene.Demo index query

The first argument is the location of the Lucene's index created before.
The second argument
is a query. In the output you should have clusters and max. three
documents from every cluster:

Results for: query
Timings: index opened in: 0,181s, search: 0,13s, clustering: 0,721s
 :> Search Lucene Rc1 Dev API
    -
F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene
/search/class-use/Query.html
      Uses of Class org.apache.lucene.search.Query Source code of org.apache.lucene.search.Query (Lucene 1.5-rc1-dev API)
    -
F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene
/search/package-summary.html
      org.apache.lucene.search (Lucene 1.5-rc1-dev API)
    -
F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene
/search/package-use.html
      Uses of Package org.apache.lucene.search (Lucene 1.5-rc1-dev API)
      (and 19 more)

 :> Jakarta Lucene
    - F:/Repositories/cvs.apache.org/jakarta-lucene/src/java/overview.html
      Jakarta Lucene API
    - F:/Repositories/cvs.apache.org/jakarta-lucene/docs/whoweare.html
      Jakarta Lucene - Who We Are - Jakarta Lucene
    - F:/Repositories/cvs.apache.org/jakarta-lucene/docs/index.html
      Jakarta Lucene - Overview - Jakarta Lucene
      (and 12 more)

If you look at the source code of Demo.java, there are plenty of things
apt for customization -- number of results from each cluster, number of
displayed
clusters (I would cut it to some reasonable number, say 10 or 15 -- the
further a
cluster is from the "top", the less it is likely to be important). Also keep
in mind that some of Carrot2 components produce hierarchical clusters.
This demonstration
works with "flat" version of Lingo algorithm, so you don't need to worry
about it.

Hope this gets you started with using Carrot2 and Lucene.
Please let me know about any successes or failures.

Dawid

-- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ------
To unsubscribe, e-mail: lucene-user-unsubscribe@(protected)
For additional commands, e-mail: lucene-user-help@(protected)


Earn $52 per hosting referral at Lunarpages.