Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Normalizing host names (e.g. www1|www2 => www)

ogjunk-nutch

2008-04-25

Replies: Find Java Web Hosting

Author LoginPost Reply
Hello,

How are people dealing with avoiding page duplication where URL are similar,
but the content is identical. I know there is page content fingerprinting and
shingling (MD5Signature and TextProfileSignature), but that assumes you
already fetched the content. I am wondering if its possible to detect things
earlier than that, even if it's not 10)% reliable.

Concretely, imagine the following URLs:
  http://example.com
  http://www.example.com
  http://www1.example.com
  http://www2.example.com

They are all, very likely, pointing to the same page. One person may link to
www.example.com and the other person may link to just example.com, thus
we parse 2 different URLs, when ideally we'd want just a single URL for each
page. Similarly, the example site may have multiple web servers (e.g. for
load balancing), but each with a slightly different name (e.g. www1.... , www2....),
pointing to the same site.

What's the best way to treat www1 and www2 as just www?
Are people using regex-normalize.xml for that?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


©2008 java2.5341.com - Jax Systems, LLC, U.S.A.