Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Help to get the entire <a> link in the anchor field instead of the anchor to a fetched page.

Ismael

2008-07-07


Author LoginPost Reply
Hello. I need to get the links followed by nutch to reach a page; something
like the anchors, but getting all the information inside the link instead of
the text of the link.

I don't know if this can be done building a plugin, or if I must modify the
Nutch code to get this information. I went through the Nutch code, and I
still didn't find where this information is collected, but I am on it.


As an example, what I need is that given the next link:

<a href="/main.html" title="Title"><img src="/src.gif" border=0
style="background-position:bottom;"> </a>

when I access to the anchor field of the "/main.html" fetched page in the
Nutch index, the text should be the entire <a href...></a> link.


I really only need the <img> tag, so if it is easier to get that, that
solutions also helps me.

Any help would be appreciated; thanks for reading.
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.