Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Stripping Carriage Returns & Line Feeds?

Nick Tkach

2008-06-09


Author LoginPost Reply
Is there some "out of the box" way to get Nutch to remove carriage returns and/or line feeds from content as it parses? I'm finding some places in a crawl I did recently of one of our sites where for some reason there are \n characters in places and I'd like to cut them out. I'm finding that if there's a \n in the middle of quoted text (such as "Some \n String") the " come out in a browser as ?. As far as I can tell it's an issue with the content being formatted strangely. I'm guessing this is a common thing and I'm just missing something?
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.