We have a collection of blogs that we've crawled using nutch and I'd like to copy them to the UNIX filesystem. I'm using nutch readseg to copy each segment, but sometimes this dies with an OutOfMemoryError (below). The particular segment it dies on is about 500MB in size, as opposed to 100-200MB for most of the other segments. I've increased the max heap size on the slaves to 1500MB but that hasn't helped. The slaves only have 500MB of physical ram so I'm going to get a lot of swapping if I try to push the heap size up.
Should I keep increasing the heap size until I can load the segment, or is there anything else I can do? Surely segread doesn't need to hold the whole segment in memory at once? We're using a nutch snapshot from 2008-01-25.