We have a program which does the following stuff:
use multi threads to fetch web document(url) from the internet
1, put document in documentdb
2, index documents in batch (count = 100) and put them in indexdb
3, repeat 1 and 2
usu. i assign 10 threads to do the fetching, and then assign a website to crawl. to small website, it works well.
but to a big website, the threads will be interrupted due to the famous
Java heap size, outofmemory issue. I ever tried to set the -Xmx, but it seems question can not be solved.
Give me some suggestion?
Thanks!