The problem does not lie in the
Java options, nor does it have anything to do with the fact that you use many string operations - you can safely do that. The problem lies in the design. If you have a good development environment, try to run the program in a debugger and see what happens. To give an example:
readLine() reads random input
lines, but the program treats them as if they were words. Actually, it is far worse: The program treats them as if they were regular expressions. If the program reads e.g. a line containing only an "e", it will try to match "e" against every input line. And
every "e" will match even if it is in the middle of a
word. That's a lot of matches. Now imagine how many matches will occur if the program reads a blank line.
The explosive number of matches combined with the fact that the program stores every match in the directionary causes the OutOfMemoryException.
To say it in another way, what the program does is this:
1) For each line l in the file, do this:1--a) Store the line in the dictionary1--b) For each previously read line m in the dictionary, do this:1--b--x) If m matches (~= contains) l, record that l was seen at the current file position What I guess it should do is this:
1) For each word w in the file, do this:1--a) Record that w was seen at the current file position With this simpler structure, the program will be shorter, faster, and will no doubt be able to organize files of several megabytes without needing special java options.
I assume the first step you need to take is find a clean way to break the input into words. You can use a StringTokenizer for this purpose. You
can also use regexps for this, but if you choose to do it,
you should only need
one generic regular expression that will match
any word.
Hope this helps!
[ April 21, 2004: Message edited by: Nicky Bodentien ]