Win a copy of Functional Reactive Programming this week in the Other Languages forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

collections

 
rahul Delhi
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
hi friends,
I am having a file containing million of words per line and i hav to find the duplicate words in it with their occurence.
So i am using tree set collection but after storing around 5,00,000 of words , it gives me error .. running out of heap space...

class WordType implements Comparable<WordType>{
String word=null;
int no_Of_Occur=0;
List list=null;

public int compareTo(WordType obj){
return word.compareToIgnoreCase(obj.word);
}
}

class Duplicate words{

TreeSet<WordType> wordTreeSet=new TreeSet<WordType>();
while((line=reader.readLine())!=null){
WordType obj=new WordType();

obj.word=line;


System.out.println("Word read:"+obj.word);
wordTreeSet.add(obj);
line=null;

}
public static void main(String[] args) throws Exception {
DuplicateWords dwObj=new DuplicateWords();
dwObj.readWords();
}

}

Please help in this...
 
Barry Gaunt
Ranch Hand
Posts: 7729
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
"rahul kk k kkk", please read our Java Ranch Naming Policy and change your displayed name to comply with it.

This is not a topic specific to SCJP, so I am moving it to our Java In General (Intermediate) forum...
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'd cheat and use several programs, some of which you probably already have with the OS or your favorite toolkit. Split the file into a new file or pipe stream with one word per line, sort them into another file or stream, then count dupes as they come through together.

input file | word splitter | sort | dupe counter

This loses the location information unless your splitter can put that on the line with each word.

----

Editing to add a reference to an old favorite of mine. A Ternary Search Tree is a very fast way to store words, and only stores the differences between them. That is, PART and PARTICLE would share the PART. That just might save enough memory to run your first test file, but still blow up on a bigger one later. Or it might take more memory from the get go for all those one-letter nodes.
[ October 06, 2006: Message edited by: Stan James ]
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic