Win a copy of Murach's Python Programming this week in the Jython/Python forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

Parsing a huge File  RSS feed

 
Holger Prause
Ranch Hand
Posts: 47
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I have to parse a huge CSV File and to replace every ' Character
in this File.The file is 70 MegaByte huge and i have 176 MegaByte
free Memory.But i get a java.lang.OutOfMemoryException.I must be doing somthing wrong.
the following shows what i am doing -
<pre>
public void parse(File file) throws IOException {
String line;
BufferedReader reader = new BufferedReader(new FileReader(file));
writer = new BufferedWriter(new FileWriter("/somepath/somefile");
while ((line = reader.readLine()) != null) {
writer.write(replace(line,"'",""));
writer.newLine();
}
writer.flush();
writer.close();
}
public String replace(String original,String searchFor,String replaceWith) {
String orig = original;
StringBuffer changed = new StringBuffer("");
int indexof;
while ((indexof=original.lastIndexOf(searchFor)) != -1) {
changed.append(orig.substring(0,indexof)).append(replaceWith).append(orig.substring(indexof+searchFor.length()));
}
return changed.toString();
}
</pre>
I think my replace Method is the Problem.
Now i got the idea that i dont have to use the replace method,instead i should use a reader that reads characters and if the specified character occurs - i just dont write him out.
What Reader should i use - is my idea right ?
Can you please post some code example ?
Thx a lot,
Holger
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think the problem is that in your replace() method, the while loop is infinite. Since the String referenced by "original" is unchanging, original.lastIndexOf(searcFor) just keeps finding the same value every time, never equal to -1 - and then appending some stuff to the StringBuffer each time, eventually leading to the OutOfMemoryError. I suggest you work on the replace() method by itself, not called through parse(). Just write a main() method that calls replace("sample a", "a", "i") and prints the result (should be "simple i"). Then you can focus on the one method without worrying about the other - if it throws OutOfMemoryError, you have a much better idea where the problem is. Then add some print statements inside the while loop:
<code><pre> System.out.println("indexof: " + indexof);
System.out.println("changed: " + changed);</pre></code>
This will give you a better idea just what your loop is doing each time through it. As you work on the loop, you may also want to consider the String methods indexOf(int, int) or lastIndexOf(int, int) to make sure you don't just find the same substring each time, as well as the StringBuffer method replace(int, int, String) which will do some of the work for you. Good luck.
[This message has been edited by Jim Yingst (edited January 13, 2001).]
 
Peter Tran
Bartender
Posts: 783
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Holger,
Try this on your large CSV file.

Let me know how fast it runs. It can use some more tweaking to make it run faster, but try this solution first.
-Peter
[This message has been edited by Peter Tran (edited January 14, 2001).]
 
Holger Prause
Ranch Hand
Posts: 47
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
He thx to both of you for helping .Its workin now .The time i tooked me to parse the file is 22 sec.So its ok now.Thank you very much.

But theres one Question left.I have to find out when the 5000th line is reached - and then i have to create a new outputfile.
But i reading the data in now with an chararray.How to find out at which line i am ?
thx again,
Holger
 
Peter Tran
Bartender
Posts: 783
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Holger,
Which solution are you using? You can try to keep a count of a unique character appearing on a line. For example, there should only be one new line character per line. Once you hit 5000, you can create your new file.
-Peter
 
Peter Tran
Bartender
Posts: 783
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Can you zip me up your input file (if it's doesn't contain confidential information)? I would like to try some different solution to see the performance.
Thanks,
-Peter
Ps. I will post my result if you send me your input. Remember to zip it up, because a 76Meg file is pretty large to pass over email.
 
Holger Prause
Ranch Hand
Posts: 47
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am using the solution of peter for my huge CSV File.
But also thanks to Jim - he showed me that my replace Method is absolutely nonsens
thx
Bye,

Holger
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'd be inclined to take Peter's filter() method and adapt it into a new FilterReader class, CommaFilter. Then you can do something like
<code><pre>BufferedReader reader = new BufferedReader(new CommaFilter(new FileReader(file)));</pre></code>
...and then take advantage of BufferedReader's readLine() method to count lines. This way you get nice clean separation of the comma filtering functionality from the line counting functionality. I imagine it will be a little slower than Peter's version (since it would create a String object for each line), but probably not much (file IO is the main delay here I expect - the other parts of the system are probably fast enough to keep up without complaint). You can also experiment with different orders of the Readers, or additional buffers (should there be a BufferedReader between the FileReader and the CommaFilter for speed?)
 
Peter Tran
Bartender
Posts: 783
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jim,
I'll take your suggestion into consideration when I try some tweaks to get some benchmarks. It just takes so much time to get accurate benchmarks. *sigh*
Thanks for the suggestion.
-Peter
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
"It just takes so much time to get accurate benchmarks."
Sure. That's why I'm just kibitzing from the sidelines, not volunteering to find out myself. Let me know what you find out though.
 
Thomas Paul
mister krabs
Ranch Hand
Posts: 13974
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I go along with Jim. My first thought was that this is crying out for a FilterReader.
 
It is sorta covered in the JavaRanch Style Guide.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!