• Post Reply Bookmark Topic Watch Topic
  • New Topic

Find Hex chars in a file  RSS feed

 
Srinivasa Raghavan
Ranch Hand
Posts: 1228
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi all,
I have a very very big ".txt" file having few hex characters , i want to remove these hex characters from this file so what i'm doing is
1. i read character by character
2. Check the hex value of each char
3. Append it to a StringBuffer if it is a valid one.

But this takes very very long time , is there any way to do this process quickly ?
 
Layne Lund
Ranch Hand
Posts: 3061
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I don't know if this would be any quicker, but you might try reading in a whole line from the text file and using regular expressions to find the hex chars that you want. Check out the java.util.regex.Parser and java.util.regex.Matcher classes for more information about regular expressions.

Layne
 
Srinivasa Raghavan
Ranch Hand
Posts: 1228
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Lund
 
David Patterson
Ranch Hand
Posts: 65
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Another approach would be to read the whole file (or a line at a time) into a StringBuffer. Then loop using charAt() to get each char. Use a switch block to select the ones you want to fix and default to doing nothing. The ones you want to fix can be replace with setCharAt().

By the way, here is a tool that was written to do a specific instance of this kind of fixup -- to correct bad HTML generated by MS Office tools.

http://www.fourmilab.ch/webtools/demoroniser/

Dave Patterson
 
Srinivasa Raghavan
Ranch Hand
Posts: 1228
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Dave,
I followed the same approach, read line by line , process each line & store it in a new file.

I want to remove some chars ( very few in number ) so i used deleteCharAt() in StringBuffer. The total file size is 15159087. i started this process 6 hours back but still it's running. Is there any simple way like replace all at one shot so that this process completes atleast by an hour.
 
Layne Lund
Ranch Hand
Posts: 3061
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
No, I don't think there is any faster way to do such a search. Even if you can hold the whole file in memory you have to check each character one at a time. This algorithm is linear in nature. That is the amount of time it takes is proportional to the number of characters in the file. Assuming that the program examines one character every second, then it will take an hour to examine 3600 characters. Again, assuming that each character uses two bytes, that's a maximum file size of 7200 bytes (just over 7 KB). Of course, this would be a very slow computer by today's standards, but the issue still remains. If you have a large file, it takes a long time to process it. The overhead of reading one character at a time also increases the amount of time needed. That's why I suggested reading a line at a time using BufferedReader.

At the moment, I cannot think of an algorithm that will perform the search any faster. I'm quite certain that it can't be done any faster even with a regex. I suspect that using a regex might actually be slower.

Layne
 
Tommy Becker
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This may be the wrong place to make this suggestion, but am I the only one who's thinking that Java may not be a particularly good tool for this job? You could throw together an equivalent scipt in Perl or even sed a lot faster and I bet the performance would be better.
 
Ernest Friedman-Hill
author and iconoclast
Sheriff
Posts: 24217
38
Chrome Eclipse IDE Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The algorithm you're using (with deleteCharAt()) isn't linear -- it's quadratic. Every time you delete a character, the StringBuffer has to move all the later characters up one space; this takes time proportional to the length of the String. The number of hex characters is also proportional to the String length; so you've got a quadratic algorithm.

The technique you described in your first message is probably much better, if the desired end result is that the whole corrected file ends up in memory. In that case, just be sure to preallocate a large-enough StringBuffer to avoid lots of reallocations (use the StringBuffer constructor that takes an int as a constructor argument, and make one as large as the original file) and also be sure to use buffered file input; i.e., use BufferedReader and then you could just use the character-at-a-time read() method. Finally, make sure your character test is fast -- i.e., test if the char is less than 0x20, for example, rather than looping over all 32 possible control characters.
 
Srinivasa Raghavan
Ranch Hand
Posts: 1228
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks again for your inputs EFH.
At one point of time I was end up with java.lang.OutOfMemory error, to over come this I changed the algorithm a little bit and called System.gc().
 
It is sorta covered in the JavaRanch Style Guide.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!