Win a copy of Murach's Python Programming this week in the Jython/Python forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

RegEx, CharBuffer Vs String performance  RSS feed

 
Jonathan Gerrish
Greenhorn
Posts: 24
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I am building a module to extract data from text files using the java.util.regex package. I have to process thosands of text documents so performance is a must.

I define regular expressions to match regions of the file, and further regular expressions to match fields within those regions.

Since the regex package uses the CharSequence interface, more or less I have the choice of using the CharBuffer or String implementations of this interface.

Since I will be passing successful matches for regions to other regular expressions to extract subfields I have to do one of the following:-

1) Using Strings..

private static String getMatch(String patternStr, String input)
{
String result = null;
Pattern pattern = getPattern(patternStr);
Matcher matcher = pattern.matcher(input);
if(matcher.find())
{
result = input.substring(matcher.start(), matcher.end());
}

return result;
}

2) Using CharBuffer

private static CharSequence getMatch(String patternStr, CharSequence input)
{
CharSequence result = null;
Pattern pattern = getPattern(patternStr);
Matcher matcher = pattern.matcher(input);
if(matcher.find())
{
result = input.subSequence(matcher.start(), matcher.end());
}
return result;
}

I would have thought that the use of CharBuffer would be the obvious choice, since as each CharSequence created by CharBuffer.subsequence() creates a new object which is backed by the same buffer, where as using String.substring() supposedly creates a new String object, and because Strings are immutable, makes a new copy of the data.

However when I perform a simple test, reading a file as follows, and parsing it by serval levels of regular expressions, x 10000 iterations the String version always comes out faster, in the test I use the following method, shamlessly lifted from javaalamanac, and the regular expressions I compile and cache. Anyone have any thoughts on what I could be doing wrong here, it just seems wrong that creating new Strings could be faster in this sitation...

.. thanks in advance, Jonathan

public static CharSequence fromFile(String filename) throws IOException {
FileInputStream fis = new FileInputStream(filename);
FileChannel fc = fis.getChannel();

// Create a read-only CharBuffer on the file
ByteBuffer bbuf = fc.map(FileChannel.MapMode.READ_ONLY, 0, (int)fc.size());
CharBuffer cbuf = Charset.forName("8859_1").newDecoder().decode(bbuf);
return cbuf;
}
 
Jonathan Gerrish
Greenhorn
Posts: 24
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This article describes how new String objects created using the String.substring() method actually share the same underlying char[] array. Looks like it functions in the same way as CharBuffer.subsequence() and that the memory won't be released until each last refering String object has been garbage collected.

http://www.cs.technion.ac.il/~genadyb/strings/strings.html
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Based on the fact that String ends up reusing its internal character array, I would expect CharBuffers and Strings to be somewhat comparable in performance. CharBuffer can be faster in certain circumstances, but it's by no means guaranteed. In your test, I suspect the main reason why the NIO solution is slower is because you're performing a map(). The documentation for this method tells us that this method is relatively inefficient for small files (e.g. "a few tens of kilobytes" or less) and is generally only worthwhile for large files. So if you're Testing using a small file, there's your problem. Try running the test with a 10 MB text file and see what happens.

The map() method is relatively simple to use, but there are other ways to get a ByteBuffer. E.g. you could read() into it. I suspect that for small files, the best performance you'll see is pretty close to what you'd get with normal IO and Strings. NIO tends to work best for high-volume operations. (Or highly concurrent operations.) For many "normal" jobs, its just not worth bringing NIO into the picture.
 
Jonathan Gerrish
Greenhorn
Posts: 24
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for the insight Jim, what I actually want to achieve is to write a parsing engine for printer spools using regular expressions. So basically, the input is a very large text file, yeah, possibly 10Mb+ full of repeatable records, like invoices in human readable text format, one after the other. My idea is to have an "Iterator" regular expression that defines the start and stop of a record, and you would specify a buffersize, a size which you know is bigger that the biggest single record you could have on the spool, probably 2-3k max. Once you have matched one record I would then apply other expressions to extract section and field data. I am thinking I would fire all of these as Sax Events, which can be later transformed to a standard XML format, and digitally signed etc.

I'd be grateful if you have any comments on my overall design idea, and perhaps give me some hints as to what would be the best way to read data into a buffer from a stream for processing by regular expressions in this way.

Thanks in advance, Jonathan.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!