Forums Register Login
Getting an index from a parser
Is it possible to get the index of a tag when you break on a START_ELEMENT event using STaX(or some other parser)?

For instance if I had a file that was 1000 characters long and the first <record> tag began at character 12, is there a way for STaX(or some other parser) to tell me that? I can't use DOM because my files are too big, and right now I'm having to manually read the file a bite at a time so that I can get these indexes.


Yes, that's what a Locator is for. org.xml.sax.Locator
Cool, but can I use that with a STaX parser, or do I have to use SAX?
On second glance that won't work for me. I need the index, not the line number and column number.

Any other ideas?
XMStreamReader.getLocation() returns a javax.xml.stream.Location...use its getCharacterOffset() to get the current byte in the stream.

Also, XMLEventReader.nextEvent() returns XMLEvent, which has a getLocation() to get a Location object.
So why not count the elements as the events come though to get your "index"??

I found that after my last post but I can't seem to get the correct index from it.

With my byte by byte parser I get this as my first ten offsets:
[3235, 6467, 9699, 12931, 16163, 19395, 22627, 25859, 29091, 32323, 35555]

The location object gives me this. They aren't even in order and I know for a fact that there are none before 3235, so why are there numbers lower than that?
[3235, 6467, 1513, 4745, 7977, 3025, 6257, 1302, 4534, 7766, 2810]

Here is my code:

Are you using a BufferedInputStream around a FileInputStream?

Karthik Shiraly wrote:Are you using a BufferedInputStream around a FileInputStream?

My location object is an MXParser and the correct number can be obtained by added the bufAbsoluteStart value with the number that is returned.

How do I get the Location instance to return THAT number?
Was browsing MXParser source code. You're right, it's maintaining its own termporary character buffer with a size limit and the offset returned is into that buffer. I feel this isn't a correct implementation by MXParser. Guess you'll have to do a dirty workaround of subclassing MXParser and add bufAbsoluteStart to super.getCharacterOffset().
Sun's implementation works fine.
What is Sun's implementation of this?

Thanks for your help btw
You're welcome!
Sun JRE comes with a default implementation for StAX. I get the correct file positions in ascending order with getCharacterOffset() for a 14 MB XML file.
Your app launcher must be overriding Sun's implementation with codehaus one by setting -Djavax.xml.stream.XMLInputFactory.
I'm using groovy, I guess I should have said that up front. That is why I'm getting the codehaus implementation.
What implementation of XMLStreamReader do you get? Which Location implementation?

I tried using Woodstox instead of STaX and it seems to be working correctly.
I get "com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl". So Sun implementation is Xerces (based).
Wink, wink, nudge, nudge, say no more ... https://richsoil.com/cards

This thread has been viewed 1897 times.

All times above are in ranch (not your local) time.
The current ranch time is
Jan 18, 2018 19:00:35.