• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Getting an index from a parser

 
Shane Burgel
Ranch Hand
Posts: 47
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Is it possible to get the index of a tag when you break on a START_ELEMENT event using STaX(or some other parser)?

For instance if I had a file that was 1000 characters long and the first <record> tag began at character 12, is there a way for STaX(or some other parser) to tell me that? I can't use DOM because my files are too big, and right now I'm having to manually read the file a bite at a time so that I can get these indexes.

Thanks

Shane
 
Paul Clapham
Sheriff
Posts: 21416
33
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes, that's what a Locator is for. org.xml.sax.Locator
 
Shane Burgel
Ranch Hand
Posts: 47
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Cool, but can I use that with a STaX parser, or do I have to use SAX?
 
Shane Burgel
Ranch Hand
Posts: 47
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
On second glance that won't work for me. I need the index, not the line number and column number.

Any other ideas?
 
Karthik Shiraly
Bartender
Posts: 1210
25
Android C++ Java Linux PHP Python
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
XMStreamReader.getLocation() returns a javax.xml.stream.Location...use its getCharacterOffset() to get the current byte in the stream.

Also, XMLEventReader.nextEvent() returns XMLEvent, which has a getLocation() to get a Location object.
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13073
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
So why not count the elements as the events come though to get your "index"??

Bill
 
Shane Burgel
Ranch Hand
Posts: 47
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I found that after my last post but I can't seem to get the correct index from it.

With my byte by byte parser I get this as my first ten offsets:
[3235, 6467, 9699, 12931, 16163, 19395, 22627, 25859, 29091, 32323, 35555]

The location object gives me this. They aren't even in order and I know for a fact that there are none before 3235, so why are there numbers lower than that?
[3235, 6467, 1513, 4745, 7977, 3025, 6257, 1302, 4534, 7766, 2810]

Here is my code:

 
Karthik Shiraly
Bartender
Posts: 1210
25
Android C++ Java Linux PHP Python
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Are you using a BufferedInputStream around a FileInputStream?
 
Shane Burgel
Ranch Hand
Posts: 47
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Karthik Shiraly wrote:Are you using a BufferedInputStream around a FileInputStream?


 
Shane Burgel
Ranch Hand
Posts: 47
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
My location object is an MXParser and the correct number can be obtained by added the bufAbsoluteStart value with the number that is returned.

How do I get the Location instance to return THAT number?
 
Karthik Shiraly
Bartender
Posts: 1210
25
Android C++ Java Linux PHP Python
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Was browsing MXParser source code. You're right, it's maintaining its own termporary character buffer with a size limit and the offset returned is into that buffer. I feel this isn't a correct implementation by MXParser. Guess you'll have to do a dirty workaround of subclassing MXParser and add bufAbsoluteStart to super.getCharacterOffset().
Sun's implementation works fine.
 
Shane Burgel
Ranch Hand
Posts: 47
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What is Sun's implementation of this?

Thanks for your help btw
 
Karthik Shiraly
Bartender
Posts: 1210
25
Android C++ Java Linux PHP Python
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You're welcome!
Sun JRE comes with a default implementation for StAX. I get the correct file positions in ascending order with getCharacterOffset() for a 14 MB XML file.
Your app launcher must be overriding Sun's implementation with codehaus one by setting -Djavax.xml.stream.XMLInputFactory.
 
Shane Burgel
Ranch Hand
Posts: 47
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm using groovy, I guess I should have said that up front. That is why I'm getting the codehaus implementation.
 
Shane Burgel
Ranch Hand
Posts: 47
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What implementation of XMLStreamReader do you get? Which Location implementation?

 
Shane Burgel
Ranch Hand
Posts: 47
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I tried using Woodstox instead of STaX and it seems to be working correctly.
 
Karthik Shiraly
Bartender
Posts: 1210
25
Android C++ Java Linux PHP Python
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I get "com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl". So Sun implementation is Xerces (based).
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic