I have to process a series of fairly small XML messages which I receive via a socket over a receive-only network connection.
My app needs to be able to cope with messages coming pretty fast over the connection and I CANT afford to miss any of them. It also will have to cope with incomplete message segments, and segments that span multiple messages. I MUST also be able to process a complete message as soon as it arrives.
I have used the nice and easy JDOM with Xerces in the past, but I am not convinced that it will work here as I can't see anything that says if JDOM can cope with incomplete documents from the input stream, and I have read that Xerces will try to close the socket if data is delayed, and may also prematurely close the socket which I need to keep open. StAX and SAX sound like interesting options... Any recommendations or experience would be very welcome.
As for specific parsers, choose SAX or StAX depending on whether you want the parser to crawl through the document, telling you where it is (SAX), or whether you want to crawl through the document, telling it where you think you are (StAX).
It does sound like I need to use SAX or StAX, although I still don't know if they can cope with incomplete documents, or segments spanning documents. BTW these are APIs not parsers, so my question of which parser to use stands.
You are certainly not in JDOM territory here.
Like Paul said, abstract away the various sources of XML fragments and hand the parser an InputStream. For one thing this will be SOOOOO much easier to test - you can create a set of test case documents and test the parsing side without ever getting tangled up in message connections.
On the other side you can test grabbing messages and creating a stream without every getting hung up in XML parsing issues.
Well, you're going to have to work on that. All XML parsers are designed to parse exactly one XML document, no more, no less. So you have a stream of XML documents coming in, chopped up into random chunks? If I understand your word "segment" then that's what you have. I can't say I'm impressed by that design but it is what it is.
Originally posted by Alasdair Jones:
Thanks, I would if I could. Unfortunately I have no control over the data I will receive and as I said, it will come in segments. Therefore I will have to parse the data before I can determine the message boundaries.
I'd say that ByteArrayInputStream and SequenceInputStream could be useful tools. If you have the possibility that a segment could contain part of two different documents then maybe PushbackInputStream too. But I would say this part of the problem is what you have to work on. Choosing an XML parser is just trivia.
Alasdair: It also will have to cope with incomplete message segments, and segments that span multiple messages.
I am not sure what you mean by the above. Are you referring to a chunked HTTP request kind of protocol?
Is it possible that on the same connection, different messages will be arriving and that too in different parts and order? If that is the case then it will be a nightmare to actually aggregate the message parts. If only one message comes over one connection but in different chunks (parts) then it does make some sense. If this is the case you can look at PipedInputStream
and PipedOutputStream pair. You can use them as a producer and consumer type of exchange in different threads. You can give the InputStream to the XmlParser and populate the OutputStream from the thread reading data from the socket. I am not sure how parsers handle a long delay in the read() of the inputstream but you can manage from your client, in the sense, if you close the OutputStream, the connected InputStream will throw an IOException in read(). So, you can decide the timeout period between the two message chunks and close the PipedOutputStream after that time.
I'll have to do some low-level 'parsing' of my own to repackage the XML segments into whole documents, and only then can I send to an XML parser.
NO - not whole documents if you need to combine the data. You just want compatible XML that can be spliced together to make the parser THINK it is looking at a single document - the parsing you have to do could be as simple as removing the XML declaration and providing your own root element. Parsing in the StaX or SAX style can proceed indefinately as long as you keep shoving valid fragments of XML text in the pipeline.
I once took this pass at the problem of combining XML fragments while providing for tracking the real location of errors back to the responsible fragment.
Single socket connection between 2 applications. This will be opened at initialisation and will have to remain open for the duration of the data exchange.
The destination app which I am writing is the socket server.
The source app is the socket client. I have no control over how this sends messages, and can only receive data. Each message will be sent separately but the source app can't guarantee that these will be sent in a continuous stream and may be split into several segments, although these will be in the correct order. Also, the messages will be sent with no separator/header so the stream I receive could contain multiple messages and/or message segments.
Just to get going I've built a proto with a test socket client which sends a series of messages as a continuous stream. Using JDOM it unsurprisingly throws a parse exception when it reaches the start of the new message "<?xml..." not expecting another in what it believes is the same document. It did however, cope with parsing the data in segments. And of course, this way, I will not be able to get at the data that has been passed...
I'm going to try with SAX/StAX now and then the SequenceInputStream/ByteArrayInputStream...
Originally posted by Alasdair Jones:
Using JDOM it unsurprisingly throws a parse exception when it reaches the start of the new message "<?xml..." not expecting another in what it believes is the same document.
Yeah, that will be the case. So, mostly you have to sniff the data coming on the socket to assert the message boundaries.(I am not sure how will you do that though!)