Jordan (and anyone else interested),
A few years ago I was working on implementing a robust feed parser; for anyone that has done real-world feed parsing you know that following the specs is one thing, but being able to parse the mess of illegal characters and mis-used tags out there, even on some of the biggest sites, is another.
I went from using SAX, to XPath to eventually using pull-parsing which was the fastest way to parse (faster than SAX in the exhaustive
testing that the Sun team did for the Pull Parser RI).
Working on the parser month after month I kept adding more and more abstractions to the parsing, realizing the reoccurring pieces of logic that could be pulled out. It eventually resulted in me created a brand new parser:
SJXP (Apache 2 license)
SJXP encapsulates the ease of XPath (minus dynamic expressions) with the raw speed and low overhead of XML Pull Parsing. To speak to the raw performance, you can check out the
benchmarks (it's included directly in the bundle if you want to try it yourself).
This is not a re-implementation of a core parsing library, but rather a VERY thin abstraction layer on top of one of the fastest XML parsers out there: XPP. For those that don't know, XPP is the backing implementation for XML parsing on the Android platform, so SJXP works with no dependencies on Android out of the box and on any other platform you just need to include the 1 xpp JAR.
Usage of SJXP is all based around defining "paths" pointing at elements or attributes that you want to be parsed, giving the rules to the parser along with callbacks and the parser will call your code every time the rule is matched, giving you an opportunity to do something with the information.
For example, if I wanted the title out of the snippet example you gave, I would define a rule like:
then give that rule to an XMLParser instance:
Now when I use the parser instance to parse content like that from any input source, your rule will get called for every title element. The actual character data (title of the book) will be contained in "text", and userObject is an optional reference to a user-object passed through from the parser IF you want and use it. For example, this might be a database DAO or some other storage class to hold the parsed value (even a List<String> if that is all you wanted would be fine).
All this is similar if you want an attribute value, just change the type of the rule and the overridden default method (handleParsedAttribute).
One of
the biggest boons I think SJXP adds is how easy it is to support parsing elements and attributes that are qualified by a namespace. SJXP does this with [] notation, more specifically: assume you were parsing an RSS feed like TechCrunch (
http://techcrunch.com/feed/) and you wanted the author name out of EACH post.
If you open up the feed, you see that the author name is stored in the <item> elements in a <dc:creator> subelement. If you are familiar with prefixed elements, you know that "dc" must be a prefix defined somewhere up in the root element of the feed.
So you scroll up and look at the root element and see that "dc" is the prefix for the Dublin Core specification of tags defined by the URI "http://purl.org/dc/elements/1.1/".
So, given that, the rule you would define in SJXP to parse out the author information would look like this:
You'll notice that you just include the full URI in []-notation before the element name to fully qualify it. The same even goes for prefixed attributes!
I've really tried to make the API as simple as possible and think it might offer you what you want in a much easier-to-use package with some of the fastest runtime and lowest-memory overhead performance out there.
If anyone is interested, you can read through the closed bugs on GitHub to get an idea of the optimizations that have been made that place the overhead of SJXP ontop of XPP to something like 1-2k of memory, it is really tiny because of things like using the hashcodes of the paths to match them instead of String comparisons and caching the locations in the
doc as it's parsed so the hashcodes aren't recalculated all the time since XML is a structured language, this was a huge win.
I would encourage anyone skeptical to run HPROF on the Benchmark class to see SJXP in action and look at the memory and CPU allocations directly from the VM to see how tight the library is.
The project has been getting more pickup in the Android community and gotten a lot of good feedback. If any of you give it a try and have comments, please let me know!