XML Pull Parsing refers to the process of parsing XML as a stream rather than building a tree (DOM) or pushing events out to client code (SAX). The main goal of XML Pull Parsers is to optimize tasks where all elements in a document must be parsed and processed. Typical examples would include SOAP processors, switching/routing apps like the CBR service in XQ, or higher level parsers (SAX/DOM processors or specialized variations of the two). This approach is called pull parsing because the parser only parses what is asked for by the application rather than passing all events up to the client application. The pull approach of this parsing model results in a very small memory footprint (no document state maintenance required), and very fast processing (fewer unnecessary event callbacks).
The most well known of the XML pull parsers is the XPP project out of the Grid Web Services group in the Extreme Programming Group at Indiana University. They've developed the best known XML pull processor, XPP. XPP2 is the stable release. The same group also has an initiative to develop a standard XML pull parsing API. That initiative has a site at http://www.xmlpull.org (see resources below). XPP3 is their implementation of the API, which is a rewrite of XPP2.
There are now 3 implementations indicated on the xmlpull.org site: XNI 2 XmlPull (based on the Xerces 2 XNI parser), the XPP3/MXP1 parser (XPP2 doesn't implement this interface), and kXML2, a J2ME optimized implementation. They offer varying degrees of support for the XML standard, with only the XNI2XmlPull parser focusing on full XML 1.0 compatibility.
Another alternative to the xmlpull.org API with different requirements is the Apache Xerces Native Interface (XNI - http://xml.apache.org/xerces2-j/xni.html). This API arose from an need to modularize the internals of the Xerces processor, but has been made "public" for fine grained control of parsing behavior. Their goal was to allow pipelining of XML streams up to the higher level APIs. For example, you could replace the default XML scanner with an HTML scanner, which would then expose the HTML tree in a SAX or DOM API. Other examples on the site include an XML preprocessor to handle XInclude style references.
The main performance challenger (in a general case - see Resources for benchmarks) to XPP3 is the Piccolo parser, which is a non-validating SAX-only parser. The performance numbers are quite impressive, and better than the XPP3 parser in it's current iteration. What's very interesting about the Piccolo parser is that its authors claim that it is the only parser that uses automatic parser generator tools to generate the low level XML stream parser. They attribute some of their performance edge to this alone. They use JFlex and BYACC/J to build the parser (see resources below).
The main piece of information I've walked away with after looking at the various parsers is that all of the performance gains in these parsers come at a certain loss of functionality. Either they don't support all of the XML 1.0 features (validation being #1 on that list) or impose certain limitations on your application (for example, knowing what the incoming document schema is). So, while you gain huge chunks in performance, you have to give up some functionality. In addition, the performance of Xerces2 deferred DOM is reasonably good, though with a much more significant memory usage. See the XML documents on the run, part 3 article below to get a good overview of the different performance tradeoffs between Piccolo, XPP3, and the others.
All of these seem a little dated (early 2002).
Again, these seem dated (most from April 2002). Some of the articles above have performance numbers as well. These are supplied by the projects themselves. I hope to run these benchmarks myself and will add those links in here.
Written by Sujal Shah