PullXML : an XML Pull Parser for PHP 5
By Nico on Friday, April 27 2007, 22:36 - Permalink
You've got a big, big XML file to parse. You need to do it in PHP. Well, you're in deep trouble. Unless I've missed anything, here are the different alternatives I've seen, none of which satisfied me :
- Use SimpleXML. Not possible here because your big, big XML file won't fit into memory.
- Use DOM or DOM XML. Same problem, the file won't fit into memory, PLUS this time you get a notoriously crappy API.
- Use XMLReader. No memory problem, this time. However the API is awkward, maybe it gets better using a combination of calls to
XMLReader::expand()andXMLReader::next(), but then again it's back to the crappy DOM API. - Use The SAX-like streaming XML parser. This works, no memory problem, but then again it's pretty awkward, you have to implements a stack-based machine to do anything remotely useful if the XML document is a little bit complicated.
Well, after messing around with PHP for a few hours (man, this language is soooo weird ! I miss Python...), I came up with PullXML, an XML Pull Parser. It works in PHP 5, it could be ported to PHP 4 but frankly this is not something I look forward to doing :).
PullXML is implemented with the SAX-like streaming parser. It builds objects that look like SimpleXML objects, but instead of loading the whole document in memory, it builds them chunk by chunk and calls a callback you provide when a chunk is ready.
The chunks are delimited by the pivot, which is a simplified XPath expression that gives the path that each chunk must match. For example, if the pivot is /foo/bar, then PullXML will call your callback for each bar element that is in a foo element, including the content of the bar element, of course (otherwise this would be quite useless). But the best way to see how it works is to have a look at the source code and experiment with the example at the end.
I'm no expert in PHP and I haven't much used SimpleXML, so this must be quite buggy for the moment. Yet, it already does the job as expected on one of my projects. If you have any remark about the source code or any suggestion for a better compatibility with SimpleXML, feel free to leave me a comment.

