SimpleXMLParse is a library for parsing XML into Python data structures instead of XML document object models (DOM) or parsing events (SAX). The idea is to make it easier to build applications that use simple XML formats. To build a parser, you supply an XML template that looks very similar to an example document. The parser will then be capable of parsing documents that match the template into Python objects. It can handle nearly any XML vocabulary. Soon it will also be able to generate XML documents from Python objects.
Download SimpleXMLParse: simplexmlparse.tar.bz2
Atom is an XML format for publishing news and updates for feed readers. SimpleXMLParse handle most of the Atom format, but this simplified example will look at only the entry
element, which contains information about a chunk of content. Here is a simple valid Atom entry from the developer documentation.
<entry xmlns="http://www.w3.org/2005/Atom"> <title>Atom-Powered Robots Run Amok</title> <link href="http://example.org/2003/12/13/atom03"/> <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id> <updated>2003-12-13T18:30:02Z</updated> <summary>Some text.</summary> </entry>
SimpleXMLParse can parse the above example document like this:
parser = simplexmlparse.SimpleXMLParser( TEMPLATE ) entry = parser.parse( DOCUMENT ) print "title =", entry.title._text print "id =", entry.id._text print "updated =", entry.updated._text if entry.summary: print "summary =", entry.summary._text if entry.link: print "link href =", entry.link.href
Notice that we can access all the values and attributes via normal Python attributes. This will output the following:
title = Atom-Powered Robots Run Amok id = urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a updated = 2003-12-13T18:30:02Z summary = Some text. link href = http://example.org/2003/12/13/atom03
For comparison, here is the equivalent code using the XML DOM:
dom = minidom.parseString( GOOD_DOCUMENT ) print "title =", dom.firstChild.getElementsByTagName( 'title' )[0].firstChild.data print "id =", dom.firstChild.getElementsByTagName( 'id' )[0].firstChild.data print "updated =", dom.firstChild.getElementsByTagName( 'updated' )[0].firstChild.data summaries = dom.firstChild.getElementsByTagName( 'summary' ) if len( summaries ) > 0: print "summary =", summaries[0].firstChild.data links = dom.firstChild.getElementsByTagName( 'link' ) if len( links ) > 0: print "link href =", links[0].getAttribute( 'href' )
SimpleXMLParse creates strict parsers. It will not accept documents that do not match the template, unlike the DOM example. If we attempt to pass in a document that is missing a required element, we get the following exception:
ParseError: Element 'http://www.w3.org/2005/Atom:entry' is missing a required element: 'http://www.w3.org/2005/Atom:id' at line 6 column 6
This message tells us exactly what is wrong, and gives us a hint as to where the problem is in the document. Compare this to the following output from the DOM version:
IndexError: list index out of range
SimpleXMLParser
instance using the template string: parser = simplexmlparse.SimpleXMLParser(template)
parse
method to create a Python object from the document string: docObj = parser.parse(document)
simplexmlparse.printObjectTree
function will display the object, along with types and values.You can also test your templates at the command line. The module provides a main routine that will parse the template and document, then print the object using printObjectTree
: simplexmlparse.py [template file] [document file]
SimpleXMLParse uses a template to build Python objects for XML elements. To create a template, you take an example document and annote it to describe which elements and attributes are required and which are optional. The template is very similar to the example document:
<entry xmlns="http://www.w3.org/2005/Atom" xmlns:simplexmlparse="http://evanjones.ca/simplexmlparse"> <title simplexmlparse:count="1">Atom-Powered Robots Run Amok</title> <link href="required URI"/> <id simplexmlparse:count="1">urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id> <updated simplexmlparse:count="1">2003-12-13T18:30:02Z</updated> <summary>Some text.</summary> </entry>
Each element in the template defines a Python type. Each child element or attribute is turned into an attribute on the Python object. If the element contains any non-whitespace text, it will be permitted to have text in the document. All the text in the element will stored in the _text
attribute on the Python object. By default, attributes and elements are optional. To make an attribute required, its value must begin with the string "required". Since attributes and elements both become attributes on the Python object, their names must be unique. Additionally, their names must be ASCII text and cannot begin with underscores (_) or contain colons (:). Because an element defines a type, SimpleXMLParse does not permit element names to be reused, unless they are defined to be the same type using references.
By default, elements are optional. To change the number of permitted elements, specify the simplexmlparse:count
attribute. It takes the following values:
Value | Meaning |
---|---|
? | (default) Zero or one elements. |
1 | Exactly one element. |
* | Zero or more elements (any number). |
+ | One or more elements. |
If the number of elements in the document does not match the number specified by the count attribute, a ParseError
exception will be raised.
An element can be the same type as a previously defined element in the document. This allows a definition to be reused. In this case, both the elements will have the same required attributes and child elements. To specify that an element uses a previously defined type, specify the simplexmlparse:ref
attribute on the element, with a value equal to the previous element's name.
content
element. It should be possible to collect this XML into an attribute on the parent object.