AccessMyLibrary provides FREE access to millions of articles from top publications available through your library.
XML has begun to work its way into the business computing infrastructure and underlying protocols such as the Simple Object Access Protocol (SOAP) and Web services. In the performance-critical setting of business computing, however, the flexibility of XML becomes a liability due to the potentially significant performance penalty.
XML processing is conceptually a multitiered task, an attribute it inherits from the multiple layers of specifications that govern its use: XML, (1,2) XML Namespaces, (3) XML Information Set (Infoset), (4) and XML Schema. (5) Traditional XML processor implementations reflect these specification layers directly. Bytes, read off the "wire" or from disk, are converted to some known form (often UTF-16 characters) and tokenized (UTF stands for Universal Text Format). Attribute values and end-of-line sequences are normalized. Namespace declarations and prefixes are resolved, and the tokens are then transformed into some representation of the document Infoset; at the same time, checking for well-formedness (1,2) is performed. The Infoset is optionally checked against an XML Schema grammar (XML schema, schema) for validity and rendered to the user through some interface, such as Simple API for XML (SAX) or Document Object Model (DOM) (API stands for application programming interface).
In practice, these tasks are combined to some extent. Typically a generic parser handles scanning, XML normalization, namespaces, and well-formedness checking, as required by the XML specification. Validation is then grafted on as a separate process, operating as a filter on the output of the generic parser. Because validation is an add-on in such a design, it has a strictly detrimental effect on parser performance. Validation is, therefore, typically used exclusively for debugging, if at all, and is disabled during production.
Although schema validation is expensive in the above scenario, there is no fundamental reason why validation needs to be expensive. Indeed, grammars have long been used to generate optimized special-purpose parsers that operate much more efficiently than their generic counterparts, while performing validation checking. (6) The XML specifications were designed to enable the compilation of an XML Schema grammar to a special-purpose parser (see Section 2.4 in Reference 5).
Techniques that apply standard grammar-based parser generation to XML Schema grammars have been used to demonstrate that compilation of schemas can produce high-performance special(7) purpose parsers. (7) Traditional parser-generation schemes are not, however, particularly well-suited to XML parsing and have difficulty representing some XML Schema constructs that are not found in traditional parsing situations. At the same time, the syntax of XML and the constraints defined in an XML schema are not sufficiently complex to require the full power of traditional parser-generation methods.
Previous efforts in this area that built on conventional intermediate representations have, in general, supported fewer features of XML (8) or delivered less efficient solutions. (9)
Rather than adapting a conventional intermediate representation to the forms of XML and XML Schema, we propose a compilation technique that deals directly with the abstract schema components of the XML Schema Recommendation. (5) By tying the code-generation scheme directly to the schema components, we are able to take advantage of the simple lexical structure of XML and the determinism assurances built into XML Schema grammars.
The generated validating parser drives the optimized scanning process. Two complementary optimization strategies, specialization and optimistic scanning, are used to speed scanning and validation. Specialization focuses on the use of specialized context-sensitive scanning primitives that can scan and validate the input efficiently. Optimistic scanning speeds the scanning of the common cases, such as simple data without comments or entity references. The resulting generated parser is shown to be significantly faster than some widely available parsers, both validating and nonvalidating.
In the next section, we describe the challenges involved in generating parsers through compilation of XML schemas. Following that, we propose an architecture for direct schema compilation, highlighting the breadth of support for schema features and the targeted optimizations that minimize the cost of parsing. Then we provide performance measurements of sample generated parsers and compare those to performance measurements of commonly used parsers. Finally, we describe related work from the technical literature and conclude with final comments.
CHALLENGES OF XML SCHEMA COMPILATION
XML Schema, and the specifications on which it depends, present several challenges to schema-based parser generation: XML namespaces and the dynamic typing features of XML Schema complicate the scanning of markup. As a result, the schema grammar and the lexical production of XML are not easily combined with traditional grammar compilation techniques. Additionally, XML Schema provides support for content models that are difficult to represent in traditional automaton models, making the traditional models inefficient as intermediate representations of the schema.
Throughout the XML processing stack, markup and meta-markup (such as namespace declarations and xsi:type attributes) assert scoped properties and declarations for the containing element and all of its attributes, as well as its content. In the case of XML namespaces and the dynamic typing mechanism used in XML Schema, this pattern presents some difficulties for naive parser implementations.
Namespaces qualify XML element and attribute names by using a namespace-prefix declaration. The declaration takes the form of a special attribute with a reserved prefix (xmlns) followed by the prefix to be declared. The value of this attribute is the declared namespace. The scope of the namespace declaration includes the enclosing element, all of the sibling attributes, and the element's content. This arrangement, although natural, presents some difficulties for XML processors.
Beyond the syntactic complexities of the namespace declaration, the XML Namespaces Recommendation augments the well-formedness constraints of XML to forbid the repetition of a qualified attribute regardless of its lexical representation in the tag. Thus, in the examples below, the first my-elt-1 is well-formed, but my-el t-2 is not, because both attributes resolve to the same qualified name.
During processing, namespace declarations prevent the qualified names of the element and its attributes from being conclusively known until the end of the tag. This means that scanning of qualified names in XML requires infinite look-ahead to fully resolve names. In the preceding examples, the second a:my-elt-1 appears to be the same as the first until the last attribute is scanned.
The pattern of declaration used in XML namespaces is typical throughout the XML processing stack. In the XML layer, for example, the predefined attributes xml:lang and xml:space, which may be used to indicate natural language and desired white-space handling, use this pattern. The values of these attributes, however, do not affect validation, and therefore do not complicate scanning or validation.
Dynamic typing in XML Schema
XML Schema includes a mechanism for dynamic typing of instance elements. By using the xsi:type attribute, an instance element may assert its XML Schema type. The declared type must be validly derived from the type that would otherwise have been used to validate the element, with respect to the constraints on type derivation. This declared type may have a significantly different content model from the default type that is otherwise expected.
The syntax of xsi:type is similar to that of namespace declarations and poses the same kinds of processing hurdles. In particular, the possibility of an xxi:type attribute prevents an XML processor from conclusively determining the type declaration to use for validation until the entire tag has been scanned. Furthermore, because the element type declaration governs the type declarations used to validate the attributes, the processor cannot conclusively determine the types--and therefore the validity--of the attributes until the entire tag has been read. In the example below, the element will be invalid if the dynamic type restricts the attributes to be of type xsd:integer:
XML Schema content models
Like the Document Type Definition (DTD) grammar used in XML, XML Schema can specify an element's content model as a regular expression over its contained element. In contrast to the grammars that can be specified with an XML DTD, however, XML Schema supports a wider range of operators in the composition of content models. In particular, the arbitrary finite occurrence constraints and xsd:all groups of XML Schema pose challenges to automaton-based approaches to compilation. Arbitrary finite occurrence constraints can lead to an explosive growth in the number of states for simple automaton approaches. In the standard implementation, an element declaration with a maximum occurrence constraint of 5000 will result in 5000 states corresponding to each possible occurrence in the range. Models of xsd:all group content are not represented in any standard regular expression syntax and require significant augmentation of the automaton model. If translated directly into a standard automaton model, xsd:all groups result in an expansion of states that is combinatorial in the number of members of the group.
Because the xsd:all compositor is not well represented in traditional models, much of the previous work on XML schema compilation has treated xsd:all groups as a "corner case" (that is, unimportant). In practice, however, xsd:all groups are quite important, as the xsd:all group is considered to be a natural translation of a data structure with named fields, such as a C struct or a Java class, where the members are identified by name, rather than by position. In practice, we have seen that XML schemas for data stored in relational databases often have a plethora of xsd:all groups. These scenarios are quite common for Web services.
XML character data