If you find a mistake, or something is unclear, please email firstname.lastname@example.org so I can fix the text.
We have earlier looked at the values (nodes, primitives, and sequences) that XQuery works with. In this article we will look more deeply into the XQuery/XPath data model and type system. On the way we will touch on a fair bit of background material, including XML Schemas and XML infosets.
The XQuery data model is based on the XML Information Set standard (W3C Recommendation 24 October 2001, http://www.w3.org/TR/xml-infoset). It rather abstractly defines the information content of an XML document as a document item that contains nested element items, which in turn contain namespace, attribute, character and other items. This is a conceptual standard: It does not define any file formats or programming interfaces, but rather it defines the interpretation of an XML file. It is intended to be useful for defining other XML-related standards, including the XQuery/XPath data model.
XML files that are different at the character level but that have the same information set or infoset are for most practical purposes equivalent. For example:
<a b='upcase' ><![CDATA[Hello!]]></a>
<a b="upcase" >Hello!</a>
have the same information sets.
The "Canonical XML" recommendation is a related standard in that it specifies a unique ("canonical") way to convert an XML infoset (back) into an XML document. Two XML documents that are "logically equivalent" (i.e. have the same infoset) translate to the same canonical XML representation. The Canonical XML for the above example is:
A parsed XML file results in an infoset, but there can also be synthetic infosets that are constructed from other sources, such as a database, or created by a program that manipulates a DOM (Document Object Model). A DOM is a popular data structure API used to encode and manipulate XML data - i.e. infosets.
The XQuery language also allows you to create infoset items, using element constructor expressions or pre-defined functions.
Node values represent the parts of a XML document, or more generally an XML infoset. Nodes are also used to represent document fragments - i.e. stand-alone nodes that are not part of a document (for example, those that might be generated by an element constructor expression).
These are the kinds of nodes, most of which are as you would expect:
Document nodes represent a complete XML document.
Element nodes present XML elements.
Attribute nodes represent the attributes of an element. Note that namespace definitions are represented by namespace nodes instead.
Namespace nodes represent the in-scope attributes of an element. (You cannot actually "get at" a namespace node in XQuery,
though you get at them in XPath using the
axis, which has been deprecated in XPath 2.0.)
Processing instruction nodes represent embedded XML processing instructions.
Comment nodes represent XML comments.
Text nodes represent character data. Note that an infoset consists of single-character items, but in the node representation multiple contiguous character items are represented in a single text node.
The XQuery 1.0 and XPath 2.0 Data Model specification (http://www.w3.org/TR/query-datamodel/) goes into details about the different kinds of nodes. It also defines a
number of functions on nodes, using the prefix
dm. For example, the function
dm:node-kind takes a
Node and returns a string value that represents the node's kind, one of
Note that these functions are only to explain the data model:
You cannot call them from an XQuery program, and
dm isn't actually bound to any namespace.
However, in some cases there may be a
function in the Functions and Operators document
with the same name and behavior. Those are available to
your XQuery programs.
For example there is a
fn:node-kind function which
you can use, and which is defined to return the same
Nodes in XQuery are immutable, which means you cannot change any part of a node once it has been created. This makes sense, since XQuery is a pure side-effect-free expression
language. However, nodes do have identity: two nodes that
were created using different expressions are distinct nodes,
even if they contain the same data.
You can compare the latter using the standard function
The following examples are all
fn:deep-equal(<a>test</a>, </a>test</a>) <a>test</a> isnot </a>test</a> let $x := <a><b></b></a> return $x/b is $x/*
Because nodes have identity, you can talk about making a copy of a node, because you can distinguish the original from the copy. However, atomic values do not have identity: There is no
way to copy the string
"xyzzy" because there is no way to distinguish the copy from the original. That is the difference between a string atomic value and a text node, which you can
A node may have children, which are the nodes below it in the tree hierarchy, but not including attribute or namespace nodes.
For example, the children of an element node are the text nodes and
nested elements (and occasionally other nodes) that are its contents.
dm:children takes a node and returns the sequence of its children.
(This function only exists in the data model; to get the children of a node
N in XQuery use the expression
Only document and element nodes can have children, so
dm:children returns the empty sequence for other node types.
For each node
X that is a child of
Y, the node
X has a parent property that is the node
Using the data model, you can get at
Y using the
dm:parent function; in an XQuery program
you have to use the expression
These properties have some surprising consequences. Because nodes are immutable, you have to specify the children of an element or document when you create it. However, those children
have to have their parent property set to the new node - but you can't modify them, as they are immutable. This chicken-and-egg problem is solved by creating new copies of the children, with
the parent property of the new nodes set to the new parent. Any children of the children also have to be copied.
Note, however, that this copying of nodes is part of the specification, but an implementation is free to optimize away the copying if it doesn't change the result. For example, consider the following expression:
The specification says that
<c> nodes are created, and then copied when
<a> is created. But since there is no way to access the old
<c> nodes, an implementation is free to just re-use the old nodes without copying them, or it can create them in-place at the same time it creates
This is an example of the important difference between specification and (valid) implementation. The lack of side effects in XQuery gives the implementation extra flexibility in choosing how to
implement things. A possible disadvantage is that it makes it hard to estimate how much work is done for an XQuery program, unless you are very familiar with your implementation. On the other hand,
you usually don't need to know.
More generally, an implementation is free to represent nodes in any way compatible with the specification. An obvious choice is to use the standard
Node type specified in the
W3C's Document Object Model (DOM) (http://www.w3.org/DOM). However, though DOM is a flexible and convenient API, it is quite space-inefficient. As an example of an alternative representation,
the Qexo implementation (http://www.gnu.org/software/qexo/) uses a single
TreeList for an entire document. The
TreeList contains an internal array, and node objects are
identified by indexes into that array. (The Apache Xalan XSLT processor uses a similar Document Array Model representation.) In fact, an implementation may in some cases not create actual node
objects at all. Consider that the ultimate result of evaluating an XQuery expression is often written out to a file as a new XML document. In that case the XQuery processor can write out the nodes
on-the-fly directly to the output file, without ever creating any nodes. More generally, the XQuery processor can "write" the output to a SAX
DocumentHandler or a similar event-driven
Sometimes it is useful to take a node, and convert it to a string value. The function
fn:string does that. The string value of a text node is the characters in the
node. The string value of an element or document node is the concatenation of the text node descendents of the node in document order. The string value of an attribute node is the attribute
There is also the typed value of an element, attribute, or text node, which you can extract using the
fn:data function. This is the value of a node as a sequence of
atomic values, as the result of Scheme validation. If an element node has a complex type, then the typed value is undefined.
The XQuery and XPath languages are typed expression (functional) languages. This means that programs are made from expressions (which may in turn contain sub-expressions), and that evaluating an expression results in a value, which has a type.
Informally, a type is a set of values: those values that are instances of or belong to the type. The type system of a programming language is the collection (vocabulary) of types that the language definition distinguishes, including the rules for determining whether a value is an instance of a type, and for how to create complex types from simple types.
A type error occurs when the operands of an operation have types that are not allowed for that operation. For example, in XQuery you can add two numbers using the
operator, but you can't add two nodes, even if the nodes contain integer values. If your program tries to add two nodes, the XQuery processor should give you an error message instead.
It is useful to distinguish between the dynamic types and static types:
The type of a value is a dynamic type. Dynamic types exist during evaluation (at run-time). Dynamic types are sets of values that are instances of the type. A type specifies the meaning or interpretation of a value.
The type of an expression (a program fragment) is a static type. Static types are the types of declarations and program fragments as specified by the programmer or inferred by a compiler. If an expression has a (static) type and you evaluate the expression without a run-time error, then the result is guaranteed to be an instance of the corresponding static type.
A dynamically typed language is one that doesn't have static types. Another way to say the same thing is that there is only a single type, which contains all values. In those languages, all type errors are run-time errors. The goal of a static type system is to detect type errors at compile time, before actual execution. This is a process called type checking, and in some languages (including XQuery) is a fairly complicated process.
Static type checking lets you detect and fix errors earlier. This is especially valuable for infrequently executed parts of a program, since they are less likely to get much testing. As a side benefit, if the compiler can determine the type of an expression, it may be able to generate more efficient code, and so the query may execute faster.
The XQuery and XPath languages specify both dynamic types and static types. The static type checking is optional, both for implementors and users: An XQuery implementation need not implement the static typing feature, and implementations that do implement static typing will have an option to disable it.
We will discuss static typing later, but first we will study dynamic typing, including the kinds of values that XQuery and XPath deal with. The data model is part of dynamic typing.
The values worked on by an XQuery program are sequences of items. An item is either an atomic value (for example an integer or a string) or a node (for example an element or an attribute).
A sequence is a collection of zero or more items. The most important idea to note is that not only are all sequences values, but also all values are sequences, because a sequence of just a single value is in all respects the same as the single value. It follows from this that you cannot nest sequences - you cannot have sequences of sequences, only flat single-level sequences.
If you have experience with arrays or lists in other programming languages, you might think it is a strange and limiting restriction that you can't nest sequences. Actually, it isn't really a limitation, because you can always uses nested elements if you need nested data. For example, to represent a two-dimensional array you can use nested elements like this:
<list> <list>11 12</list> <list>21 22</list> </list>
A major difference between XPath 1 and XPath 2 is that the latter has sequences, while the former does not. Instead, XPath 1 has node sets, which are like sequences, but without duplicates, and in unspecified order. XPath 1 path expressions evaluate to node sets, while in XPath 2 (and XQuery) path expressions evaluate to node sequences. However, the latter sequences are defined to be sorted in document order and with duplicates removed. (These are actually equivalent, in that you can map a set into a sequence that is ordered and without duplicates, and back again, without information loss. Furthermore, any valid XPath 1 expression will behave the same under either model.)
The XQuery/XPath primitive types are the same as in XML Schema, which is a standard for specifying element structure of XML data, and associating types with XML data.
Atomic values include numbers, values, and booleans. There are two kinds of atomic type:
A primitive type is not defined in terms of some other type.
A derived type is based on some other type, its base type.
A derived atomic type is a restriction of its base type, because it is a restriction (sub-set) of the set of atomic values that belong to the base type.
Following is a complete list of the built-in types defined by XML Schema. We will only list them briefly; for more information see the W3C Recommendation (02 May 2001) of XML Schema Part 2: Datatypes (http://www.w3.org/TR/xmlschema-2/). This specifies for each type its value space (the abstract values that belong to the type), its lexical space (the text representation of values using printable characters), and its facets (properties of the type itself).
XML Schema defines the following builtin types:
boolean is one of the two truth values
string is zero or more Unicode characters.
There a various sub-types of
normalizedString is a
string that does not have any whitespace characters except for space. A
normalizedString that has no leading or trailing spaces, and does not have two or more spaces in a row. A
language is a
token used to specify a natural (human) language. An
NMTOKEN is a
token consisting of one or more
NameChar characters, as defined in the XML standard. A
Name is a
token used to represent XML names, such as
NCName is a plain
Name without a colon, such as
body. The types
ENTITY are sub-types of
NCName used for
special kinds of attribute values as specified in the XML standard. The types
NMTOKENS are used for space-separated sequences of the
NOTATION is used for attributes that specify the notation (encoding) of an element. However,
NOTATION is not a sub-type of
anyURI represents a Uniform Resource Identifier Reference, such as http://www.w3.org/.
QName represents an XML qualified name, which is a pair of a namespace name, and a local part. Note that in the value space a namespace name is an
anyURI (such as http://www.w3.org/1999/xhtml), while the lexical representation uses namespace prefixes (as in
xhtml:body). Therefore mapping between the two requires a context
that contains the needed namespace declaration.
decimal is an arbitrary-precision real number, in base 10. An
integer is a
decimal without a fractional part. The types
positiveInteger are the obvious sub-types of
integer. The types
unsignedByte are sub-types of
integer that can be encoded in binary using respectively 64, 32, 16, or 8 bits.
double correspond to 32-bit and 64-bit IEEE binary floating-point real numbers. The standard lexical representation uses decimal
format, with an optional exponent, such as
1.25e-10, even though these types are not sub-types of
There are a number of time-related types: A
date is a calendar date, such as May 31, 1999 (written
time is an instant that
occurs every day, like 1:20pm (written
dateTime is a specific instant of time, like 1:20pm on May 31 1999 (written as
1999-05-31T13:20). Any of these may have an
optional timezone specified. A
gYear is a specific year in the Gregorian calendar, while a
gMonthYear is a specific year and month. A
gMonthDay is a month and day that recurs
every year, a
gMonth is a month that recurs every year, and a
gDay is a day that recurs every month. A
duration is a duration of time, like 2 days and 1 hour (written as
P2D1H). The XQuery/XPath committee has added two sub-types of
duration, which may get added to future Schema revisions:
xdt:yearMonthDuration (a duration of some number of years and
xdt:dayTimeDuration (a duration of some number of days, hours, minutes, and seconds).
base64Binary are used to encode arbitrary binary data. The value of either is zero or more octets (8-bit bytes). A
hexBinary uses two hexadecimal digits for each octet, so
0FB7 encodes the 16-bit integer 4023. A
base64Binary uses the Base64 MIME Content-Transfer-Encoding.
The union of all primitive types is
All of these standard types names are in the
http://www.w3.org/2001/XMLSchema namespace, conventionally written using the predefined namespace prefix
xs, as in
The XQuery specication adds four types:
the duration types
xdt:dayTimeDuration are mentioned above;
xdt:anyAtomicType includes all the atomic values;
xdt:untypedAtomic is a type used for untyped data,
such as text that has not been validated.
All are subtypes of
anySimpleType, and are
The word schema comes from the database community, and means a description of the structure, types, and relations of a database. In the XML world a schema is a description of the syntax and meaning (types) of a class of XML documents. A schema language is a formalism for specifying the types of documents as schemas.
The earliest XML schema language is DTD (Document Type Descriptor), which appears in the original XML specification from 1997, and goes back to the SGML roots of XML. DTD is a simple language that lets you express simple structural constraints. For example, the following:
<!ELEMENT tr td*>
means that a
<tr> element consists of zero or more
DTD does not have any mechanism for specifying semantic or type information, except in a very few cases. Other schema definition languages allow you to define and specify types.
XML Schema (http://www.w3.org/XML/Schema) is a 2001 specification from W3C that can be used to specify structural constrains and associate type information with XML documents. While there are other Scheme language in use, this is the one with most usage and visibility, partly because it is a W3C standard. The type semantics of XQuery/XPath2 are defined in terms of XML Schema.
As an example we will use the record of a series of dice throws. Perhaps you want to verify the dice are fair, or you want a source of random numbers, or you want search for mystical patterns.
<?xml version="1.0"?> <die-tests> <die-test> <who>Nathan</who> <when>whenever</when> <throws>5 2 2 2 1 3 6 6 2 6</throws> </die-test> <die-test> <who>Per</who> <when>2002-10-09T09:07</when> <throws>6 2 5 2 2 3 3 3 4 1</throws> </die-test> </die-tests>
The Schema for this might look like the following:
<?xml version="1.0"?> <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:element name="die-tests"> <xsd:complexType> <xsd:sequence> <xsd:element name="die-test" type="die-test-type" minOccurs="0" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:complexType name="die-test-type"> <xsd:sequence> <xsd:element name="who" type="xsd:string"/> <xsd:element name="when" type="xsd:dateTime"/> <xsd:element name="throws"> <xsd:simpleType> <xsd:list itemType="die6-result"/> </xsd:simpleType> </xsd:element> </xsd:sequence> </xsd:complexType> <xsd:simpleType name="die6-result"> <xsd:restriction base="xsd:integer"> <xsd:minInclusive value="1"/> <xsd:maxInclusive value="6"/> </xsd:restriction> </xsd:simpleType> </xsd:schema>
This is verbose, and may be a bit intimidating, but it is relatively straightforward. It contains a top-level element declaration for the root element
as well as type definitions for types named
A simple type (such as
can only be expressed as character data.
A complex type (such as
can consist of attribute specifications,
and either sub-elements (
character data (
or a mixture of these (
The type of an attribute can
only be a simple type, while the type of an element can be either a simple type (if it only contains text data), or it can be a complex type. All pre-defined types (such as
The top-level element declaration for
die-tests says that any element with the
die-tests tag has the structure and type specified: It is a complex type consisting of a
sequence of 0 or more elements that have the tag
die-test, and that the content of each such
die-test element has the type with the name
die-test-type. (It is possible to specify
that a given element tag can be have different types in different contexts, but we'll ignore that possibility.)
The definition of the complex type
die-test-type specifies that any element declared to have that type (in our case
die-test) consists of a sequence of a
<name> element, a
<when> element, and a
<throws> element. The latter is a space-separated list of
die6-result values. The definition for the simple type
die6-result says that a
die-result is an
integer in the range 1 through 6.
To validate an XML document against a schema means to scan the document, verifying that the document satisfies the constraints specified in the schema. The result is a post-schema validation infoset (PSVI), which is an info set (as defined earlier) with additional type annotations. A type annotation is the QName of a type named in a schema.
An XQuery processor may optionally implement the Schema import feature. If it does, it must be able to import definitions from external schemas and validate node trees.
Each element or attribute in XQuery has a type annotation, which is its dynamic type. If an element has not been validated, or otherwise been given a type annotation, then
it has the default type annotation
xs:anyType. The corresponding
default for an attribute node is the type
Atomic (non-node) values can also have type annotations. The annotation
xsd:untypedAtomic indicates that the type is unknown, typically
raw text from an schema-less XML file. Operations that take atomic values may
xsd:untypedAtomic to a more specific type, such as
xs:double, but if the atomic value is of the wrong kind (a string where a number is required, as in the operands of
then a run-time error may be signaled.
An XQuery application can use a validate expression:
validate ( EXPR )
This takes a sequence of elements, strips off any existing type annotations, and adds type annotations as specified by the in-context scheme definitions. The latter are all the scheme element declarations and type definitions that are imported by schema import declarations. (You can optionally specify a SchemaContext that can be used with context-dependent schema types.)
A schema import declaration appears in the Query Prolog of an XQuery program. For example:
import schema "http://www.w3.org/1999/xhtml" at "xhtml.xsd"
This tells the XQuery processor to look at the location specified (in this case by the relative URL "
xhtml.dtd") and add any schema components in the specified namespace
(http://www.w3.org/1999/xhtml) to the set of visible schema components. These now become available for
Note that Schema validation and type annotation are conceptually dynamic (run-time) type operations. A type annotation is a QName that is associated with a value, not associated with a static (compile-time) expression. Next we will look at static type-checking.
The XQuery language provides operations to check whether a value belongs to a type, as well as mechanisms to declare that a variable or parameter has a specific type. In an XQuery
program (static) types are instances of
SequenceType. We won't go into detail about
SequenceType, but here are some examples:
text() — Matches any text node.
element() — Matches any element node.
element(xhtml:td,*) — Matches any element node whose tag has the local part
td and has the same namespace URI that
xhtml is bound to. (It does not have to have
xhtml as the actual namespace prefix.)
element(*, die6-result)— Matches any element of any tag that has a type annotation of
element(xhtml:title)? — Matches an optional type
element - i.e. zero or one items that match element
xhtml:title, and whose type annotation matches
that declared for
xhtml:title in an imported schema
node()* — Matches a sequence of zero or more nodes.
item()+ — Matches any non-empty sequence.
attribute(@ID, *) — Matches any attribute node whose name is
ID (in the empty namespace).
xs:integer — Matches any integer type or any type derived from it, such a
xs:nonNegativeInteger, assuming this is in scope of a namespace declaration that binds
http://www.w3.org/2001/XMLSchema, which is normally the case.
These types can be used for XQuery's type-checking and -conversion operators. Here is a very brief summary; see the specification or other chapters for details and examples.
expr instance of type — Returns true if the value of
expr matches (is an instance of)
cast as type (expr) — Convert the value of
expr to a given
type, using certain standard conversions.
treat as type (expr) — Treat the
expr as having static type
type. At run-rime, a dynamic error is signaled if the value of
expr is not an instance of
typeswitch (expr) case type1 return expr1 ... default return exprd — Select the first
case whose type matches the value of the
expr, and evaluate the
An XQuery implementation may optionally implement the Static Typing Feature. This means that the implementation is required to detect static type errors at analysis (compile) time. At the time of this writing, the specification has a number of unresolved issues, and I don't know of any implementation that actually does implement static typing. (However, some of the precursor languages that inspired XQuery do implement static typing.) For these reasons, plus the fact that the specification of static typing is big and formal, I won't go beyond mentioning a few of the concepts.
The static type system defined in the XQuery formal semantics (http://www.w3.org/TR/query-semantics/) goes far beyond what you can express as a S
equenceType. It includes
most of the type specification concepts of XML Schema. The formal semantics defines extra declarations
define element, and
define attribute. These are not in
the XQuery source language (i.e. you can't write them directly), but are a formalism used in the formal semantics to express types imported from schemas. The idea is that an XQuery program is
translated to core XQuery, which is simpler and more regular (but less convenient) than the actual XQuery program. Part of this translation is that Scheme import declaration are translated
define element, and
define attribute declarations. These internal declarations, as well as the whole concept of core XQuery, are purely part of the formal
specification of XQuery: There is no requirement that any implementation implement the translation to core XQuery, only that it acts as if it does.
Static type checking is done at the level of core XQuery at analysis (or compile) time. There are a whole slew of rules that say things like if the type of
xsd:boolean, the type of
type2, and the type of
type3, then the type of
if (expr1) then expr2 else expr3
(type2|type3) is a type expression in the formal semantics, which you cannot write directly in the actual XQuery language.
Copyright 2002, 2003 (C) Per Bothner.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, version 1.1.