About the namespace riddle

NOTE: This is a restored version from this archive.

WARNING : This document will soon be deprecated. The ‘Back to Basics’ series of article will replace it as soon at it is completed.

The riddle

RDDL was created, according to its authors, in response to the following question (or riddle, hence the name) :

Some namespace names are URL. What should I get when I type this URL into my browser ?

As this doesn’t care about namespace names that are not URL, let’s rewrite it this way :

What should I obtain when resolving a namespace name ?

Resolving non-URL thingies is feasible, through the indirection of a catalog. So what should those catalogs return when resolving a namespace name ?

Well if we agree that names are handles for object, resolving a namespace name should give you the namespace itself.

What is a namespace ?

So what is a namespace ? A namespace is a namespace is a namespace. It is a mathematical object. Like the W3C says :

From http://www.w3.org/TR/REC-xml-names/:

[Definition:] An XML namespace is a collection of names, identified by a URI reference [RFC2396], which are used in XML documents as element types and attribute names. XML namespaces differ from the “namespaces” conventionally used in computing disciplines in that the XML version has internal structure and is not, mathematically speaking, a set. These issues are discussed in “A. The Internal Structure of XML Namespaces”.

So a namespace is a collection of names.

First attempt at answering the riddle

What should I obtain when resolving a namespace name ? A collection of names.

What should I obtain when resolving a namespace name ? NOT a human readable documentation. An XML namespace is NOT a human readable documentation, it is a collection of names.

What can I do with a collection of names ? Not much. I can enumerate the collection, and I can check if a name belongs to the collection or not. That’s a bit useful, though, because I can check if a name is valid when encountering it, and signal invalid names (possibly typos).

Maybe we could add a few useful things to this collection of names. Let’s think about it from the start.

An interpretation of XML and namespaces

What is an XML document ? It is a particular serialisation of a labeled tree whose labels are names that belong to namespaces and leaves are text. The XML 1.0 specification describe the rules for serializing such a tree. When a document correctly follows those rules, we say it is well-formed.

There are some names in some XML documents that don’t seem to belong to any namespace, either because they were created before the namespace concept appeared, or because their author did not find it important to create a namespace for them. Let us say that they are in a special namespace.

What should I obtain when resolving this special namespace name ? The collection of all names that are not explicitly defined in a namespace. How big is that collection ? I don’t know. Who maintains that collection ? Nobody, or everybody, depending on how you see things. But this collection exists ? Yes, but I’d rather never have to enumerate it. I can check if a name belongs to it or not, though, but it’s not of great help, since it does not allow me to validate names.

This collection is fundamentally open, it is a mathematical object but it is not interesting for data processing. What is this namespace’s name ? It has not. Well, it would better have not, for fear that I would have to answer the riddle for it : “What should I obtain when resolving this special namespace name ?”.

An interpretation of schemas

Definition of a schema

What is a schema ? A schema is a set of constraints on the structure of a labeled tree. Well-formedness makes sure that the format of an XML document follows the rules of XML 1.0, so that I can build a labeled tree from it. Then, this labeled tree is said to be valid according to a particular schema if it follows the constraints of the schema. Well-formedness is a set of constraints on the serialised form, while validity is a set of constraints on the labeled tree.

What kind of constraints do we have in a schema ? I can see structural constraints, that enforce a particular structure of the tree. Such constraints force certain names to appear or not depending on the presence of other names as parent or siblings. I can also see content constraints, that enforce a particular format of some leaves of the labeled tree that are text nodes.

Is there a relation between namespaces and schemas ? Not really. The structural constraints of a schema are not bound to a particular namespace. I can say that after a ‘foo’ from namespace A must come a ‘bar’ from namespace B. I can even say that after a ‘foo’ from namespace A comes a ‘foo’ from namespace B, or from the namespace-with-no-name. Some schema can leverage the fact that namespace are a collection of name to say that after a ‘foo’ from namespace A must come a name from namespace B (one should avoid using the namespace-with-no-name here).

Nothing forces me to have a schema which only manipulates names from the same namespace. Indeed, namespaces were built to allow authors to mix names from different namespaces in the same document.

The purposes of schemas

What is the purpose of a schema ? An evident purpose is the validation of documents. But why is it important to validate documents ? By enforcing structural constraints, schemas create patterns of names within the documents. In human languages, specific patterns of words create semantics that mean more than those individual words. Likewise, name patterns are a way to create new semantics that are more than the set of names they contain. A schema makes sure that a labeled tree contains those patterns, so validating a document makes sure that the document does have the expected semantic.

At this point, one must be careful. A document cannot have an intrinsinc semantic. English words in a book are nothing more than ink patterns on a sheet of paper. But even a computer can build a series of word from those ink patterns, and still this series of word doesn’t have any meaning for a computer. The series of words itself has no meaning. It is the human brain, fed by the series of word, that makes sense of the book.

There is no direct semantic communication between the writer and the reader of a book. The writer produces an image of its ideas, in a series of words. The reader reads those words and produces a mental image of what the writer wants to say. Even if the transmission of the series of words is 100% accurate, there is no guarantee that the mental image of the reader will be the same as the mental image of the writer, quite the contrary in fact. This is true for any communication between humans, since all means of communication require a “serialisation” of thoughts. Until we discover telepathy, I mean :).

Likewise, an XML document has no intrinsinc semantics. It is a way for two systems to exchange some information. The semantics are known on both sides, but the documents itself has no semantics. It is the code that runs on both sides that creates the semantic out of the document. Therefore, validation is a way to make sure that the information coming in or going out is effectively correctly structured. It is a safety check of the semantics of the document.

Operational schemas

There is another kind of schema which is less obvious. When writing a program that process an XML document, one has to follow the same element patterns that were used to convey the original semantic through the document. If the document author built some specific element patterns, one cannot process the XML document at the element level or with different element patterns. Well, nothing prevents you from doing this, but is in this case :

you’re building a tool, like a gateway, that do not need to extract semantic from the document, and just make some general processing before passing the document to another program. The document can go from gateway to gateway until it eventually reaches a program for which the semantics do care.
or you’re going to obtain semantics that have nothing to do with the original ones. This is a guarantee that your program will fail, or worse, produce bad results. Preventing some code to produce bad result is the main objective of validation. Validation can’t prevent programs from failing, indeed they will provoke a program failure if some unexpected data is encountered.

If we’re not in the tool scenario, and we want to process the document following the semantics it was built to convey, we’ll have to follow the elements patterns that the semantic context has defined. Even if there is no available schema, for example because of an informal agreement between the producer and the consumer of the document, the patterns are written inside the code that process the document. This set of hard-wired patterns define an implicit schema that we could call an operational schema.

Operational schemas and validation schemas are not necessarily identical. Some programs may work on a part of the document only, not the entire document. The operational schema then contains only the patterns that are needed, not the full set of patterns that are defined in the validation schema. Of course, there is also the aforementioned case where there is no validation schema. An operational schema aforementioned exists as soon as some code is written. A validation schema has to be explicitly built.

Relations between schemas and namespaces

We have seen that a schema is a way to specify the structure of elements within an XML document. Those elements have names that can reside in different namespaces. Thus, a schema is linked to the set of namespaces it uses.

Conversely, should I associate namespaces with the set of schemas that use it ? That is a bad idea. There are no sound relationships possible between a collection of names and a set of patterns built with these names. The schema does depend on one or more namespaces because it defines structural constraints based on the names they contain. But the namespaces do not depend on the schema. Adding such a relation would be a burden for the owner of the namespace, because it would force him/her to keep an eye on every possible usage of names from its namespace.

Should I use the namespace name as a way to find schema ? That is too a bad idea. The semantics of a document come from the schema, not the namespace of the element it uses, so one should never use the namespace to identify a particular set of semantics. One should use another method.

Obtaining meta-data from a document

When I receive an XML document, how do I know its type, how do I know what are the schemas I can use for validation, etc. ?

Current mechanisms for DTDs and XML Schema

XML 1.0 has a mechanisms by which a document can be associated to a schema (in DTD format) : the DOCTYPE declaration. There are two types of DOCTYPE declaration : PUBLIC and SYSTEM. In both declarations, a (possibly relative) URI is provided so that the DTD can directly be fetched by the validator. The PUBLIC declaration also provides a PUBLIC identifier, a URN which can serve as a key to locate the DTD on a local cache.

The XML Schema specification describe another system. The root element of the document must contain a special attribute whose named is defined in the XML Schema namespace. The value of this attribute is an URI which is used to fetch the schema. Note that there is no public identifier, so that if caching occurs, it directly uses the URI as a key (thus, XML Schema caching is quite like standard HTTP caching).

Last, there is another mechanism used to bind resources to an XML document. It is used for associating a stylesheet to an XML document. This is done by the insertion of a Processing Instruction (PI) at the beginning of the document.

The problem with those mechanisms

readability of the documents : the beginning of the document (up to its root element) is more and more crowded with references to meta-data.
maintenance : what if I want to add a schema written in a new schema language ? Do I have to edit all the documents I produced according to the desiderata of the schema language (adding PIs or special attributes) ? What if I have a new stylesheet that can change all documents in another format ? Do I have to specify the existence of this stylesheet in each and every document ?

The meta-data resource directory solution

{description of the purpose of a meta-data resource directory}

{where should the meta-data resource directory reside ?}

{how to associate an XML document to a given meta-data RD ? Using a PI ? }

Document types and element types

{What is a document type ?}

{What is an element type ?}

{How should I associate meta-data to document types and element types ?}

A proposal for an answer to the riddle : the Namespace Description Language

What should I obtain when resolving a namespace name ? A Namespace Description Language document. This is NOT a collection of names, but a program can build a collection of names from it.

Apart from the different names appearing in the namespace, what information could the NDL document provide ? Well, I’m not sure yet, but I think that for each name which corresponds to an element name, there should be the definition of the corresponding element type URI. This URI would enable us to fetch a metadata RD and get some associated resources, like a human-readable documentation of the general meaning (e.g. out of any pattern) of the tag.

What about the human-readable documentation of the meaning of a tag inside a pattern ? It has to be specified along with the schema that defines the pattern, as a meta-data resource of the schema.

{…}

Acknowledgements

This is the result of my participation to a series of threads on the XML-DEV mailing list. Thanks go to all the people on this list that defended their position and animated the debate. Amongst those people, I’d like to thank : Jonathan Borden and Tim Bray (authors of the RDDL specification), Len Bullard, Leigh Dodds, Dare Obasanjo, Simon St.Laurent and Paul Tchistopolskii.