xml

Appendix A: XML

A1. What is XML

XML, the Extensible Markup Language, is an Internet-friendly format for data and documents, invented by the World Wide Web Consortium (W3C).
The Markup denotes a way of expressing the structure of a document within the document itself.
XML has its roots in a markup language called SGML (Standard Generalized Markup Language),which is used in publishing and shares this heritage with HTML.
XML was created to do for machine-readable documents on the Web what HTML did for human-readable documents—that is, provide a commonly agreed-upon syntax so that processing the underlying format becomes a commodity and documents are made accessible to all users.

XML is the subset of SGML (Standard Generalized Markup Language) that is used to define other markup languages, including SVG.
HTML is an SGML application.
SVG is XML application.

A2. Anatomy of an XML Document

A2.0 Example

Example of a simple XML authors document:

    <?xml version="1.0" encoding="us-ascii"?>
                    <authors>
                        <person id="lear">
                            <name>Edward Lear</name>
                            <nationality>British</nationality>
                        </person>
                        <person id="asimov">
                            <name>Isaac Asimov</name>
                            <nationality>American</nationality>
                        </person>
                        <person id="mysteryperson"/>
                    </authors>

The first line in xml document is XML declaration, for example
- example 1: <?xml version="1.0" encoding="us-ascii"?>
- example 2:<?xml version="1.0" encoding="utf-8"?>

Elements and Attributes

<tag> element </tag>
An XML document must contain exactly one root element, which contains all other content within the document.
The name of the root element defines the type of the XML document
Elements that contain both text and other elements simultaneously are classified as mixed content.
The SVG <text> element is such an element; it can contain text and <tspant> elements.
The document uses the XML shortcut syntax for an empty element.
The following is a reasonable alternative for the shortcut
<person id="mysteryperson"/>
is
<person id="mysteryperson"></person>
The sample authors document uses elements named person to describe the authors themselves.
Each person element has an attribute named id.
- Attributes can contain only textual content.
- Their values must be surrounded by quotes
  Either single quotes (') or double quotes (") may be used, as long as you use the same kind of closing quote as the opening one.
- It does not matter in what order attributes are presented in the element start tag.
Within XML documents, attributes are frequently used for metadata
(i.e., data about data)—describing properties of the element's contents.
This is the case in our example, where id contains a unique identifier for the person being described.

A2.1 Name Syntax

XML 1.0 has certain rules about element and attribute names. In particular:

Names are case-sensitive: e.g., <person/> is not the same as <Person/>.
Names beginning with xml (in any permutation of uppercase or lowercase) are reserved for use by XML 1.0 and its companion specifications.
A name must start with a letter or an underscore, not a digit, and may continue with any letter, digit, underscore, or period

A precise description of names can be found in section 2.3 of the XML 1.0 specification. The rules for names in XML 1.1 are slightly different, primarily with regard to Unicode characters. SVG uses the XML 1.0 rules.

A2.2 Well-Formed

An XML document that conforms to the rules of XML syntax is known as wellformed. At its most basic level, well-formedness means that elements should be properly matched, and all opened elements should be closed.

A2.3 Comments

Comments in XML are similar to those in HTML. A comment begins with .
The <desc> and <title> elements in SVG obviate much of the need for comments.

A2.4 Entity References

Another feature of XML that is occasionally useful when writing SVG documents is the mechanism for escaping characters. Because some characters have special significance in XML, there needs to be a way to represent them. For example, in some cases the < symbol might really be intended to mean less than rather than to signal the start of an element name.

Literal character

Entity reference

<

>

'

"

&

A2.5 Character References

You are likely to find character references in the context of SVG documents.
Character references allow you to denote a character by its numeric position in the Unicode character set
(this position is known as its code point). Table below contains a few examples that illustrate the syntax.

Actual character

Character reference

1 (decimal)

A (decimal)

Ñ (hexadecimal)

® (hexadecimal)

Note that the code point can be expressed in decimal or, with the use of x as a prefix, in hexadecimal.

A3. Character Encoding

XML, designed to be an Internet-friendly syntax for information exchange, has internationalization at its very core. One of the basic requirements for XML processors is that they support the Unicode standard character encoding. Unicode attempts to include the requirements of all the world's languages within one character set. Consequently, it is very large!

A3.1 Unicode Encoding Schemes

Unicode 3.0 (in 2026 Unicode 17)has over 57,700 code points, each of which corresponds to a character.4
If one were to express a Unicode string by using the position of each character in the character set as its encoding (in the same way as ASCII does), expressing the whole range of characters would require four octets5 for each character. Clearly, if a document is written in 100 percent American English, it will be four times larger than required all the characters in ASCII fitting into a 7-bit representation. This places a strain both on storage space and on memory requirements for processing applications.

Fortunately, two encoding schemes for Unicode alleviate this problem: UTF-8 and UTF-16. As you might guess from their names, applications can process documents in these encodings in 8- or 16-bit segments at a time. When code points are required in a document that cannot be represented by one chunk, a bit pattern is used that indicates that the following chunk is required to calculate the desired code point. In UTF-8, this is denoted by the most significant bit of the first octet being set to 1.

This scheme means that UTF-8 is a highly efficient encoding for representing languages using Latin alphabets, such as English.
All of the ASCII character set is represented natively in UTF-8 — an ASCII-only document and its equivalent in UTF-8 are byte-for-byte identical

This knowledge will also help you debug encoding errors. One frequent error arises because of the fact that ASCII is a proper subset of UTF-8. Programmers get used to this fact and produce UTF-8 documents, but use them as if they were ASCII. Things start to go awry when the XML parser processes a document containing, for example, characters such as Á. Because this character cannot be represented using only one octet in UTF-8, this produces a two-octet sequence in the output document; in a non-Unicode viewer or text editor, it looks like a couple of characters of garbage.

A3.2 Other Character Encodings

Unicode, in the context of computing history, is a relatively new invention. Native operating system support for Unicode is by no means universal. For instance, older systems like Windows 95 and 98 do not have it.

XML 1.0 allows a document to be encoded in any character set registered with the Internet Assigned Numbers Authority (IANA). European documents are commonly encoded in one of the ISO Latin character sets, such as ISO-8859-1. Japanese documents commonly use Shift-JIS, and Chinese documents use GB2312 and Big 5.

A full list of registered character sets is maintained by the Internet Assigned Numbers Authority (IANA).

XML processors are not required by the XML 1.0 specification to support any more than UTF-8 and UTF-16, but most commonly support other encodings, such as USASCII and ISO-8859-1. Although most SVG transactions are currently conducted in ASCII (or the ASCII subset of UTF-8), there is nothing to stop SVG documents from containing, say, Korean text. You will, however, probably have to dig into the encoding support of your computing platform to find out if it is possible for you to use alternative encodings.

A4. Validity

A4.1 What for another level of verification

In addition to well-formedness, XML 1.0 offers another level of verification, called validity. To explain why validity is important, let's take a simple example. Imagine you invented a simple XML format for your friends’ telephone numbers:

<phonebook>
            <person>
                <name>Albert Smith</name>
                <number>123-456-7890</number>
            </person>
            <person>
                <name>Bertrand Jones</name>
                <number>456-123-9876</number>
            </person>
        </phonebook>

Based on your format, you also construct a program to display and search your phone numbers. This program turns out to be so useful, you share it with your friends. However, your friends aren't so hot on detail as you are, and try to feed your program this phone book file with a <phone> element instead of a <number> element:

<phonebook>
            <person>
                <name>Melanie Green</name>
                <phone>123-456-7893</phone>
            </person>
        </phonebook>

Note that, although this file is perfectly well-formed, it doesn't fit the format you prescribed for the phone book, and you find you need to change your program to cope with this situation. If your friends had used number as you did to denote the phone number, and not phone, there wouldn't have been a problem. However, as it is, this second file is not a valid phone book document.

For validity to be a useful general concept, you need a machine-readable way of saying what a valid document is; that is, which elements and attributes must be present and in what order.
XML 1.0 achieves this by introducing document type definitions (DTDs).
For the purposes of SVG, you don't need to know much about DTDs. Rest assured that SVG does have a DTD, and it spells out in detail exactly which combinations of elements and attributes make up a valid document.

A4.2 Document Type Definitions (DTDs)

The purpose of a DTD is to express the allowed elements and attributes in a certain document type and to constrain the order in which they must appear within that document type. A DTD contains declarations defining the element types and attribute lists.
A DTD may span more than one file, and the SVG 1.1 specification uses a modularized DTD spread over more than a dozen files.
However, the mechanism for including one file inside another—parameter entities—is outside the scope of this book.

It is commonto mistakenly conflate element and element types.
The distinction is that an element is the actual instance of the structure as found in an XML document, whereas the element type is the kind of element that the instance is.

A4.3 Putting It Together

What is important to you is knowing how to link a document to its defining DTD. This is done with a document type declaration <!DOCTYPE …>, inserted at the beginning of the XML document, after the XML declaration in our fictitious example:

<?xml version="1.0" encoding="us-ascii"?>
<!DOCTYPE authors SYSTEM "http://example.com/authors.dtd">
<authors>
    <person id="lear">
        <name>Edward Lear</name>
        <nationality>British</nationality>
        </person>
        <person id="asimov">
        <name>Isaac Asimov</name>
        <nationality>American</nationality>
    </person>
<person id="mysteryperson"/>
</authors>

This example assumes the DTD file has been placed on a web server at example.com.
Note that the document type declaration specifies the root element of the document, not the DTD itself. You could use the same DTD to define person, name, or nationality as the root element of a valid document. Certain DTDs, such as the DocBook DTD for technical documentation, use this feature to good effect, allowing you to provide the same DTD for multiple document types.

A validating XML processor is obliged to check the input document against its DTD. If it does not validate, the document is rejected. To return to the phone book example, if your application validated its input files against a phone book DTD, you would have een spared the problems of debugging your program and correcting your friend's XML because your application would have rejected the document as being invalid. Some of the programs that read SVG files have a validating XML processor built into them to assure they have valid input (and to keep you honest!).
The kinds of XML processors that are available are discussed in “Tools for Processing XML” in the section A4.5.

A5. Namespaces

XML 1.0 lets developers create their own elements and attributes, but this leaves open the potential for overlapping <title> may mean the name of a book in one context, but it may mean the prefix for a person's name (Ms., Dr., etc.) in a different context.

The Namespaces in XML specification provides a mechanism developers can use to identify particular vocabularies using Uniform Resource Identifiers (URIs)

SVG uses the URI http://www.w3.org/2000/svg for its namespace. The URI is just an identifier—opening that page in a web browser reveals some links to the SVG, XML 1.0, and Namespaces in XML specifications. Programs processing documents with multiple vocabularies can use the namespaces to figure out which vocabulary they are handling at any given point in a document.

SVG applies the namespace in the root element of SVG documents:

<svg xmlns="http://www.w3.org/2000/svg" width="100" height="100">
    ....
</svg>

The xmlns attribute, which defines the namespace, is actually provided as a default value by the SVG DTD. However, some browsers will not render an SVG document if you don't use the namespace explicitly. (If the namespace does appear, it must have the exact value shown earlier.) The namespace declaration applies to all of the elements contained by the element in which the declaration appears, including the containing element. This means that the element named svg is in the namespace http://www.w3.org/2000/svg.

SVG uses the “default namespace” for its content, using the SVG element names without any prefix. Namespaces can also be applied using prefixes, as shown here:

<svgns:svg xmlns:svgns="http://www.w3.org/2000/svg"
width="100" height="100">
    ....
</svgns:svg>

In this case, the namespace URI http://www.w3.org/2000/svg would apply to all elements using a prefix of svgns. The SVG 1.0 DTD won't validate against such documents.

Namespaces are very simple on the surface but are a well-known field of combat in XML arcana. For more information on namespaces, see XML in a Nutshell or Learning XML (both O'Reilly).

A6. Tools for Processing XML

Many parsers exist for using XML with many different programming languages. Most are freely available, the majority being open source.

A6.1 Selecting a Parser

An XML parser typically takes the form of a library of code that you interface with your own program.
The SVG program hands the XML over to the parser, and it hands back information about the contents of the XML document.
Typically, parsers do this either via events or via a Document Object Model (DOM).

With event-based parsing, the parser calls a function in your program whenever a parse event is encountered. Parse events include things like finding the start of an element, the end of an element, or a comment. Most Java event-based parsers follow a standard API called SAX, which is also implemented for other languages such as Python and Perl.

DOM-based parsers work in a markedly different way. They consume the entire XML input document and hand back a tree-like data structure that the SVG software can interrogate and alter. The DOM is a W3C standard with its own documentation.

As XML matures, hybrid techniques that give the best of both worlds are emerging. If you're interested in finding out what's available and what's new for your favorite programming language, keep an eye on the following online sources:

XML.com Resource Guide1: https://www.xml.com/pub/a/w3j/s3.walsh.html
Free XML Tools Guide: https://www.garshol.priv.no/download/xmltools/

Many XML applications involve transforming one XML document into another XML document or into HTML. The W3C has defined a special language called XSLT for doing transformations. XSLT processors are becoming available for all major programming platforms.

XSLT works by using a stylesheet, which contains templates that describe how to transform elements from an XML document. These templates typically specify what XML to output in response to a particular element or attribute. Using a W3C technology called XPath gives you the flexibility to say not only “do this for every person element,” but to give instructions as complex as “do this for the third person element whose name attribute is Fred.

Because of this flexibility, some applications have sprung up for XSLT that aren't really transformation applications at all, but take advantage of the ability to trigger actions on certain element patterns and sequencers. Combined with XSLT's ability to execute custom code via extension functions, the XPath language has enabled applications such as document indexing to be driven by an XSLT processor. You can see a brief introduction to XSLT in Chapter 15.

The W3C specifications for XSLTcan be found at https://www.w3.org/TR/xslt/
The W3C specifications for XPath can be found at https://www.w3.org/TR/xpath/