Comments, Characters, Entities, Notations

XML, Part 4

Version 1.21

Comments

<!-- This is an XML comment.
     Use comments in XML as you would in any programming language:
     to make something clearer, or to "comment out" something.
-->

XML Processors

Remember that an XML application is a language which is in the XML framework. Examples of XML applications include SVG, XHTML, MathML. But the author sometimes uses the term to mean an application program which uses XML—such as Inkscape, a web browser visiting an XHTML page, or an equation editor. Such a program is technically called an XML processor, and I will use that term in these notes.

Character Data

We need to consider separately:

  1. Character encodings
  2. Ways to "input" special characters
  3. Unparsed character data (CDATA sections)

Character Encodings

The XML declaration may declare a character encoding. The default encoding is UTF-8, which is a way of encoding the Unicode character set:

<?xml version="1.0" encoding="UTF-8"?>

The text mentions (p. 73) UTF-16 as another option, but erroneously implies that it supports more characters. UTF-8 and UTF-16 both support the same character set. I believe other encodings can also be declared, though the text states that "XML applications" (meaning XML processors) are not required to support other encodings.

Special Characters

You know—those unusual characters that you might not have a key for, on your computer keyboard, like ← (leftwards arrow, U+2190), € (Euro sign, U+20AC), Я (Cyrillic capital letter Ya, U+042F), ✓ (Check mark, U+2713), and 🁊 (Domino tile horizontal-03-04, U+1F04A).2

If your XML document has a Unicode encoding, you can probably find a way of inputing the character:

And if you have the right fonts installed, you will see the character as intended, so that will be much easier to read.

Otherwise, you can use an XML character reference such as

This can make your XML source code ugly and unintelligible, so you may want to define an entity to make it look better and make more sense.

CDATA Sections

Remember that in XML there can be parsed character data (#PCDATA in a DTD) and unparsed character data. The word parsed means that an XML processor treats parsed character data as follows:

  1. Certain characters, including < and &, have special significance
  2. Extra whitespace is suppressed.

In unparsed character data, therefore, no characters have special significance, and extra whitespace is left as it is.

The purpose of a CDATA section is to contain unparsed character data. You begin a CDATA section with

<![[CDATA[

and you end it with

]]>

For example:

<![[CDATA[
   Every n < 10 is better than any y > 0,
   & I can tell you that Pete is     well, he got lost
   in    s        
             p
                  a
                         c
                              e
]]>

It looks bizarre, but it works.

I rarely, or perhaps never, have felt the need to use a CDATA section, but your needs may vary.

Entities

Entity is a philosophical word that means an existing thing. When computer scientists use philosophical words, there's sure to be trouble!

The textbook describes entities as "building blocks" or "special units of storage" for XML documents.
I don't understand what this means.

You can think of entities as being variables with string values. Whenever you write such a variable in an expression, the value of the variable is substituted in its place.

For example, in Python:

a = "the boy"
b = " sees "
c = "the girl"
print(a + b + c)

The output is the boy sees the girl

With entities, it's much the same, only you have to use special notation to reference the entity (that is, to evaluate it) and you don't need + for concatenation. If we declared XML entities equivalent to a, b, and c, we could write

&a;&b;&c;

and the result is

the boy sees the girl

Kinds of Entities

We can classify entities according to their role:

We can also classify entities according to their location:

General Entities

General entities are declared in a DTD (remember, this can be an internal or external DTD), like this:

<!ENTITY fspt "frogs, snails, and puppy dog tails">
<!ENTITY ssan "sugar, spice, and all things nice">
<!ENTITY leftArrow "&#x2190">

They are referenced like this:

&fspt;
&leftArrow;

Used on context:

Little boys &leftArrow; are made of &fspt;.  
Little girls &leftArrow; are not made 
of &fspt; but of &ssan;.

When this character data is parsed, the values are substituted for the referenced entities, resulting in

Little boys ← are made of frogs, snails, and puppy dog tails. Little girls ← are not made of frogs, snails, and puppy dog tails but of sugar, spice, and all things nice.

Parameter Entities

These are described on pp. 79–80, but unless you're writing (or trying to read) a very complex DTD, they're not worth bothering about. They can make the DTD shorter but harder to understand, if it has a lot of repeated parts. We'll skip it.

Unparsed Entities

These are typically binary data, for example a JPEG image file. We'll skip this too. (Details on p. 80.)

External Entities

The examples above are internal entities: the entity value is part of the declaration, so it is inside the DTD. With external entities, the entity value comes from an external source, such as a file or URL. This can be useful for character data—for example, if you want to just "include" the contents of a text (or XML) file—as well as binary data. Details on pp. 80–81; again, we'll skip it.

Notations

Notations are intended to suggest a "helper application" for XML processors that don't know how to "display" a particular kind of entity (pp. 81–83). Notice the assumption here is that the role of the XML processor is to "display" the XML document, much like a web browser.

But we can imagine that the XML processor has some other task to do, such as querying the XML document (like a database), or transforming it into some other format (for example, from Microsoft's Office Open XML to OpenOffice.org's Open Document Format). In that case, the problem of "displaying" the entity doesn't necessarily come up.

In any case, notations are not necessarily required to solve the problem. Look at the "Catalist Radio" example on pp. 83–86 (Listing 4.1). This references some external data (MP3 files), which are certainly binary and therefore not parsed XML content; but there is no entity here, and no notation. The MP3 files are played by a Flash program. How does the Flash program know which files to play? The file names are simple the contents of <file> elements.

So, clearly, it is possible for XML documents to reference external, binary data, without the need for notations or even entities. Once again, therefore, we'll skip the details.

Exercises

Page 88: #2.


  1. Version history:
    • Version 1.2, 2012 Mar 16. Split parts 4 and 5 into separate files.
    • Version 1.1, 2011 Apr 8. Moved distinction between XML application and XML processor from footnote to section. Changed string concatenation example from Java to Python. Reformatted result of substitution in "little boys and girls" and added arrows.
    • Version 1.0, 2011 Apr 6. Added entities, notations, and namespaces.
    • Version 0.1, 2011 Apr 6. Incomplete draft, covering only comments and character data.
  2. The notation U+nnnn specifies the Unicode codepoint, or character code number, for the character in hexadecimal notation.