Version 1.21
<!-- This is an XML comment.
Use comments in XML as you would in any programming language:
to make something clearer, or to "comment out" something.
-->
--. This makes only half sense, doesn't it? You can change -- to -/-.Remember that an XML application is a language which is in the XML framework. Examples of XML applications include SVG, XHTML, MathML. But the author sometimes uses the term to mean an application program which uses XML—such as Inkscape, a web browser visiting an XHTML page, or an equation editor. Such a program is technically called an XML processor, and I will use that term in these notes.
We need to consider separately:
The XML declaration may declare a character encoding. The default encoding is UTF-8, which is a way of encoding the Unicode character set:
<?xml version="1.0" encoding="UTF-8"?>
The text mentions (p. 73) UTF-16 as another option, but erroneously implies that it supports more characters. UTF-8 and UTF-16 both support the same character set. I believe other encodings can also be declared, though the text states that "XML applications" (meaning XML processors) are not required to support other encodings.
You know—those unusual characters that you might not have a key for, on your computer keyboard, like ← (leftwards arrow, U+2190), € (Euro sign, U+20AC), Я (Cyrillic capital letter Ya, U+042F), ✓ (Check mark, U+2713), and 🁊 (Domino tile horizontal-03-04, U+1F04A).2
If your XML document has a Unicode encoding, you can probably find a way of inputing the character:
And if you have the right fonts installed, you will see the character as intended, so that will be much easier to read.
Otherwise, you can use an XML character reference such as
Я (decimal Unicode number) for ЯЯ (x for hexadecimal Unicode number) for ЯThis can make your XML source code ugly and unintelligible, so you may want to define an entity to make it look better and make more sense.
Remember that in XML there can be parsed character data (#PCDATA in a DTD) and unparsed character data. The word parsed means that an XML processor treats parsed character data as follows:
< and &, have special significanceIn unparsed character data, therefore, no characters have special significance, and extra whitespace is left as it is.
The purpose of a CDATA section is to contain unparsed character data. You begin a CDATA section with
<![[CDATA[
and you end it with
]]>
For example:
<![[CDATA[
Every n < 10 is better than any y > 0,
& I can tell you that Pete is well, he got lost
in s
p
a
c
e
]]>
It looks bizarre, but it works.
I rarely, or perhaps never, have felt the need to use a CDATA section, but your needs may vary.
Entity is a philosophical word that means an existing thing. When computer scientists use philosophical words, there's sure to be trouble!
The textbook describes entities as "building blocks" or "special units of storage" for XML documents.
I don't understand what this means.
You can think of entities as being variables with string values. Whenever you write such a variable in an expression, the value of the variable is substituted in its place.
For example, in Python:
a = "the boy"
b = " sees "
c = "the girl"
print(a + b + c)
The output is the boy sees the girl
With entities, it's much the same, only you have to use special notation to reference the entity (that is, to evaluate it) and you don't need + for concatenation. If we declared XML entities equivalent to a, b, and c, we could write
&a;&b;&c;
and the result is
the boy sees the girl
We can classify entities according to their role:
We can also classify entities according to their location:
General entities are declared in a DTD (remember, this can be an internal or external DTD), like this:
<!ENTITY fspt "frogs, snails, and puppy dog tails">
<!ENTITY ssan "sugar, spice, and all things nice">
<!ENTITY leftArrow "←">
They are referenced like this:
&fspt;
&leftArrow;
Used on context:
Little boys &leftArrow; are made of &fspt;.
Little girls &leftArrow; are not made
of &fspt; but of &ssan;.
When this character data is parsed, the values are substituted for the referenced entities, resulting in
Little boys ← are made of frogs, snails, and puppy dog tails. Little girls ← are not made of frogs, snails, and puppy dog tails but of sugar, spice, and all things nice.
These are described on pp. 79–80, but unless you're writing (or trying to read) a very complex DTD, they're not worth bothering about. They can make the DTD shorter but harder to understand, if it has a lot of repeated parts. We'll skip it.
These are typically binary data, for example a JPEG image file. We'll skip this too. (Details on p. 80.)
The examples above are internal entities: the entity value is part of the declaration, so it is inside the DTD. With external entities, the entity value comes from an external source, such as a file or URL. This can be useful for character data—for example, if you want to just "include" the contents of a text (or XML) file—as well as binary data. Details on pp. 80–81; again, we'll skip it.
Notations are intended to suggest a "helper application" for XML processors that don't know how to "display" a particular kind of entity (pp. 81–83). Notice the assumption here is that the role of the XML processor is to "display" the XML document, much like a web browser.
But we can imagine that the XML processor has some other task to do, such as querying the XML document (like a database), or transforming it into some other format (for example, from Microsoft's Office Open XML to OpenOffice.org's Open Document Format). In that case, the problem of "displaying" the entity doesn't necessarily come up.
In any case, notations are not necessarily required to solve the problem. Look at the "Catalist Radio" example on pp. 83–86 (Listing 4.1). This references some external data (MP3 files), which are certainly binary and therefore not parsed XML content; but there is no entity here, and no notation. The MP3 files are played by a Flash program. How does the Flash program know which files to play? The file names are simple the contents of <file> elements.
So, clearly, it is possible for XML documents to reference external, binary data, without the need for notations or even entities. Once again, therefore, we'll skip the details.
Page 88: #2.
The notation U+nnnn specifies the Unicode codepoint, or character code number, for the character in hexadecimal notation. ↩