Chapter 3 XML Primer

Table of Contents
3.1 Overview
3.2 Elements, Tags, and Attributes
3.3 The DOCTYPE Declaration
3.4 Escaping Back to SGML
3.5 Comments
3.6 Entities
3.7 Using Entities to Include Files
3.8 Marked Sections
3.9 Conclusion

The majority of FDP documentation is written in applications of XML. This chapter explains exactly what that means, how to read and understand the source to the documentation, and the sort of XML tricks you will see used in the documentation.

Portions of this section were inspired by Mark Galassi's Get Going With DocBook.

3.1 Overview

Way back when, electronic text was simple to deal with. Admittedly, you had to know which character set your document was written in (ASCII, EBCDIC, or one of a number of others) but that was about it. Text was text, and what you saw really was what you got. No frills, no formatting, no intelligence.

Inevitably, this was not enough. Once you have text in a machine-usable format, you expect machines to be able to use it and manipulate it intelligently. You would like to indicate that certain phrases should be emphasized, or added to a glossary, or be hyperlinks. You might want filenames to be shown in a “typewriter” style font for viewing on screen, but as “italics” when printed, or any of a myriad of other options for presentation.

It was once hoped that Artificial Intelligence (AI) would make this easy. Your computer would read in the document and automatically identify key phrases, filenames, text that the reader should type in, examples, and more. Unfortunately, real life has not happened quite like that, and our computers require some assistance before they can meaningfully process our text.

More precisely, they need help identifying what is what. Let's look at this text:

To remove /tmp/foo use rm(1).

% rm /tmp/foo

It is easy to see which parts are filenames, which are commands to be typed in, which parts are references to manual pages, and so on. But the computer processing the document cannot. For this we need markup.

“Markup” is commonly used to describe “adding value” or “increasing cost”. The term takes on both these meanings when applied to text. Markup is additional text included in the document, distinguished from the document's content in some way, so that programs that process the document can read the markup and use it when making decisions about the document. Editors can hide the markup from the user, so the user is not distracted by it.

The extra information stored in the markup adds value to the document. Adding the markup to the document must typically be done by a person—after all, if computers could recognize the text sufficiently well to add the markup then there would be no need to add it in the first place. This increases the cost (i.e., the effort required) to create the document.

The previous example is actually represented in this document like this:


<para>To remove <filename>/tmp/foo</filename> use &man.rm.1;.</para>

<screen>&prompt.user; <userinput>rm /tmp/foo</userinput></screen>

As you can see, the markup is clearly separate from the content.

Obviously, if you are going to use markup you need to define what your markup means, and how it should be interpreted. You will need a markup language that you can follow when marking up your documents.

Of course, one markup language might not be enough. A markup language for technical documentation has very different requirements than a markup language that was to be used for cookery recipes. This, in turn, would be very different from a markup language used to describe poetry. What you really need is a first language that you use to write these other markup languages. A meta markup language.

This is exactly what the eXtensible Markup Language (XML) is. Many markup languages have been written in XML, including the two most used by the FDP, XHTML and DocBook.

Each language definition is more properly called a grammar, vocabulary, schema or Document Type Definition (DTD). There are various languages to specify an XML grammar, for example, DTD (yes, it also means the specification language itself), XML Schema (XSD) or RELANG NG. The schema specifies the name of the elements that can be used, what order they appear in (and whether some markup can be used inside other markup) and related information.

A schema is a complete specification of all the elements that are allowed to appear, the order in which they should appear, which elements are mandatory, which are optional, and so forth. This makes it possible to write an XML parser which reads in both the schema and a document which claims to conform to the schema. The parser can then confirm whether or not all the elements required by the vocabulary are in the document in the right order, and whether there are any errors in the markup. This is normally referred to as “validating the document”.

Note: This processing simply confirms that the choice of elements, their ordering, and so on, conforms to that listed in the grammar. It does not check that you have used appropriate markup for the content. If you tried to mark up all the filenames in your document as function names, the parser would not flag this as an error (assuming, of course, that your schema defines elements for filenames and functions, and that they are allowed to appear in the same place).

It is likely that most of your contributions to the Documentation Project will consist of content marked up in either XHTML or DocBook, rather than alterations to the schemas. For this reason this book will not touch on how to write a vocabulary.