DocBook XML Documents (with Eclipse)

Sirisha Mantravadi

Qualcomm Inc.

Jean-Francois Chamberland

Texas A&M University

Gregory Huff

Texas A&M University

Table of Contents

Extensible Markup Language (XML)
XML Documents and Syntax
Document Type Definition (DTD)
Displaying XML with XSLT
DocBook
DocBook XML in Eclipse
Creating a DocBook XML Source File
Using Xalan with Ant to convert DocBook XML to HTML output
Using FOP with Ant to convert DocBook XML to PDF output
Useful Resources

Extensible Markup Language (XML)

The extensible markup language (XML) is a W3C-endorsed standard markup language for text documents containing structured information. It was originally designed to meet the challenges of large-scale electronic publishing. Over time, XML has evolved to also act as a software-independent, hardware-independent tool for storing and transporting self-describing data. Information is included in XML documents as strings of text. The XML specification defines the exact syntax for using markup and tags to describe the data. This markup language is an important tool in web development as it simplifies data interchanges. For instance, the well-structured format of an XML document makes it possible to write scripts or applications that can process files without human intervention.

XML Documents and Syntax

An XML document is written in plain text, and it never contains binary data. The XML specification defines a grammar for these documents. A parser analyzes the markup; it can reject files that violate basic XML rules. A valid XML document produces a tree-like structure that starts at the root and branches to the leaves. A sample file appears next.

<?xml version="1.0" encoding="iso-8859-1" standalone="no"?/>
<PersonalInformation>
  <FullName>
    <FirstName>Lawrence Sullivan</FirstName>
    <LastName>Ross</LastName>
  </FullName>
  <BirthDate>
    <Month>September</Month>
    <Date>27</Date>
    <Year>1838</Year>
  </BirthDate>
  <Address>
    <City>College Station</City>
    <Country>USA</Country>
    <PostalCode>77843</PostalCode>
  </Address>
</PersonalInformation>

This example illustrates the most basic syntax rules in writing XML documents. Collectively, these rules are simple and logical, as seen through the following list.

  1. The fundamental unit of data and markup in XML is an element. A single entity, called the root element, contains all the text and any other elements in the XML document.

  2. All XML elements must have a closing tag, the XML tags themselves are case-sensitive. XML elements must be properly nested; the overlap of elements is not allowed.

  3. Some characters have a special meaning in XML, with five predefined entity references.

    Table 1. Predefined Entity References

    EntityName
    &lt;< (less than)
    &gt;> (greater than)
    &amp;& (ampersand)
    &apos;' (apostrophe)
    &quot;" (quotation mark)


  4. The syntax for writing comments in XML is straightforward. <!-- This is a comment. -->

  5. An XML element can have attributes, which provide additional information about this element.

    <employee id="1">
      <firstname>Robert</firstname>
      <lastname>Gates</lastname>
    </employee>
    

    Attributes must have values and these values must be enclosed within quotation marks.

  6. Most XML documents start with an XML declaration that provides basic information about the document to the parser. An XML declaration is recommended, but not required. If present, it should be the first line in the document.

    <?xml version="1.0" encoding="iso-8859-1" standalone="no"?/>
    

Document Type Definition (DTD)

A document type definition is a set of markup declarations that define a document type. It uses terse formal syntax to declare where elements and references can be placed in a document, and under what context they can be employed. A well-formed document conforms to the XML syntax rules and it can be validated against its document type definition. In XML vocabularies like DocBook, a typical DTD declaration looks as follows.

<?xml version='1.0'?>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
    "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">

While defining attributes, it is possible to declare which ones are required, list their admissible values, and specify default values. Oftentimes, an author or programmer will leverage existing DTDs and, as such, understanding their roles may be more important than being able to construct new ones. It is worth mentioning that the W3C supports an alternative to DTD called XML schema. This is a complementary and powerful way to specify the structure of valid XML documents.

Displaying XML with XSLT

The success of XML stems, partly, from the separation of content and presentation. The appearance of a processed document is not specified in the structured file itself, rather it is governed by a stylesheet language called XSL. Technically, XML does not use predefined tags and therefore the meaning of each element remains ambiguous. To resolve this issue and enable appropriate styling, XSL dictates how the XML document should be displayed. XSL encompasses three related components: XSLT, XPath, and XSL-FO. The general idea behind this framework is that an author writes an XML document, then this author obtains an XSLT transform to convert the document into a formatting object, which is subsequently used to generate a desired output such as a PDF file. In the process, XPath is employed to navigate through the XML source code.

DocBook

DocBook XML is a library of semantics markup languages tailored for writing technical documentation. Since DocBook is XML, it can be transformed into many different output formats including HTML, PDF and Java help pages. Thus, DocBook XML is a natural choice for writing documents once and generating them in various formats. To create DocBook files and convert them into other formats, three items are necessary.

  • The DocBook DTD identifies the semantics of the DocBook document.
  • XSLT stylesheets dictate how to convert this DocBook into alternate formats.
  • An XSLT processor performs the transformation according to prescribed guidelines.

From a practical perspective, DocBook is a gateway into writing technical documentation using a semantics markup language without having to worry about creating a DTD or XSL. Provided that an author subscribes to the elements and tags already present in the corresponding DocBook DTD, standard workflows can be leveraged to generate the desired output. Some effort may be required for proper styling, yet this remains minimal compared to the monumental effort necessary to establish a new workflow.

DocBook offers two main document classes, book and article. The element article is used to write a self-contained technical article. A more comprehensive exposition is initiated with the element book. Books can include instances of article, section, chapter and part. Inside higher-level elements, there can be other elements for actual contents such as paragraphs, lists, code samples, etc. Some of the popular ones are given in the table below.

Table 2. Common Elements in DocBook

EntityDescription
paraA paragraph or block of text
titleA title associated with an element
orderedlistA list of automatically numbered items
itemizedlistA list of items that optionally has bullet points
tableA table of text or data


DocBook XML in Eclipse

We shall use DocBook XML to create documentation that can be transformed into HTML and PDF formats from a single source file. We will employ the Web Tools Platform (WTP) extension to Eclipse as an XML editor and validator, Xalan-Java as an XSLT processor and Apache Ant for the XSLT transformation. This combination of tools seems to be one of the simplest, most accessible workflow available. In addition, we make most of the software available on our Subversion repository, thereby enabling a (somewhat) seamless author experience.

First and foremost, you must install an Eclipse package (e.g., Classic), preferably a recent version that comes with the Web Tools Platform. Apache Ant is integrated into Eclipse and consequently no additional installation is required for this command-line tool. If Eclipse is already present on the computer, then WTP can be added by going into Help -> Install New Software... Enter the software repository URL (Juno) http://download.eclipse.org/webtools/repository/juno/. Make sure to type the URL that corresponds to your version of Eclipse. Select the latest WTP version for download and then follow through the installation wizard steps.

The easiest, most straightforward way to use DocBook XML in Eclipse is to first get Subversion to interface with Eclipse and then check out a working copy of https://lemur.tamu.edu/svn/InnovationLab/XML-DocBook/. This directory contains all the necessary tools to process DocBook XML source files, along with a proper folder hierarchy and an ant-build script. To authenticate, use your own credentials or, alternatively, you can obtain read-only access with the anonymous username and password.

  • User Name: anonymous
  • Password: anonymous

The essential building blocks contained in the XML-DocBook directory are:

  1. XML-DocBook/docbook-xml-4.5: DocBook XML V4.5 document type definitions,

  2. XML-DocBook/docbook-xsl: DocBook XSL stylesheets,

  3. XML-DocBook/lib/xalan: Apache Xalan-Java XSLT Processor (JAR files),

  4. XML-DocBook/lib/fop: Apache FOP (JAR files).

Once the necessary software components are installed, you can create and start editing source files. You can also write or modify the Ant script that will transform your DocBook XML. Please follow the steps given below in order to manage and process your DocBook XML files in Eclipse.

Creating a DocBook XML Source File

In the folder XML-DocBook/input, create a file named <username>.xml. The path ../docbook-xml-4.5/docbookx.dtd in our example refers to the directory structure introduced earlier in this tutorial.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "../docbook-xml-4.5/docbookx.dtd">
<article>
  <articleinfo>
    <title>Memoirs of a Secretary of Defense</title>
    <author>
      <firstname>Robert</firstname>
      <surname>Gates</surname>
    </author>
  </articleinfo>
  <section label="1.0">
    <title>Secretary of Defense Robert M. Gates, West Point, NY, Friday, February 25, 2011</title>
    <para> 
      "One thing I have learned from decades of leading large public organizations is that it is important to
      really focus on the top 20 percent of your people and, though it may be politically incorrect to say so,
      the bottom 20 percent as well.  The former to elevate and give more responsibility and opportunity, the
      latter to transition out, albeit with consideration and respect for the service they have rendered.
      Failure to do this risks frustrating, demoralizing and ultimately losing the leaders we will most need
      for the future."
    </para>
  </section>
</article>

The XML Editor in Eclipse offers two views of the source file: Design and Source. The XML code given above is written in the Source view of the Eclipse XML Editor. On the other hand, the Design view displays the elements of the file in a grid. Under the latter view, you can add new entities to the file by right-clicking the source and choosing Add Child, Add After, or Add Before. When using a proper DTD definition, the menu options allow for adding only valid elements, thereby reducing the probability of having errors.

Validation of the DocBook XML source file can be done by right-clicking the file in the Project Explorer, then clicking Validate. The validation errors, if any, will appear in the Problems view of the Eclipse IDE.

Using Xalan with Ant to convert DocBook XML to HTML output

Apache Ant is a Java-based build tool that reads an XML script and can perform many tasks defined in the script. The steps involved in using Xalan with Ant are outlined below.

  1. Locate the ant-build.xml file contained in the XML-DocBook directory of this project.
  2. Run the ant-build.xml file (right click, Run as -> Ant Build).
  3. Refresh the project, then check the output directory. You will find the HTML files in the output directory.

Messages from the Ant build script, if any, will appear in the Console view.

Using FOP with Ant to convert DocBook XML to PDF output

XML is optimized for print media. The workflow of converting .xml files to .pdf files takes place in two steps. First, the DocBook XML source must be converted into an XML-Formating Objects (XML-FO .fo file) via the DocBook XSL Stylesheets. The XML-FO is then translated into PDF via the Apache FOP library. Only the .jar files from the FOP distribution are required. Since you have already acquired the lib folder from the repository, you should be able to see the FOP JAR files located in the ../lib/fop/build/ and ../lib/fop/lib/ directories. Follow these steps to transform an DocBook XML source file into a PDF output.

  1. Locate the ant-build.xml file contained in the XML-DocBook directory of this project.
  2. In the ant-build.xml file, note that the default value is build-html. At the end of the file, a new target name = "all" is defined that includes both build-html and build-pdf. The build-pdf value is introduced to set up the fop target. The FOP task is defined in the Ant build script before it is called in the build-pdf target. Editing the Ant configuration setting will modify the intended target.
  3. Run the ant-build.xml with the right attributes. Refresh your project through the Explorer window, then check the output directory. You should find the PDF and HTML files in the output directory.

Useful Resources