Table of Contents
The extensible markup language (XML) is a W3C-endorsed standard markup language for text documents containing structured information. It was originally designed to meet the challenges of large-scale electronic publishing. Over time, XML has evolved to also act as a software-independent, hardware-independent tool for storing and transporting self-describing data. Information is included in XML documents as strings of text. The XML specification defines the exact syntax for using markup and tags to describe the data. This markup language is an important tool in web development as it simplifies data interchanges. For instance, the well-structured format of an XML document makes it possible to write scripts or applications that can process files without human intervention.
An XML document is written in plain text, and it never contains binary data. The XML specification defines a grammar for these documents. A parser analyzes the markup; it can reject files that violate basic XML rules. A valid XML document produces a tree-like structure that starts at the root and branches to the leaves. A sample file appears next.
<?xml version="1.0" encoding="iso-8859-1" standalone="no"?/> <PersonalInformation> <FullName> <FirstName>Lawrence Sullivan</FirstName> <LastName>Ross</LastName> </FullName> <BirthDate> <Month>September</Month> <Date>27</Date> <Year>1838</Year> </BirthDate> <Address> <City>College Station</City> <Country>USA</Country> <PostalCode>77843</PostalCode> </Address> </PersonalInformation>
This example illustrates the most basic syntax rules in writing XML documents. Collectively, these rules are simple and logical, as seen through the following list.
The fundamental unit of data and markup in XML is an element. A single entity, called the root element, contains all the text and any other elements in the XML document.
All XML elements must have a closing tag, the XML tags themselves are case-sensitive. XML elements must be properly nested; the overlap of elements is not allowed.
Some characters have a special meaning in XML, with five predefined entity references.
Table 1. Predefined Entity References
|<||< (less than)|
|>||> (greater than)|
|"||" (quotation mark)|
The syntax for writing comments in XML is straightforward.
<!-- This is a comment. -->
An XML element can have attributes, which provide additional information about this element.
<employee id="1"> <firstname>Robert</firstname> <lastname>Gates</lastname> </employee>
Attributes must have values and these values must be enclosed within quotation marks.
Most XML documents start with an XML declaration that provides basic information about the document to the parser. An XML declaration is recommended, but not required. If present, it should be the first line in the document.
<?xml version="1.0" encoding="iso-8859-1" standalone="no"?/>
A document type definition is a set of markup declarations that define a document type. It uses terse formal syntax to declare where elements and references can be placed in a document, and under what context they can be employed. A well-formed document conforms to the XML syntax rules and it can be validated against its document type definition. In XML vocabularies like DocBook, a typical DTD declaration looks as follows.
<?xml version='1.0'?> <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
While defining attributes, it is possible to declare which ones are required, list their admissible values, and specify default values. Oftentimes, an author or programmer will leverage existing DTDs and, as such, understanding their roles may be more important than being able to construct new ones. It is worth mentioning that the W3C supports an alternative to DTD called XML schema. This is a complementary and powerful way to specify the structure of valid XML documents.
The success of XML stems, partly, from the separation of content and presentation. The appearance of a processed document is not specified in the structured file itself, rather it is governed by a stylesheet language called XSL. Technically, XML does not use predefined tags and therefore the meaning of each element remains ambiguous. To resolve this issue and enable appropriate styling, XSL dictates how the XML document should be displayed. XSL encompasses three related components: XSLT, XPath, and XSL-FO. The general idea behind this framework is that an author writes an XML document, then this author obtains an XSLT transform to convert the document into a formatting object, which is subsequently used to generate a desired output such as a PDF file. In the process, XPath is employed to navigate through the XML source code.
DocBook XML is a library of semantics markup languages tailored for writing technical documentation. Since DocBook is XML, it can be transformed into many different output formats including HTML, PDF and Java help pages. Thus, DocBook XML is a natural choice for writing documents once and generating them in various formats. To create DocBook files and convert them into other formats, three items are necessary.
From a practical perspective, DocBook is a gateway into writing technical documentation using a semantics markup language without having to worry about creating a DTD or XSL. Provided that an author subscribes to the elements and tags already present in the corresponding DocBook DTD, standard workflows can be leveraged to generate the desired output. Some effort may be required for proper styling, yet this remains minimal compared to the monumental effort necessary to establish a new workflow.
DocBook offers two main document classes,
article is used to write a self-contained technical article.
A more comprehensive exposition is initiated with the element
Books can include instances of
Inside higher-level elements, there can be other elements for actual contents such as paragraphs, lists, code samples, etc.
Some of the popular ones are given in the table below.
Table 2. Common Elements in DocBook
|A paragraph or block of text|
|A title associated with an element|
|A list of automatically numbered items|
|A list of items that optionally has bullet points|
|A table of text or data|
We shall use DocBook XML to create documentation that can be transformed into HTML and PDF formats from a single source file. We will employ the Web Tools Platform (WTP) extension to Eclipse as an XML editor and validator, Xalan-Java as an XSLT processor and Apache Ant for the XSLT transformation. This combination of tools seems to be one of the simplest, most accessible workflow available. In addition, we make most of the software available on our Subversion repository, thereby enabling a (somewhat) seamless author experience.
First and foremost, you must install an Eclipse package (e.g., Classic), preferably a recent version that comes with the Web Tools Platform. Apache Ant is integrated into Eclipse and consequently no additional installation is required for this command-line tool. If Eclipse is already present on the computer, then WTP can be added by going into Help -> Install New Software... Enter the software repository URL (Juno) http://download.eclipse.org/webtools/repository/juno/. Make sure to type the URL that corresponds to your version of Eclipse. Select the latest WTP version for download and then follow through the installation wizard steps.
The easiest, most straightforward way to use DocBook XML in Eclipse is to first get Subversion to interface with Eclipse and then check out a working copy of https://lemur.tamu.edu/svn/InnovationLab/XML-DocBook/. This directory contains all the necessary tools to process DocBook XML source files, along with a proper folder hierarchy and an ant-build script. To authenticate, use your own credentials or, alternatively, you can obtain read-only access with the anonymous username and password.
The essential building blocks contained in the XML-DocBook directory are:
XML-DocBook/docbook-xml-4.5: DocBook XML V4.5 document type definitions,
XML-DocBook/docbook-xsl: DocBook XSL stylesheets,
XML-DocBook/lib/xalan: Apache Xalan-Java XSLT Processor (JAR files),
XML-DocBook/lib/fop: Apache FOP (JAR files).
Once the necessary software components are installed, you can create and start editing source files. You can also write or modify the Ant script that will transform your DocBook XML. Please follow the steps given below in order to manage and process your DocBook XML files in Eclipse.
In the folder XML-DocBook/input, create a file named
../docbook-xml-4.5/docbookx.dtd in our example refers to the directory structure introduced earlier in this tutorial.
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "../docbook-xml-4.5/docbookx.dtd"> <article> <articleinfo> <title>Memoirs of a Secretary of Defense</title> <author> <firstname>Robert</firstname> <surname>Gates</surname> </author> </articleinfo> <section label="1.0"> <title>Secretary of Defense Robert M. Gates, West Point, NY, Friday, February 25, 2011</title> <para> "One thing I have learned from decades of leading large public organizations is that it is important to really focus on the top 20 percent of your people and, though it may be politically incorrect to say so, the bottom 20 percent as well. The former to elevate and give more responsibility and opportunity, the latter to transition out, albeit with consideration and respect for the service they have rendered. Failure to do this risks frustrating, demoralizing and ultimately losing the leaders we will most need for the future." </para> </section> </article>
The XML Editor in Eclipse offers two views of the source file: Design and Source.
The XML code given above is written in the Source view of the Eclipse XML Editor.
On the other hand, the Design view displays the elements of the file in a grid.
Under the latter view, you can add new entities to the file by right-clicking the source and choosing
Add After, or
When using a proper DTD definition, the menu options allow for adding only valid elements, thereby reducing the probability of having errors.
Validation of the DocBook XML source file can be done by right-clicking the file in the Project Explorer, then clicking Validate. The validation errors, if any, will appear in the Problems view of the Eclipse IDE.
Apache Ant is a Java-based build tool that reads an XML script and can perform many tasks defined in the script. The steps involved in using Xalan with Ant are outlined below.
ant-build.xmlfile contained in the XML-DocBook directory of this project.
ant-build.xmlfile (right click, Run as -> Ant Build).
Messages from the Ant build script, if any, will appear in the Console view.
XML is optimized for print media.
The workflow of converting .xml files to .pdf files takes place in two steps.
First, the DocBook XML source must be converted into an XML-Formating Objects (XML-FO .fo file) via the DocBook XSL Stylesheets.
The XML-FO is then translated into PDF via the Apache FOP library.
Only the .jar files from the FOP distribution are required.
Since you have already acquired the lib folder from the repository, you should be able to see the FOP JAR files located in the
Follow these steps to transform an DocBook XML source file into a PDF output.
ant-build.xmlfile contained in the XML-DocBook directory of this project.
ant-build.xmlfile, note that the default value is
build-html. At the end of the file, a new
target name = "all"is defined that includes both
build-pdfvalue is introduced to set up the
foptarget. The FOP task is defined in the Ant build script before it is called in the
build-pdftarget. Editing the Ant configuration setting will modify the intended target.
ant-build.xmlwith the right attributes. Refresh your project through the Explorer window, then check the output directory. You should find the PDF and HTML files in the
For more information on DocBook XML, visit DocBook: The Definitive Guide.
A very useful tutorial on DocBook XML can be found at DocBook Tutorial.
Introduction to DocBook XSL - DocBook XSL
For an introduction to working with Eclipse Platform, see Eclipse Resources.
Additional material on Eclipse - Eclipse Tutorials