Chapter 3: Validating XML
with the Document Type Definition (DTD)
In This Chapter
Document Type Definitions
Some Simple DTD Examples
Structure of a Document Type Definition
DTD Drawbacks and Alternatives
XML is a meta-markup language that is fully extensible. As long as it is well
formed, XML authors can create any XML structure they desire in order to
describe their data. However, an XML author cannot be sure that the structure he
poured so much time and effort into creating won't be changed by another
XML author or for that matter an application. There needs to be a way to ensure
that the XML structure cannot be changed at random. This type of assurance for
XML document structure is vital for e-commerce applications and
business-to-business processing, among other things. This is where the Document
Type Definition (DTD) steps in. A DTD provides a roadmap for describing and
documenting the structure that makes up an XML document. A DTD can be used to
determine the validity of an XML document.
In this chapter we will start with several examples and a brief overview of
the DTD and what it does. Then we will break down the different items that make
up the structure of the DTD. The coverage of the DTD structure will begin with a
discussion of the Document Type Declaration. Then we will move on to the
functional items that make up the DTD. The DTD includes element definitions,
entity definitions, and parameters. Finally, before closing the chapter, we will
explore some of the drawbacks of DTDS and emerging alternatives for validation.
Now, let's start by defining the Document Type Definition.
Document Type Definitions
DTD stands for Document Type Definition. A Document Type Definition
allows the XML author to define a set of rules for an XML document to make it
valid. An XML document is considered "well formed" if that document is
syntactically correct according to the syntax rules of XML 1.0. However, that
does not mean the document is necessarily valid. In order to be considered
valid, an XML document must be validated, or verified, against a DTD. The DTD
will define the elements required by an XML document, the elements that are
optional, the number of times an element should (could) occur, and the order in
which elements should be nested. DTD markup also defines the type of data that
will occur in an XML element and the attributes that may be associated with
those elements. A document, even if well formed, is not considered valid if it
does not follow the rules defined in the DTD.
Note DTDs are part of the W3C's
XML 1.0 recommendation. This recommendation may be found at
http://www.w3.org/TR/REC-xml.
When an XML document is validated against a DTD by a validating XML parser,
the XML document will be checked to ensure that all required elements are
present and that no undeclared elements have been added. The hierarchical
structure of elements defined in the DTD must be maintained. The values of all
attributes will be checked to ensure that they fall within defined guidelines.
No undeclared attributes will be allowed and no required attributes may be
omitted. In short, every last detail of the XML document from top to bottom will
be defined and validated by the DTD.
Although validation is optional, if an XML author is publishing an XML
document for which maintaining the structure is vital, the author can reference
a DTD from the XML document and use a validating XML parser during processing.
Requiring that an XML document be validated against a DTD ensures the integrity
of the data structure. XML documents may be parsed and validated before they
are ever loaded by an application. That way, XML data that is not valid can be
flagged as "invalid" before it ever gets processed by the application
(thus saving a lot of the headaches that corrupt or incomplete data can
cause).
Imagine a scenario where data is being exchanged in an XML format between
multiple organizations. The integrity of business-to-business data is vital for
the smooth functioning of commerce. There needs to be a way to ensure that the
structure of the XML data does not change from organization to organization
(thus rendering the data corrupt and useless). A DTD can ensure this.
An extra advantage of using DTDs in this situation is that a single DTD could
be referenced by all the organization's applications. The defined structure
of the data would be in a centralized resource, which means that any changes to
the data structure definition would only need to be implemented in one place.
All the applications that referenced the DTD would automatically use the new,
updated structure.
A DTD can be internal, residing within the body of a single XML document. It
can also be external, referenced by the XML document. A single XML document
could even have both a portion (or subset) of its DTD that is internal and a
portion that is external. As mentioned in the previous paragraph, a single
external DTD can be referenced by many XML documents. Because an external DTD
may be referenced by many documents, it is a good repository for global types of
definitions (definitions that apply to all documents). An internal DTD is good
to use for rules that only apply to that specific document. If a document has
both internal and external DTD subsets, the internal rules override the external
rules in cases where the same item is defined in both subsets.
Given this brief overview, you can quickly see why a DTD would be important
to applications that exchange data in an XML format. Before diving into the
actual coverage of the structure of DTDs, take a look at a couple of quick
examples. This will give you a better impression of what we are talking about as
we go forward.
Some Simple DTD Examples
Let's take a quick look at two DTDsone internal and one external.
Listing 3.1 shows an internal DTD.
Listing 3.1 An Internal DTD
<?xml version="1.0"?>
<!DOCTYPE message [
<!ELEMENT message (#PCDATA)>
]>
<message>
Let the good times roll!
</message>
In Listing 3.1, the internal DTD is contained within the Document Type
Declaration, which begins with <!DOCTYPE and ends with
]>. The Document Type Declaration will appear between the XML
declaration and the start of the document itself (the document or root element)
and identify that section of the XML document as containing a Document Type
Definition. Following the Document Type Declaration (DOCTYPE), the root
element of the XML document is defined (in this case, message). The DTD
tells us that this document will have a single element, message, that
will contain parsed character data (#PCDATA).
Note The Document Type Declaration
should not be confused with the Document Type Definition. These are two
exclusive items. Also confusing is the acronym DTD, which is only ever used in
reference to the Document Type Definition. The Document Type Declaration is the
area of the XML document after the XML declaration that begins with
<!DOCTYPE and ends with ]>. It actually encompasses the
Document Type Definition. The Document Type Definition will be contained within
an opening bracket ([) and a closing bracket (]).
Now, let's take a look at Listing 3.2 and see how this same DTD and XML
document would be joined if the DTD were external.
Listing 3.2 An External DTD
<?xml version="1.0"?>
<!DOCTYPE message SYSTEM "message.dtd">
<message>
Let the good times roll!
</message>
In Listing 3.2 the DTD is contained in a separate file, message.dtd.
The contents of message.dtd are assumed to be the same as the contents
of the DTD in Listing 3.1. The keyword SYSTEM in the Document Type
Declaration lets us know that the DTD is going to be found in a separate file. A
URL could have been used to define the location of the DTD. For example, rather
than message.dtd, the Document Type Declaration could have specified
something like ../DTD/message.dtd.
Note The keyword SYSTEM used
in a Document Type Declaration will always be indicative of the Document Type
Definition being contained in an external file.
Both of these examples show us a well-formed XML document. Additionally,
because both XML documents contain a single element, message, which
contains only parsed character data, both adhere to the DTD. Therefore, they are
both also valid XML documents.
A document that looks like what's shown in Listing 3.3 would not be
valid according to the DTD in these examples.
Listing 3.3 Document Not Valid According to Defined DTD
<?xml version="1.0"?>
<!DOCTYPE message SYSTEM "message.dtd">
<message>
<text>
Let the good times roll!
</text>
</message>
Even though this is a well-formed XML document, it is not valid. When this
document is validated against message.dtd, a flag will be raised
because message.dtd does not define an element named text.
Don't worry if you do not completely understand what is going on at this
point. As long as you get the gist, everything will become very clear in the
sections that follow.
Structure of a Document Type Definition
The structure of a DTD consists of a Document Type Declaration, elements,
attributes, entities, and several other minor keywords. We will take a look at
each of these topics, in that order. As we progress from topic to topic, we will
follow a mini case study about the use of XML to store employee records by the
Human Resources department of a fictitious company.
Our coverage of the DTD structure shall begin with the Document Type
Declaration.
The Document Type Declaration
In order to reference a DTD from an XML document, a Document Type Declaration
must be included in the XML document. Listings 3.1, 3.2, and 3.3 gave some
examples and brief explanations of using a Document Type Declaration to
reference a DTD. There may be one Document Type Declaration per XML document.
The syntax is as follows:
<!DOCTYPE rootelement SYSTEM | PUBLIC DTDlocation [ internalDTDelements ] >
The exclamation mark (!) is used to signify the beginning of the
declaration.
DOCTYPE is the keyword used to denote this as a Document Type
Definition.
rootelement is the name of the root element or document element
of the XML document.
SYSTEM and PUBLIC are keywords used to designate that
the DTD is contained in an external document. Although the use of these keywords
is optional, to reference an external DTD you would have to use one or the
other. The SYSTEM keyword is used in tandem with a URL to locate the
DTD. The PUBLIC keyword specifies some public location that will
usually be some application-specific resource reference.
internalDTDelements are internal DTD declarations. These
declarations will always be placed within opening ([) and closing
(]) brackets.
Note This book typically uses the more
common SYSTEM keyword when referencing external DTDs.
It is possible for a Document Type Declaration to contain both an external
DTD subset and an internal DTD subset. In this situation, the internal
declarations take precedence over the external ones. In other words, if both the
external and internal DTDs define a rule for the same element, the rule of the
internal element will be the one used. Consider the Document Type Declaration
fragment shown in Listing 3.4.
Listing 3.4 Internal and External DTDs
<!DOCTYPE rootelement SYSTEM "http://www.myserver.com/mydtd.dtd"
[
<!ELEMENT element1 (element2,element3)>
<!ELEMENT element2 (#PCDATA)>
<!ELEMENT element3 (#PCDATA)>
]>
Here in Listing 3.4, we see that the Document Type Declaration references an
external DTD. There is also an internal subset of the DTD contained in the
Document Type Declaration. Any rules in the external DTD that apply to elements
defined in the internal DTD will be overridden by the rules of the internal
DTD.
Note You will also notice in Listing
3.4 that the Document Type Declaration is spread out over several lines.
Whitespace is unimportant in Document Type Declarations as long as there is no
whitespace on either side of the ! symbol. Multiple lines are used for
clarity.
Now that you have seen how to reference a DTD from an XML document, we will
begin our coverage of the items that make up the declarations in DTDs.
DTD Elements
All elements in a valid XML document are defined with an element declaration
in the DTD. An element declaration defines the name and all allowed contents of
an element. Element names must start with a letter or an underscore and may
contain any combination of letters, numbers, underscores, dashes, and periods.
Element names must never start with the string "xml". Colons should
not be used in element names because they are normally used to reference
namespaces.
Each element in the DTD should be defined with the following syntax:
<!ELEMENT elementname rule >
ELEMENT is the tag name that specifies that this is an element
definition.
elementname is the name of the element.
rule is the definition to which the element's data content
must conform.
In a DTD, the elements are processed from the top down. A validating XML
parser will expect the order of the appearance of elements in the XML document
to match the order of elements defined in the DTD. Therefore, elements in a DTD
should appear in the order you want them to appear in an XML document. If the
elements in an XML document do not match the order of the DTD, the XML document
will not be considered valid by a validating parser.
Listing 3.5 demonstrates a DTD, contactlist.dtd, that defines the
ordering of elements for referencing XML documents.
Listing 3.5 contactlist.dtd
<!ELEMENT contactlist (fullname, address, phone, email) >
<!ELEMENT fullname (#PCDATA)>
<!ELEMENT address (addressline1, addressline2)>
<!ELEMENT addressline1 (#PCDATA)>
<!ELEMENT addressline2 (#PCDATA)>
<!ELEMENT phone (#PCDATA)>
<!ELEMENT email (#PCDATA)>
The first element in the DTD, contactlist, is the document element.
The rule for this element is that it contains (is the parent element of) the
fullname, address, phone, and email
elements. The rule for the fullname element, the phone
element, and the email element is that each contains parsed character
data (#PCDATA). This means that the elements will contain marked-up
character data that the XML parser will interpret. The address element
has two child elements: addressline1 and addressline2. These
two children elements contain #PCDATA. This DTD defines an XML
structure that is nested two levels deep. The root element,
contactlist, has four child elements. The address element is,
in turn, parent to two more elements. In order for an XML document that
references this DTD to be valid, it must be laid out in the same order, and it
must have the same depth of nesting.
The XML document in Listing 3.6 is a valid document because it follows the
rules laid out in Listing 3.5 for contactlist.dtd.
Listing 3.6 contactlist.xml
<?xml version="1.0"?>
<!DOCTYPE contactlist SYSTEM "contactlist.dtd">
<contactlist>
<fullname>Bobby Soninlaw</fullname>
<address>
<addressline1>101 South Street</addressline1>
<addressline2>Apartment #2</addressline2>
</address>
<phone>(405) 555-1234</phone>
<email>bs@mail.com</email>
</contactlist>
The second line of this XML document is the Document Type Declaration that
references contactlist.dtd. This is a valid XML document because it is
well formed and complies with the structural definition laid out in the DTD.
Note In Listing 3.6, the element name
listed in the Document Type Declaration matches the name of the root element of
the XML document. If the element name listed in the Document Type Declaration
did not match the root element of the XML document, the XML document would
immediately be deemed invalid and the XML parser would halt.
The element rules govern the types of data that may appear in an element.
DTD Element Rules
All data contained in an element must follow a set rule. As stated
previously, the rule is the definition to which the element's data content
must conform. There are two basic types of rules that elements must fall into.
The first type of rule deals with content. The second type of rule deals with
structure. First, we will look at element rules that deal with content.
Content Rules
The content rules for .elements deal with the actual data that defined
elements may contain. These rules include the ANY rule, the
EMPTY rule, and the #PCDATA rule.
The ANY Rule
An element may be defined. using the ANY rule. This rule is just
what it sounds like: The element may contain other elements and/or normal
character data (just about anything as long as it is well formed). An element
using the ANY rule would appear as follows:
<!ELEMENT elementname ANY>
The drawback to this rule is that it is so wide open that it defeats the
purpose of validation. A DTD that defines all its elements using the
ANY rule will always be valid as long as the XML is well formed. This
really precludes any effective validation. The XML fragments as shown in Listing
3.7 are all valid given the definition of elementname.
Listing 3.7 XML Fragments Using the ANY Rule
<elementname>
This is valid content
</elementname>
<elementname>
<anotherelement>
This is more valid content
</anotherelement>
This is still valid content
</elementname>
<elementname>
<emptyelement />
<yetanotherelement>
This is still valid content!
</yetanotherelement>
Here is more valid content
</elementname>
You should see from this listing why it is not always a great idea to use the
ANY rule. All three fragments containing the element
elementname are valid. There is, in effect, no validation for this
element. Use of the ANY rule should probably be limited to instances
where the XML data will be freeform text or other types of data that will be
highly variable and have difficulty conforming to a set structure.
The EMPTY Rule
This rule is the exact opposite of the ANY rule. An element that is
defined with this rule will contain no data. However, an element with the
EMPTY rule could still contain attributes (more on attributes in a
bit). The following element is an example of the EMPTY rule:
<!ELEMENT elementname EMPTY>
This concept is seen a lot in HTML. There are many tags such as the break tag
(<br />) and the paragraph tag (<p />) that follow
this rule. Neither one of these tags contains any data, but both are very
important in HTML documents. The best example of an empty tag used in HTML is
the image tag (<img>). Even though the image tag does not contain
any data, it does have attributes that describe the location and display of an
image for a Web browser.
In XML, the EMPTY rule might be used to define empty elements that
contain diagnostic information for the processing of data. Empty elements could
also be created to hold metadata describing the contents of the XML document for
indexing purposes. Empty elements could even be used to provide clues for
applications that will render the data for viewing (such as an empty
"gender" tag, which designates an XML record as "male" or
"female"male records could be rendered in blue, and female
records could be rendered in pink) .
The #PCDATA Rule
The #PCDATA rule indicates that parsed character data will be
contained in the element. Parsed character data is data that may contain normal
markup and will be interpreted and parsed by any XML parser accessing the
document. The following element demonstrates the #PCDATA rule:
<!ELEMENT elementname (#PCDATA)>
An element in an XML document that adheres to the #PCDATA rule might
appear as follows:
<data>
This is some parsed character data
</data>
It is possible in an element using the #PCDATA rule to use the
CDATA keyword to prevent the character data from being parsed. You can
see an example of this in Listing 3.8.
Listing 3.8 CDATA
<sample>
<data>
<![CDATA[<tag>This will not be parsed</tag>]]>
</data>
</sample>
All the data between <![CDATA[ and ]]> will be
ignored by the parser and treated as normal characters (markup ignored).
Structure Rules
Whereas the content rules. deal with the actual content of the data contained
in defined elements, structure rules deal with how that data may be organized.
There are two types of structure rules we will look at here. The first is the
"element only" rule. The second rule is the "mixed"
rule.
The "Element Only" Rule
The "element only" rule .specifies that only elements may appear as
children of the current element. The child element sequences should be separated
by commas and listed in the order they should appear. If there are to be options
for which elements will appear, the listed elements should be separated by the
pipe symbol (|). The following element definition demonstrates the
"element only" rule:
<!ELEMENT elementname (element1, element2, element3)>
You can see here that a list of elements are expected to appear as child
elements of elementname when the referencing XML document is parsed.
All these child elements must be present and in the specified order. Here is how
an element that is listing a series of options will appear:
<!ELEMENT elementname (element1 | element2)>
The element defined here will have a single child element: either
element1 or element2.
The "Mixed" Rule
The "mixed" rule is used to help define elements that may have both
character data (#PCDATA) and child elements in the data they contain. A
list of options or a sequential list will be enclosed by parentheses. Options
will be separated by the pipe symbol (|), whereas sequential lists will
be separated by commas. The following element is an example of the
"mixed" rule:
<!ELEMENT elementname (#PCDATA | childelement1 | childelement2)*>
In this example, the element may contain a mixture of character data and
child elements. The pipe symbol is used here to indicate that there is a choice
between #PCDATA and each of the child elements. However, the asterisk
symbol (*) is added here to indicate that each of the items within the
parentheses may appear zero or more times (we will cover the use of element
symbols in the next section). This can be useful for describing data sets that
have optional values. Consider the following element definition:
Note The asterisk symbol used in these
examples indicates that an item may occur zero or more times. Element symbols
are covered in detail in Table 3.1.
<!ELEMENT Son (#PCDATA | Name | Age)*>
This definition defines an element, Son, for which there may be
character data, elements, or both. A man might have a son, but he might not. If
there is no son, then normal character data (such as "N/A") could be
used to describe this condition. Alternatively, the man might have an adopted
son and would like to indicate this. Consider the XML fragments shown in Listing
3.9 in relation to the definition for the element Son.
Listing 3.9 The "Mixed" Rule
<Son>
N/A
</Son>
<Son>
Adopted Son
<Name>Bobby</Name>
<Age>12</Age>
</Son>
The first fragment contains only character data. The second fragment contains
a mixture of character data and the two defined child elements. Both fragments
conform to the definition and are valid.
Element Symbols
In addition to the normal rules that apply to element definitions, element
symbols can be used to control the occurrence of data. Table 3.1 shows the
symbols that are available for use in DTDs.
Table 3.1 Element Symbols
|
Symbol
|
Definition
|
|
Asterisk (*)
|
The data will appear zero or more times (0, 1, 2, ...). Here's an
example: <!ELEMENT children (name*)> In this example, the element
children could have zero or more occurrences of the child element name. This
type of
|
|
|
rule would be useful on a form asking a person about his or her children. It
is possible that the person could have no children or many children.
|
|
Comma (,)
|
Provides separation of elements in a sequence. Here's an example:
<!ELEMENT address (street, city, state, zip)> -In this example,
the element address will have four child elements: street, city, state, and zip.
Each of the child elements must appear in the defined order in the XML
document.
|
|
Parentheses [( )]
|
The parentheses are used to contain the rule for an element. Parentheses may
also be used to group a sequence, subsequence, or a set of alternatives in a
rule. Here's an example: <!ELEMENT address (street, city, (state |
province), zip)> In this example, the parentheses enclose a sequence.
Additionally, a subsequence is nested within the sequence by a second set of
parentheses. The subsequence indicates that there will be either a state or a
province element in that spot in the main sequence.
|
|
Pipe (|)
|
Separates choices in a set of options. Here's an example:
<!ELEMENT dessert (cake | pie)> The element dessert will have one
child element: either cake or pie.
|
|
Plus sign (+)
|
Signifies that the data must appear one or more times (1, 2, 3, ...).
Here's an example: <!ELEMENT appliances (refrigerator+)> The
appliances element will have one or more refrigerator child elements. This
assumes that every household has at least one refrigerator.
|
|
Question mark (?)
|
Data will appear either zero times or one time in the element. Here's an
example: <!ELEMENT employment (company?)> The element employment
will have either zero occurrences or one occurrence of the child element
company.
|
|
No symbol
|
When no symbol is used (other than parentheses), this signifies that the data
must appear once in the XML file.
|
|
|
Here's an example: <!ELEMENT contact (name)> The element
contact will have one child element: name.
|
Element symbols can be added to element definitions for another
level of control over the XML documents that are being validated against it.
Consider the DTD in Listing 3.10, which makes very limited use of XML
symbols.
Listing 3.10 Limited Use of Symbols
<!ELEMENT contactlist (contact) >
<!ELEMENT contact (name, age, sex, address, city, state, zip, children) >
<!ELEMENT name (#PCDATA) >
<!ELEMENT age (#PCDATA) >
<!ELEMENT sex (#PCDATA) >
<!ELEMENT address (#PCDATA) >
<!ELEMENT city (#PCDATA) >
<!ELEMENT state (#PCDATA) >
<!ELEMENT zip (#PCDATA) >
<!ELEMENT children (child) >
<!ELEMENT child (childname, childage, childsex) >
<!ELEMENT childname (#PCDATA) >
<!ELEMENT childage (#PCDATA) >
<!ELEMENT childsex (#PCDATA) >
You can see in Listing 3.10 that a contact record for a contactlist file is
being laid out. It is very straight forward and includes the basic address
information you would expect to see in this type of file. Information on the
contact's children is also included. This looks like a well-laid-out,
easy-to-use file format. However, there are several problems. What if you are
not sure about a contact's address? What if the contact does not have
children? What if the user is a lady and you are afraid to ask her age? The way
that this DTD is laid out, it will be very difficult for a referencing XML
document to be deemed valid if any of this information is unknown.
Using element symbols, you can create a more flexible DTD that will take into
account the possibility that you might not always know all of a contact's
personal information. Take a look at a similar DTD laid out in Listing 3.11.
Listing 3.11 Broader Use of Symbols
<!ELEMENT contactlist (contact+) >
<!ELEMENT contact (name, age?, sex, address?, city?, state?, zip?, children?) >
<!ELEMENT name (#PCDATA) >
<!ELEMENT age (#PCDATA) >
<!ELEMENT sex (#PCDATA) >
<!ELEMENT address (#PCDATA) >
<!ELEMENT city (#PCDATA) >
<!ELEMENT state (#PCDATA) >
<!ELEMENT zip (#PCDATA) >
<!ELEMENT children (child*) >
<!ELEMENT child (childname, childage?, childsex) >
<!ELEMENT childname (#PCDATA) >
<!ELEMENT childage (#PCDATA) >
<!ELEMENT childsex (#PCDATA) >
Listing 3.11 is much more flexible than Listing 3.10. There is still a single
root element, contactlist, which will contain one or more instances
(+) of the element contact. Under each contact
element is a list of child elements that make up the description of the contact
record. It is assumed here that the name and sex of the contact will be known.
However, the definition indicates that there will be zero or one occurrence
(?) of the age, address, city,
state, zip, and children elements. These elements are
set for zero or one occurrence because the definition is taking into account
that this information might not be known. Looking further down the listing, you
see that the children element is marked to have zero or more instances
(*) of the child element. This is because a person might have
no children or many children (or we might not know how many children the person
has). Under the child element, it is assumed that childname
and childsex information will be known (if there is at least one
child element). However, the childage element is marked as
zero or one (?), just in case it is unknown how old the child is.
You can easily see how Listing 3.11 is more flexible than Listing 3.10.
Listing 3.11 takes into account that much of the contact data could be missing
or unknown. An XML document being validated against the DTD in Listing 3.10
could still be validated and accepted by a validating parser even though it
might not have all the contact's personal data. However, an XML document
being validated against the DTD in Listing 3.10 would be rejected as invalid if
it did not include the children element.
Now that you have seen how DTDs define element declarations, let's take
a look at how attributes are used in a mini case study.
Zippy Human Resources: XML for Employee Records, Part I
Now that you have seen how elements are defined in a DTD, you have enough
tools to follow along with a mini case study that shows how a company could
use XML in its Human Resources department.
The Human Resources department for a small but growing company, Zippy
Delivery Service, has decided that in order to make their employee data flexible
across all the applications used by the company, the employee data should be
stored in XML. The Zippy Human Resources department's first task is to
decide on the fields to be included in the XML structure:
Employee Name
Position
Age
Sex
Race
Marital Status
Address Line 1
Address Line 2
City
State
Zip Code
Phone Number
E-Mail Address
After determining which elements are needed, they decide to put together a DTD
in order to ensure that the structure of the employee records in the XML data
file never changes. Additionally, the decision is made that multiple employee
records should be stored in a single file. Because this is the case, they need
to declare a document (root) element to hold employee records and a parent
element for the elements making up each individual employee record. The Human
Resources department also realizes that some of the data might not be applicable
to all employees. Therefore, they need to use element symbols to account for
varying occurrences of data. They've come up with the following DTD
structure as the first draft:
Employees1.dtd
<!ELEMENT employees (employee+) >
<!ELEMENT employee (name, position, age, sex,
race, m_status, address1,
address2?, city, state, zip, phone?, email?) >
<!ELEMENT name (#PCDATA) >
<!ELEMENT position (#PCDATA) >
<!ELEMENT age (#PCDATA) >
<!ELEMENT sex (#PCDATA) >
<!ELEMENT race (#PCDATA) >
<!ELEMENT m_status (#PCDATA) >
<!ELEMENT address1 (#PCDATA) >
<!ELEMENT address2 (#PCDATA) >
<!ELEMENT city (#PCDATA) >
<!ELEMENT state (#PCDATA) >
<!ELEMENT zip (#PCDATA) >
<!ELEMENT phone (#PCDATA) ><!ELEMENT email (#PCDATA) >
The Human Resources department has decided that the document
element employees is required to have one or more (+) child
elements (employee). The employee element would be the
container element for each individual employee's data. Out of the elements
comprising the employee data, the Human Resources department knows that not all
employees have a second line to their street address. Also, some employees do
not have home telephone numbers or e-mail addresses. Therefore, the elements
address2, phone, and email are marked to appear zero
or one time in each record (?). The new DTD structure is saved in a
file named employees1.dtd (which, by the way, you can download from the
Sams Web site).
The first several employee records are then entered into an XML document,
called Employees1.xml:
<?xml version="1.0"?>
<!DOCTYPE employees SYSTEM "employees1.dtd">
<employees>
<employee>
<name>Bob Jones</name>
<position>Dispatcher</position>
<age>37</age>
<sex>Male</sex>
<race>African American</race>
<m_status>Married</m_status>
<address1>202 Carolina St.</address1>
<city>Oklahoma City</city>
<state>OK</state>
<zip>73114</zip>
<phone>4055554321</phone>
<email>bobjones@mail.com</email>
</employee>
<employee>
<name>Mary Parks</name>
<position>Delivery Person</position>
<age>19</age>
<sex>Female</sex>
<race>Caucasian</race>
<m_status>Single</m_status>
<address1>1015 Empire Blvd.</address1>
<address2>Apt. D3</address2>
<city>Oklahoma City</city>
<state>OK</state>
<zip>73107</zip>
<phone>4055559876</phone>
<email>maryparks@mail.com</email>
</employee>
<employee>
<name>Jimmy Griffin</name>
<position>Delivery Person</position>
<age>23</age>
<sex>Male</sex>
<race>African American</race>
<m_status>Single</m_status>
<address1>1720 Maple St.</address1>
<city>Oklahoma City</city>
<state>OK</state>
<zip>73107</zip>
<phone>4055556633</phone>
</employee></employees>
The XML document Employees1.xml (also available for
download from the Sams Web site) initially has three employee records entered
into it. The Document Type Declaration is entered after the XML declaration and
before the document element, employees, and it uses the SYSTEM
keyword to denote that it is referencing the DTD, employees1.dtd,
externally.
The Human Resources department at Zippy Delivery Service feels that they are
off to a good start. They have defined a DTD, employees1.dtd, for their
XML data structure and have created an XML document, Employees1.xml
(containing three employee records), that is valid according to the DTD.
However, you'll find out during the course of this chapter that the Human
Resources department's DTD can be improved.
DTD Attributes
So far you have seen that it is possible to use intricate combinations of
elements and symbols to create complex element definitions. Now let's take
a look at how XML attribute definitions can be added into this mix.
XML attributes are name/value pairs that are used as metadata to describe XML
elements. XML attributes are very similar to HTML attributes. In HTML,
src is an attribute of the img tag, as shown in the following
example:
<img src="images/imagename.gif" width="10" height="20">
In this example, width and height are also attributes of
the img tag. This is very similar to the markup in Listing 3.12, which
demonstrates how an image element might be structured in XML.
Listing 3.12 Attribute Use in XML
<image src="images/" width="10" height="20">
imagename.gif
</image>
In Listing 3.12, src, width, and height are
presented as attributes of the XML element image. This is very similar
to the way that these attributes are used in HTML. The only difference is that
the src attribute merely contains the relative path of the image's
directory and not the actual name of the image file.
In Listing 3.12, the attributes width, height, and
src are used as metadata to describe certain aspects of the content of
the image element. This is consistent with the normal use of
attributes. Attributes can also be used to provide additional information to
further identify or index an element or even give formatting information.
Attributes are also defined in DTDs. Attribute definitions are declared using
the ATTLIST declaration. An ATTLIST declaration will define
one or more attributes for the element that it is referencing.
Note Attribute definitions do not
follow the same "top-down" rule that element definitions do. However,
it is still a good coding practice to list the attributes in the order you would
like them to appear in the XML document. Usually this means listing the
attributes directly after the element to which they refer.
Attribute list declarations in a DTD will have the following syntax:
<!ATTLIST elementname attributename type defaultbehavior defaultvalue>
ATTLIST is the tag name that specifies that this definition will
be for an attribute list.
elementname is the name of the element that the attribute will
be attached to.
attributename is the actual name of the attribute.
type indicates which of the 10 valid kinds of attributes this
attribute definition will be.
defaultbehavior dictates whether the attribute will be required,
optional, or fixed in value. This setting determines how a validating parser
should relate to this attribute.
defaultvalue is the value of the attribute if no value is
explicitly set.
Take a look at Listing 3.13 for an example of how this declaration may be
used.
Listing 3.13 ATTLIST Declaration
<!ATTLIST name
sex CDATA #REQUIRED
age CDATA #IMPLIED
race CDATA #IMPLIED >
In Listing 3.13, an attribute list is declared. The name element is
being referenced by the declaration. Three attributes are defined; sex,
age, and race. The three attributes are character data
(CDATA). Only one of the attributes, sex, is required
(#REQUIRED). The other two attributes, age and race,
are optional (#IMPLIED). An XML element using the attribute list
declared here would appear as follows:
<name sex="male" age="30" race="Caucasian">Michael Qualls</name>
The name element contains the value "Michael Qualls". It
also has three attributes of Michael Qualls: sex, age, and race. The attributes
in Listing 3.13 are all character data (CDATA). However, attributes
actually have 10 possible data types.
Attribute Types
Before going over a more detailed example of using attributes in your DTDs,
let's first review Table 3.2, which presents the 10 valid types of
attributes that may be used in a DTD. Then we will look at Table 3.3, which
shows the default values for attributes.
Table 3.2 Attribute Types
|
Type
|
Definition
|
|
CDATA
|
Characterdata only. The attribute will contain no markup. Here's an
example: <ATTLIST box height CDATA "0">
|
|
|
In this example, an attribute, height, has been defined for the element box.
This attribute will contain character data and have a default value of
"0".
|
|
ENTITY
|
The name of an unparsed general entity that is declared in the DTD but refers
to some external data (such as an image file). Here's an example:
<!ATTLIST img src ENTITY #REQUIRED> The src attribute is an
ENTITY type that refers to some external image file.
|
|
ENTITIES
|
This is the same as the ENTITY type but represents multiple values
listed in sequential order, separated by whitespace. Here's an example:
<!ATTLIST imgs srcs ENTITIES #REQUIRED> The value of the
imgs element using the srcs attribute would be something like
img1.gif img2.gif img3.gif. This is simply a list of
image files separated by whitespace.
|
|
ID
|
An attribute that uniquely identifies the element. The value for this type of
attribute must be unique within the XML document. Each element may only have a
single ID attribute, and the value of the ID attribute must be
a valid XML name, meaning that it may not start with a numeric digit (which
precludes the use of a simple numbering system for IDs). Here's an example:
<!ATTLIST cog serial ID #REQUIRED> Each cog element in the XML
document will have a required attribute, serial, that uniquely
identifies it.
|
|
IDREF
|
This is the value of an ID attribute of another element in the document.
It's used to establish a relationship with other tags when there is not
necessarily a parent/child relationship. Here's an example:
<!ATTLIST person cousin IDREF #IMPLIED> Each person element could
have a cousin attribute that references the value of the ID attribute
of another element.
|
|
IDREFS
|
This is the same as IDREF; however, it represents multiple values
listed in sequential order, separated by whitespace.
|
|
|
Here's an example: <!ATTLIST person cousins IDREFS #IMPLIED>
Each person element could have a cousins attribute that
contains references to the values of multiple ID attributes of other
elements.
|
|
NMTOKEN
|
Restricts the value of the attribute to a valid XML name. Here's an
example: <!ATTLIST address country NMTOKEN "usa"> Each
address element will have a country attribute with a default value of
"usa".
|
|
NMTOKENS
|
This is the same as NMTOKENS; however, it represents multiple values
listed in sequential order, separated by whitespace. Here's an example:
<!ATTLIST region states NMTOKENS "KS OK" > Each region
element will have a states attribute with a default value of "KS
OK".
|
|
NOTATION
|
This type refers to the name of a notation declared in the DTD (more on
notations later). It is used to identify the format of non-XML data. An example
would be using the NOTATION type to refer to an external application
that will interact with the document. Here's an example: <!ATTLIST
music play NOTATION "mplayer2.exe "> In this example, the
element music has an attribute, play, that will hold the name of a
notation that determines the type of music player to use. The default value
(notation) is "mplayer2.exe ".
|
|
Enumerated
|
This type is not an actual keyword the way the other types are. It is
actually a listing of possible values for the attribute separated by pipe
symbols (|). Here's an example: <!ATTLIST college grad
(1|0) "1"> The element college has an attribute,
grad, that will have a value of either "1" or
"0" (with the default value being
"1").
|
You saw during the coverage of the 10 valid attribute types
that we used two preset default behavior settings: #REQUIRED and
#IMPLIED. There are four different default types that may be used in an
attribute definition, as detailed in Table 3.3.
Table 3.3 Default Value
Types
|
Type
|
Definition
|
|
#REQUIRED
|
Indicates that the value of the attribute must be specified. Here's an
example <!ATTLIST season year CDATA #REQUIRED > In this example,
the element season has a character data attribute, year, that
is required.
|
|
#IMPLIED
|
Indicates that the value of the attribute is optional. Here's an
example: <!ATTLIST field size CDATA #IMPLIED > In this case, each
field element may have a size attribute, but it is not
required.
|
|
#FIXED
|
Indicates that the attribute is optional, but if it is present, it must have
a specified set value that cannot be changed. Here's an example:
<!ATTLIST bcc hidden #FIXED "true" > Each bcc element
has an attribute, hidden, that has a fixed value of
"true".
|
|
Default
|
This is not an actual default behavior type. The value of the default is
supplied in the DTD. Here's an example: <!ATTLIST children number
CDATA "0"> This represents that the children element
has a number attribute with a default value of
"0".
|
So far you have element (ELEMENT) declarations and
attribute (ATTLIST) declarations under your belt. You have seen that
you can create some very complex hierarchical structures using elements and
attributes. Next, we will take a look at a way to save some time when building
DTDs. DTD entities offer a way to store repetitive or large chunks of data for
quick reference. First, however, we are going to revisit our mini case
study.
Zippy Human Resources: XML for Employee Records, Part II
This is the second part of our mini case study on the use of XML in the
Human Resources department at Zippy Delivery Service. You saw in Part I
that the Human Resources department was able to put together a DTD (Employees1.
dtd) and an XML document with some employee records (Employees1.xml).
The DTD was referenced from the XML file for purposes of validation.
Upon review of their DTD, the members of the Human Resources department have
decided that they are not quite satisfied. They feel that they have two types of
information about each employee: personal information and contact information.
They've decided that the personal information would be better stored as
attributes of the employee name element rather than as individual
elements. Additionally, they've decided that they need an ID type of
attribute for each employee element in order to be able to quickly
search the XML document. The DTD, therefore, has been amended as follows (you
can download the DTD Employees2.dtd from the Sams Web site):
<!ELEMENT employees (employee+) >
<!ELEMENT employee (name, position,
address1, address2?, city, state,
zip, phone?, email?) >
<!ATTLIST employee serial ID #REQUIRED >
<!ELEMENT name (#PCDATA) >
<!ATTLIST name
age CDATA #REQUIRED
sex CDATA #REQUIRED
race CDATA #IMPLIED
m_status CDATA #REQUIRED >
<!ELEMENT position (#PCDATA) >
<!ELEMENT address1 (#PCDATA) >
<!ELEMENT address2 (#PCDATA) >
<!ELEMENT city (#PCDATA) >
<!ELEMENT state (#PCDATA) >
<!ELEMENT zip (#PCDATA) >
<!ELEMENT phone (#PCDATA) ><!ELEMENT email (#PCDATA) >
You can see that a new ID attribute, serial, has been
added for the employee element. The serial attribute is marked
as required (#REQUIRED). The age, sex, race,
and m_status elements have been removed and changed to attributes of
the name element. Each of these attributes is character data
(CDATA). Also, the race attribute has been deemed optional
(#IMPLIED).
The XML document has also been updated to reflect the new requirements of the
changed DTD (you can download XML document Employees2.xml from the Sams
Web site):
<?xml version="1.0"?>
<!DOCTYPE employees SYSTEM "employees2.dtd">
<employees>
<employee serial="emp1">
<name age="37" sex="Male" race="African American" m_status="Married">
Bob Jones
</name>
<position>Dispatcher</position>
<address1>202 Carolina St.</address1>
<city>Oklahoma City</city>
<state>OK</state>
<zip>73114</zip>
<phone>4055554321</phone>
<email>bobjones@mail.com</email>
</employee>
<employee serial="emp2">
<name age="19" sex="Female" race="Caucasian" m_status="Single">
Mary Parks
</name>
<position>Delivery Person</position>
<address1>1015 Empire Blvd.</address1>
<address2>Apt. D3</address2>
<city>Oklahoma City</city>
<state>OK</state>
<zip>73107</zip>
<phone>4055559876</phone>
<email>maryparks@mail.com</email>
</employee>
<employee serial="emp3">
<name age="23" sex="Male" race="African American" m_status="Single">
Jimmy Griffin
</name>
<position>Delivery Person</position>
<address1>1720 Maple St.</address1>
<city>Oklahoma City</city>
<state>OK</state>
<zip>73107</zip>
<phone>4055556633</phone>
</employee></employees>
In order for the XML document to remain valid according to the
DTD, a serial attribute has been added for each employee
element. Each serial attribute is set to a unique value. The
age, sex, race, and m_status elements have
been removed and added as attributes of the name element.
The Zippy Human Resources department now feels that they are getting pretty
close to having the DTD and XML structure they need in order to have an
effective solution for storing their employee records. However, as you'll
see in Part III, there is still a bit more tweaking that can be done with the
addition of entities.
DTD Entities
Entities in DTDs are storage units. They can also be considered placeholders.
Entities are special markups that contain content for insertion into the XML
document. Usually this will be some type of information that is bulky or
repetitive. Entities make this type of information more easily handled because
the DTD author can use them to indicate where the information should be inserted
in the XML document. This is much better than having to retype the same
information over and over.
An entity's content could be well-formed XML, normal text, binary data,
a database record, and so on. The main purpose of an entity is to hold content,
and there is virtually no limit on the type of content an entity can hold.
The general syntax of an entity is as follows:
<!ENTITY entityname [SYSTEM | PUBLIC] entitycontent>
ENTITY is the tag name that specifies that this definition will
be for an entity.
entityname is the name by which the entity will be referred in
the XML document.
entitycontent is the actual contents of the entitythe data
for which the entity is serving as a placeholder.
SYSTEM and PUBLIC are optional keywords. Either one can
be added to the definition of an entity to indicate that the entity refers to
external content.
Note The keyword SYSTEM or
PUBLIC used in an entity declaration will always be indicative of the
contents of the entity being contained in an external file. Think of this as
something like a pointer in C and C++. The entity is used as a reference to an
external source of data.
Note Entity declarations do not follow
the same "top-down" rule that element definitions do. They may be
listed anywhere in the body of the DTD. However, it is good practice to list
them first in the DTD as they may be referenced later in the document.
Entities may either point to internal data or external data. Internal
entities represent data that is contained completely within the DTD. External
entities point to content in another location via a URL. External data could be
anything from normal parsed text in another file, to a graphics or audio file,
to an Excel spreadsheet. The type of data to which an external entity can refer
is virtually unlimited.
An entity is referenced in an XML document by inserting the name of the
entity prefixed by & and suffixed by ;. When referenced in
this manner, the content of the entity will be placed into the XML document when
the document is parsed and validated. Let's take a look at an example of
how this works (see Listing 3.14).
Listing 3.14 Using Internal Entities
<?xml version="1.0"?>
<!DOCTYPE library [
<!ENTITY cpy "Copyright 2000">
<!ELEMENT library (book+)>
<!ELEMENT book (title,author,copyright)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT copyright (#PCDATA)>
]>
<library>
<book>
<title>How to Win Friends</title>
<author>Joe Charisma</author>
<copyright>&cpy;</copyright>
</book>
<book>
<title>Make Money Fast</title>
<author>Jimmy QuickBuck</author>
<copyright>&cpy;</copyright>
</book>
</library>
Listing 3.14 uses an internal DTD. In the DTD, an entity called cpy
is declared that contains the content "Copyright 2000". In the
copyright element of the XML document, this entity is referenced by
using &cpy;. When this document is parsed, &cpy; will
be replaced with "Copyright 2000" in each instance in which it is
used. Using the entity &cpy; saves the XML document author from
having to type in "Copyright 2000" over and over. This is a fairly
simple example, but imagine if the entity contained a string of data that was
several hundred characters long. It is much more convenient (and easier on the
fingers) to be able to reference a three- or four-character entity in an XML
document than to type in all that content.
Predefined Entities
There are five predefined entities, as shown in Table 3.4. These entities do
not have to be declared in the DTD. When an XML parser encounters these
entities (unless they are contained in a CDATA section), they will
automatically be replaced with the content they represent.
Table 3.4 Predefined Entities
|
Entity
|
Content
|
|
&
|
&
|
|
<
|
<
|
|
>
|
>
|
|
"
|
"
|
|
'
|
'
|
The XML fragment in Listing 3.15 demonstrates the use of a
predefined entity.
Listing 3.15 Using Predefined Entities
<icecream>
<flavor>Cherry Garcia</flavor>
<vendor>Ben & Jerry's</vendor>
</icecream>
In this listing, the ampersand in "Ben & Jerry's" is
replaced with the predefined entity for an ampersand (&) .
External Entities
External entities are used to reference external content. As stated
previously, external entities get their content by referencing it via a URL
placed in the entitycontent portion of the entity declaration. Either
the SYSTEM keyword or the PUBLIC keyword is used here to let
the XML parser know that the content is external.
XML is incredibly flexible. External entities can contain references to
almost any type of dataeven other XML documents. One well-formed XML
document can contain another well-formed XML document through the use of an
external entity reference. Taking this a step further, it can be easily
extrapolated that a single XML document can be made up of references to many
small XML documents. When the document is parsed, the XML parser will gather all
the small XML documents, merging them into a whole. The end-user application
will only see one document and never know the difference. One useful way to
apply the principle of combining XML documents through the use of external
entities would be in an employee-tracking application, like the one shown in
Listing 3.16.
Listing 3.16 Using External Entities
<?xml version="1.0"?>
<!DOCTYPE employees [
<!ENTITY bob SYSTEM "http://srvr/emps/bob.xml">
<!ENTITY nancy SYSTEM "http://srvr/emps/nancy.xml">
<!ELEMENT employees (clerk)>
<!ELEMENT clerk (#PCDATA)>
]>
<employees>
<clerk>&bob;</clerk>
<clerk>&nancy;</clerk>
</employees>
In this listing, two external entity references are used to refer to XML
documents outside the current document that contain the employee data on
"bob" (bob.xml) and "nancy" (nancy.xml).
The SYSTEM keyword is used here to let the XML parser know that this is
external content. In order to insert the external content into the XML document,
the entities &bob; and &nancy; are used. It is useful
to be able to contain the employee information in a separate file and
"import" it using an entity reference. This is because this same
information could be easily referenced by other XML documents, such as an
employee directory and a payroll application. Defining logical units of data and
separating them into multiple documents, as in this example, makes the data more
extensible and reduces the need to reproduce redundant data from document to
document.
Caution Use prejudice when splitting up
your XML data into multiple documents. Splitting up employee records into 100
different XML documents so that they will have increased extensibility across
multiple applications might be a good idea. Taking the orders table from your
order tracking database and splitting it into 100,000 documents would be a
horrible idea. External entities are parsed at runtime. Could you imagine
parsing thousands of entities that point to XML documents at runtime?
Applications would suddenly be forced to search through 100,000 separate
documents to find what they need instead of a single indexed table. Performance
would be destroyed. So, keep in mind that although the approach mentioned here
does have very applicable uses, it should not represent an overall data storage
solution.
Non-Text External Entities and Notations
Some external entities will contain non-text data, such as an image file. We
do not want the XML parser to attempt to parse these types of files. In order to
stop the XML parser, we use the NDATA keyword. Take a look at the
following declaration:
<!ENTITY myimage SYSTEM "myimage.gif" NDATA gif>
The NDATA keyword is used to alert the parser that the entity
content should be sent unparsed to the output document.
The final part of the declaration, gif, is a reference to a
notation. A notation is a special declaration that identifies the format
of non-text external data so that the XML application will know how handle the
data. Any time an external reference to non-text data is used, a notation
identifying the data must be included and referenced. Notations are declared in
the body of the DTD and have the following syntax:
<!NOTATION notationname [SYSTEM | PUBLIC ] dataformat>
ENTITY is the tag name that specifies that this definition will
be for an entity.
notationname is the name by which the notation will be referred
in the XML document.
SYSTEM is a keyword that is added to the definition of the
notation to indicate that the format of external data is being defined. You
could also use the keyword PUBLIC here instead of SYSTEM.
However, using PUBLIC requires you to provide a URL to the data format
definition.
dataformat is a reference to a MIME type, ISO standard, or some
other location that can provide a definition of the data being
referenced.
Note Notation declarations do not
follow the same "top-down" rule that element definitions do. They may
be listed anywhere in the body of the DTD. However, it is good practice to list
them after the entity that references them in order to increase clarity.
Listing 3.17 is an example of using notation declarations for non-text
external entities.
Listing 3.17 Using External Non-Text Entities
<!NOTATION gif SYSTEM "image/gif" >
<!ENTITY employeephoto SYSTEM "images/employees/MichaelQ.gif" NDATA gif >
<!ELEMENT employee (name, sex, title, years) >
<!ATTLIST employee pic ENTITY #IMPLIED >
...
<employee pic="employeephoto">
...
</employee>
In this example, an ENTITY type of attribute, pic, is
defined for the element employee. In the XML document, the pic
attribute is given the value employeephoto, which is an external entity
that serves as a placeholder for the GIF file MichaelQ.gif. In order to
aid the application process and display the GIF file, the external entity (using
the NDATA keyword) references the notation gif, which points
to the MIME type for GIF files.
Parameter Entities
The final type of entity we will look at is the parameter entity, which is
very similar to the internal entity. The main difference between an internal
entity and a parameter entity is that a parameter entity may only be referenced
inside the DTD. Parameter entities are in effect entities specifically for
DTDs.
Parameter entities can be useful when you have to use a lot of repetitive or
lengthy text in a DTD. Use the following syntax for parameter entities:
<!ENTITY % entityname entitycontent>
The syntax for a parameter entity is almost identical to the syntax for a
normal, internal entity. However, notice that in the syntax, after the
declaration, there is a space, a percent sign, and another space before
entityname. This alerts the XML parser that this is a parameter entity
and will be used only in the DTD. These types of entities, when referenced,
should begin with % and end with ;. Listing 3.18 shows an
example of this.
Listing 3.18 Using Parameter Entities
<!ENTITY % pc "(#PCDATA)">
<!ELEMENT name %pc;>
<!ELEMENT age %pc;>
<!ELEMENT weight %pc;>
In this listing, pc is used as a parameter entity to reference
(#PCDATA). All entities in the DTD that hold parsed character data use
the entity reference %pc;. This saves the DTD author from having to
type #PCDATA over and over. This particular example is somewhat
trivial, but you can see where this can be extrapolated out to a situation where
you have a long character string that you do not want to have to retype.
We are almost finished. Having covered the use of element, attribute, and
entity declarations in DTDs, we have just a few more loose ends to tie up. In
the next section, we will look at the use of the IGNORE and
INCLUDE directives. Then we will discuss the use of comments in DTDs.
In the final part of the chapter, we will look at the future of DTDs, some
possible shortcomings of DTDs, and a possible alternative for DTD validation.
Before moving on though, let's pay one more quick visit to the Zippy Human
Resources department in our mini case study.
Zippy Human Resources: XML for Employee Records, Part III
This is the final part of the mini case study on the use of XML in the Human
Resources department at Zippy Delivery Service. In Part II, the Human Resources
department decided to change the structure of their DTD by moving the employees'
personal data into attributes. This created a separation between personal
data and contact data (which remained stored in elements).
At this point, the Human Resources department felt pretty satisfied with
their work. Now, however, there are just a couple more minor areas where they
feel the DTD (Employees2.dtd) could be improved. They've decided
that they need to add several entities in order to speed the entry process
for new records and to cut down on having to retype redundant information.
First, they've added an entity for "Delivery Person". This
makes sense to them because all but a few of the employees of Zippy Delivery
Service are delivery people, and this will save them from having to type it
over and over. The second entity they've decided to add is a parameter
entity to give them a shortcut for entering #PCDATA type elements.
Here's the updated DTD (you can download Employees3.dtd from
the Sams Web site):
<!ENTITY dp "Delivery Person">
<!ENTITY % pc "#PCDATA">
<!ELEMENT employees (employee+) >
<!ELEMENT employee (name, position, address1,
address2?, city, state,
zip, phone?, email?) >
<!ATTLIST employee serial ID #REQUIRED >
<!ELEMENT name (%pc;) >
<!ATTLIST name
age CDATA #REQUIRED
sex CDATA #REQUIRED
race CDATA #IMPLIED
m_status CDATA #REQUIRED >
<!ELEMENT position (%pc;) >
<!ELEMENT address1 (%pc;) >
<!ELEMENT address2 (%pc;) >
<!ELEMENT city (%pc;) >
<!ELEMENT state (%pc;) >
<!ELEMENT zip (%pc;) >
<!ELEMENT phone (%pc;) ><!ELEMENT email (%pc;) >
In the new DTD, the entity dp is declared first. This entity is used
to insert the value "Delivery Person" into the XML document when it
is referenced. Next, the entity pc is declared. This is a parameter
entity that holds the value "#PCDATA" for insertion into the DTD when
referenced.
The XML document Employees2.xml has been updated to reflect the
addition of the dp entity (the whole XML document is not listed because
only a few lines actually changed; data not shown here should be assumed to
be the same as in Parts I and II of this case study). Here's the code
for Employees3.xml (which you can download from the Sams Web site):
<?xml version="1.0"?>
<!DOCTYPE employees SYSTEM "employees3.dtd">
<employees>
<employee serial="emp1">
<name age="37" sex="Male" race="African American" m_status="Married">
Bob Jones
</name>
<position>Dispatcher</position>
...
</employee>
<employee serial="emp2">
<name age="19" sex="Female" race="Caucasian" m_status="Single">
Mary Parks
</name>
<position>&dp;</position>
...
</employee>
<employee serial="emp3">
<name age="23" sex="Male" race="African American" m_status="Single">
Jimmy Griffin
</name>
<position>&dp;</position>
...
</employee>
</employees>
For the first employee, Bob Jones, the dp entity was not used for his
position value because he is the company's dispatcher. However,
for Mary Parks and Jimmy Griffin, the entity reference &dp; was
inserted as the value for their position elements because they are
both delivery people. This entity reference would also be used for any new employees
added to the XML document that are delivery people.
The DTD for Zippy Deliver Service's Human Resources department is now
complete. The DTD contains all the information required. It takes account
for information that might not be applicable. The employees' personal
and contact information has been logically separated between attributes and
elements. Also, entities have been added to serve as timesaving devices for
future additions to the XML document. The Zippy Human Resource department
has built a DTD that will serve to validate their XML employee records effectively
and efficiently.
More DTD Directives
Just a few more DTD keywords are left to cover. These are keywords that do
not neatly fit into any particular topic, so they're lumped together here.
These keywords are INCLUDE and IGNORE, and they do just what
their names suggestthey indicate pieces of markup that should either be
included in the validation process or ignored.
The IGNORE Keyword
When developing or updating a DTD, you may need to comment out parts of the
DTD that are not yet reflected in the XML documents that use the DTD. You could
use a normal comment directive (which will be covered in the next section), or
you can use an IGNORE directive. The syntax for IGNORE is
shown in Listing 3.19.
Listing 3.19 Using IGNORE Directives
<![ IGNORE
This is the part of the DTD ignored
]]>
You can choose to ignore elements, entities, or attributes. However, you must
ignore entire declarations. You may not attempt to ignore a part of a
declaration. For example, the following would be invalid:
<!ELEMENT Employee <![ IGNORE (#PCDATA) ]]> (Name, Address, Phone) >
In this example, the DTD author has attempted to ignore the rule
#PCDATA in the middle of an element declaration. This is invalid and
would trigger an error.
The INCLUDE Keyword
The INCLUDE directive marks declarations to be included in the
document. It might seem interesting that this keyword exists at all because not
using an INCLUDE directive is the same as using it! In the absence of
the INCLUDE directive, all declarations (unless they are commented out
or enclosed in an IGNORE directive) will be included anyway. The syntax
for INCLUDE, as shown in Listing 3.20, is very similar to the syntax
for the IGNORE directive.
Listing 3.20 Using INCLUDE Directives
<![ INCLUDE
This is the part of the DTD included
]]>
The INCLUDE directive follows the same basic rules as the
IGNORE directive. It may enclose entire declarations but not pieces of
declarations. The INCLUDE directive can be useful when you're in
the process of developing a new DTD or adding to an existing DTD. Sections of
the DTD can be toggled between the INCLUDE directive and the
IGNORE directive in order to make it clear which sections are currently
being used and which are not. This can make the process of developing a new DTD
easier, because you are able to quickly "turn on" or "turn
off" different sections of the DTD.
Note If an INCLUDE directive
is enclosed by an IGNORE directive, the INCLUDE directive and
its declarations will be ignored.
Comments Within a DTD
Comments can also be added to DTDs. Comments within a DTD are just like
comments in HTML and take the following syntax:
<!-- Everything between the opening tag and closing tag is a comment -->
As in HTML, comments in a DTD may not be nested. Comments may, however, span
multiple lines. Generally comments in a DTD are used to demarcate different
sections of the DTD or to help human readers understand different abbreviations
used in the declarations. Comments will be ignored by the XML parser during
processing. Listing 3.21 shows how to insert comments into a DTD.
Listing 3.21 Using Comments
<!-- This is a comment -->
<!ELEMENT rootelement (element1, element2)>
<!ELEMENT element1 (#PCDATA)>
<!-- This is another comment -->
<!ELEMENT element2 (#PCDATA)>
<!-- This is a comment
that spans multiple lines -->
Comments provide a useful way to explain the meaning of different elements,
attribute lists, and entities within the DTD. They can also be used to demarcate
the beginning and end of different sections in the DTD.
The DTD is a powerful tool for defining rules for XML documents to follow.
DTDs have had and will continue to have an important place in the XML world for
some time to come. However, DTDs are not perfect. As XML has expanded beyond a
simple document markup language, these limitations have become more apparent.
XML is quickly becoming the language of choice for describing more abstract
types of data. DTDs are hard-pressed to keep up. We will now take a look at some
of the drawbacks to DTDs and what future alternatives will be available.
DTD Drawbacks and Alternatives
Throughout this book, we will continue to document new growths, changes, and
permutations to XML as a technology to enhance data exchange, data structuring,
e-commerce, the Internet, and so on. As newer uses for XML come into being, the
needs for validation expand. XML is being used to describe the data structure of
video files, audio files, and Braille devices, among other thingsnot to
mention the ever-growing plethora of alternative data devices such as cellular
phones, handheld computers, televisions, and even appliances. There are several
drawbacks that limit the ability of DTDs to meet these growing and changing
validation needs.
First and foremost, DTDs are composed of non-XML syntax. Given that one of
the central tenets of XML is that it be totally extensible, it may not seem to
make a lot of sense that this is the case for DTDs. However, you must consider
that XML is a child of SGML, and in SGML, DTDs are the method used to validate
documents. Therefore, XML inherited DTDs from its parent. Although DTDs are
effective at defining the structure for document markup, as XML evolves, the
fact that DTDS are not formed of XML syntax and are nonextensible becomes
constraining.
Additionally, there can only be a single DTD per document. It is true that
there can be internal and external subsets of DTDs, but there can only be a
single DTD referenced. In the modern programming world, we are used to being
able to draw the programming constructs we use from different modules or
classes. If we applied this idea to DTDs, we might expect to be able to use a
DTD for customers, a separate DTD for inventory, and a separate DTD for orders.
However, this is not the case. All aspects of an XML document must be within a
single DTD. This limitation is similar to what programmers faced back in the
days of monolithic applications before object-oriented programming became a
normal standard for application development. This leads into the next
limitation.
DTDS are not object oriented. There is no inheritance in DTDs. As
programmers, we have gotten used to describing new objects based on the
characteristics of existing objects. One classic example is having Porsche,
Ford, and Chevrolet classes that inherent some characteristics from a base car
class. DTDs have no capability to do this.
DTDs do not support namespaces very well. For a namespace to be used, the
entire namespace must be defined within the DTD. If there are more than one
namespace, each of them must be defined within the DTD. This totally defeats
the purpose of namespacesbeing able to define multiple namespaces from
many different external sources.
Additionally, DTDs have weak data typing and no support for the XML DOM. DTDs
basically have one data type: the text string. There are a few restraints, such
as the element rules and attribute types covered in this chapter, but these are
pretty weak considering the types of applications for which XML is now being
used (especially in e-commerce). The Document Object Model has become a powerful
tool to manipulate XML data; however, the DTD is totally cut off from the reach
of the DOM.
Finally, and possibly most important from a security standpoint, is the
ability of the internal DTD subset to override the external DTD subset. An
company could spend a great deal of time and effort crafting a DTD to validate
the XML data in its e-commerce transactions only to have the settings in the DTD
overridden by the internally defined elements of a DTD. The implications on this
from a transaction security standpoint should be fairly clear.
So, what is to be done about the DTD? The DTD is still an effective mechanism
for validating XML documents and will be so for a long time to come. It just
does not "scale" to meet the needs of the expanding XML world. At the
time of this writing, the W3C organization has just recently finished the final
touches on the recommendation for the XML Schema, which is a new validation
mechanism for XML that corrects all the shortcomings of DTDs. XML Schema is a
powerful and important technology for the future of XML. The next chapter of
this book will be devoted to covering the XML Schema.
Note The W3C organization's Web
resources page for the XML Schema may be viewed at
http://www.w3.org/XML/Schema.
Summary
In this chapter, we have covered the Document Type Definition (DTD) and how
it is used to validate XML documents. Well-formed XML documents are documents
that are syntactically correct according to the syntax rules of XML. However, in
order to be a valid XML document, it must be validated against a DTD using a
validating XML parser. A DTD serves as a roadmap for defining what structure a
valid XML document should have.
We covered the following items in relation to using DTDs:
A DTD may be internal to the XML document or external and referenced by
the XML document.
A DTD is attached to an XML document through a Document Type Declaration.
A Document Type Declaration appears after the XML declaration and before the
root element of the XML document. The Document Type Declaration may include a
reference to an external DTD, encompass an internal DTD, or both.
XML elements are declared and defined within the DTD. Elements are parsed
from the top down, and elements in the XML document should appear in the same
order they appear in the DTD. Element declarations have a specific set of rules
and symbols that may be used in their definitions.
XML attributes are declared and defined within the DTD. Attributes are
not processed in a top-down fashion, but it is good programming practice to
insert them after the element they reference. Attribute declarations have a
specific set of types that may be used in their definitions.
Entities are used in DTD as storage spaces or placeholders for data.
Normally entities are used to store repetitive or bulky data for easy reference.
There are four types of entities: internal, predefined, external, and parameter.
Notations are used as references to help define the format of the external
data.
The IGNORE directive is used to indicate blocks of declarations
that should not be included when the document is processed.
The INCLUDE directive is used to indicate blocks of declarations
that should be included when the document is processed. This directive is
totally unnecessary to the successful processing of a DTD.
Comments may be included in DTDs. Comments in DTDs are used in exactly
the same way they are used in HTML.
The DTD has several drawbacks that limit its scalability with respect to
new and future XML applications. The XML Schema is the new recommendation from
the W3C organization for XML validation. The XML Schema will be covered in
detail in the next chapter.
Throughout the chapter, we followed a mini case study in which the Human
Resources department for Zippy Delivery Service used XML to store employee
records. The Human Resources department required a DTD to ensure that all XML
records were of a uniform structure. To start, they built a simple DTD that was
functional and worked. However, they were able to expand upon and improve their
DTD to coincide with the introduction of new DTD topics in this chapter.
Ultimately, they produced a DTD that effectively defined all the needs of the
Human Resources department and enabled them to build a good roadmap for a valid
XML document containing employee records.
© Copyright Pearson Education. All rights reserved.
|