Instructional Module X51

Document Type Definitions for XML

Link to Top

to Top Overview

Overview

Document Type Definitions are the original way of specifying how an XML-based language works. Our goal in this module is to give you an idea of how they're put together, and help you to interpret them.

 

Link to Top
-->

Document Type Definitions for XML

In a Nutshell:

Document Type Definitions are formal documents beginning with a DOCTYPE statement and containing definitions of the elements and attributes found in the document type.

The purpose of this module is to make it possible to understand DTD; other learning resources are available for learning to create DTDs. Learn more in this section about...

Referring to DTDs

DTDs can occur either within an XML document or in a separate document, on their own. Except in textbook examples, though, document type definitions need to be public documents available through the Internet for all to use, so almost all XML documents based on DTDs refer to them in their prolog area. this module does not discuss the use of DTD statements within an XML document.

To get from the XML document to the DTD, there has to be a reference, known as a Document Type Declaration. Here's what the declaration for XHTML looks like:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Here's what it means:

<! is the opening delimiter for DTD tags

html is the root element of the document

PUBLIC refers to the intention that the document be available for everyone to use. The alternative is SYSTEM, meaning the DTD is for some organization's internal use.

-//W3C//DTD XHTML 1.0 Transitional//EN is the formal public identifier, or FPI. It refers to the standard (- indicates that the organization named in the following part created the standard itself) // the organization responsible for it (W3C) // Name and version of the document type (DTD XHTML 1.0 Transitional) // Language in which the DTD is written (EN = English)

http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd is the URL of the document we're referrring to.

> is the closing DTD tag delimiter

Elements Defined


The overall structure of DTDs is a series of element and attribute definitions. The order doesn't matter, and attributes can be listed separately, because they are linked to their elements by name. Here is a brief example:

<!ELEMENT example (#PCDATA)>

ELEMENT (in upper-case) is the keyword introducing the definition of an element-type.

example (case-sensitive) in this case is the name of the root element of this type of document

(#PCDATA) is the type of data accepted in this element. PCDATA is the most common data type: plain text, including entities - the character sequences that can be rendered as other characters or sequences, like &eacute; rendered as é.

Because nesting elements is such an important part of XML, element definitions include any nested elements they might (or might not) contain:

<!ELEMENT example (title, description, date)>

Here, element example must contain one title element, one description element, and one date element.

What if one or more contained elements are optional?

<!ELEMENT example (title*, description, date*)>

The * after the element name means the parent element may have zero or more elements. In this came, there need not be a title or a date, but there could be any number of them; and there must be exactly one description.

Some of the other possibilities: + means one or more, and ? means zero or one. Instead of a comma between chiled elements, you can put a vertical bar character | which means there's a choice of elements.

<!ELEMENT example (title*, description+, date?, author | source)>

Ask yourself: What does this mean?
<!ELEMENT example (title*, description+, date?, author | source)>
Answer: the example element can have:
zero or many title elements
at least one, but potentially many description elements
date is optional, but no more than one can occur
and either one author or one source element must be present

This example show two more features of DTDs: first, it's possible to mix text and elements; second, that you can create groups that repeat.

<!ELEMENT furthermore (#PCDATA | heading)*>

Element furthermore can have either text or a heading element, but there can be any number of them repeated as long as there is at least some text or a heading element.

Attributes Defined


Attributes

Recall that an attribute is a modifier of an element, contained in the first tag of the element, like type in the <furthermore> element:

<furthermore type="informative">Rainshine</furthermore>

Here's how a DTD would set this up:

<!ELEMENT furthermore #PCDATA>
<!ATTLIST furthermore
   type #CDATA>

Notice these points:

  • ATTLIST must be in capitals
  • futhermore is the name of the element whose attributes we're defining. It is case-sensitive, and must refer to an element defined somewhere above in the file, but not necessarily directly above
  • type is the name of the attribute, and is also case-sensitive
  • #CDATA is the kind of plain text used in attributes rather than #PCDATA (see note below)
Attributes Required?

DTDs offer several options for indicating whether an attribute is required, not required, or has a default value.

  • Required attributes are followed by #REQUIRED like:
    <!ATTLIST furthermore
       type (#CDATA) #REQUIRED>
  • Optional attributes are unmarked, or followed by the technical term for "optional", which is #IMPLIED
    <!ATTLIST furthermore
       type (#CDATA) #implied>
  • If an implied attribute should have a default value, it can be put in like this:
    <!ATTLIST furthermore
       type (#CDATA) #implied "helpful">

    so that in a case like this:
    <furthermore>Sit down</furthermore>
    the attribute value helpful can be filled in by software that reads the file, and when creating an XML file, the type attribute only needs to be used if it's not helpful.
Attribute Data Types
  • CDATA is character data. It is not expected to have tags or character entities in it.
  • PCDATA is parsed character data. A processor is expected to look through it for tags and entities.
Entities Defined
What is an Entity? What is it For?

An entity in XML (and before that, in SGML) is a type of abbreviation or short string that can be used for a number of purposes:

  • To make it possible to show characters in the text if they are also used as delimiters in the code. For example, the angle-brackets used to delimit tags, < and >, may need to appear in the document's text (as they just did!), and can be coded as entities < and > . Since the & symbols is used as the opening delimiter for entities, it also requires an entity when it is to appear in the text. That entity is &amp; . This use of entities is necessary, and is defined as part of XML. In all, five character entities are predefined for XML:
    SymbolEntityMeaning
    < &lt; "less than"
    > &gt; "greater than"
    & &amp; "ampersand"
    " &quot; "quote"
    ' &apos; "apostrophe"
  • To make it easier to read XML documents by coding abbreviations. The namespace qualifiers used in an XML document can (optionally) be encoded in DTDs using entities; any other useful abbreviations can be encoded as entities as well. This use of entities is not necessary, but can be helpful in some situations.
  • To simplify coding DTDs themselves - entities can be used within DTDs themselves.
  • To specify objects that aren't text, or that the XML-handling software should not try to "understand" (parse), such as multimedia files.
How Entities are Defined

Let's define an entity that expands to "film studio, production company, or sponsoring organization" for our movies ontology.

<!ENTITY fpso "film studio, production company, or sponsoring
organization">

We can then use the entity fps this way in an XML document:

<dc:publisher>Dreamworks (&fpso;)</dc:publisher>

The XML software that reads this file will interpret is as:

<dc:publisher>Dreamworks (film studio, production company, or
sponsoring organization)</dc:publisher>

Unparsed Entities

Suppose we want to include movie trailers in our movie XML file. The trailer is in a video format such as Quicktime or AVI, and XML software doesn't (normally) handle this type of data. Since XML software does not try to parse ("understand") these files, there is a special way to include media files, including graphic images of all kinds, and sound files as well as video. This uses a type of entity called unparsed entities.

We could define an element and attribute to refer to trailers, and we would also have to indicate what medium it is or what application can handle it - for example, a Quicktime player. This could be declared in a DTD file:

<!ENTITY m:trailer EMPTY>
<!NOTATION qt SYSTEM "video/quicktime">
<!ATTLIST m:trailer
source CDATA #REQUIRED
description CDATA #REQUIRED>

Notes about these declarations:

  • The trailer entity is declared EMPTY because all the information needed will be in its two attributes, source and description. The EMPTY keyword makes this a self-closing tag entity.
  • NOTATION is the DTD keyword to associate an abbreviation - in this case qt - with a data type or application to handle it.
  • The keyword SYSTEM refers to something defined more-or-less locally.
  • I chose to use the MIME type video/quicktime. This allows the OS of the computer on which the XML file is processed to select an application that will decode the file. MIME types are an IETF standard, and are used on all current operating systems as a way of associating media files with programs to handle them.
    However, another option here would be to use the name of the program directly, on the assumption that the computer our file will be decoded on has that specific program. That option would look like this:
    <!NOTATION qt SYSTEM "QuickTimePlayer.exe">
  • Attribute source will have the URL of the movie trailer.
  • Attribute description will contain some text for users to identify.

The entity that contains the media object reference can be declared in the DTD file, but since a separate entity is needed for each media object, it makes more sense to declare it in the XML file where it will be used.

<?XML version="1.0" encoding="UTF-8"?>
<!DOCTYPE m:movies SYSTEM
"http://poggin.wccnet.edu/xml/movies.dtd" [
<!ENTITY narnia_trailer
SYSTEM "http://www.apple.com/trailers/disney/
thechroniclesofnarnia/clips.html
?movie=trailer&size=QTsmall&clip_title=Theatrical%20trailer"
NDATA qt>
]>

<m:movies xmlns:m="http://poggin.wccnet.edu/xml/movies.dtd"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<m:movie>
<dc:title>The Chronicles of Narnia: The Lion, The Witch, and The Wardrobe</dc:title>
<m:trailer m:description="Small theatrical trailer"
m:source=narnia_trailer/>
</movie>
</movies>

Notes about this code:

  • Several lines have been broken to fit the screen, where they should not normally be broken, such as in the middle of a long URL.
  • In the DOCTYPE declaration, I've added an entity definition for narnia_trailer (shown in strong text). Notice how the definition part is enclosed in [square brackets].
  • The defintion includes:
    • Keyword ENTITY
    • The entity name narnia_trailer
    • The SYSTEM keyword
    • The (very long) URL of the media
    • Keyword NDATA which identifies this as an unparsed data entity
    • The abbreviation qt which we defined in the DTD file's NOTATION definition to refer to any Quicktime media file.
  • In the XML movie element is the child element trailer with its attribute source, using the entity narnia_trailer.
  • This combination provides any XML software with the information it needs to handle the file. This is no guarantee that the trailer will actually be shown, because:
    1. The XML-handling software may not have incorporated any method for handling non-text objects.
    2. Even if the XML-handling software knows how to make use of this information, the computer and OS on which it's attempting to show the file may not have the necessary resources. For example, the user may not have downloaded and installed the Quicktime player.
How are Entities Used Within DTDs?

Entities can be used to store part -or all - of a DTD file. This can be useful for:

  • Complicated code that needs to be used often in a DTD;
  • "Borrowing" whole DTDs from the Web;
  • Putting together DTDs or portions of DTDs stored in different files.

When entities are used this way, they are known as parameter entities. A couple of brief examples will illustrate what this looks like:

File: further.dtd File: more.dtd
<!ELEMENT further #PCDATA>
<!ATTLIST further
   howfar #CDATA>
<!ELEMENT more #PCDATA>
<!ATTLIST more
   howmuch #CDATA>
File furthermore.dtd

<!ENTITY % futher SYSTEM further.dtd>
<!ENTITY % more SYSTEM more.dtd>
%futher;
%more;

In effect, file furthermore.dtd will look like this:

<!ELEMENT further #PCDATA>
<!ATTLIST further howfar #CDATA>

<!ELEMENT more #PCDATA>
<!ATTLIST more howmuch #CDATA>

Notes about parameter entities:

  • The % symbol indicates parameter entities being defined. Mnemonic: percent and parameter both begin with the same letter.
  • Compare the delimiters for parameter entity references to other ("general") entity references (notice both end with ; ):
    &general_entity_reference;
    %parameter_entity_reference;
  • Looking are this example, you may not think it's worthwhile, since we hardly saved any typing by using the parameter entity; but in real life, the the first two DTD files would be far bigger.

So as you can see, entities are used in several different ways.

More about DTDs


This is just a quick overview of DTDs. Get a more information! Consult these references:

Tutorials
References

List
Link to Top

 

Link to Top

Interpreting DTDs

Now it's time to practice what you know about DTDs and try interpreting some. The general idea is to read a fairly simple DTD, and explain in English what the elements, attributes, and entities are for.

There are three tasks in this section; the first two are fairly simple, and the third a little more challenging, but not too difficult for starters.

Task 1


We'll start with a simple DTD from a presentation to the Reuters News Service coders, at http://www.fisd.net/presentations/Reuters500/tsld004.htm.

Questions to answer:

  1. What is the root element?
  2. Which elements within the root are required, and which are optional?
  3. What attributes, if any, are required? Which optional?
  4. What entities are defined, and how are they used?

Task 2


Let's take a look at another simple example and work through it. Thanks to Elizabeth Castro's Cookwood Press Web site and her book, XML for the World Wide Web (Peachpit Press, 2001).

Browse to the Endangered Species DTD at http://www.cookwood.com/xml/examples/dtd_creating/end_species.dtd. Study the DTD to find answers to these questions:

  1. What is the root element?
  2. What element(s) is/are allowed directly under the root? How many are required or allowed?
  3. In element animal, which child elements are required? Which are optional and/or repeatable?
  4. Which elements that are children of animal have children of their own? Which are required and which are options?
  5. Which elements have attributes, and which of them are required vs. optional?
  6. Are any entities defined? If so, what are they used for?

Task 3


You may be already familiar with the Dublin Core Metadata Initiative (DCMI) and its 15 elements, if you used them in completing assignment module X65h. Though DCMI does not use its DTD as the primary definition of its elements, it makes one available for information purposes.

Browse to the DCMI DTD at http://dublincore.org/documents/2002/07/31/dcmes-xml/dcmes-xml-dtd.dtd and answer the following questions:

  1. What is the purpose of entities rdfns and dcns? Why are they needed?
  2. What type of entity (general or parameter) are rdfnsdecl and dcnsdecl? How are they used in the declaration of the "wrapper element"?
  3. The entity dcmes is declared early in the document. Where is it used?
  4. Element rdf:Description can contain a number of child elements and one attribute. What are its children? Are they required? How many of each may/must there be?
  5. Elements from the Dublin Core Metadata Element Set (DCMES) are listed with their attributes and introduced by a brief comment each. What are the attributes most of them can have; required or optional? Are any elements different, and if so what attributes are they given?

to Top About This Document
Audience
to Top

This module is for people who need to understand DTDs.

Objectives

On successful completion of this module, you will be able to recognize and interpret XML Document Type Definitions (DTDs) so as to be able to correctly analyze documents based on those DTDs.

Module X51: Document Type Definitions for XML
This document is part of a modular instruction series in Computer Instruction. For more information, see the overview or the list of modules in this series, X: XML, XHTML, DHTML, CSS. This document has been used in the following classes: CIS 179.
History
Original: 7 November 2006, by Laurence J. Krieg
Last modification: Monday, August 31, 2009
Copyright
Copyright © 2007, Laurence J. Krieg, Washtenaw Community College
Instructors: You may point to this file in your Web-based materials; however, its location may change without notice.
Students: You are welcome to make a copy for your personal use.
All other uses: Please contact the author, Laurence J. Krieg, for permission: krieg@ieee.org.
Background: X50c | Related modules | Module Home | Next reading: X52

Link to Top