Instructional Module X10a

How to Code XML

Background: X02c | Related modules | Module Home | Next reading: X11a

to Top Overview

XML is governed by a number of rules; to the extent that we create XML according to the rules, it can be used by software anywhere in the world. But if we fail to implement the rules, our XML code will be handicapped or useless.

The purpose of this module is to help you understand the general rules governing XML. The overall objective is to enable you to identify the core components and rules of an XML document including syntax and essential elements. When you complete this module, you should be able to:

  1. Explain and differentiate the concepts of data and metadata
  2. Explain the concept of well-formedness
  3. Explain the concept of validity
  4. Identify elements and attributes
  5. Identify and correctly use delimiters
  6. Identify the prolog of an XML document and explain its purpose
  7. Identify the root element and its role in an XML document
  8. Correctly use balanced opening, closing, and self-closing tags
  9. Assemble the fundamental components of XML into a well-formed, valid XML document

 

 

Data And Metadata

In a Nutshell:
Data:
The facts
Metadata:
Facts about the facts
An Example of "Facts"

Here's a common set of "facts":

Chiang Liu
1234 Main St.
Hales Ford VA 23456

Ask yourself:

  • What do you know about each "fact"?
  • How do you know?
  • How likely is it for a computer to know what these "facts" represent?

Here's another example: 2122
NEC187 1700
NEC196 -
NEC217 -
NEC228 1731

Ask yourself:

  • What do you know about each "fact"?
  • How likely is it for a computer to know what these "facts" represent?
Discussion and more questions

In the first example, most people can identify what each "fact" represents. That's because we're accustomed to seeing names and addresses in this arrangement and order.

  • "Chiang" a family name or a personal name?
  • Is this person a "he" or a "she"?

Metadata filled in for you:

The second example represents a portion of a train schedule.

Metadata Data
Train Number: 2122
Station at milepost NEC187
Departure time 1700
Station at milepost NEC196
Departure time -
Station at milepost NEC217
Departure time -
Station at milepost NEC228
Departure time 1731
So what?

Humans need metadata sometimes.

Computers need metadata always.

The point: XML is a system for providing metadata, mainly for computers, but understandable for humans.

 

Well-Formed XML

In a Nutshell:

Well-formed XML conforms to these four rules, determined by the World Wide Web Consortium ( W3C):

  1. An XML document consists of a root element, optionally preceded by a prolog and followed by miscellaneous content (which simply allows white space to exist there);
  2. XML elements have either
    • An opening tag, data, and a closing tag; or
    • A single self-closing tag.
  3. XML tags begin and end with angle-brackets < > and have a name.
    • Opening tags may have attributes;
    • Closing tags have a forward slash immediately after the first angle-bracket </;
    • Self-closing tags have a forward slash immediately before the second angle-bracket />
  4. Elements can contain other elements, but they cannot overlap. The first element to open is the last to close.

 

Check yourself!

 

Validity

In a Nutshell:

In addition to being well-formed, XML files must be valid: all the elements (tags and their contents) conform to the XML-based definition.

  • Each standard XML language has an official, on-line, machine-readable definition, either:
    • A Document Type Definition (DTD) based on SGML; or
    • A Schema based on XML.
  • W3C (and other organizations) provide validators: software that checks an XML document against the on-line definition.
  • Documents that are not valid cannot be processed reliably by software that reads and interprets XML, so it's important to validate all XML documents.
Learn more in this section...
Got it already? Check yourself...

What's the Difference between "Well-Formed" and "Valid"?

To be valid, a document must first be well-formed. But there's more...

No document is "just" XML, because XML isn't really a language, it's just a set of rules for how to create a language. All "XML" documents contain some specific XML-based language. Anybody can invent an XML-based language - assuming they know the rules! There are thousands of standard ones; here are some that are better known:

  • InkML: "an XML data format for representing digital ink data that is input with an electronic pen or stylus as part of a multimodal system."
  • MathML: "a low-level specification for describing mathematics as a basis for machine to machine communication."
  • RDF: "The Resource Description Framework (RDF) is a general-purpose language for representing information in the Web."
  • RSS: "Really Simple Syndication" for news feeds, blog updates, and anything you want to keep informed about.
  • SMIL: "The Synchronized Multimedia Integration Language (SMIL, pronounced 'smile') enables simple authoring of interactive audiovisual presentations."
  • SVG: "Scalable Vector Graphics (SVG) ... is a language for describing two-dimensional graphics and graphical applications in XML."
  • WML: Wireless Markup Language "is a markup language ... intended for use in specifying content and user interface for narrowband devices, including cellular phones and pagers."
  • XHTML: The XML-compliant version of HTML for Web page display.
  • XLink: "XML Linking Language ... allows elements to be inserted into XML documents in order to create and describe links between resources."
  • XML Schemas "express shared vocabularies and allow machines to carry out rules made by people. They provide a means for defining the structure, content and semantics of XML documents in more detail."

These XML-based languages, and all widely-used ones, have on-line definitions. These definitions:

  • Are available publicly, world-wide, on the Internet;
  • Can be read and understood by XML software;
  • Provide a formal definition for each XML-based language;
  • Are used by XML-handling software to validate, process, display, and transform XML-based data documents.

So in order to be valid, an XML document must conform to the rules of one (or more) specific XML language's official, public definition, as well as being well-formed.

Ask yourself: what's the advantage to having publicly available XML language definitions on the Internet? Answers: wide diffusion, increased use, less chance of misunderstanding ... and others.

A Word about DTDs and Schemas

There are two ways you can formally define an XML-based language:

  • A Document Type Definition (DTD) based on SGML; or
  • A Schema based on XML.

These are the documents that provide a formal, on-line, machine-readable definition of the XML-based language. If you go into any detail at all with XML, you'll need to know a lot about DTDs and schemas. Here, we'll just point out a couple of important facts:

Schemas are the best way to define a new XML language. Why? Because schemas use XML itself, and so are consistent with everything else connected to XML. They also provide many more options and finer control over data formats than DTDs. Most new standards are defined using schemas.

DTDs are still important to know about, because they were the only way to define XML languages at first, so many of the original XML languages are defined using DTDs. They're also used in XML's predecessor, SGML (Standard Generalized Markup Language).

Where to Find Validators

The best way to validate a document is using W3C's validator, http://validator.w3.org/.

Many XML software tools have validators built-in. These are fine to use, but don't have the authoritative weight of the W3C.

Learn more about validators

Why Validity is Important

Sometimes, we can get away with XML-based documents that are well-formed but not valid. (Rarely, we can even get away with documents that aren't even well-formed!) So if we can get away with it, why bother with formal validity?

The answer is that XML's strength lies in being both open and standard. Invalid documents are weak, because they deceive others into thinking they're open and standard.

  • Open: XML's whole philosophy is one of openness - being both public on the Internet, and designed by an organization (W3C) committed to taking input from all interested parties. When an XML DTD or schema is published on the Internet, it is, in effect, a pledge that all documents claiming to be written in that language conform to that definition. Any deviation from that definition is therefore deceptive.
  • Standard: When a standard like XML and its derived languages is created, people spend countless hours and thousand (or millions!) of dollars building software and systems that depend on that standard. When a document claims to follow a particular standard, but doesn't, the software is likely to break, the system is likely to fail, money and time will be wasted ... lives may even be lost, as we depend more on software systems - and XML - for our health and safety.

Creating valid documents - especially coding by hand - can be frustrating, as the validators don't usually give very helpful messages. Depending on your situation, there are a couple of things you can do:

  • If things just aren't working right, keep trying! Get to know the validator and its (frustrating!) messages.
  • If the XML standard you're using doesn't have a good way to express what you need, look around: chances are, there's a standard out there that does meet your needs. And if not, develop your own XML extension - that's a big part of how XML has grown!
Check yourself...

 

Elements and Attributes

In a Nutshell:

  • Elements are the fundadmental building blocks from which XML documents are built.
    • Elements consist of tags, and very often, the data enclosed in them.
    • Some elements consist of just one tag, which is self-closing
    • Other elements contain data and/or other elements
  • Attributes are properties of elements that give information about a particular instance of an element.
    • Attributes are contained in the first tag of an element
    • Each attribute has a name, specific to the element it is part of, and a value.
Learn more in this section about...
Got it already? Check yourself...

Elements

Elements are the basic unit of XML documents. Think of them as the atoms from which the "chemistry" of XML is derived. Let's look at an example of a simple XML document:

<train number="353">
<conductor name="Sylvester Ardmore" empNum="3785221"/>

<engineer name="Eric Olsen" empNum="3993625"/>
<locomotive type="P42">26</locomotive>
<consist>
<car type="baggage">133765</car>
<car type="coach">100236</car>
<car type="coach">100389</car>
<car type="coach">100725</car>
<car type="lounge/snack">110853</car>
</consist>
</train>

This illustrates each of the types of elements:

  • Elements with no data, but with other elements inside them:
    <train>
    <consist>
  • Elements with text data inside them:
    <locomotive>
    <car>
  • Elements with no data ("empty" elements)
    <conductor>
    <engineer>

Note also that elements that are not empty (the ones with data or other elements inside them) have a closing tag. The closing tag has a forward slash right after the opening angle bracket </ and no attributes.

Ask yourself: Is there another logical possibility for what could be in an element? Answer: Yes, you could have an element containing but text data and other elements. These are called mixed elements are are skipped here for simplicity.
Learn more about elements in the XML Specifications...  

What's the Difference between Empty and Non-empty Elements?

An empty element is one that has no content - it may have attributes, but everything is contained in one tag. Here are some more examples:

  • Element with content:
    <soda>Mountain Dew</soda>
  • Empty element (two ways to write it):
    <soda name="Mountain Dew"></soda>
    <soda name="Mountain Dew"/>
    The second example is "self-closing": the forward-slash just before the closing angle bracket /> signals that the element is closed, without needing the closing tag </soda> shown in the first example.
Ask yourself: Is the flexibility illustrated in these examples more helpful, or more confusing? You're right. (Whatever you said!) But I hope you'll find it helpful.

Attributes

Attributes are properties of elements that are listed inside the element's main tag.

Where to put attributes:
  • Attributes are placed in the main tag, which is either the first tag:
    <car type="coach">100389</car>
    or the only tag:
    <conductor name="Sylvester Ardmore" empNum="3785221"/>
  • If an element has a closing tag, attributes cannot be repeated in the closing tag:
    <locomotive type="P42">26</locomotive>
What attributes look like:
  • Each attribute starts with its name:
    type="coach"
    name="Sylvester Ardmore"

    empNum="3785221"
    The name is a kind of metadata
  • After the name comes the equal-sign =
  • Each attribute has one (and only one) value. The value is always in quotes, which can be either single ' or double "
    number="353"
    empNum='3785221'
    name="Eric Olsen"
    The value is a kind of data.
When to Use Attributes:

Attribute values and the text content of elements are both example of data, as opposed to metadata. So you may wonder how to decide whether to put data in an attribute or an element's data.

There are no hard-and-fast rules, but here are some considerations:

  • If CSS (Cascading Style Sheets) is used to format an XML document's output, newer browsers will be able to display the data in an element just like an HTML Web page. But data in attributes cannot be shown this way: instead, using XSLT (eXtensible Stylesheet Language Transformations) would be necessary. CSS is more likely to be familiar to Web designers, and is somewhat less complex, so in general, data that is intended to be displayed should be in an element rather than an attribute.
  • Because attributes are part of tags, most people are more comfortable with short attribute values. Data in elements can comfortably be made longer.
  • When an element has many different types of data associated with it, it is less bulky to put the data in attributes - especially if the data itself isn't very long. That's because attributes don't need opening and closing tags for their values - just quotation marks around them.

 

Ask yourself: What might be some other considerations in deciding whether a data item should be coded as an attribute value or element value? Answer: There are lots of considerations, and no one correct answer!
Learn more about attributes in the XML specifications...
Check yourself!

 

 

XML Delimiters

In a Nutshell:

Delimiters are boundary -markers. XML uses three main delimiter-pairs:

  • Angle brackets < > also known as Greater-than and Less-than signs: these mark the boundaries of tags.
  • Quotation marks " " and ' ' are used to mark the boundaries of string values within tags and data.
  • Entity delimiters & ; are used to mark the boundaries of codes representing special "entities" - usually characters that could cause problems if used directly.

 

Angle Bracket Delimiters

Angle brackets are used to mark the boundaries of tags.

<building number="37">Haven Hall</building>

The example shows a typical XML non-empty entity, which begins and ends with tags <building number="37"> and </building>. While the tags delimit (mark the beginning and end) of the entity, the tags themselves are delimited with angle brackets < and >

To be specific about the angle brackets:

 Left-AngleRight-Angle
Alternate name less-than greater-than
Appearance < >
Entity code &lt; &gt;
Decimal code 60 62
Octal code 74 76
Hex code 3C 3E

When one of these two symbols needs to appear as part of the data in an element, it must be represented by its entity code

Ask yourself: Why are delimiters necessary for tags in XML? Answer: Computers can recgnize the tags much more quickly if they are clearly marked.

Quotation Mark Delimiters

Quotation marks are familiar to just about eveyone, and work in XML much as they do in programming languages, or even literature.

Here's an explanation, just to make sure everything is clear:

  • Quotation marks are used to delimit the value of an attribute.
  • Quotation marks can also be used as part of normal text in string data.
  • The custom in XML documents has been to use double-quotes first (outermost), but this is not required, and either single or double can be used as the outermost.
  • Do not use curly quotes “ ” ‘ ’ (Hex codes 8220, 8221, 8216, and 8217 respectively) or grave accent marks ` (hex code 60). None of these are considered to be XML delimiters. Be careful if you use a word processor: they often substitute curly quotes for straight quotes, unless you turn off that feature.

Here are some examples of quotes used in XML:

<tree genus="Quercus" species="virginiana">liveoak</tree>
<tree genus='Quercus' species='alba'>white oak</tree>

<citation author="Frazer, J. G." title='The Golden Bough' location="vol.1 p.111">
"Positive magic says, 'Do this in order that so and so may happen.'
Negative magic or taboo says 'Do not do this lest so and so should happen.'" </citation>

In the first two examples, quotes (double and single) are used to delimit values of attributes. In the third, they are used that way, and also in the element's data.

Here are the technical details about quotation marks:

 Single-quoteDouble-quote
Alternate name apostrophe quotation mark
Appearance ' "
Entity code &apos; &quot;
Decimal code 39 34
Octal code 47 42
Hex code 27 22

 

Instead of beginning and ending delimiters being different from one another, the same delimiter is used at the beginning and end. Ask yourself: Why do you think quotation marks are used differently than angle brackets? Answer: Possibly because most people are already familiar with using quotes this way. (There are many possible reasons!)

Entity Delimiters

Entities in XML are codes that can be used as abbreviations or ways of entering repetitive or difficult data. They are discussed in another module (see below).

The most commonly used entities are characters that are either:

  • difficult to enter, because they aren't found on standard keyboards - like the characters of many languages other than English; or
  • could be misinterpreted as delimiters when they are intended to represent data - specifically, the characters we're talking about here: < > " ' &

Examples:

XML Code Intended Output
<railroad>B&amp;O</railroad> B&O
<railroad>AT&amp;SF</railroad> AT&SF
<code lang="C++">if (x &lt; y || a &gt; b) cout &lt;&lt; a;</code> if (x < y || a > b) cout << a;

Here are the technical details:

 ampersandsemi-colon
Appearance & ;
Entity code &amp; &0x3b;
*
Decimal code 56 59
Octal code 78 73
Hex code 38 3B

Note: The code for semi-colon hardly ever needs to be used. It is only treated as a delimiter when it is at the end of a properly-formed entity code.

Ask yourself: Why do you think the characters & and ; were chosen as delimiters? Answer: Your guess is as good as mine!
Learn more about entities...
Check yourself!

 

The XML Prolog Section

In a Nutshell:

The Prolog is the optional part of the document that comes before the root element. Its purpose is to help XML software to process the file correctly by giving background information, like the version of XML and the character encoding.

Learn more in this section...
Got it already? Check yourself...

What is a Prolog?

In XML, the Prolog is the optional part of the document that comes before the root element.

In this mini-example, we have a prolog:

<?xml version="1.0" encoding="utf-8"?>
      <train number="353">
      ...
      </train>

What is the Prolog for?

Its purpose is to help XML software to process the file correctly by giving background information, like the version of XML, and the character encoding.

All this is intended to help software process XML files correctly. It's possible for software to process XML files without a prolog, but only if the software already knows all about the file. Information in the prolog gives software the ability to verify the file type and encoding, and make adjustments if either of these were unexpects. With this information, XML files have the potential to be used much more widely.

If the XML language uses the older DFD - Document Type Definition - the doctype statement is part of the prolog, too. (We'll get to DFDs later.)

Ask yourself: The prolog is optional; is it worth the trouble of including it? Answer: Yes, because it helps insure the document is processed correctly.
Learn more about the Prolog...
Learn more about...
  • Versions: how XML is advancing
  • Character encoding: how letters and symbols are represented in computers
  • DFDs: Document Type Definitions (coming soon)
Check yourself!

 

The XML Root Element

In a Nutshell:

The root is the first element in an XML document, and it contains all the other elements of the document.

Got it already? Check yourself...
Learn more in this section...

 

What is a Root Element?

The root element is the starting point of an XML document.

  • The root must be the first element in an XML document;
  • There can only be one root in a document;
  • The root contains all the other elements of the document inside it;
  • The root cannot be inside any other element.

Here's a mini-document:

<?xml version="1.0" encoding="utf-8"?>
<train number="353">
  <locomotive type="P42">26</locomotive>
  <consist>
  <car type="baggage">133765</car>
  <car type="coach">100236</car>
  <car type="coach">100389</car>
  <car type="coach">100725</car>
  <car type="lounge/snack">110853</car>
  </consist>
</train>
Ask yourself: What is the root element in this example? Answer: <train>

Why is it called a "root"?

It's called a "root" because of the way XML is designed to be processed. As soon as an XML file is opened by software that processes it, the software creates a data "tree". This is a very efficient way for computers to store and process data internally.

In fact, one of the reasons XML has become such a widely-used method for storing and transmitting data is that it is designed to be processed using this efficient internal structure.

All "tree" structures need to have a starting point, which is known as the "root" of the tree. From there, they branch out, with each element forming a limb or a leaf of the tree.

Learn more about tree structures...
  • In HTML, which is typical of SGML and XML-based languages: module X20d.
  • In the Document Object Model (coming soon)
Check yourself!

 

The XML Tag

In a Nutshell:

XML tags are the markers for the beginning and end of each element.

  • XML tags are delimited by angle-brackets (or less-than and greater-than signs): < >
  • If an element has content (is non-empty), its opening tag must be balanced by a closing tag, which has the same name as the opening tag, but begins with </ and has no attributes.
  • If the element has no content (is empty), it has only one tag which is closed with the characters />

 

Anatomy of a Tag

The basic structure of all XML tags is simple. You've seen them! They look like the tags in this simple element:

<instrument>ruby laser</instrument>

The components:

  1. The opening delimiter <
  2. The name of the element
  3. The closing delimiter >

Beyond the basics, what's in the tag depends on what role it serves. There are three roles:

  • Opening tag: <instrument>
  • Closing tag: </instrument>
  • Self-closing (empty-element) tag: <instrument name="laser"/>

 

Opening Tags

The opening tag marks the beginning of an element that has content; after the content will come a closing tag.

<instrument>

Opening tags can also have any number of attributes:

<instrument type="ruby" partNumber="RL56721">

Learn more about attributes...

Closing Tags

Closing tags, at the end of an element, are simple: they just have a slash right before the name of the element:

</instrument>

Tip: the closing tag never has attributes, even if the opening tag does.

Self-Closing Tags

Elements that have no content are known as empty elements. They can have attributes inside the element's tag, but they have nothing after the tag.

To show processing softwere that it need not look for content, empty elements are required to have a slash before the closing delimiter:

<tree name="white oak" genus="Quercus" species="albus" qty="5"/>

This kind of tag is called "self-closing" because it doesn't need to be followed by a closing tag.

Check yourself!

 

XML Document Examples

In a Nutshell:

In this section, there are two examples of XML documents. You can use these to get an idea of what XML documents look like, or to review the basics of XML document structure.

Got it already? Check yourself...

 

Example 1

<?xml version="1.0" encoding="utf-8"?>
<mountains>
    <mountain name="Brokeoff Mountain"> 
      <elevation unit="feet">9144</elevation>
      <lattitude>N40:26.717</lattitude>
      <longitude>W121:33.605</longitude>
      <locality>Lassen Volcanic Wilderness 
                Tehama County 
                California 
                United States
      </locality>
    </mountain>
    <mountain name="Shoshone Point">
      <elevation unit="feet">5672</elevation>
      <lattitude>N40:41.045</lattitude>
      <longitude>W116:32.353</longitude>
      <locality>Eureka County 
                Nevada 
                United States
      </locality>
    </mountain>
</mountains>
Ask yourself: Which line is the prolog? Answer: <?xml version="1.0" encoding="utf-8"?>
Ask yourself: What is the root element? Answer: <mountains>
Can you find any "empty" elements? Answer: this document has no empty elements
Can you find an element with an attribute? Answer: mountain and elevation have attributes
Ask yourself: What (if any) are the attributes? Answer: "name" and "unit"
Ask yourself: What (if any) are the attribute values? Answer: "Brokeoff Mountain", "feet" (in two places), and "Shoshone point"

Example 2

<?xml version="1.0" encoding="utf-8"?>
<rigs>
    <rig>
        <layout type="trac+trail"/>
        <tractor id="CK200511">
            <vin>KW044HH8693779450</vin>
            <manufacturer>Kenworth</manufacturer>
            <model>Heavy Hauler</model>
            <modelYear>2002</modelYear>
            <yearAquired>2005</yearAquired>
            <horsepower>650</horsepower>
            <weight unit="pounds">12000</weight>
            <axles>3</axles>
            <tires>10</tires>
        </tractor>
        <trailer id="LE199904">
            <manufacturer>East</manufacturer>
            <model>Rear-dump 30</model>
            <yearAquired>1999</yearAquired>
            <modelYear>1999</modelYear>
            <weight unit="pounds">11000</weight>
            <capacity unit="tons">30</capacity>
            <axles>2</axles>
            <tires>8</tires>
        </trailer>
        <loadedWeight unit="pounds">83000</loadedWeight>
    </rig>
</rigs>
Ask yourself: What is the root element? Answer: <rigs>
Can you find any elements with an attribute? Answer: layout, tractor, trailer, capacity, weight, and loadedWeight have attributes
Ask yourself: What (if any) are the attributes? Answer: "type", "id" and "unit"
Can you find any "empty" elements? Answer: layout is an empty element
Check yourself!

 

Putting XML Together (Introductory)

In this section, you'll put simple XML files together. We'll do it in two steps:

  1. In the main part, we'll give you the ideas we're trying to represent, and the code to implement it - but you'll need to put it together in the right order.
  2. In "Check yourself" we'll give you the ideas and let you write the code yourself.

Here's the Idea

You're setting up a simple restaurant menu database, and you want to use XML. Here's what you'll need to represent:

  • The fact that this is a menu
  • Type of item
    • Appetizer
    • Entré
    • À la carte
    • Desert
    • Beverage
  • Name of item (like Romaine, Boeuf bourguignon, Boisenberry sorbet, Harp lager...)
  • Price

Here's the Code to Put Together

This part has the lines of code you'll need. Your job is to put them together in the right order. Use your mouse to copy the code, paste it into a text editor, and drag each line to the right place. (Hint: there are three menu items entered...)

</item>
</item>
</item>
</menu>
<?xml version="1.0" encoding="utf-8"?>
<item>
<item>
<item>
<menu>
<name>Crab tempura</name>
<name>French-fried Onion Flower</name>
<name>Pinot Noir 2004 (bottle)</name>
<price>15.95</price>
<price>18.95</price>
<price>4.50</price>
<type>Appetizer</type>
<type>Beverage</type>
<type>Entre</type>

 


Check yourself!

to Top About This Document
Link to Review
Click here for review questions related to this module's objectives.
Audience
to Top
Link to Top

This module is for people who ...

Objectives

On successful completion of this module, you will be able to identify the core components and rules of an XML document including syntax and essential elements.

Link to Top
Module X10a: How to Code XML
This document is part of a modular instruction series in Computer Instruction. For more information, see the overview or the list of modules in this series, X: XML, XHTML, DHTML, CSS. This document has been used in the following classes: CIS 179.
History
Original: 15 September 2006, by Laurence J. Krieg
Last modification:
Copyright
Copyright © 2006, Laurence J. Krieg, Washtenaw Community College
Instructors: You may point to this file in your Web-based materials; however, its location may change without notice.
Students: You are welcome to make a copy for your personal use.
All other uses: Please contact the author, Laurence J. Krieg, for permission: krieg@ieee.org.
Background: X02c | Related modules | Module Home | Next reading: X11a

Link to Top