Coding XHTML



to Top Whys and Wherefores
How XHTML evolved from HTML and XML


HTML was Developed From SGML

SGML, the Standard Generalized Markup Language, was designed as a way to build information-formatting languages. HTML, designed by Tim Berners-Lee in 1989-90, is one of the languages derived from SGML.

XML is a "Meta-Language"

HTML and the World Wide Web showed the potential for sharing information among millions of computers around the world. XML was derived from SGML as a simplified and extended set of rules for designing information mark-up languages. XML isn't used directly, but is used to create new codes for specific information-sharing purposes. Because it's a language for creating languages, it is called a "meta-language".

XML works by having each language derived from it defined in a Document Type Definition (DTD). This is a file that uses XML to define each of the tags of the language and their purpose. In addition, information about how each tag appears is put into a Stylesheet.

XHTML adapts HTML to XML

With XML providing more possibilities for information-sharing, a new version of HTML was created to take advantage of the new features. That's XHTML.

But the new capabilities come at a price: Certain things have to be done differently, so for people who already know HTML, there are old habits to break and new ways to adopt.

One of the main differences between HTML and XHTML is the need to separate structure from format...

Separating Structure from Format


What are "Structure" and "Format"?

Structure is the outline of the information - the skeleton. It consists of things like:

  • major headings
  • less important headings
  • lists
  • tables

Format is how the information is presented. It could include things like:

  • font
  • color
  • size of text
  • style of voice output
Reasons for Separating Them

The main reason for separating structure and format is that there are so many ways information can be presented. The Web, and more generally the Internet, is used to transmit and present information in a stunning variety of media:

  • Computer screens (the usual way - for now!)
  • Hand-held computers
  • Cell phones
  • Text readers for the blind
  • Paper printers
  • Braille printers

And more interestingly, computerized agents ("bots") are evolving to become information-search tools with the potential to leave today's search engines far behind.

For all but computer screens, HTML is a hassle to use, because it is very difficult or impossible to tell which tags can be ignored by (for example) a text reader, and which tags have information that helps make the meaning and structure clear.

How the Separation Works

XHTML requires that we avoid using tags that are purely formatting devices, such as <font>, <b>, <i>. Instead, we use "styles" - Cascading Style Sheets (CSS) that work somewhat like the styles in Microsoft Word.

Styles allow us to define the formatting for each of the structural tags - the headings, paragraph, lists, table parts, "strong" and "emphasis" tags. By changing the definition of the formatting for each style, one document can be used in a wide variety of ways.


 
to Top Nuts and Bolts
From the Top


What in the World is This?

When a browser - or more generally a User Agent (UA) - receives a file to display, the first thing it needs to know is what kind of file it is. "What in the world does this file contain?" A closely related second questions is, "What am I supposed to do to present it to humans on this device?"

When HTML was the main language used on the Web, browsers had only to look at the <HTML> tag at the beginning, and use their own programmed instructions to read the file.

With XML and its many derivative language, the job is not so simple. Instead of being programmed directly to handle HTML, browsers are now programmed to handle XML descriptions of the language in the file. Since XML-based languages are constantly being developed and improved, browsers need to find an on-line definition of how to handle this variant of XML-based language.

So: the first thing in a document needs to be a reference to the language being used, and where to find its definition - the Document Type Description, or DTD.

In the version of XHTML we're using now, this means including a DOCTYPE statement at the beginning of the file:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd ">

By using this, all files that use XHTML refer to the same XML definition, located at the World Wide Web Consortium (W3C). W3C is the organization that defines HTML, XML, and XHTML - plus many other derivatives of XML.

If you're curious about XML, you can look at the DTD yourself.

XHTML Namespace Declaration

In addition to the DTD, the browser uses a Namespace Declaration. This is a more complete form of the <HTML> tag used in older HTML documents:

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

This contains a reference (for human reading) to the documents that explain XHTML, and also defined the human language of the document - in this case "en" = English.

Character Set Definition

There are many types of characters out there, and you can't be too careful. ;-) Actually, the "characters" referred to are letters and symbols. If the World Wide Web is to be truly world-wide, it must accomodate the letters and symbols of all the major languages, including Greek, Hebrew, Arabic, Russian, Korean, Chinese, Japanese and many others - maybe even Elvish.

Before a document can truly claim to be a citizen of the World Wide Web, it needs a "passport" defining what character set (alphabet) it uses. To specify English, put this line in the Head area of your file:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

Internal Differences


Once you've got those two lines in the file, you can write normal HTML code, except for a few little differences...

The Case for Case

XHTML tags all lower-case. Although most browsers understand the tags if you write them in upper-case or with initial caps, they aren't really defined that way. HTML has always had some elements that are case-sensitive: character codes for accented letters like &Eacute; which produces É, as opposed to &eacute; which displays é. With XHTML we need to keep case consistent by use all lower-case tags.

Tags to Avoid

Since HTML was originally designed to have structure and format tags all mixed up together, we now have to avoid tags that simply format the text. The main ones to avoid are:

HTML XHMTL
<font ...> use styles instead
<b> use <strong> instead
<i> use <em> instead
<center> use styles instead
<td> for all table cells use <th> in the table header row, and <td> for the rest.

It's also better to use styles instead of putting align=x and valign=y in your headings, paragraphs, and table elements.

Close the @*#*%*& Tag Behind You!

The final difference is the need to close all tags. For every tag you start, you must have a closing tag as well - or put a forward-slash at the end of the tag.

In HTML, having a closing tag for some, like <p> and <li> was optional. Not in XHTML. Here's a list of tags you need to watch out for:

HTML XHTML
<P>My paragraph. <p>My paragraph.</p>

<UL>
<LI>My first list item
<LI>My second list item
</ul>

<ul>
<li>My first list item</li>
<li>My second list item</li>
</ul>

My line break <BR>
goes here
My line break <br />
goes here
My horizontal rule:
<HR>
My horizontal rule:
<hr />
My photo:
<IMG SRC="me.gif">
My photo:
<img src="me.gif" alt="Photo of me" />
My anchor tag:
<A name="Top">
My anchor tag:
<a name="Top">&nbsp;</a>

 

How You Can Tell


How can you tell if you got it right?

Most browsers still operate on the theory that people don't like to see error messages. When they run into a coding error, they just quietly ignore whatever they don't understand, and do the best they can. That may be OK for amateur Web designers, but not for professionals. Professionals need to get it right - not just for their reputation and self-esteem, but because correct, valid code is more likely to work on multiple browsers and is easier to maintain.

So, we validate our code.

Various Validators

The W3C maintains an on-line validator which you can access at their site: http://validator.w3.org/.

In addition to W3C, you can validate at the Web Design Group: http://www.htmlhelp.com/tools/validator/.

At both places, you enter the URL of a Web page, or at W3C you can give it the path to a local file on your disk. The software goes over your page with a fine-tooth comb. In a few moments, it returns a list of errors - or if you're lucky or persistent, the good news that your file is valid.

Another way to access the validators is to use bookmarklets or favelets - bookmarks or favorites that authomatically ask the browser to take the page you're looking at to a validator. One place where you can get them is at Gazingus: http://www.gazingus.org/js/?id=102.

Try It...


Play with HTML, XHTML, and Validators

The best way to understand XHTML and validators is to play with them. Here's something to play with...

  1. Copy the HTML code from the colored area below.
  2. Paste it into a totally empty file in a text- or code-editor.
  3. Look carefully at each line of code, and see if you can identify what isn't good XHTML. Don't change the bad code yet!
  4. Save the file with a .htm extension and submit it to the W3C validator. You can use the Choose button to find the file on your disk, then click Validate File.
  5. Look at the error report - what is it checking for? Is it checking for XHTML, or what? See if you can understand the error message.
  6. In your editor, go back and add the DOCTYPE statement and the Namespace Declaration, at the top of the file, removing the simple <HTML> tag. In the Head area, just below the <TITLE> tag, at the Character Set Definition - but don't make any other changes.
  7. Resubmit the file to the validator. What is it checking for now? Are the error messages different? Can you understand them? Why are there so many for such a small file?
  8. Now change the code in your editor to good XHTML.
  9. Resubmit the code to the validator, and if there are still errors, keep correcting and resubmitting until you get the message that your file is valid XHTML.

<HTML>
<HEAD>
<title>Towns of Washtenaw County</title>
</head>

<BODY>
<H1>Towns of Washtenaw County</h1>
<P>This is a list of the main cities and towns in Washtenaw County, Michigan.

<OL>
<LI>Ann Arbor
<LI>Ypsilanti
<LI>Saline
<LI>Dexter
<LI>Chelsea
<LI>Manchester
</ol>

<b>Milan</b> is partly in Washtenaw County and partly in <i>Monroe County</i>.

</body>
</html>

When you have validated your code for this example, try the same with a file of your own. Choose a small file you've created for an earlier class or small project.


to Top About This Document
Review

Click here Review Buttonfor review questions.

Audience


This module is for people who are familiar with HTML and want to learn how to code in XHTML. A knowledge of HTML is expected at least up through simple tables (module W24h).

 

Objectives

On successful completion of this module, you will be able to:

  1. Discuss the evolution of XHTML from HTML and XML;
  2. Explain why XHTML is preferable to HTML;
  3. Discuss the policy of separating format from structure;
  4. Describe the XHTML Doctype tag;
  5. Describe the XHTML Namespace declaration;
  6. List the tags that can remain open in HTML but must be closed in XHTML;
  7. List the HTML tags that are to be avoided in XHTML, and the alternatives to using them;
  8. Discuss the need to use lowercase in all tags.
Module X10d: Coding XHTML

This document is part of a modular instruction series in Computer Instruction. For more information, see the overview or the list of modules in this series, X: "XML, XHTML, DHTML, and CSS". This document has been used in the following classes: INP 270.
History:
Original: 20 January 2003
Last modification: Monday, 31-Aug-2009 11:48:07 EDT
Copyright
Copyright © 2003, Laurence J. Krieg, Washtenaw Community College
Instructors: You may point to this file in your Web-based materials; however, its location may change without notice.
Students: You are welcome to make a copy for your personal use.
All other uses: Please contact the author, Laurence J. Krieg, for permission: krieg@ieee.org.