Instructional Module X02c

Markup Languages


to Top Overview

Let's take a look at the World Wide Web's "markup languages" and see what's involved. To begin with, exactly what is a "markup language"? How does computer markup work? Where did HTML and XHTML come from? These are all questions we'll address in this module.


to Top Markup Languages
What Is Markup?

to Top
Link to Top

Markup is a way of giving instructions on how a text should look. In the days when people wrote drafts of papers and books with a typewriter (or even by hand!) an editor would "mark it up" for the person who set the type for printing. There was a standard set of proofreaders' and editors' marks used for this, so any typesetting person would know what any editor meant. Here's an example of this kind of markup, using Abraham Lincoln's "Gettysburg Address":

Draft with Markup
Final Printed Copy

From Webster's New Collegiate Dictionary (1959), p. 1159

 

 

This is where the term "markup" came from. As you can see, the draft in the illustration above has been thoroughly marked up!

Computers and Text

to Top
Link to Top

Computers were not originally designed to store text. As their name implies, most were intended to compute things: numbers. It wasn't until computers were nearly 30 years old that it became practical to use them for storing text and printing it out in anything but the plainest, typewriter-like format.

As computer storage devices became able to hold more, output devices were developed in the mid to late 1960s that could create printing almost as good as that of printing presses. It became necessary to change the markup system from something human typesetters and printers could understand, to something computer typesetting machines could understand. This is what led to computer markup.


to Top Computer Markup
Specific Markup

to Top
Link to Top

Corporate pioneers in computer text processing and typesetting tended to work alone. This is pretty normal in the early stages of developing a new technolgy: everyone wants to be first to market with the new systems.

Computer typographyThe result of this was that each organization developed its own specific way to "mark up" text for their computer system, like MIT's runoff system and IBM's page-1. Unlike the human editors and typesetters, there was no standard markup language for computerized text processing and printing.

So text input into one brand of computer and marked up for printing couldn't be transferred to any other brand of computer. This was inconvenient and very expensive - espcially as newer technology came out, incompatible with the older markup language.

Generalized Markup

to Top
Link to Top

The expense and inconvenience of specific computer markup languages drove people to consider creating a generalized markup language, that could be used on many systems. The first person to campaign for a generalized markup language was William Tunnicliffe, chairman of the Graphic Communications Association, in 1967, according to Charles Goldfarb, in his talk "The Roots of SGML -- A Personal Recollection" (http://www.sgmlsource.com/history/roots.htm).

Charles GoldfarbGoldfarb himself worked for IBM with Ed Mosher and Ray Lorie to create in 1969 the Generalized Markup Language (GML). The purpose of GML was to allow markup to serve two purposes:

  • Mark the function of different parts of a text for information retrieval; and
  • Allow formating of the different parts of text for printing.

Here's how Goldfarb himself put it, in "Design Considerations for Integrated Text Processing Systems", IBM Cambridge Scientific Center Technical Report G320-2094, May 1973 (but written in 1971):

This analysis of the markup process suggests that it should be possible to design a generalized markup language so that markup would be useful for more than one application or computer system. Such a language would restrict markup within the document to identification of the document's structure and other attributes. This could be done, for example, with mnemonic "tags". The designation of a component as being of a particular type would mean only that it will be processed identically to other components of that type. The actual processing commands, however, would not be included in the text, since these could vary from one application to another, and from one processing system to another. (Quoted in Goldfarb's The Roots of SGML -- A Personal Recollection)
Meta-Languages

to Top
Link to Top

IBM logoThe idea of a generalized markup language caught on rapidly, but there were problems with GML:

  • It was owned by IBM, so it couldn't be developed by any other company or group;
  • It was difficult to extend to meet new needs.

Because of these problems, people from the information and text processing community got together to create something new. These people included Goldfarb, Tunnicliffe, and Brian Reid - a Carnegie Mellon University researcher who had developed a generalized markup language called Scribe. What they needed was a language that was...

  • An open standard, rather than proprietary;
  • Flexible enough to extend to lots of new needs.

These researchers formed the nucleus of a group of hundreds of people from around the world who worked on this effort for eight years. The result, in 1981, was the Standard Generalized Markup Language, SGML: ISO 8879. (Like all ISO standards, this is available for sale, but not for free viewing.)

SGML is known as a meta-language because it is not used directly in text markup, but is used to create generalized markup languages.

Evolution of SGML

to Top
Link to Top

The original 1981 version of SGML was revised and a new version issued in 1986. Beyond that, the standard has stayed stable and continues to be widely used. SGML is the basis of:

  • HTML
  • XML
  • XHTML
  • and scores of other markup languages.

We turn now to some of the best-known descendants of SGML: those used on the World Wide Web.

to Top HTML and the World Wide Web
Codes on the World Wide Web

to Top
Link to Top

The World Wide Web has been based on two "languages":

  • HyperText Markup Language (HTML) for marking up text to be displayed on the Web; and
  • HyperText Transfer Protocol (HTTP) for transferring information between Web clients (primarily browsers) and Web servers.

We'll be looking in more detail at HTML here.

HTML's SGML Origin

to Top
Link to Top

Tim Berners-Lee photoTim Berners-Lee put the concept of hypertext together with the capabilities of the Internet in 1989, primarily for the benefit of the scientists with whom he worked at the European Particle Physics Lab, "Conseil Européen pour la Récherche Nucleaire" (CERN) near Geneva, Switzerland.

The great success of the Web was in part the result of the purpose for which Berners-Lee created it:

The Web was for people who were absorbed in scientific experimentation, but wanted to share their exciting findings with colleages around the world.

This resulted in a markup language based on SGML that was...

  • Simple:
    Berners-Lee used a very small subset of SGML, which meant that preoccupied scientists didn't have a complex system to learn, and browsers could be relatively small and fast.
  • Flexible:
    As the Web became increasingly popular, people were able to do many things with it besides mark up scientific reports.
Evolution of HTML

to Top
Link to Top

It seems now that there is no facet of information exchange that isn't handled on the World Wide Web. Millions of people create Web pages for thousands of purposes. The Web has become an art form as well as a source of information.

The Web also became an arena in which giant corporations wrestled with each other for market dominance. One of the tricks these giants used to gain dominant positions for their products was adding new features to the markup language their browsers could display. This led to helter-skelter growth of HTML, frustration for Web developers, and confusion for Web users.

By 1994, the situation was difficult, and by 1996 it was intolerable. Tim Berners-Lee left CERN in 1994 to found theW3C logo World Wide Web Consortium (W3C) , his idea being to get the major Web stakeholders together and convince them to develop standards that would allow the Web to grow in a less confusing way. His effort has largely succeeded, owing to his vision, combined with the efforts of hundreds of members and staff at W3C. The results was increasing acceptance of standard versions of HTML, up through version 4.01.

We'll take a look at some of the fruits of this effort next.

 
to Top XML and World Wide Computing
From SGML to XML

to Top
Link to Top

One of the first and most important efforts of W3C members beyond HTML was to develop an eXtensible Markup Language (XML). The vision for XML involved...

  • Taking features of SGML that were relevant to the Internet, and...
  • Developing features that would facilitate exchange of data as well as text.

Like SGML, XML was designed as a meta-language - not used directly for markup, but for creating other markup languages.

XML does not depart far from SGML: it uses the same concepts, and even many of the same markers and entities.

The first version of XML was agreed on and officially published in 1998. The standard is maintained by W3C, and the details are freely visible on W3C's XML site, http://www.w3.org/XML/.

Paul Grosso photoLocal pride note: The XML Core Working Group is co-chaired by Paul Grosso, of Arbortext. Arbortext is an Ann Arbor company specializing in XML-based publishing. You can see it just north of I-94 near State Street.

Leading XML-based Languages

to Top
Link to Top

XMLMany markup languages have been developed from XML. Most of them are not centrally registered with W3C, since anyone who makes the effort can use XML to create their own markup language.

Some of the major efforts are based at W3C though. Here is a partial listing:

  • InkML: "an XML data format for representing digital ink data that is input with an electronic pen or stylus as part of a multimodal system."
  • MathML: "a low-level specification for describing mathematics as a basis for machine to machine communication. It provides a much needed foundation for the inclusion of mathematical expressions in Web pages".
  • RDF: "The Resource Description Framework (RDF) is a general-purpose language for representing information in the Web."
  • SMIL: "The Synchronized Multimedia Integration Language (SMIL, pronounced 'smile') enables simple authoring of interactive audiovisual presentations."
  • SVG: "Scalable Vector Graphics (SVG) ... is a language for describing two-dimensional graphics and graphical applications in XML."
  • The "W3C Voice Browser Working Group ... is defining a suite of markup languages covering dialog, speech synthesis, speech recognition, call control and other aspects of interactive voice response applications. Specifications such as the Speech Synthesis Markup Language, Speech Recognition Grammar Specification, and Call Control XML are core technologies for describing speech synthesis, recognition grammars, and call control constructs respectively. VoiceXML is a dialog markup language that leverages the other specifications for creating dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key (touch tone) input, recording of spoken input, telephony, and mixed initiative conversations."
  • WML:Wireless Markup Language "is a markup language based on XML, and is intended for use in specifying content and user interface for narrowband devices, including cellular phones and pagers."
  • XLink: "XML Linking Language ... allows elements to be inserted into XML documents in order to create and describe links between resources."
XHTML

to Top
Link to Top

"The Extensible HyperText Markup Language (XHTML™) is a family of current and future document types and modules that reproduce, subset, and extend HTML, reformulated in XML. XHTML Family document types are all XML-based, and ultimately are designed to work in conjunction with XML-based user agents. XHTML is the successor of HTML, and a series of specifications has been developed for XHTML." (Quoted from W3C's "HyperText Markup Language (HTML) Home Page", http://www.w3.org/MarkUp/)

As you can see from the listing in the previous section, XHTML is just one of many XML-based markup languages. But it is likely to be the most widely known of those languages, just as HTML is the most widely known SGML-based language. That's why we devote an entire series of classes (Web Coding I - IV) to it!


to Top About This Document
Audience

to Top
Link to Top

This module is for people who are interested in the World Wide Web and would like to know about the origin or markup languages used on the Web and elsewhere.

Objectives

On successful completion of this module, you will be able to:

  1. Define SGML, HTML, XHTML and XML
  2. Identify key differences between SGML, HTML, XHTML and XML
  3. Discuss the history of the W3C and the move towards standards compliance in browsers

Link to Top
Module X02c: Markup Languages
This document is part of a modular instruction series in Computer Instruction. For more information, see the overview or the list of modules in this series, X: XML, etc. This document has been used in the following classes: INP 150.
History
Original: 5 September 2003, by Laurence J. Krieg
Last modification: Monday, 31-Aug-2009 11:48:07 EDT
Copyright
Copyright © 2003, Laurence J. Krieg, Washtenaw Community College
Instructors: You may point to this file in your Web-based materials; however, its location may change without notice.
Students: You are welcome to make a copy for your personal use.
All other uses: Please contact the author, Laurence J. Krieg, for permission: krieg@ieee.org.

Link to Top