How to Convert Microsoft Word Documents into Web Pages

Module W36c

Contents


Overview

MS Word as a Web Page Editor

MS Word is not primarily a Web page editor: Microsoft has other tools, notably FrontPage, intended to create Web pages. However, lots of organizations have a wealth of information stored as MS Word documents, and it's often desireable to convert these for use on the Web. 

Since Microsoft is very committed to using the Intenet and the Web, it's not surprising that they have put some effort into getting Word to convert documents into HTML.

There's a basic problem, though: HTML has a much restricted set of capabilities compared with any modern word processor, especially one with the broad capabilities of MS Word. So unless the document is a very simple one, there will be features that don't convert to HTML. Word's HTML conversion also tends to distort spacing and fonts somewhat.

As a result, you'll just about always want to bring the output of Word's HTML converter into a Web page editor to see if you can restore some of the intended appearance. Often, you'll have to use substitutes, or re-think the design to make it more Web oriented. But that's a good idea anyway: good paper-based designs often make poor Web-based designs!

I've put in a large table, based on MS Word's Help, that discusses all the major features of Word and what happens to them in HTML. But first, let's look at basic Word document conversion.

Basic Conversion



Converting a Word document to HTML is easy: File...Save as HTML
  1. On the File menu, select Save As HTML.
  2. Choose a name and directory for the file, and click OK.
Even with a simple document, it's a good idea to look at the resulting Web page in a browser and see how it came out. You may well want to modify the spacing, and possibly the fonts, using a Web page editor. To do that, you'll need to close the HTML version of the document in Word, since you can only open a file in one editing program at a time.

Let's look at some of the conversion details, first...

Converting MS Word Special Features

The following paragraph and table are found in MS Word Help under the title, "Learn what happens when you save a Word 97 document as a Web page." I have added my comments and suggestions to several of the topics.

From MS Word Help:

When you save a Word document as a Web page, Word closes the document and then reopens it in HTML format. Word displays the Web page similar to the way it will appear in a Web browser. Formatting and other items that aren't supported by HTML or the Web page authoring environment are removed from the file. This table shows the elements that Word changes or removes upon conversion.
Element Word to HTML Notes and Details  
...with comments added by Larry Krieg.
Comments See note Comments you insert with the Comments command on the Insert menu are removed. After saving the document in HTML format, however, you can enter comments and apply the Comments style. The comments will not appear when the Web page is displayed by a Web browser. 
If you want comments from a Word document to transfer to a Web page as comments, change them (one-by-one) to a different Word style - anything will do. When the document gets converted to HTML, go back and convert them (one-by-one) to HTML comments. 
Font sizes See note Fonts are mapped to the closest HTML size available, which ranges from size 1 to 7. These numbers are not point sizes but are used as instructions for font sizes by Web browsers. Word displays the fonts in sizes ranging from 9 to 36.
I have lots more to say about this in the section on fonts and styles below!
Emboss, shadow, engrave, all caps, small caps, double strikethrough, and outline text effects (Format menu, Font command, Font tab) No These character formats are lost, but the text is retained.
 
To get special font effects, the only reliable technique is to use a photo manipulation program to create the font effects, and save the file in a Web-acceptable graphics format, such as GIF, JPEG, or PNG.
Bold, strikethrough, italic, and underline effects Yes Some special underline effects, such as dotted underlines, are converted to a single underline, and some underline effects aren't converted.
Animated text
(Format menu, Font command, Animation tab)
See note Animations are lost, but the text is retained. For an animated effect, insert scrolling text into your page in the Web page authoring environment. 
 
How to add scrolling text to a Web page:
From the Insert menu, select Scrolling Text. Type or paste the text into the dialog window's text box, and if you like, specify a background color. But only MS Internet Explorer displays scrolling text created this way! A more useful way to create special text effects is to use JavaScript (but we don't get into JavaScript in this module!).
Graphics See note Graphics, such as pictures and clip art, are converted to GIF (.gif) format, unless the graphics are already in JPEG (.jpg) format. Drawing objects, such as text boxes and shapes, are not converted. Lines are converted to horizontal lines.
 
See the Graphics section in this module for more detail.
Tabs Yes Tabs are converted to the HTML tab character, represented in HTML source as &#9. Tabs may appear as spaces in some Web browsers, so you may want to use indents or a table instead.
 
Tabs are never effective in Web pages. Indents work well for simple effects; otherwise, tables have to be used to line up information nicely.
Fields See note Field results are converted to text; field codes are removed. For instance, if you insert a DATE field, the text of the date converts, but the date will not continue to update.
Tables of contents, tables of authorities, and indexes See note The information is converted, but indexes and tables of contents, figures, and authorities can't be updated automatically after conversion because they are based on field codes. The table of contents displays asterisks in place of the page numbers; these asterisks are hyperlinks that the reader can click to navigate through the Web page. You can replace the asterisks with text that you want to have displayed for the hyperlinks.
Drop caps No Drop caps are removed. In the Web page authoring environment, you can increase the size of one letter by selecting it and then clicking Increase Font Size. Or, if you have a graphic image of a letter, you can insert it in front of the text.
Drawing objects, such as AutoShapes, text effects, text boxes, and shadows No Drawing objects are not retained. You can use drawing tools in the Web page authoring environment by inserting Word Picture Objects. The object is converted to GIF format.
 
See the Graphics section in this module for more detail.
Equations, charts, and other OLE objects  See note These items are converted to GIF images. The appearance is retained, but you won't be able to update these items.
Tables Yes Tables are converted, although settings that aren't supported in the Web page authoring environment are lost. Colored and variable width borders are not retained. 
Table widths See note By default, tables are converted with a fixed width. To convert a table with percentage width (so that the table is sized relative to the browser window), set the option PercentageTableWidth=1 in the following Windows 95 Registry location: HKEY_LOCAL_MACHINE\Software\Microsoft\Shared Tools\Text Converters\Export\HTML\Options 
 
Only edit your Windows 95 Registry if you are very sure you know what you are doing! You can cause serious problems if you make mistakes.
Highlighting No Highlighting is lost.
 
You can simulate highlighting with borderless, colored tables, but this technique doesn't let you highlight text in the same line as non-highlighted text.
Revision marks No Changes entered with the track changes feature are retained, but the revision marks are removed.
Page numbering No Because an HTML document is considered a single Web page, regardless of its length, page numbering is removed.
Margins No To control the layout of your page, you can use a table.
Borders around paragraphs and words No You can place borders around a table, and you can use horizontal lines to help emphasize or separate parts of your Web page.
Page borders No There isn't an HTML equivalent for a page border. You can make your pages more attractive by adding a background using the Background command on the Format menu. You can also place borders around a table, and you can use horizontal lines to help emphasize or separate parts of your Web page.
Headers and footers No There aren't equivalents for headers and footers in HTML. 
Footnotes and endnotes No
You can hyperlink your text to notes placed either elsewhere in the same document, or even in a different document, so your readers can quickly check them. For those who really want to put technology to work, JavaScript can be used to create notes that pop up when the reader's mouse moves over linked text! (But we're not doing Javascript in this module, remember...:-)
Newspaper columns No For a multicolumn effect, use tables.
Styles See note User-defined styles are converted to direct formatting, provided the formatting is supported in HTML. For instance, if you convert a style that includes bold and shadow formatting, the bold formatting is retained as a direct formatting, but the shadow formatting is lost.


MS Word Styles and the What Happens to them in HTML

New HTML templateWhen you create an MS Word document, you have a choice of several templates to choose from. One such template is listed as "Blank Web Page" under the Web Pages tab of the New dialog box. The actual template is normally stored in

C:\Program Files\Microsoft Office\Office\HTML.DOT
MS Word Styles available for New HTML documents
MS Word Styles for HTML, scrolled down
When you create or edit Web page in Word, you get a choice of several HTML-related styles. One set of choices that looks obvious but isn't: the series Heading 1 through Heading 6 appear to be the same as Netscape Composer's Heading 1 through Heading 6, which translate into HTML <H1> through <H6> tags. In MS Word, these actually do not produce the expected HTML tags! Instead, they produce custom-formatted text using various fonts and sizes. These are not bad, but they aren't standard either, and depend on the availability of the font used - often Arial. In order to get genuine <H1> through <H6> tags in Word, you should use MS Word styles H1 through H6, which are so far down in the list-box that your can't see them without scrolling down. This is illustrated in the following examples:

Microsoft HTML Template H1

Microsoft HTML Template Heading 1

Microsoft HTML Template H2

Microsoft HTML Template Heading 2

Microsoft HTML Template H3

Microsoft HTML Template Heading 3

Microsoft HTML Template H4

Microsoft HTML Template Heading 4
Microsoft HTML Template H5
Microsoft HTML Template Heading 5
Microsoft HTML Template H6
Microsoft HTML Template Heading 6


Graphics and Other Objects

Word accepts many types of "objects" and displays them: images, drawings, "Word Art," sounds, video clips, and imports from many other programs that use Microsoft's "Object Linking and Embedding" (OLE) standard. Only a few of these are converted when a document is translated into HTML. Here are details, together with some advice on how to get a few of them into HTML...
 

What Word Does with Pictures


Pictures in the Word document are put into the Web page this way:
  • GIF and JPEG images are transferred as they are.
  • Other picture formats are converted to GIF using Word's GIF Filter.
  • Drawing objects - diagrams created with the Word drawing toolbar are not converted.

Graphics Interchange 
Format (GIF) filter


From MS Word Help:
"The Graphics Interchange Format filter (Gifimp32.flt) supports file format versions GIF87a (including interlacing) and GIF89a (including interlacing and transparency). The GIF filter works with the Portable Network Graphics filter (Png32.flt) to import GIF files into Word. The GIF filter is also used by the HTML converter to export pictures in a Word document to .gif images linked to an HTML page.
The GIF filter has the following limitation: Only the first image of a multiimage [animated] GIF is imported."

Drawings and Word Art



MS DrawDrawings are objects created by using the Microsoft Draw toolbar, or the older Microsoft Draw editor. They differ from the usual picture formats in that they are "vector graphics" - a collection of shape and color objects in which each object is defined by coefficients of the equation that describes its shape. As you might guess, translating these to HTML is not straight-forward! But MS Draw is a very useful way of creating charts and diagrams, so it's worth knowing how to bring onto the Web.

Word Art version 1Word Art (versions 1 and 2) is a proprietary Microsoft format that allows you to making pleasing and decorative headlines by distorting the shapes of words, giving them three dimensions, shadows, and colors. Word Art is a good, quick way of making interesting headlines for your Web pages, so how do you get them onto the Web? Word Art version 2

The simplest way to get these images to the Web is to use copy-and-paste. Some (but not all) graphics programs know how to accept Word Art and Microsoft Drawings. The general idea is:

  1. In the Word document, select the graphic object you want. If it's a drawing, it helps to "group" the individual objects by selecting them all (hold shift and click) and grouping them together (from the Draw toolbar, Draw...Group). Copy the graphic (Edit...Copy or <Ctrl>C ).
  2. Open a graphics program that can accept a Microsoft Draw or Word Art object. Paste the object in as a new image (Edit...Paste as New Image or <Ctrl>V ).
  3. Save the image as GIF. (GIF usually works better than JPEG with letters and diagrams; it is also more compact than PNG.)
  4. Edit the HTML file in a Web editor or Word and insert the GIF file in the proper place.

Graphics Programs

Take a look at module W47c for more information about graphics programs. If you have a full-service graphics editor like PhotoShop or PaintShop, you can use the copy-and-paste method described in the preceding section. If you don't have these tools, here's a "work-around" for converting Word Art and Drawing objects in Word to Web-graphics using only tools from MS Windows 95 and MS Office Professional.
  1. Open MS Paint (Windows Start...Programs...Accessories...Paint) and paste the drawing there;
  2. Save the file (it will be a .BMP file, useless on the Web)
  3. Open the .BMP file with MS Photo Editor
  4. Save as .GIF

Sounds and Video Clips



All the wide variety of multimedia that is available can be linked to Web pages and played either by the browsers themselves, or by "plug-ins". This module doesn't cover plug-ins, but the good news for Web page creators is that all you have to do is create a link to the multimedia file, and FTP the file to your Web server. The browser will do the rest, including telling the user if they need to install a plug-in - and often, where to download it.


 About this document...

Audience:

This is for people who are familiar with MS Word and Web editors, and want to take existing Word documents and convert them to Web documents without losing any more of the formatting than is necessary.

Objectives:

When you successfully complete this lesson, you will be able to...
  • Explain how to convert simple Word documents into Web pages;
  • Discuss formats that could cause problems when converted;
  • Explain how to convert diagrams drawn with MS Draw tools in Word.

Module W36c:

This document is part of a modular instruction series in Computer Information Systems. For more information, see the overview or the list of modules in this series, W: World Wide Web. This document has been used in the following classes: CIS 260

Author:

Laurence J. Krieg

Institution:

Department of Computer Information Systems, Washtenaw Community College
History: Original: 29 Nov 1998
Last modification: Wednesday, 07-Nov-2001 13:03:36 EST
Copyright: Copyright © 1999, Laurence J. Krieg.
Instructors: You may point to this file in your Web-based materials.
Students: you may make a copy for your personal use.
All other uses: contact the author, Laurence J. Krieg for permission. Email krieg@ieee.org