SGML and Microsoft Office 2000 – MS Office meets the Web

SGML and Microsoft Office 2000 – MS Office meets the Web[1]

Rolf Marvin Bøe Lindgren

[email protected]

Det Norske Veritas

Rolf Lindgren graduated from the University of Oslo in 1996, where he was trained as a clinical psychologist. He works as a research engineer for Det Norske Veritas. Some of the projects he has worked on are the Knowledge Management initiatives and the Networked DNV projects, investigating how styles of work and information technology may be integrated. While a student, he hacked the Balise scripts that are used to convert the SGML sources for the student guides and course books to LaTeX.


One of the hopes of most members of the SGML community is that SGML editing will become mainstream. Anybody who needs to write documents for posterity, as it were, might benefit from SGML. The SGML enthusiast happily hand-tags documents or uses Author/Editor or FrameMaker+SGML: the electronics engineer or the grammarian might want to reap the benefits of SGML without having the additional burden of hand-coding tags and dealing with counter-intuitive user interfaces. This is especially true if the manager of a project that uses SGML for documentation needs to hire help. Gaining widespread acceptance of SGML would be much easier if it were possible to argue to managers that there exist SGML-based editors that enable the user to reap the benefits of SGML without having to spend much time understanding the mechanics behind SGML, or learning obscure editors or complicated editing practices.

A user-friendly XML editor? Not as such.

Microsoft’s core competence is to make good ideas accessible. When Microsoft announced that Office 2000 was to include support for XML, one of the first thoughts that sprang to my mind was that this might provide a gate to XML for common people. If anybody has the resources to develop a word processor that can present the advantages of SGML in a user-friendly fashion, it’s Microsoft.

If Office 2000 is to be used as a general XML editor, one would assume that Microsoft would support nested paragraph styles, some way of editing attributes to styles, and some fancy way of displaying the hierarchy of the current paragraph. Having tested Office 2000 for some time, however, I can only conclude that these or similar features will not be present in Office 2000 and that Microsoft’s commitment to XML represents something entirely different. Microsoft Office should be viewed as a target application for a particular XML DTD. This XML DTD maps directly to RTF, Microsoft’s ASCII-based document format, and supports nothing more than what can already be done with RTF: map formatting to default or user-added paragraph and character styles.

How XML appears to the user in Office 2000

To the user, Office 2000’s XML is a web page format – and that is the entire clue about XML in Office 2000. Office 97 supports saving to HTML, but the functionality seems very ad hoc – Word documents lose a lot of formatting and the resulting HTML is relatively messy. Office 2000 changes all that. The user has a control of the layout of Office documents that is almost totally awesome, even after exporting to Web format.

The user does not need to know anything about XML. There is in fact nothing about Office 2000, as far as I can see, that tells the user that documents can be saved as XML. Office 2000 can export files to XML quite easily through the File/Save As… or the File/Save As Web Page… dialogs. Office does not say anything about XML at all. To the user, XML is a web page or advanced HTML format.

An Office 2000 XML document contains lots of formatting information, embedded as XML Style specifications. Using style mechanisms partly from CSS and XSL, and partly from XML Schemas, layout is specified in a manner that at least in theory is portable. Layouts that look the same in both Internet Explorer and Netscape Navigator is unproblematic to achieve.

Using the LINK element, it is possible to specify external documents containing layout information specified using CSS, XSL or XML Schema. This makes it possible to reformat an entire web in one go. A script editor that seems extremely powerful makes it possible to edit style specifications in a WYSIWYG-like manner.

To the SGML user, however, Office 2000’s XML still leaves a bit to be desired, to wit:

  • not all attribute values are quoted:

    <meta http-equiv=Content-Type content="text/html; charset=windows-1252">
  • XML is embedded in the document, but contexts are redirected using syntax constructs hidden in comments:

    <!–[if VML]><![if !VMLRender]>

    <object id=VMLRender classid="CLSID:10072CEC-8CC1-11D1-986E-00A0C955B42E"

    width=0 height=0>


  • CSS using proprietary style properties


This format cannot be parsed using any SGML parser I know of, including Balise and nsgmls. In fact, apparently the XML DTD will not be available in pure text format, but as an Active X component that can be used by Microsoft’s programming environments and other programming environments that support Active X.

I tested, just for fun, how Office 2000 would represent a character style element. Character styles are like familiar paragraph styles except they do not span an entire paragraph and can only contain character information. The advantage to using character styles are at least twofold: they can be used for identifying classes of information, and are much more easily updated than normal font changes; also, they are not lost when reapplying styles to paragraphs as normal font changes often are.

  1. This is how Office 2000 represents an italic style change:

    <p class=MsoNormal>

    <span lang=EN-US>This is how Office 2000 represents an <i>italic style change</i>:</span></p>

  2. This is how Office 2000 represents a character style element:

    <p class=MsoNormal>

    <span lang=EN-US>This is how Office 2000 represents a </span><span class=code>

    <span lang=EN-US style=’font-size:10.0pt;mso-bidi-font-size:12.0pt;

    font-family:"Courier New";mso-bidi-font-family:"Times New Roman"’>

    character style element</span></span>

    <span lang=EN-US>:</span></p>

  3. This is how Office 2000 represents a character style element:

    <p class=MsoNormal>

    <span lang=EN-US>This is how Office 2000 represents a

    <span class=em>character style element</span>:</span></p>

Notice the difference between 2. and 3. Both use character style names that are created in entirely similar manners, but their XML usages are quite different, presumably because em is an HTML element while code is not.

The bottom line

A company that looks to use XML for markup should look further than Office 2000. However, it is possible to view Office 2000 as a target application for XML, allowing operability from other XML-oriented applications to Office 2000. Office 2000 currently uses a very idiosyncratic interpretation of XML that cannot be parsed by any known parser. Given a suitable pre-processor, it might be possible, though, to convert Office 2000’s XML to something that can be parsed in a standard fashion, but that would require an SGML parser that allows one to look inside comments, and until quite recently such an idea would seem quite extraordinary.

A company that already uses Microsoft Office 2000 as their standard tool for creating documents is faced, however, with an incredibly powerful tool for generating standard layouts and uniform-looking web sites. Tied with Microsoft FrontPage 2000, Microsoft has created an incredibly powerful web site creation and maintenance environment for any enterprise or family that does not require batch-oriented HTML generation.

[1] Thanks to Lars Marius Garshol for insightful comments. All errors are mine.