SGML and Microsoft Office 2000 – MS Office meets the Web

SGML and Microsoft Office 2000 – MS Office meets the Web[1]

Rolf Mar­vin Bøe Lind­gren

[email protected]

Det Nors­ke Veri­tas

Rolf Lind­gren gra­dua­ted from the Uni­ver­sity of Oslo in 1996, whe­re he was tra­i­ned as a cli­ni­cal psycho­lo­gist. He works as a rese­arch engi­neer for Det Nors­ke Veri­tas. Some of the pro­jects he has wor­ked on are the Know­led­ge Mana­ge­ment ini­tia­ti­ves and the Networ­ked DNV pro­jects, investi­ga­ting how sty­les of work and infor­ma­tion tech­no­lo­gy may be inte­gra­ted. Whi­le a stu­dent, he hack­ed the Bali­se scripts that are used to con­vert the SGML sources for the stu­dent guides and cour­se books to LaTeX.


One of the hopes of most mem­bers of the SGML com­mu­ni­ty is that SGML edi­ting will become main­stre­am. Any­body who needs to wri­te docu­ments for pos­terity, as it were, might bene­fit from SGML. The SGML ent­hus­i­ast happ­i­ly hand-tags docu­ments or uses Author/Editor or FrameMaker+SGML: the electro­nics engi­neer or the gram­ma­ri­an might want to reap the bene­fits of SGML wit­hout having the additio­nal bur­den of hand-coding tags and dea­ling with coun­ter-intui­ti­ve user inter­faces. This is espec­ial­ly true if the mana­ger of a pro­ject that uses SGML for docu­men­ta­tion needs to hire help. Gai­ning wide­spre­ad accep­tan­ce of SGML would be much easi­er if it were pos­sib­le to argue to mana­gers that the­re exist SGML-based edi­tors that enab­le the user to reap the bene­fits of SGML wit­hout having to spend much time under­stan­ding the mecha­nics behind SGML, or lear­ning obscu­re edi­tors or com­pli­cated edi­ting prac­tices.

A user-friendly XML editor? Not as such.

Micro­soft­’s core com­pe­ten­ce is to make good ideas acces­sib­le. When Micro­soft announ­ced that Office 2000 was to inclu­de sup­port for XML, one of the first thoughts that sprang to my mind was that this might pro­vi­de a gate to XML for com­mon peop­le. If any­body has the resources to develop a word proces­sor that can pre­sent the advan­ta­ges of SGML in a user-fri­end­ly fashion, it’s Micro­soft.

If Office 2000 is to be used as a gene­ral XML edi­tor, one would assu­me that Micro­soft would sup­port nested para­graph sty­les, some way of edi­ting attri­bu­tes to sty­les, and some fancy way of dis­play­ing the hie­rar­chy of the cur­rent para­graph. Having tested Office 2000 for some time, how­e­ver, I can only con­clu­de that these or simi­lar featu­res will not be pre­sent in Office 2000 and that Micro­soft­’s com­mit­ment to XML repre­sents somet­hing entire­ly dif­fe­rent. Micro­soft Office should be viewed as a tar­get appli­ca­tion for a par­ti­cu­lar XML DTD. This XML DTD maps direct­ly to RTF, Micro­soft­’s ASCII-based docu­ment for­mat, and sup­ports not­hing more than what can alre­ady be done with RTF: map for­mat­ting to default or user-added para­graph and cha­rac­ter sty­les.

How XML appears to the user in Office 2000

To the user, Office 2000’s XML is a web page for­mat – and that is the enti­re clue about XML in Office 2000. Office 97 sup­ports saving to HTML, but the func­tio­na­li­ty seems very ad hoc – Word docu­ments lose a lot of for­mat­ting and the resul­ting HTML is rela­tive­ly mes­sy. Office 2000 chan­ges all that. The user has a con­trol of the lay­out of Office docu­ments that is almost totally aweso­me, even after expor­ting to Web for­mat.

The user does not need to know any­thing about XML. The­re is in fact not­hing about Office 2000, as far as I can see, that tells the user that docu­ments can be saved as XML. Office 2000 can export files to XML qui­te easi­ly through the File/Save As… or the File/Save As Web Page… dia­logs. Office does not say any­thing about XML at all. To the user, XML is a web page or advan­ced HTML for­mat.

An Office 2000 XML docu­ment con­tains lots of for­mat­ting infor­ma­tion, embed­ded as XML Style spec­i­fi­ca­tions. Using style mecha­ni­sms part­ly from CSS and XSL, and part­ly from XML Sche­mas, lay­out is spec­i­fied in a man­ner that at least in theory is por­tab­le. Lay­outs that look the same in both Inter­net Explo­rer and Nets­cape Navi­ga­tor is unpro­ble­ma­tic to achie­ve.

Using the LINK ele­ment, it is pos­sib­le to spec­i­fy exter­nal docu­ments con­tai­ning lay­out infor­ma­tion spec­i­fied using CSS, XSL or XML Sche­ma. This makes it pos­sib­le to refor­mat an enti­re web in one go. A script edi­tor that seems extreme­ly power­ful makes it pos­sib­le to edit style spec­i­fi­ca­tions in a WYSIWYG-like man­ner.

To the SGML user, how­e­ver, Office 2000’s XML still lea­ves a bit to be desi­red, to wit:

  • not all attri­bute values are quoted:

    <meta http-equiv=Content-Type content=«text/html; charset=windows-1252»>
  • XML is embed­ded in the docu­ment, but con­tex­ts are redi­rected using syn­tax con­structs hid­den in com­ments:

    <! – [if VML]><![if !VML­Ren­der]>

    <object id=VMLRender classid=«CLSID:10072CEC-8CC1-11D1-986E-00A0C955B42E»

    width=0 height=0>


  • CSS using pro­prie­ta­ry style pro­per­ties


This for­mat can­not be par­sed using any SGML par­ser I know of, inclu­ding Bali­se and nsgmls. In fact, appa­rent­ly the XML DTD will not be avai­lab­le in pure text for­mat, but as an Acti­ve X com­po­nent that can be used by Micro­soft­’s pro­gram­ming environ­ments and other pro­gram­ming environ­ments that sup­port Acti­ve X.

I tested, just for fun, how Office 2000 would repre­sent a cha­rac­ter style ele­ment. Cha­rac­ter sty­les are like fami­li­ar para­graph sty­les except they do not span an enti­re para­graph and can only con­tain cha­rac­ter infor­ma­tion. The advan­ta­ge to using cha­rac­ter sty­les are at least two­fold: they can be used for iden­ti­fy­ing clas­ses of infor­ma­tion, and are much more easi­ly updated than nor­mal font chan­ges; also, they are not lost when reapp­ly­ing sty­les to para­graphs as nor­mal font chan­ges often are.

  1. This is how Office 2000 repre­sents an ita­lic style chan­ge:

    <p class=MsoNormal>

    <span lang=EN-US>This is how Office 2000 repre­sents an <i>italic style change</i>:</span></p>

  2. This is how Office 2000 repre­sents a cha­rac­ter style ele­ment:

    <p class=MsoNormal>

    <span lang=EN-US>This is how Office 2000 repre­sents a </span><span class=code>

    <span lang=EN-US style=‘font-size:10.0pt;mso-bidi-font-size:12.0pt;

    font-family:«Courier New»;mso-bidi-font-family:«Times New Roman« ‘>

    cha­rac­ter style element</span></span>

    <span lang=EN-US>:</span></p>

  3. This is how Office 2000 repre­sents a cha­rac­ter style ele­ment:

    <p class=MsoNormal>

    <span lang=EN-US>This is how Office 2000 repre­sents a

    <span class=em>character style element</span>:</span></p>

Notice the dif­fe­ren­ce betwe­en 2. and 3. Both use cha­rac­ter style names that are created in entire­ly simi­lar man­ners, but their XML usa­ges are qui­te dif­fe­rent, pre­sumab­ly becau­se em is an HTML ele­ment whi­le code is not.

The bottom line

A com­pany that looks to use XML for mar­kup should look furt­her than Office 2000. How­e­ver, it is pos­sib­le to view Office 2000 as a tar­get appli­ca­tion for XML, allowing opera­bi­li­ty from other XML-ori­ented appli­ca­tions to Office 2000. Office 2000 cur­rent­ly uses a very idio­syn­cra­tic inter­pre­ta­tion of XML that can­not be par­sed by any known par­ser. Given a suit­ab­le pre-proces­sor, it might be pos­sib­le, though, to con­vert Office 2000’s XML to somet­hing that can be par­sed in a stan­dard fashion, but that would requi­re an SGML par­ser that allows one to look insi­de com­ments, and until qui­te recent­ly such an idea would seem qui­te extra­or­di­na­ry.

A com­pany that alre­ady uses Micro­soft Office 2000 as their stan­dard tool for crea­ting docu­ments is faced, how­e­ver, with an incre­di­bly power­ful tool for gene­ra­ting stan­dard lay­outs and uni­form-look­ing web sites. Tied with Micro­soft Front­Pa­ge 2000, Micro­soft has created an incre­di­bly power­ful web site crea­tion and mainte­nan­ce environ­ment for any enter­pri­se or fami­ly that does not requi­re batch-ori­ented HTML gene­ra­tion.

[1] Thanks to Lars Mari­us Gars­hol for insight­ful com­ments. All errors are mine.