Skip to content

Office 2007 document formats explained, part 1

An in-depth look at the new Office 2007 document formats.

THE IMPLICATIONS OF OFFICE XML FORMAT

Over in our other newsletter Office for Mere Mortals (our companion newsletter for beginners) last week we talked about the basic difference between Word documents and the XML documents you’ve probably heard about. If all this talk about Office and XML has lost you – that’s the place to start

To follow up that primer, in this Office Watch we’ll explain a bit more about Office and XML documents. It exists now for Word 2003 and Excel 2003 but for the next version of Office it will be improved and made the default file format. As we mention in Office for Mere Mortals, most of the coverage echoed Microsoft’s announcements and little more. Worse, it assumed that most people know about XML and presumes that everyone knows that Office XML is a good thing – hence the Office for Mere Mortals starter pack.

Some of you might think that we’re making too much of this, but we don’t think so. Unlike most of the ‘exciting’ announcements from Microsoft, this one is REALLY is important. Even if you swear you’ll never buy Office 2007 or Longhorn, Office XML documents will come your way.

And it is a MAJOR change. Microsoft is right to announce it well ahead of release. We have our reservations and concerns about the new Office 2007 XML formats but generally it looks like an excellent move with benefits for all users.

THE SITUATION TODAY

In Office 2003 you have the choice to save Word or Excel documents in an XML format instead of the binary format. This is called WordprocessingML (aka WordML) for the Word format and SpreadsheetML for the Excel format.

The Office 2003 XML schemas for these formats are available from Microsoft and if you’re having trouble sleeping, then those documents will send you to the land of nod.

These XML schemas are really an interim step. There are problems with XML generally that are not addressed by the current Office XML schemas.

XML is primarily a text based system. When you start adding non-text items like pictures it doesn’t work very well. There is a standard proposal to address that problem but it has yet to be ratified.

XML documents can get very big – even larger than normal Word documents at times. All that fussing we have to do with compressing documents before emailing is the result.

There’s another problem that which is not XML related but is an issue for Microsoft. Many people use Word documents but they are large and they don’t always display reliably on different Windows computers (let alone on Macs, Linux etc). The result is that people often convert Word documents to Adobe’s excellent PDF format. Acrobat has grown from being just a stable document viewing format into commenting and other document sharing capabilities – which is MS Office turf. The problem for Microsoft is that they don’t own the PDF format and would much prefer us to use a format that they have some control over.

Hence the related move to the ‘Metro’ technology – a newly announced Longhorn technology that, despite protestations otherwise, seems to be a move to overlap if not eventually replace the Abode PDF format.

That’s not necessarily a bad thing. If Microsoft can create an open, royalty-free format which has the best of Word’s flexibility with Acrobat’s wide fidelity across platforms, then we customers will benefit.

OFFICE WITH XML = OFFICEML

Microsoft has announced that the default document format for Office 2007 (the next Office for Windows due in 2006) will have an XML based format. For Word, Excel and (new) Powerpoint the current formats will be replaced by XML incarnations. The next Office for Mac will also support the new formats.

That’s been the headline news – but the real news is the changes in the Office XML formats that will come with Office 2007.

The new version of the XML formats will address some of the problems with the current OfficeML versions.

COMPRESSED DOCUMENT FORMAT

Elements within the XML document will be compressed using the well-known ZIP method. This should reduce the size of documents on your hard drive and, importantly, when emailing them.

This compression should be unseen by users of the program – Office will handle the compression and decompression automatically. It is NOT like the current situation where you have to separately compress a document to reduce its size before sending.

It is not the entire XML document that will be compressed, rather sections of it. There will be plain text XML tags in between blobs of compressed data.

This compression option is probably a good idea but it is not part of the XML standard as it stands. While Microsoft talks about an ‘open’ Office XML format – the fact remains that existing XML tools will not be able to look inside the compressed section of an Office 2007 document. All those tools will see is a blob of undecipherable data between XML tags. Presumably Microsoft will release programming tools to handle their Office tweaks of XML.

SEPARATION OF PARTS

The OfficeXML document will be broken up, internally, into separate parts for the purposes of compression and recovery if something goes wrong.

Separating the document makes it easier to fix if the document is corrupted. The corruption can be isolated to a section of the document (eg the comments) leaving the rest untouched and capable of recovery. The current Office binary formats can be a bit like a house of cards – if one part goes wrong the entire thing falls apart.

This compartmentalization could also help programmers. They can identify and extract only the sections needed by their code leaving the rest intact. This is safer and probably more efficient than opening the whole thing.

NEW DOCUMENT EXTENSIONS

In Office 2003 any WordML or SpreadsheetML documents have been available. If you use one of these formats the file will be saved with a .XML extension. The problem with that extension is that other programs also use it. A Word 2003 XML document can be accidentally opened in other XML programs.

To stop that happening and properly distinguish old-style binary documents with the new format we have a new set of document extensions. These will be used by the Office 2007 XML documents.

WordML .docx

SpreadsheetML .xlsx

Powerpoint ML .pptx

Note that the use of a four- letter extension will break any programs still using 8.3 file names. Programmers should check their code to make sure there’s no in-built assumption that all files have 3 letter extensions.

‘Long’ file names can currently have extensions beyond 3 characters but they are rare (a combination of habit and caution). Office 2007 XML will be one of the first times a longer extension is used in a widely deployed program.

COMPATIBILITY

While there will be some encouragement to take up the new format, it will be optional.

The most obvious encouragement will be that ‘out of the box’ Office 2007 programs will use the XML document format by default. At least that’s the current plan – I would not be surprised to see that change – perhaps users will be given the choice during installation?

Office 2007 will be able to read and write in the older document formats – so you can create documents that can be read be people who have Office 2003 or before. In addition, Microsoft has announced that Office 2003 and before will have filter available for download that will allow you to read and perhaps write documents in the new format.

Past experience shows that the compatibility will not be 100%. It is quite possible that Office 2007 non-XML documents will not support all the features that the same document as OfficeML will (I exclude here features that require XML itself).

Even more likely, the compatibility add-ins for Office 2003 and before may not fully support the new OfficeML format.

These types of gaps have always been there. Sometimes the formats are incompatible and on other occasions it is not practical to devote the programming resources to relatively minor matters. What is important is that Microsoft openly discloses any compatibility issues in their filters so that customers can be forewarned.

About this author