Office 2007 document formats explained, part 2


More on the new document formats in Office 2007

MORE ON THE NEW OFFICE 2007 FORMAT

Back in Office Watch 10.18 we talked about the new document format coming in the next version of Office, due sometime in 2006. In this issue we’ll continue with more details and some updates.

As news of Office 2007 comes out we’ll cover the important features in special issues of Office Watch, keeping the regular issues devoted to practical matters related to the currently released versions of Office.

We’re going into more details for a few reasons. Some more expert users and developers will need to know details of this more accessible format, and there are some notable benefits for network administrators and some possible tweaks available to the more cunning users.

Even novice users should be aware of some major benefits even though the new format should make no difference to their daily use of Office.

And finally we’d like to squash some of the more silly and ill-founded rumors about the new format that are already about. We’ve been amused by some contradictory reports of ‘facts’ from readers, with even some people who have attended some of the briefings at the recent Tech-Ed coming away with different ‘facts’.

We’re greatly indebted to some Microsoft experts in the Office development team who have taken the time to clarify some points for us. Some of the Microsoft documents are a bit misleading to those of us looking for more than a broad overview.


$$PAGE$$

SEPARATION OF PARTS

As we explained in part 1, each Office document is made up of parts (eg comments, charts, slides etc). In current binary formats those parts are saved in a ‘stream’ with one part following another in a software ‘conga line’. The problem is that if something gets corrupted, the ‘conga line’ is broken and it is hard to recover the broken pieces.

The Office 2007 XML formats change that by converting each element of a document into a separate component with links between each one and an overall reference table. Then all those components are bundled together and compressed using the familiar ZIP file format.

But the resultant file is not given the usual ZIP extension – instead it’s called DOCX for a Word 12 document, XLSX for Excel and PPTX for Powerpoint (plus other X extensions for templates etc)


$$PAGE$$

DOCX OR XLSX OR PPTX = ZIP

You read that right – Office 2007 documents are actually ZIP files in disguise.

Take a DOCX file, rename it to .ZIP and you can open it in any compression tool – WinZIP, WinRAR or many, many others.

Inside the ZIP file are a series of small XML files that comprise the Office 2007 document. Since those ‘files’ make up an Office 2007 document we call them components to distinguish them from normal files.

For example, a Powerpoint 12 PPTX document each slide will actually be a separate XML component within the PPTX file.

All this compression and separate XML components will be done in the background with no special work needed by the user. Office 2007 will save and open these new formats and you won’t have to do or worry about anything.

That is how it should be – when you save an Excel document at the moment you don’t concern yourself with the binary format, streaming or other programming details. It should be the same in Office 2007 – save the document and the program handles the details.

There is precedent for this – we already use file formats that are compressed by their nature. JPG and GIF image files are already compressed formats while MP3 and WMA are compressed audio files.


$$PAGE$$

WHY ZIP?

Some people might wonder about using the ZIP format but it seems like a good compromise. Despite its relative age the ZIP standard does a good job at making files smaller without too much delay. There are other options (like our personal favorite for daily use, RAR) but none are as widely accepted as ZIP. And there’s probably licensing issues as well.

By using the ZIP format, Microsoft and its customers can make use of the many ZIP tools out there. File recovery is a good example. While Office 2007 will have tools to recover data from corrupted documents, the new formats mean you have the option, in extreme cases, to use one of the many other ZIP management tools available.

For developers it means they can use any of the ZIP compatible tools to work with Office 2007 documents. That’s important because one of the other benefits of the XML formats is the ability to read and write Office 2007 files without using Office 2007 programs.

Another side benefit of using an overall compression system is data integrity. The ZIP standard includes a checksum (CRC check) for each component when compressed. If there’s any data corruption it will be detected immediately when the document is opened because the checksum will not match the file. This check not only happens for the entire DOCX but also for each component within the document. Any corruption can be isolated to one or more components leaving the rest of the document recoverable.


$$PAGE$$

COMPRESSION ISSUES

In the ZIP format there are different levels of compression. All compression methods work basically by finding duplication in the files and replacing the duplicates with placeholders. How that’s done is the subject of some fearsome maths and programming skill which, thankfully, we mortals don’t have to worry about.

There’s a trade-off in any compression between achieving a smaller file size vs the time taken to shrink the data. You may get a slightly smaller file size by spending more computer time to find duplications – but in most cases the extra time is too long compared with the very small file size benefits. (mind you we’re normally talking about a second or two extra at most – not minutes).

Usually the compression works with a default middle-ground setting – a trade-off between a slightly larger file size and faster completion. In Office 2007 there’s some smarts that decides automatically on the optimal compression ratio.

How much smaller will the files be? That depends mostly on the document – some files compress better than others. Microsoft is saying up to 50% smaller but I would not budget on that kind of reduction across the board. The reduction in file size (compared to a binary document in say Office 2003) depends on the nature of the data being compressed and will vary for each document.

We did quick test of a random selection of Peter’s Word 2003 documents comparing them with ZIP’d versions of the same documents. The compressed versions were all smaller but some were only 4% smaller than the originals while others where a massive 90% smaller. That’s an indication of the wide variation in disk space saving you might get.


$$PAGE$$

DIFFERENT COMPRESSION OPTIONS

There are various compression choices that ZIP offers. Normally the software you use will work from a default middle-ground setting where you get good compression without taking up too much computer time to do it.

But there are choices. At one extreme there is the ‘Store’ or ‘No Compression’ option where the files are stored within a ZIP file in their raw form with no compression at all. Normally you’d only use that for files that are already compressed like JPG, WMA or MP3.

At the other extreme there’s ‘Best’ or ‘Maximum Compression’ where more computer time is devoted to compressing the files into the smallest possible size.

As an indication of the small difference between default ZIP compression and the extreme ‘Best’ option. We compressed the same group of documents – the difference between the default and Best compression was a lousy 7KB (3,447KB vs 3,440KB). A ZIP ‘No compression’ of the same documents was 5,585KB.

Office 2007 will analyze each document and make some decisions about the type of compression used. Naturally this happens in the background and there’s no need for a user to worry about this.

But the nature of the compression may be of interest to developers and network administrators as we’ll discuss later.


$$PAGE$$

CHANGEOVER WOES

Long-time Word users will recall the troubles when Microsoft has changed the document structure. Office 97 was a classic example of how not to implement a format change, for it alienated many customers in a short time.

Changing the file structure can cause all manner of nightmares especially in large companies where there are thousands of existing documents held by many users, some in the old format and some in the new format. This is complicated by the file extension staying as ‘doc’ but giving no immediate clue as to what version of the Word format the document is in.

Microsoft argues, with some justification, that changes in the document format have been necessary to accommodate changes in technology. After all when Word v1 for Windows was launched few of us had heard of the Internet, email or even simple networks.

The sad fact is that Microsoft has never really deployed changes in the document format very well. Microsoft tried to slip a change into Word 97 in the foolish hope that customers would not notice. They have annoyed many large companies by not disclosing the change until very late in the piece. For Office 97, Microsoft management had to do a public, though belated, mea culpa when they realized they were losing millions of dollars of sales once IT departments discovered the change.

The company has learnt that lesson and this time that have announced document format changes well ahead of time. ‘Filters’ will be released to let past versions of Office (2003, XP and 2000 only) read and write to the new format. If you choose File | Save As in Word and scroll down the list of document formats you’ll find one or more entries to save to earlier versions of the Word format.

There will also be a ‘bulk’ converter to change groups of documents to the new formats.


$$PAGE$$

SO WHY CHANGE?

Why bother changing?

For starters, you won’t have to. Even though the new formats will be the default upon installation, you can choose to save documents in the old binary formats if you prefer. There’s already such an option in Office to let you save documents to different default formats (in Word, Tools | Options | Save).

If you do switch to the new formats the main benefit for most people will be a saving of disk space. An Office 2007 XML document should be smaller than the same document saved in the older binary format.

Users with Office 2003, XP (2002) and 2000 will be able to download tools to let you read and write the new formats. As we noted in the last issue, the compatibility of these tools may not be 100% perfect – time will tell.

Smaller file sizes will not only mean less disk space but also bandwidth savings. For most people that means documents they email will be smaller and there’s no need for manually compressing before sending. For network administrators it means less money to spend on storage and backup media plus a reduction in the amount of traffic across large and complex networks.


$$PAGE$$

BEYOND WORD …

In this article we’ve talked about Word because that’s the document format that concerns people the most. But the same applies to Excel and Powerpoint too.

Powerpoint 2003 has only a binary format but there will be a XML format by default in Office 2007.

There’s no word on what will happen with Visio and Publisher.

It would be nice if email and other Outlook objects could be imported and exported via an XML format.


$$PAGE$$

VIRUSES AND OTHER NASTIES

Virus protection is a big deal and almost all the readers who’ve emailed us in the last few weeks have expressed concern about the vulnerability of the new Office documents to virus infection.

Microsoft is assuring the world that the new format will have protection against ‘untrustworthy’ code. We certainly hope that is the case, but the devil will be in the detail. Microsoft’s responses and focus seems to be on code integrity and not the possibility of the document being hacked.

Some readers have noted the possibility of documents being hacked more easily once they are in XML format. Since the document format is openly available it is conceivable that an unauthorized program could change text in documents without needing to open the Office program. Currently changes to Office binary format documents are, for practical purposes, done by infecting the programs via macros etc. But with an open format this may not be necessary.

For that reason it would be good to see some form of data integrity built into the new formats, so that users could be notified of changes made to a document outside the usual Office programs.

At present the only option to protect the integrity of a document, as opposed to code, is to password protected the document,

The new formats give Microsoft a chance to start from a somewhat ‘clean slate’ – a new document format where the mistakes of the past can be repaired. That assumes that Microsoft can acknowledge the mistakes that have been made in the past and therefore learn from them. Nowhere is this more important than in the area of document security.

In the next installment of this feature we’ll look at how ‘open’ this new open document format will really be. Password protection and encryption plus some interesting possibilities opened up by this announcement. Finally we’ll give our interim opinion on this important development.


Want More?

Office Watch has the latest news and tips about Microsoft Office. Independent since 1996. Delivered once a week.