An introduction to XML as a primer for understanding the new Office document formats.
THE BIG DEAL ABOUT OFFICEML
All the ’tissues’ have been full of the news that the next Office release (known for the moment as Office 12) will have a new and improved document format – known as OfficeML.
What is this all about? Does it matter? Why is Microsoft doing this? A lot of the coverage seems to assume that the readers know or even care about XML, document formats etc. In this issue we’ll go back to basics. It may seem dry and irrelevant but it is important – we don’t often have special issues of Office Watch devoted to a single topic and never without a good reason.
In this case it is a double-header. Here in Office for Mere Mortals we’ll cover the basics of XML as it applies to Word and Excel – this lays the ground-work for our Office Watch coverage of the major change in document formats coming in Office 12, the next version of Office.
Apologies to those of you for whom this is old news – please sit quietly at the back of the class for a moment .
Every file on your computer has a particular format – a structure to define how information is presented. For Word documents there’s all sorts of information to be stored – not just the words you type but their position, font, size, page size, margins plus all the extras like comments, footnotes, embedded or linked images, charts and macros. All that information has to be saved in a way that Word can read correctly at a later time.
It seems obvious but it can be extremely complicated. You have the choice to save the document in other formats like RTF which work fine but which can omit many of the more Word-specific information.
WORD ‘BINARY’ DOCUMENTS
The Word binary format is owned by Microsoft – this is where the dreaded ‘proprietary’ comes into the conversation. You can’t design a program to create a Word document without Microsoft’s permission. You need Microsoft to reveal the inner workings of the binary document format if you want to try reading the document without Word. Anti-virus companies are a good example of a software maker that needs to know the innards of the Word document format so they can scan your documents for nasties.
Ever since Word was created the default method of saving a file has been the ‘binary’ format with a .doc extension. That format has changed a lot over the years even though the extension is unchanged.
OTHER FORMATS INCLUDING HTML
There are many other document formats just for word processors. There is a Word Perfect format, Rich Text Format (RTF), Microsoft Works has its own document format, a special one for the Pocket PC version of MS Word plus many others that have mostly fallen by the way side. Even plain text files have a ‘format’ however simple. Given the market share the MS Word document format is far and away the most common.
If you’ve ever made a web page you’ll have seen another example of a document structure – HTML is a plain text document with to tell a web browser where to place words and images. HTML is open in that anyone can read or write a web page (even Notepad can do it, if necessary), the structure is publicly known and anyone can use it without charge.
In the last few versions of Office, you’ve had the option to save documents in HTML format instead of the binary format. This works fairly well as you can open the Word-created document in any HTML editor but you’ll find it full of strange proprietary and unfathomable sections. Such documents are also very large, often larger than the same information saved as a normal Word document.
XML is HTML’s big brother. It is an open source way to save data of many different types. There are ‘schema’ or structures defined (usually by mutual agreement) for many different situations.
Since XML structures are open to all, anyone smart enough can make a program to read or write to a particular XML format. That’s one of the beauties of XML – the program you use to make the XML document does not matter (well, it should not matter).
Here’s an example. Banks and other lending institutions have agents offering loans. These ‘free agents’ who work with the borrower to find the best deal from many sources. In order to find the best deal, or even if the loan proposal would be accepted, the agent (pre XML) had to fill out many different forms (paper or online) with the same information about the borrower (name, social security, employment, assets etc) with a different form for each bank. It’s the same information about the borrower but re-entered in different formats for each bank.
These days there’s an XML format to arrange all that data – all the agent has to do is fill out one online form, the XML document is created and send it off to each bank. The bank’s computer reads that information and acts accordingly.
Everyone wins – the agent only completes one form online. The banks get the info they need in a known format which can be processed quickly.
CHOOSE YOUR WEAPON
The important thing here is that it doesn’t matter what software the agent or bank uses – as long as they both use the same agreed XML structure the programs making or reading the info is irrelevant to the other side.
The agent probably has a special program to create and send the loan application but he/she can make it in Notepad if desperate. The bank almost certainly has custom mainframe programs to handle the XML traffic.
That doesn’t apply to Word binary formats – for most practical purposes you need Word to create Word documents. If you email a document to someone you have to check what program they use as a word-processor. Only the massive market share of Microsoft Office makes that question unnecessary in many instances.
What Microsoft has done and is proposing to extend is changing Word documents from a closed ‘binary’ format to an open XML format that, in theory, any program can read or write to.
EXTRACTING DATA FROM DOCUMENTS
XML isn’t just about the openness of the document structure. With an open structure you can label and identify special information in a document. This is mostly useful for companies who want to organize the wide ranging information stored in documents.
Imagine a situation where quotes by salesmen are written as Word documents. A sales manager knows there’s all this information about customers and what they want stored in those documents – but extracting that information from Word binary documents means opening each document in Word and copying the information out.
With XML formats there’s an alternative. Each quote is still saved in Word but in XML format – the sales people might not even notice the difference. In addition to the Word OfficeML structure you add your own data structure (schema in XML speak) which defines things like salesman, customer name, address, phone, product, quote price etc.
When a quote document is saved that information is tagged accordingly. If you open the document in Notepad you can see the information something like this (buried among all the Word document formatting information):
The isn’t printed in the quote document but merely surrounds or labels the specific pieces of information so that other systems know what is what. Those tags are created separately from anything Microsoft defines and Word ignores these additional tags for most purposes.
Now the sales manager can read all the quote documents automatically, extract all the information into a spreadsheet for analysis or into Word to make personalized follow-up letters. All the program has to do is look for the information inside certain tags.
The programs to do that don’t have to be Microsoft products – though obviously it’s in the company’s interests to encourage you to buy only from them.
And the data can be written back the other way too. Documents could be searched and text within tags could be changed. Since the document format is open to all any program could, in theory, do it.
Sure, there are other ways to achieve the same result (eg quotes could be generated via a database) but the XML path gives you more options than other methods. Not the least of these options is to combine programs from different sources (the sales people use Word while the back office extracts data with a specially written program). There are programming tools available to work with XML schema easily.
The data exchange benefit doesn’t mean much, if anything, to many home or small business users. For companies this feature could be huge if they have the whit to make use of it.
CONTINUED IN OFFICE WATCH
That’s the background of XML and how it applies to Office. Over in our parent newsletter ‘Office Watch’ we’ll explain the implications of Microsoft’s announcement. What is OfficeML, why Microsoft says is it making the change, why it is really making the change and what’s in it for the majority of Office users.
You can subscribe to Office Watch for free. Like all our newsletters you can join or leave at any time and we always treat your email address like it was our first-born. Hop over right now to join Office Watch.
All this is just the beginning. I’ve only touched on schemas (DTD vs XML Schema) and not even mentioned transforming data (XSL) among many other acronyms mostly beginning with X.
If you want to understand the XML basics in more detail then get a copy of Beginning XML for probably more information than you’ll need but in an easily understood format.
- More on the Office 2007 doc formats
- Office 2007 document formats explained, part 2
- Office 2007 document formats explained, part 1