On reading the OOXML specification

As part of my shallow dive into OOXML, the new document format that Microsoft has proposed be made into an international standard, I decided to see if I could locate a copy of the the specification and try to wrap my arms around it, to get a feel for what the community was being asked to deal with. I also wanted to see if I could do this solely using Ubuntu Linux.

First, I had to find the specification documents. It didn’t take long to find Wikipedia’s Office Open XML. It begins by noting that “the neutrality and factual accuracy of this document is disputed.” That wasn’t a surprise, but I assumed the location of the specification itself wouldn’t be controversial, and that led me to Standard ECMA-376 Office Open XML File Formats, (December 2006) . When I visited that site I found the specification comes in five parts, and each part is available in two formats: DOCX and PDF.

I knew about PDF, but wasn’t sure about DOCX. A web search revealed that DOCX was the format used by Microsoft Office itself, at least in some of its versions. See for example, Microsoft’s Introducing the Office (2007) Open XML File Formats, which says in part:

Learn the benefits of the Office Open XML Formats. Users can exchange data between Office applications and enterprise systems using XML and ZIP technologies. Documents are universally accessible. And, you reduce the risk of damaged files.

To open a Word 2007 XML file

  1. Create a temporary folder in which to store the file and its parts.
  2. Save a Word 2007 document, containing text, pictures, and other elements, as a .docx file.
  3. Add a .zip extension to the end of the file name.
  4. Double-click the file. It will open in the ZIP application. You can see the parts that comprise the file.
  5. Extract the parts to the folder that you created previously.

That looked promising. Microsoft says that DOCX documents are “universally accessible.” Well, I reside in the universe, so I should be able to access them.

Since I’m limiting myself to Ubunto to read the specification, I tried to download a copy of Microsoft Office for Ubuntu but was unable to find it. That wasn’t too surprising, but perhaps it is on the way. Towards that end, here’s an open letter to Microsoft:

Steve Ballmer,
CEO,Microsoft
Redmond, Washington

Dear Steve:

I’m trying to read the OOXML specification and see it is available in DOCX format. Indeed, your company’s web site says that DOCX documents are “universally accessible.” I’m in the universe and I would like to access one of these DOCX documents, the OOXML specification.

But I’m using Ubuntu and don’t yet have a copy of Microsoft Office that I can use to read the files.

In order to make the specificiation accessible to me, could you please ask your team to prepare an implementation of Office for Ubuntu? Please send it first to the Debian folks, so don’t forget to include copy of the OSI-approved open-source license you’ll be using when you send the code out. Once they have scrubbed it and made sure its meets their high standards, I’m sure the Ubuntu folks will incorporate it into their distribution.

Take your time. I’m a patient guy, and I appreciate it is better to get it right than to rush something out in haste.

thanks,
dave

Since MS Office is not yet part of Ubuntu, I couldn’t use the DOCX variants, so I deleted them, leaving only the PDF files.

It turned out the file names had blanks in them, so I found it necessary to use the find command to get rid of them while I waited for Steve’s team to finish their work.

$ find . -name ‘*(DOCX).zip’ -exec rm {} \;

I then moved the remaining PDF files to a working directory, renaming them to just be 1.pdf, 2.pdf, and so forth on the way.

I finally had five files that were easy to work with. First, to get an estimate of what I had to deal with I found the total size:

$ cat *.pdf >pdf.all

$ ls -l pdf.all

And found they totalled almost 52MB. Wowsers, that’s a big spec indeed! There probably are thousands of pages inside.

Ubuntu provides several programs to view pdf files. I played around with them, and eventually settled on kdpf (“sudo apt-get install kpdf”). Here are the page totals I came up using the “kpdf” package:

Part Pages

1 178
2 131
3 473
4 5220
5 43

That gives 6045 pages, confirming there are really over six thousand pages!

So much for wrapping my arms around the spec. I would need a couple of new cartridges for my laser printer and over twelve reams of paper just to print it out, and to boot would require a forklift to deal with that printout.

Working with OOXML is a constant source of surprise.

For example, the specification itself is over six thousand pages in length, and is available in only two formats, DOCX and PDF, each a proprietary format controlled by a commercial software vendor (Microsft and Adobe, respectively.)

And I thought this was an open standard…

I then moved on accessing the PDF documents. PDF itself is worth a post, one that will soon follow.

Advertisements

One Comment

  1. Posted December 28, 2007 at 08:44 | Permalink | Reply

    Thanks for this post, and wish you good luck!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

  • Pages

  • September 2007
    M T W T F S S
    « Aug   Oct »
     12
    3456789
    10111213141516
    17181920212223
    24252627282930
  • RSS The Wayward Word Press

  • Recent Comments

    mrrdev on On being the maintainer, sole…
    daveshields on On being the maintainer, sole…
    James Murray on On being the maintainer, sole…
    russurquhart1 on SPITBOL for OSX is now av…
    dave porter on On being the maintainer, sole…
  • Archives

  • Blog Stats

  • Top Posts

  • Top Rated

  • Recent Posts

  • Archives

  • Top Rated

  • %d bloggers like this: