AspAlliance.com LogoASPAlliance: Articles, reviews, and samples for .NET Developers
URL:
http://aspalliance.com/articleViewer.aspx?aId=1252&pId=-1
XML Paper Specification (XPS) of a Word 2003 Document
page
by Jayaram Krishnaswamy
Feedback
Average Rating: This article has not yet been rated.
Views (Total / Last 10 Days): 28047/ 35

Introduction

Microsoft breathed new life into legacy office documents by opening an XML window (Office Open XML) to its office products through its royalty-free XPS specification. XPS stands for XML Paper Specification that specifies cross-platform, open standard, document representation that can be used for generating, sharing, printing and archiving of paginated documents. Its virtues in Microsoft’s own words are, "With XPS, documents print better, can be shared easier, be archived with confidence, and are more secure."

Microsoft Word 2007 with its fileName.docx format is a full fidelity (describing everything related to the document completely) XML file format, the default save format, which used to be binary format till Word 2003. The document as envisaged in Word 2007 consists of document parts (folder, file hierarchy) each of which describing a part of the document with a logical relationship (logical hierarchy) between the constituent parts. This makes it easy to touch and modify only those parts that need to be modified without modifying the others. In essence it is no more necessary to work with the whole document but surgically modify the needed part. In a manner similar to the *.docx extension for Word 2007, there are *.xlsx and *.pptx extensions for Excel 2007 and Power Point 2007, respectively. Although this tutorial deals with a Word 2003 document with the new format, similar considerations apply for MS Excel and MS Power Point files as well.

A document created in Word 2003 can be saved with the new extension. This article describes the details of such a document. The document used in this tutorial is a very simple document with very little content as shown in Figure 1.

Figure 1

Anatomy of a Word 2003 document

Saving the Word 2003 document

Although the default save type of a word document in Word 2003 is a file with extension doc, a binary format, it can also be saved as a file with docx extension, the Word 2007 default file format. Consider the example of a word document which contains a single line, "This is a test" followed by an image (arrow.jpg, 855 bytes) right after the line as shown in Figure 1. When saved as a file with the doc extension, the file size is 19.5KB including the 855 bytes image. This can also be saved with the extension docx and now the size of this file is 11.7KB, a significant reduction in file size.

Office Open XML (OOXML) Structure of the Word 2003 document

In order to see the components of this document based on Office Open XML, an ECMA standardized format (ECMA 376), it is only necessary to change the file extension to ZIP, a data compression and archival format which began with the PKZIP and PKUNZIP suite of utilities. This is an ideal format for putting folders and files together and compressing them for archival purposes. The Open Office XML scheme can be disassembled and stuffed into several related folders in the ZIP format.

The OriginalTest.doc (Word 2003) containing a single line of text and an image was saved as OriginalTest.docx and then its extension was changed so that it was converted to OriginalTest.zip. If the zip folder is unzipped into a container (folder, Backup of OriginalTestZIP), then the contents that you would find will be like those below.

Figure 2

The [Content Types].xml basically consists of the "types" encountered in OriginalTest.doc document. The XML file is shown Listing 1. This file is like a manifest for the contained elements (folders and files).

Listing 1: Content Types].xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
<Default Extension="jpeg" ContentType="image/jpeg"/>
<Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
<Default Extension="xml"
 ContentType="application/xml"/><Override PartName="/word/document.xml"
 ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>
<Override PartName="/<span class=Bold>word/styles.xml</span>"
 ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml"/>
<Override PartName="/<span class=Bold>docProps/app.xml</span>"
 ContentType="application/vnd.openxmlformats-officedocument.extended-properties+xml"/>
<Override PartName="/<span class=Bold>word/settings.xml</span>"
 ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml"/>
<Override PartName="/<span class=Bold>word/theme/theme1.xml</span>"
 ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/>
<Override PartName="/<span class=Bold>word/fontTable.xml</span>"
 ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml"/>
<Override PartName="/<span class=Bold>word/webSettings.xml</span>"
 ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml"/>
<Override PartName="/<span class=Bold>docProps/core.xml</span>"
 ContentType="application/vnd.openxmlformats-package.core-properties+xml"/>
</Types>

The "word" folder in the tree contains the following files: styles.xml, settings.xml, fontTable.xml, webSettings.xml, and document.xml. These can be together rationalized as representing the necessary resource elements needed for the display of the document.

The document.xml shown in Listing 2 contains the body of the document.

Listing 2:document.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document
 xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006"
 xmlns:o="urn:schemas-microsoft-com:office:office"
 xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
 xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"
 xmlns:v="urn:schemas-microsoft-com:vml"
 xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
 xmlns:w10="urn:schemas-microsoft-com:office:word"
 xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
 xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml">
<w:body><w:p w:rsidR="00A605EA" w:rsidRDefault="00A605EA">
<w:r>
<w:t>This is a test</w:t>
</w:r>
</w:p>
<w:p w:rsidR="00A605EA" w:rsidRPr="00943E24" w:rsidRDefault="00A605EA"><w:r w:rsidRPr="00943E24">
<w:pict><v:shapetype id="_x0000_t75" coordsize="21600,21600"
 o:spt="75" o:preferrelative="t"
 path="m@4@5l@4@11@9@11@9@5xe" filled="f"
 stroked="f">
<v:stroke joinstyle="miter"/>
<v:formulas>
<v:f eqn="if lineDrawn pixelLineWidth 0"/>
<v:f eqn="sum @0 1 0"/>
<v:f eqn="sum 0 0 @1"/>
<v:f eqn="prod @2 1 2"/>
<v:f eqn="prod @3 21600 pixelWidth"/>
<v:f eqn="prod @3 21600 pixelHeight"/>
<v:f eqn="sum @0 0 1"/><v:f eqn="prod @6 1 2"/>
<v:f eqn="prod @7 21600 pixelWidth"/>
<v:f eqn="sum @8 21600 0"/>
<v:f eqn="prod @7 21600 pixelHeight"/>
<v:f eqn="sum @10 21600 0"/></v:formulas>
<v:path o:extrusionok="f"
 gradientshapeok="t" o:connecttype="rect"/><o:lock v:ext="edit" aspectratio="t"/>
</v:shapetype>
<v:shape id="_x0000_i1025"
 type="#_x0000_t75" style="width:45pt;height:45pt">
<v:imagedata r:id="rId4" o:title=""/>
</v:shape>
</w:pict>
</w:r>
</w:p>
<w:p w:rsidR="00A605EA"
 w:rsidRPr="002C481F" w:rsidRDefault="00A605EA"/>
<w:sectPr w:rsidR="00A605EA"
 w:rsidRPr="002C481F" w:rsidSect="00A605EA"><w:pgSz w:w="11906" w:h="16838"/>
<w:pgMar w:top="1440"
 w:right="1800" w:bottom="1440" w:left="1800"
 w:header="720" w:footer="720" w:gutter="0"/>
<w:cols w:space="720"/>
<w:docGrid w:type="lines"
 w:linePitch="360"/>
</w:sectPr>
</w:body>
</w:document>

Similarly, the other XML files describe in detail the applicable settings, the details of fonts, styles, themes and web settings for the document. These are not listed in this article, but very easy to generate.

The folder media is where the media files like pictures would reside. In the present case, the arrow.jpg is in the folder media as image1.jpeg.

The _rels folder has a single file called document.xml.rels which is also in XML shown in Listing 3. When you review each of the relationships you will find that this file relates the constituent parts of the document.

Listing 3: document.xml.rels

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships 
xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id="rId3"
 Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/webSettings"
 Target="webSettings.xml"/>
<Relationship Id="rId2"
 Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings"
 Target="settings.xml"/>
<Relationship Id="rId1"
 Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles"
 Target="styles.xml"/>
<Relationship Id="rId6"
 Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme"
 Target="theme/theme1.xml"/>
<Relationship Id="rId5"
 Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable"
 Target="fontTable.xml"/>
<Relationship Id="rId4"
 Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image"
 Target="media/image1.jpeg"/>
</Relationships>

The docParts folder consists of two files, the app.xml and the core.xml. The following in Listing 4 shows app.xml which describes the document wide details such as template file used, security information, etc.

Listing 4: app.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties
 xmlns="http://schemas.openxmlformats.org/officeDocument/2006/extended-properties"
 xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes">
<Template>Normal_Wordconv.dotm</Template>
<TotalTime>1</TotalTime>
<Pages>1</Pages>
<Words>2</Words>
<Characters>16</Characters>
<Application>Microsoft Office Outlook</Application>
<DocSecurity>0</DocSecurity>
<Lines>0</Lines>
<Paragraphs>0</Paragraphs>
<ScaleCrop>false</ScaleCrop>
<Company> Hodentek</Company>
<LinksUpToDate>false</LinksUpToDate>
<CharactersWithSpaces>0</CharactersWithSpaces>
<SharedDoc>false</SharedDoc>
<HyperlinksChanged>false</HyperlinksChanged>
<AppVersion>12.0000</AppVersion>
</Properties>

The core.xml describes the details of the creator, the version and other information about the document that are either provided or default values as in Listing 5.

Listing 5: core.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cp:coreProperties 
xmlns:cp="http://schemas.openxmlformats.org/package/2006/metadata/core-properties" 
xmlns:dc="http://purl.org/dc/elements/1.1/" 
xmlns:dcterms="http://purl.org/dc/terms/" 
xmlns:dcmitype="http://purl.org/dc/dcmitype/" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<dc:title>This is a test</dc:title>
<dc:subject>
</dc:subject>
<dc:creator>HP Authorized Customer</dc:creator>
<cp:keywords></cp:keywords>
<dc:description></dc:description>
<cp:lastModifiedBy>HP Authorized Customer</cp:lastModifiedBy>
<cp:revision>2</cp:revision>
<dcterms:created xsi:type="dcterms:W3CDTF">2007-04-10T14:49:00Z</dcterms:created>
<dcterms:modified xsi:type="dcterms:W3CDTF">2007-04-10T14:49:00Z</dcterms:modified>
</cp:coreProperties>
Saving the *.docx file to a *.xps file

It is highly improbable that a printed document resembles the original seen on a screen or browser. Considering the number of printer types available on the market this is not an unusual experience. The XPS format has properties endowed by the underpinning software and standards that make it escape from the wysinwyg (what you see is not what you get) dilemma that many face, especially when it comes to printing. This is the reason that the MS WPF (Windows Presentation Foundation) will have an immediate audience and a large following.

The XPS is an electronic fixed layout format that guarantees that the printed version is exactly the same as the file version (high fidelity) preserving the document formatting. This file cannot be modified as is. There are two ways this file can be generated form a Microsoft Word document.

Using the Microsoft Document Writer

Open the docx file in Word 2007. Go and select Print to open the Print window. Accept the defaults and click OK. Make sure you have selected the Print to file option as shown in Figure 3 and the Printer Name as Microsoft Document Printer. Microsoft Document Printer must be in the Printers folder as shown in Figure 3.

Figure 3

 

Figure 4

Click OK which opens the Print to File window (not shown) with a default folder to save the document as well as the default file type, Printer Files (*.prn). Click on the drop-down and choose All Files(*.*) for the drop down, Save as Type. For the file name provide a name and the extension XPS. The file will be saved with this extension to the specified location. By double clicking the file with XPS extension, the document may be viewed as shown in Figure 5.

Figure 5

 

Using 2007 Microsoft Office Add-in: Microsoft Save as PDF or XPS

In order to use this option, the add-in SaveAsPDFandXPS.exe must be downloaded from the Microsoft site. This executable allows converting the eight programs in the Office 2007 suite to both XPS as well as PDF formats. When you download and install this program, the Word 2007 document shows an extra item for Save As menu item as shown in Figure 6. This functionality is not available for Word 2003. Of course, you can save Word 2003 in Word 2007 format and use this functionality.

Figure 6

When you click on this hyperlink it opens the window, Publish as PDF or XPS.

Figure 7

The default publish type is PDF, but you can choose to publish it as a XPS document. After choosing this option if you click on the Publish button, the explorer opens as shown in Figure 8.

Figure 8

Working with the XPS documents

Microsoft’s Window Presentation Foundation, with its obvious support from .NET 3.0, will be able to handle all the necessary procedures such as programmatic creation of XPS documents, navigating, storing and archiving, providing digital signature security features, and providing an excellent packaging support that keeps the whole document in a text based format. The logical XPS hierarchy keeps the document and the various pages in the document and their associated resources together. Motivated readers should read this Microsoft article on XPS.

There are also third party products such as NiXPS (Beta version) which provide a similar support through user friendly interfaces for documents on both Mac and Windows. The NiXPS v1.0 beta 1.exe may be downloaded from NiXPS site. Figure 9 is a screen shot of one of the Word 2003 document converted to the XPS format probed using the NiXPS application. The picture shows the document in the left pane and a page in the right pane. Since there is only one page in the document, only that page is shown. NiXPS can look through the contents of the document details (fonts, style and so on), merge documents, etc.

 Figure 9

 

The window with the title, Inspector: imageChanged.xps, looks at the package as a whole in a tree arrangement. This is just the XPS file renamed with the extension changed to ZIP (imageChanged.xps renamed to imageChanged_XPS.zip) with its content unzipped into a folder. It has additional elements such as FixedDocumentSequence.fdseq which takes care of managing multiple documents in a package.

Figure 10 shows the folder contents of imageChanged_xps_zip. The metadata folder contains the printer targeted information in XML.

Figure 10

 

Wrap up of different documents discussed in this article

Figure 11 summarizes the several document extensions discussed when you start off with a test.doc document. Figure 11 also shows how to go from one to the other.

Figure 11

Summary

This article dealt with the XPS structure of a word document according to the Office Open XML format and XPS. The docx file extension provides an easy route to get at the constituent parts of the document in conformity with the specification. In order to work with the constituent parts it will be necessary to use the application programming interface provided by Microsoft [Microsoft® Windows® Software Development Kit (SDK) for Beta 2 of Windows Vista and WinFX Runtime Components] which can be used on both platforms, Windows XP SP2 as well as Windows Vista. The article also described saving the office 2007 formats to XPS files and applications that can modify the files. This backward compatibility will be a boon to zillions of word documents whose contents can now be kept in a more portable format.

Although XPS is welcomed by the Microsoft customers, it has also attracted criticism for some of it shortcomings. Since Microsoft has positioned XPS as an alternative to PDF by strategically providing a utility that can compare the two format in-situ, a lot of discussion will center on Microsoft and Adobe and will get stronger as Vista gets a wider audience.


Product Spotlight
Product Spotlight 

©Copyright 1998-2024 ASPAlliance.com  |  Page Processed at 2024-04-20 5:52:56 AM  AspAlliance Recent Articles RSS Feed
About ASPAlliance | Newsgroups | Advertise | Authors | Email Lists | Feedback | Link To Us | Privacy | Search