XML Paper Specification (XPS) of a Word 2003 Document
page 2 of 5
by Jayaram Krishnaswamy
Feedback
Average Rating: This article has not yet been rated.
Views (Total / Last 10 Days): 28051/ 41

Anatomy of a Word 2003 document

Saving the Word 2003 document

Although the default save type of a word document in Word 2003 is a file with extension doc, a binary format, it can also be saved as a file with docx extension, the Word 2007 default file format. Consider the example of a word document which contains a single line, "This is a test" followed by an image (arrow.jpg, 855 bytes) right after the line as shown in Figure 1. When saved as a file with the doc extension, the file size is 19.5KB including the 855 bytes image. This can also be saved with the extension docx and now the size of this file is 11.7KB, a significant reduction in file size.

Office Open XML (OOXML) Structure of the Word 2003 document

In order to see the components of this document based on Office Open XML, an ECMA standardized format (ECMA 376), it is only necessary to change the file extension to ZIP, a data compression and archival format which began with the PKZIP and PKUNZIP suite of utilities. This is an ideal format for putting folders and files together and compressing them for archival purposes. The Open Office XML scheme can be disassembled and stuffed into several related folders in the ZIP format.

The OriginalTest.doc (Word 2003) containing a single line of text and an image was saved as OriginalTest.docx and then its extension was changed so that it was converted to OriginalTest.zip. If the zip folder is unzipped into a container (folder, Backup of OriginalTestZIP), then the contents that you would find will be like those below.

Figure 2

The [Content Types].xml basically consists of the "types" encountered in OriginalTest.doc document. The XML file is shown Listing 1. This file is like a manifest for the contained elements (folders and files).

Listing 1: Content Types].xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
<Default Extension="jpeg" ContentType="image/jpeg"/>
<Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
<Default Extension="xml"
 ContentType="application/xml"/><Override PartName="/word/document.xml"
 ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>
<Override PartName="/<span class=Bold>word/styles.xml</span>"
 ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml"/>
<Override PartName="/<span class=Bold>docProps/app.xml</span>"
 ContentType="application/vnd.openxmlformats-officedocument.extended-properties+xml"/>
<Override PartName="/<span class=Bold>word/settings.xml</span>"
 ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml"/>
<Override PartName="/<span class=Bold>word/theme/theme1.xml</span>"
 ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/>
<Override PartName="/<span class=Bold>word/fontTable.xml</span>"
 ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml"/>
<Override PartName="/<span class=Bold>word/webSettings.xml</span>"
 ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml"/>
<Override PartName="/<span class=Bold>docProps/core.xml</span>"
 ContentType="application/vnd.openxmlformats-package.core-properties+xml"/>
</Types>

The "word" folder in the tree contains the following files: styles.xml, settings.xml, fontTable.xml, webSettings.xml, and document.xml. These can be together rationalized as representing the necessary resource elements needed for the display of the document.

The document.xml shown in Listing 2 contains the body of the document.

Listing 2:document.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document
 xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006"
 xmlns:o="urn:schemas-microsoft-com:office:office"
 xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
 xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"
 xmlns:v="urn:schemas-microsoft-com:vml"
 xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
 xmlns:w10="urn:schemas-microsoft-com:office:word"
 xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
 xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml">
<w:body><w:p w:rsidR="00A605EA" w:rsidRDefault="00A605EA">
<w:r>
<w:t>This is a test</w:t>
</w:r>
</w:p>
<w:p w:rsidR="00A605EA" w:rsidRPr="00943E24" w:rsidRDefault="00A605EA"><w:r w:rsidRPr="00943E24">
<w:pict><v:shapetype id="_x0000_t75" coordsize="21600,21600"
 o:spt="75" o:preferrelative="t"
 path="m@4@5l@4@11@9@11@9@5xe" filled="f"
 stroked="f">
<v:stroke joinstyle="miter"/>
<v:formulas>
<v:f eqn="if lineDrawn pixelLineWidth 0"/>
<v:f eqn="sum @0 1 0"/>
<v:f eqn="sum 0 0 @1"/>
<v:f eqn="prod @2 1 2"/>
<v:f eqn="prod @3 21600 pixelWidth"/>
<v:f eqn="prod @3 21600 pixelHeight"/>
<v:f eqn="sum @0 0 1"/><v:f eqn="prod @6 1 2"/>
<v:f eqn="prod @7 21600 pixelWidth"/>
<v:f eqn="sum @8 21600 0"/>
<v:f eqn="prod @7 21600 pixelHeight"/>
<v:f eqn="sum @10 21600 0"/></v:formulas>
<v:path o:extrusionok="f"
 gradientshapeok="t" o:connecttype="rect"/><o:lock v:ext="edit" aspectratio="t"/>
</v:shapetype>
<v:shape id="_x0000_i1025"
 type="#_x0000_t75" style="width:45pt;height:45pt">
<v:imagedata r:id="rId4" o:title=""/>
</v:shape>
</w:pict>
</w:r>
</w:p>
<w:p w:rsidR="00A605EA"
 w:rsidRPr="002C481F" w:rsidRDefault="00A605EA"/>
<w:sectPr w:rsidR="00A605EA"
 w:rsidRPr="002C481F" w:rsidSect="00A605EA"><w:pgSz w:w="11906" w:h="16838"/>
<w:pgMar w:top="1440"
 w:right="1800" w:bottom="1440" w:left="1800"
 w:header="720" w:footer="720" w:gutter="0"/>
<w:cols w:space="720"/>
<w:docGrid w:type="lines"
 w:linePitch="360"/>
</w:sectPr>
</w:body>
</w:document>

Similarly, the other XML files describe in detail the applicable settings, the details of fonts, styles, themes and web settings for the document. These are not listed in this article, but very easy to generate.

The folder media is where the media files like pictures would reside. In the present case, the arrow.jpg is in the folder media as image1.jpeg.

The _rels folder has a single file called document.xml.rels which is also in XML shown in Listing 3. When you review each of the relationships you will find that this file relates the constituent parts of the document.

Listing 3: document.xml.rels

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships 
xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id="rId3"
 Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/webSettings"
 Target="webSettings.xml"/>
<Relationship Id="rId2"
 Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings"
 Target="settings.xml"/>
<Relationship Id="rId1"
 Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles"
 Target="styles.xml"/>
<Relationship Id="rId6"
 Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme"
 Target="theme/theme1.xml"/>
<Relationship Id="rId5"
 Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable"
 Target="fontTable.xml"/>
<Relationship Id="rId4"
 Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image"
 Target="media/image1.jpeg"/>
</Relationships>

The docParts folder consists of two files, the app.xml and the core.xml. The following in Listing 4 shows app.xml which describes the document wide details such as template file used, security information, etc.

Listing 4: app.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties
 xmlns="http://schemas.openxmlformats.org/officeDocument/2006/extended-properties"
 xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes">
<Template>Normal_Wordconv.dotm</Template>
<TotalTime>1</TotalTime>
<Pages>1</Pages>
<Words>2</Words>
<Characters>16</Characters>
<Application>Microsoft Office Outlook</Application>
<DocSecurity>0</DocSecurity>
<Lines>0</Lines>
<Paragraphs>0</Paragraphs>
<ScaleCrop>false</ScaleCrop>
<Company> Hodentek</Company>
<LinksUpToDate>false</LinksUpToDate>
<CharactersWithSpaces>0</CharactersWithSpaces>
<SharedDoc>false</SharedDoc>
<HyperlinksChanged>false</HyperlinksChanged>
<AppVersion>12.0000</AppVersion>
</Properties>

The core.xml describes the details of the creator, the version and other information about the document that are either provided or default values as in Listing 5.

Listing 5: core.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cp:coreProperties 
xmlns:cp="http://schemas.openxmlformats.org/package/2006/metadata/core-properties" 
xmlns:dc="http://purl.org/dc/elements/1.1/" 
xmlns:dcterms="http://purl.org/dc/terms/" 
xmlns:dcmitype="http://purl.org/dc/dcmitype/" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<dc:title>This is a test</dc:title>
<dc:subject>
</dc:subject>
<dc:creator>HP Authorized Customer</dc:creator>
<cp:keywords></cp:keywords>
<dc:description></dc:description>
<cp:lastModifiedBy>HP Authorized Customer</cp:lastModifiedBy>
<cp:revision>2</cp:revision>
<dcterms:created xsi:type="dcterms:W3CDTF">2007-04-10T14:49:00Z</dcterms:created>
<dcterms:modified xsi:type="dcterms:W3CDTF">2007-04-10T14:49:00Z</dcterms:modified>
</cp:coreProperties>

View Entire Article

User Comments

Title: Please disregard   
Name: Jayaram Krishnaswamy
Date: 2010-04-11 7:19:40 AM
Comment:
The user comment from 'USER' may be disregarded. It does not concern the present article

Author

Product Spotlight
Product Spotlight 





Community Advice: ASP | SQL | XML | Regular Expressions | Windows


©Copyright 1998-2024 ASPAlliance.com  |  Page Processed at 2024-04-18 5:31:00 PM  AspAlliance Recent Articles RSS Feed
About ASPAlliance | Newsgroups | Advertise | Authors | Email Lists | Feedback | Link To Us | Privacy | Search