Betsy Rolland
LIS 600 Independent Study

Transforming XML to Microsoft XML
Terry Brooks
Winter 2006

Harvesting all text from a Word document into a plain text file

The instructions below will walk you through getting all text out of a Word document's XML representation.

One of the most powerful uses for WordprocessingML is the ability to automatically harvest the information from an existing Word document. This is especially useful with highly-structured documents with defined sections that can then be structured as XML documents with schemas. But it can also be useful for other documents. The XML file below comes from a Word document saved as XML.

Documents:

Steps:

  1. Open your Word document and save as XML.

  2. Create a new XSLT stylesheet that looks like this or use GetAllText.xslt (taken from Office 2003 XML by Evan Lenz, Mary McRae & Simon St. Laurent (2004)):
  3. Transform the XML document, then open in Notepad or other text editor.