Using Groovy scriptlets inside a *.docx document

Eugene PolyhaevMay 29th, 2012Last Updated: October 26th, 2012

0 145 4 minutes read

Introduction

One of my recent projects required automated generation of contracts for customers. Contract is a legal document of about 10 pages length. One contract form can be applied for many customers so the document is a template with customer info put in certain places.

In this article I am going to show you how I solved this problem.

Requirements

This is an initial version of formalized requirements :

Specified data must be placed in marked places of a complex DOC/DOCX file

The requirements were subsequently refined and expanded :

Specified data must be placed in marked places of a complex DOCX file.
Output markup must be scriptlet-like: ${}, <%%>, <%=%>.
Output data may be not only strings but also hashes and objects. Field access must be an option.
Output language must be brief and script-friendly: Groovy, JavaScript.
A possibility to display list of objects in a table, each cell displaying a field.

Background

It turned out that the existing products in the field (I’m talking about Java world) do not fit into initial requirements.

A brief overview of the products:

Jasper reports

Jasper Reportsuses *.jrxml files as templates. Template file in combination with input data (SQL result set or a Map of params) are given to a processor which forms any of these formats: PDF, XML, HTML, CSV, XLS, RTF, TXT.

Did not fit in:

It’s not a WYSIWYG, even with help of iReport — a visual tool to create jrxml-templates.
JasperReports API must be learned well to create and style a complex template.
JR does not output in a suitable format. PDF might be okay, but ability of hand-editing is preferable.

Docx4java

Docx4jis a Java library for creating and manipulating Microsoft Open XML (Word docx, Powerpoint pptx, and Excel xlsx) files.

Did not fit in:

There is no case meeting my requirements in docx4java documentation. A brief note about XMLUtils.unmarshallFromTemplate functionality is present but it only does simpliest substitutions.
Repeats of output is done with prepared XML-sources and XPath, link

Apache POI

Apache POI is a Java tool for creating and manipulating parts of *. doc, *.ppt, *.xls documents. A major use of the Apache POI api is for Text Extraction applications such as web spiders, index builders, and content management systems.

Did not fit in:

Does not have any options that meet my requirements.

Word Content Control Toolkit

Word Content Control Toolkit is a stand-alone, light-weight tool that opens any Word Open XML document and lists all of the content controls inside of it.

After I developed my own solution with scriptlets I heard of a solution based on combination of this tool and XSDT-transformations. It may work for somebody but I did not bother digging because it simply takes less steps to use my solution straightforward.

Solution of the problem

It was fun!

1. Document text content is stored as Open XML file inside a zip-archive. Traditional JDK 6 zipper does not support an explicit encoding parameter. That is, a broken docx-file may be produced using this zipper. I had to use a Groovy-wrapper AntBuilder for zipping, which does have an encoding parameter.

2. Any text inside you enter in MS Word may be “arbitrary” broken into parts wrapped with XML. So, I had to solve the problem of cleaning pads generated from the template xml. I used regular expressions for this task. I did not try to use XSLT or anything because I thought RegEx would be faster.

3. I decided to use Groovy as a scripting language because of its simplicity, Java-nature, and a built-in template processor. I found an interesting issue related to the processor. It turned out that even in a small 10-sheet document one can easily run into a restriction on the length of a string between two scriptlets.
I had to substitute the text going between a pair of scriptlets with a UUID-string, run the Groovy template processor using the modified text, and finally swich back those UUID-placeholders with the initial text fragments.

After overcoming these difficulties, I tried out the project in real life. It turned out well!

I created a project website and published it.

Project address: snowindy.github.com/scriptlet4docx/

Code example

HashMap<String, Object> params = new HashMap<String, Object>();
params.put("name","John");
params.put("sirname","Smith");

DocxTemplater docxTemplater = new DocxTemplater(new File("path_to_docx_template/template.docx"));
docxTemplater.process(new File("path_to_result_docx/result.docx"), params);

Scriptlet types explanation

${ data }
Equivalent to out.print(data)

<%= data %>
Equivalent to out.print(data)

<% any_code %>
Evaluates containing code. No output applied. May be used for divided conditions:

<% if (cond) { %>
This text block will be printed in case of "cond == true"
<% } else { %>
This text block will be printed otherwise.
<% } %>

$[ @listVar.field ]

This is a custom Scriptlet4docx scriptlet type designed to output collection of objects to docx tables. It must be used inside a table cell.

Say, we have a list of person objects. Each has two fields: ‘name’ and ‘address’. We want to output them to a two-column table.

Create a binding with key ‘personList’ referencing that collection.
Create a two-column table inside a template docx-document: two columns, one row.
$[@person.name] goes to the first column cell; $[@person.address] goes to the second.
Voila, the whole collection will be printed to the table.

Live template example

You can check all mentioned scriptlets usage in a demonstration template

Project future

If I actually developed a new approach to processing docx-templates, it would be nice to popularize it.

Projects TODOs: