In the last blog of office format conversion, we discussed the design of format conversion function. Finally, we decided to convert office into html and implement two versions which respectively embed image and extract image out.
In this blog, we discuss the detail of implementations and show the runnable code examples.
Library Chosen
Through some search, we find that the Apache POI has relative large number of questions on stackoverflow, which proves the user community is relative large and active. So we decided to use Apache POI to assistant conversion.
Despite of it, we find xdocreport, which uses POI to add more utility class, support more convenient and powerful interfaces to do this job, so we include this lib also.
One thing have to notice is that the poi version of 3.14 not compiled and we have to use the newer version of 3.15.
maven dependency
vs docx
As we have previously stated, the doc and docx are in two different format, so the poi
library using two different abstractions and two set of interfaces to manipulate them.
abstraction is for doc
and the XWPF
is for docx
whilst HWPF and XWPF provide similar features, there is not a common interface across the two of them at this time.
Convert Doc to Html
When convert doc to html, we first read document in:
HWPFDocument wordDocument = new HWPFDocument(in);
Then using a converter with options to convert:
DOMSource dom = new DOMSource(converter.getDocument());
Finally, output DOM into html:
serializer.transform(dom, new StreamResult(outFile(outDir, fileName)));
Extract Image
In order to extract out the image, we have to set the call back functions when converter handling the image in doc. And the converter lib provide a class to do this:
converter.setPicturesManager(new HtmlPicturesManager(outDir.toString()));
public class HtmlPicturesManager implements PicturesManager {
// ...
HtmlPicturesManager(String baseDir) {
this.baseDir = baseDir;
public String savePicture(byte[] content, PictureType pictureType, String name, float widthInches, float heightInches) {
// ...
return name;
Embed Image
By contrast, we can override original converter to embed image:
public class EmbedImgHtmlConverter extends WordToHtmlConverter {
EmbedImgHtmlConverter() throws ParserConfigurationException {
protected void processImageWithoutPicturesManager(Element currentBlock, boolean inlined, Picture picture) {
Element imgNode = currentBlock.getOwnerDocument().createElement("img");// 创建img标签
StringBuilder sb = new StringBuilder(picture.getSize() + PoiEmbedImgHtmlConverter.EMBED_IMG_SRC_PREFIEX.length())
imgNode.setAttribute("src", sb.toString());
Convert Docx to Html
The interfaces to convert docx is more clean.
We first read file in:
XWPFDocument document = new XWPFDocument(in);
The customized output options:
XHTMLOptions options = XHTMLOptions.create();
Finally convert file:
XHTMLConverter.getInstance().convert(document, out, options);
Extract Image
In order to extract image, we can set our options:
And POI provide a util class: FileImageExtractor
to assist this process, even though it fix the location of output image file location. We want to change the location of image file, so we have to also set image url resolver as following source code shows:
// img/@src
String src = pictureData.getFileName();
if ( StringUtils.isNotEmpty( src ) )
src = resolver.resolve( WORD_MEDIA + src );
attributes = SAXHelper.addAttrValue( attributes, SRC_ATTR, src );
Our implementations:
public class ExtractorAndResolver extends FileImageExtractor implements IURIResolver {
public ExtractorAndResolver(File baseDir) {
public void extract(String imagePath, byte[] imageData) throws IOException {
super.extract(Paths.get(imagePath).getFileName().toString(), imageData);
public String resolve(String uri) {
return Paths.get(uri).getFileName().toString();
Embed Image
Because the original implementation separate the extraction of image and setting of image src
attribute in html, we have to combine them to embed image:
Original source code
IImageExtractor extractor = getImageExtractor();
if ( extractor != null )
XWPFPictureData pictureData = getPictureData( picture );
if ( pictureData != null )
extractor.extract( WORD_MEDIA + pictureData.getFileName(), pictureData.getData() );
// visit the picture and set image attributes
visitPicture( picture, offsetX, relativeFromH, offsetY, relativeFromV, wrapText,
parentContainer );
Our implementations:
public class EmbedImgResolver extends FileImageExtractor implements IURIResolver {
// ...
public void extract(String imagePath, byte[] imageData) throws IOException {
this.picture = imageData;
public String resolve(String uri) {
StringBuilder sb = new StringBuilder(picture.length + PoiEmbedImgHtmlConverter.EMBED_IMG_SRC_PREFIEX.length())
return sb.toString();
Complete code example can be found here.
Because I spent much time to discover how to implement this kind of utility class, I decided to add some class to original source code and make some pull requests. And we can just use them in the future release as wiki of repo suggested. If you need them now, you may find source here.
Written with StackEdit.