Office Format Conversion (2): Implementation

In the last blog of office format conversion, we discussed the design of format conversion function. Finally, we decided to convert office into html and implement two versions which respectively embed image and extract image out.

In this blog, we discuss the detail of implementations and show the runnable code examples.

Library Chosen

Through some search, we find that the Apache POI has relative large number of questions on stackoverflow, which proves the user community is relative large and active. So we decided to use Apache POI to assistant conversion.
Despite of it, we find xdocreport, which uses POI to add more utility class, support more convenient and powerful interfaces to do this job, so we include this lib also.

Pitfalls

One thing have to notice is that the poi version of 3.14 not compiled and we have to use the newer version of 3.15.

maven dependency

`doc` vs `docx`

As we have previously stated, the doc and docx are in two different format, so the poi library using two different abstractions and two set of interfaces to manipulate them.
The HWPF abstraction is for doc and the XWPF is for docx

whilst HWPF and XWPF provide similar features, there is not a common interface across the two of them at this time.

Convert Doc to Html

When convert doc to html, we first read document in:

HWPFDocument wordDocument = new HWPFDocument(in);

Then using a converter with options to convert:

converter.processDocument(wordDocument);
DOMSource dom = new DOMSource(converter.getDocument());

Finally, output DOM into html:

serializer.transform(dom, new StreamResult(outFile(outDir, fileName)));

Extract Image

In order to extract out the image, we have to set the call back functions when converter handling the image in doc. And the converter lib provide a class to do this:

converter.setPicturesManager(new HtmlPicturesManager(outDir.toString()));


public class HtmlPicturesManager implements PicturesManager {
    // ...

    HtmlPicturesManager(String baseDir) {
        this.baseDir = baseDir;
    }

    @Override
    public String savePicture(byte[] content, PictureType pictureType, String name, float widthInches, float heightInches) {
        // ...
        return name;
    }

}

Embed Image

By contrast, we can override original converter to embed image:

public class EmbedImgHtmlConverter extends WordToHtmlConverter {

    EmbedImgHtmlConverter() throws ParserConfigurationException {
    }

    @Override
    protected void processImageWithoutPicturesManager(Element currentBlock, boolean inlined, Picture picture) {
        Element imgNode = currentBlock.getOwnerDocument().createElement("img");// 创建img标签
        StringBuilder sb = new StringBuilder(picture.getSize() + PoiEmbedImgHtmlConverter.EMBED_IMG_SRC_PREFIEX.length())
                .append(PoiEmbedImgHtmlConverter.EMBED_IMG_SRC_PREFIEX)
                .append(Base64.getEncoder().encodeToString(
                (picture.getRawContent())));
        imgNode.setAttribute("src", sb.toString());
        currentBlock.appendChild(imgNode);
    }
}

Convert Docx to Html

The interfaces to convert docx is more clean.
We first read file in:

XWPFDocument document = new XWPFDocument(in);

The customized output options:

XHTMLOptions options = XHTMLOptions.create();

Finally convert file:

XHTMLConverter.getInstance().convert(document, out, options);

Extract Image

In order to extract image, we can set our options:

options.setExtractor(extractor);

And POI provide a util class: FileImageExtractor to assist this process, even though it fix the location of output image file location. We want to change the location of image file, so we have to also set image url resolver as following source code shows:

// img/@src
String src = pictureData.getFileName();
if ( StringUtils.isNotEmpty( src ) )
{
    src = resolver.resolve( WORD_MEDIA + src );
    attributes = SAXHelper.addAttrValue( attributes, SRC_ATTR, src );
}

Our implementations:

options.URIResolver(extractor);

public class ExtractorAndResolver extends FileImageExtractor implements IURIResolver {
    public ExtractorAndResolver(File baseDir) {
        super(baseDir);
    }

    @Override
    public void extract(String imagePath, byte[] imageData) throws IOException {
        super.extract(Paths.get(imagePath).getFileName().toString(), imageData);
    }

    @Override
    public String resolve(String uri) {
        return Paths.get(uri).getFileName().toString();
    }
}

Embed Image

Because the original implementation separate the extraction of image and setting of image src attribute in html, we have to combine them to embed image:

Original source code

IImageExtractor extractor = getImageExtractor();
if ( extractor != null )
{
    XWPFPictureData pictureData = getPictureData( picture );
    if ( pictureData != null )
    {
        extractor.extract( WORD_MEDIA + pictureData.getFileName(), pictureData.getData() );
    }
}
// visit the picture and set image attributes
visitPicture( picture, offsetX, relativeFromH, offsetY, relativeFromV, wrapText,
              parentContainer );

Our implementations:

public class EmbedImgResolver extends FileImageExtractor implements IURIResolver {
    // ...
    @Override
    public void extract(String imagePath, byte[] imageData) throws IOException {
        this.picture = imageData;
    }

    @Override
    public String resolve(String uri) {
        StringBuilder sb = new StringBuilder(picture.length + PoiEmbedImgHtmlConverter.EMBED_IMG_SRC_PREFIEX.length())
                .append(PoiEmbedImgHtmlConverter.EMBED_IMG_SRC_PREFIEX)
                .append(Base64.getEncoder().encodeToString(picture));
        return sb.toString();
    }
}

Complete code example can be found here.

Because I spent much time to discover how to implement this kind of utility class, I decided to add some class to original source code and make some pull requests. And we can just use them in the future release as wiki of repo suggested. If you need them now, you may find source here.

Ref

Written with StackEdit.

On teh way

Blog Search