跳至主要内容

Office Format Conversion (2): Implementation

In the last blog of office format conversion, we discussed the design of format conversion function. Finally, we decided to convert office into html and implement two versions which respectively embed image and extract image out.

In this blog, we discuss the detail of implementations and show the runnable code examples.

Library Chosen

Through some search, we find that the Apache POI has relative large number of questions on stackoverflow, which proves the user community is relative large and active. So we decided to use Apache POI to assistant conversion.
Despite of it, we find xdocreport, which uses POI to add more utility class, support more convenient and powerful interfaces to do this job, so we include this lib also.

Pitfalls

One thing have to notice is that the poi version of 3.14 not compiled and we have to use the newer version of 3.15.

maven dependency

doc vs docx

As we have previously stated, the doc and docx are in two different format, so the poi library using two different abstractions and two set of interfaces to manipulate them.
The HWPF abstraction is for doc and the XWPF is for docx

whilst HWPF and XWPF provide similar features, there is not a common interface across the two of them at this time.

Convert Doc to Html

When convert doc to html, we first read document in:

HWPFDocument wordDocument = new HWPFDocument(in);

Then using a converter with options to convert:

converter.processDocument(wordDocument);
DOMSource dom = new DOMSource(converter.getDocument());

Finally, output DOM into html:

serializer.transform(dom, new StreamResult(outFile(outDir, fileName)));

Extract Image

In order to extract out the image, we have to set the call back functions when converter handling the image in doc. And the converter lib provide a class to do this:

converter.setPicturesManager(new HtmlPicturesManager(outDir.toString()));


public class HtmlPicturesManager implements PicturesManager {
    // ...

    HtmlPicturesManager(String baseDir) {
        this.baseDir = baseDir;
    }

    @Override
    public String savePicture(byte[] content, PictureType pictureType, String name, float widthInches, float heightInches) {
        // ...
        return name;
    }

}

Embed Image

By contrast, we can override original converter to embed image:

public class EmbedImgHtmlConverter extends WordToHtmlConverter {

    EmbedImgHtmlConverter() throws ParserConfigurationException {
    }

    @Override
    protected void processImageWithoutPicturesManager(Element currentBlock, boolean inlined, Picture picture) {
        Element imgNode = currentBlock.getOwnerDocument().createElement("img");// 创建img标签
        StringBuilder sb = new StringBuilder(picture.getSize() + PoiEmbedImgHtmlConverter.EMBED_IMG_SRC_PREFIEX.length())
                .append(PoiEmbedImgHtmlConverter.EMBED_IMG_SRC_PREFIEX)
                .append(Base64.getEncoder().encodeToString(
                (picture.getRawContent())));
        imgNode.setAttribute("src", sb.toString());
        currentBlock.appendChild(imgNode);
    }
}

Convert Docx to Html

The interfaces to convert docx is more clean.
We first read file in:

XWPFDocument document = new XWPFDocument(in);

The customized output options:

XHTMLOptions options = XHTMLOptions.create();

Finally convert file:

XHTMLConverter.getInstance().convert(document, out, options);

Extract Image

In order to extract image, we can set our options:

options.setExtractor(extractor);

And POI provide a util class: FileImageExtractor to assist this process, even though it fix the location of output image file location. We want to change the location of image file, so we have to also set image url resolver as following source code shows:

// img/@src
String src = pictureData.getFileName();
if ( StringUtils.isNotEmpty( src ) )
{
    src = resolver.resolve( WORD_MEDIA + src );
    attributes = SAXHelper.addAttrValue( attributes, SRC_ATTR, src );
}

Our implementations:

options.URIResolver(extractor);

public class ExtractorAndResolver extends FileImageExtractor implements IURIResolver {
    public ExtractorAndResolver(File baseDir) {
        super(baseDir);
    }

    @Override
    public void extract(String imagePath, byte[] imageData) throws IOException {
        super.extract(Paths.get(imagePath).getFileName().toString(), imageData);
    }

    @Override
    public String resolve(String uri) {
        return Paths.get(uri).getFileName().toString();
    }
}

Embed Image

Because the original implementation separate the extraction of image and setting of image src attribute in html, we have to combine them to embed image:

Original source code

IImageExtractor extractor = getImageExtractor();
if ( extractor != null )
{
    XWPFPictureData pictureData = getPictureData( picture );
    if ( pictureData != null )
    {
        extractor.extract( WORD_MEDIA + pictureData.getFileName(), pictureData.getData() );
    }
}
// visit the picture and set image attributes
visitPicture( picture, offsetX, relativeFromH, offsetY, relativeFromV, wrapText,
              parentContainer );

Our implementations:

public class EmbedImgResolver extends FileImageExtractor implements IURIResolver {
    // ...
    @Override
    public void extract(String imagePath, byte[] imageData) throws IOException {
        this.picture = imageData;
    }

    @Override
    public String resolve(String uri) {
        StringBuilder sb = new StringBuilder(picture.length + PoiEmbedImgHtmlConverter.EMBED_IMG_SRC_PREFIEX.length())
                .append(PoiEmbedImgHtmlConverter.EMBED_IMG_SRC_PREFIEX)
                .append(Base64.getEncoder().encodeToString(picture));
        return sb.toString();
    }
}

Complete code example can be found here.

More

Because I spent much time to discover how to implement this kind of utility class, I decided to add some class to original source code and make some pull requests. And we can just use them in the future release as wiki of repo suggested. If you need them now, you may find source here.

Ref

Written with StackEdit.

评论

此博客中的热门博文

Spring Boot: Customize Environment

Spring Boot: Customize Environment Environment variable is a very commonly used feature in daily programming: used in init script used in startup configuration used by logging etc In Spring Boot, all environment variables are a part of properties in Spring context and managed by Environment abstraction. Because Spring Boot can handle the parse of configuration files, when we want to implement a project which uses yml file as a separate config file, we choose the Spring Boot. The following is the problems we met when we implementing the parse of yml file and it is recorded for future reader. Bind to Class Property values can be injected directly into your beans using the @Value annotation, accessed via Spring’s Environment abstraction or bound to structured objects via @ConfigurationProperties. As the document says, there exists three ways to access properties in *.properties or *.yml : @Value : access single value Environment : can access multi

Elasticsearch: Join and SubQuery

Elasticsearch: Join and SubQuery Tony was bothered by the recent change of search engine requirement: they want the functionality of SQL-like join in Elasticsearch! “They are crazy! How can they think like that. Didn’t they understand that Elasticsearch is kind-of NoSQL 1 in which every index should be independent and self-contained? In this way, every index can work independently and scale as they like without considering other indexes, so the performance can boost. Following this design principle, Elasticsearch has little related supports.” Tony thought, after listening their requirements. Leader notice tony’s unwillingness and said, “Maybe it is hard to do, but the requirement is reasonable. We need to search person by his friends, didn’t we? What’s more, the harder to implement, the more you can learn from it, right?” Tony thought leader’s word does make sense so he set out to do the related implementations Application-Side Join “The first implementation

Implement isdigit

It is seems very easy to implement c library function isdigit , but for a library code, performance is very important. So we will try to implement it and make it faster. Function So, first we make it right. int isdigit ( char c) { return c >= '0' && c <= '9' ; } Improvements One – Macro When it comes to performance for c code, macro can always be tried. #define isdigit (c) c >= '0' && c <= '9' Two – Table Upper version use two comparison and one logical operation, but we can do better with more space: # define isdigit(c) table[c] This works and faster, but somewhat wasteful. We need only one bit to represent true or false, but we use a int. So what to do? There are many similar functions like isalpha(), isupper ... in c header file, so we can combine them into one int and get result by table[c]&SOME_BIT , which is what source do. Source code of ctype.h : # define _ISbit(bit) (1 << (