Office Format Conversion (1): Design

Intuitives

Our application support the upload and download office file to server. In order to support the preview of office format, which is a relative popular format, we decided to make some trials.

The first things to do is of course searching. Via searching, we found a question show that we can preview office file through the website of Office365 viewer or Google Docs, in which case those website will interpret the url we feed and convert the document on the fly to show it. Considering the privacy and security problem, we decided not to do it.

We know the there is no native support for render office in the front end, i.e. through browser, so we have to convert the format of office to others, like pdf and html.

Challenges

In order to perform conversion, we need to handle much challenges, the first thing to notice is that office format has two major versions.

Version

The early office format (‘doc’) is binary, which means if you open it by some other editor, the content is not readable and maybe with some errors (because there may exists some binary content which can’t be represented by specific encoding and fail to present it in the editor).
The newer format (‘docx’), on the other hand, is based on xml and more open.

This means we need to have two versions of code, either in lib or written by hand.

Performance

The performance to preview a file is important because it will directly affect the user experience of our application and there may exists some document which is relative large. The performance influenced by three main parts:

conversion
transport of file
render

The output format can be:

pdf: converted file is too large; need specific js lib to render
image: converted file is large; render is easy
html: maybe smaller than original file; render html is easy and optimization of loading is easy (e.g. lazy loading image)

Considering all format’s advantages and disadvantages, we decided to convert it into html and show it.

Functionality

The document file is actually a collections of file: it contains text, images, tables etc. According to the way to handle images, we have following discussions.

EmBed Image

We can embed image into html page using base64:

simple but not easy to manipulate images
no need to remember mapping between document between images
file become larger1

Base64

Image will become 1.37 times larger than original pictures, because:

int len = 0;
if (doPadding) {
    len = 4 * ((srclen + 2) / 3);
} else {
    int n = srclen % 3;
    len = 4 * (srclen / 3) + (n == 0 ? 0 : n + 1);
}

And we will use MIME base64 encoder which is larger, with line separator and line max.

Extract Image

The other way is to extract image, which is more flexible:

load html and image separately, lazy load image to accelerate speed of preview;
compress image if necessary

Conclusion

By taking many aspects into account, we decided to:

convert doc into html
implement two versions:
- version one: separate html and images, i.e. not embed images into html
- version two: embed image, which is easy to store and load

Ref

Written with StackEdit.

Base64 encode every 6bits into a char that occupies 8bits, which means it waste some space (8/6 ), e.g. encode “Man”: three chars occupy 3*8=24bits, every 6bits will become a char, which results in “TWFu”. means base64 encoded string will have to align to 3 bytes ↩

On teh way

Blog Search