Intuitives
Our application support the upload and download office file to server. In order to support the preview of office format, which is a relative popular format, we decided to make some trials.
The first things to do is of course searching. Via searching, we found a question show that we can preview office file through the website of Office365 viewer or Google Docs, in which case those website will interpret the url we feed and convert the document on the fly to show it. Considering the privacy and security problem, we decided not to do it.
We know the there is no native support for render office in the front end, i.e. through browser, so we have to convert the format of office to others, like pdf and html.
Challenges
In order to perform conversion, we need to handle much challenges, the first thing to notice is that office format has two major versions.
Version
The early office format (‘doc’) is binary, which means if you open it by some other editor, the content is not readable and maybe with some errors (because there may exists some binary content which can’t be represented by specific encoding and fail to present it in the editor).
The newer format (‘docx’), on the other hand, is based on xml and more open.
This means we need to have two versions of code, either in lib or written by hand.
Performance
The performance to preview a file is important because it will directly affect the user experience of our application and there may exists some document which is relative large. The performance influenced by three main parts:
- conversion
- transport of file
- render
The output format can be:
- pdf: converted file is too large; need specific js lib to render
- image: converted file is large; render is easy
- html: maybe smaller than original file; render html is easy and optimization of loading is easy (e.g. lazy loading image)
Considering all format’s advantages and disadvantages, we decided to convert it into html and show it.
Functionality
The document file is actually a collections of file: it contains text, images, tables etc. According to the way to handle images, we have following discussions.
EmBed Image
We can embed image into html page using base64:
- simple but not easy to manipulate images
- no need to remember mapping between document between images
- file become larger1
Base64
Image will become 1.37 times larger than original pictures, because:
int len = 0;
if (doPadding) {
len = 4 * ((srclen + 2) / 3);
} else {
int n = srclen % 3;
len = 4 * (srclen / 3) + (n == 0 ? 0 : n + 1);
}
And we will use MIME base64 encoder which is larger, with line separator and line max.
Extract Image
The other way is to extract image, which is more flexible:
- load html and image separately, lazy load image to accelerate speed of preview;
- compress image if necessary
Conclusion
By taking many aspects into account, we decided to:
- convert doc into html
- implement two versions:
- version one: separate html and images, i.e. not embed images into html
- version two: embed image, which is easy to store and load
Ref
Written with StackEdit.
- Base64 encode every 6bits into a char that occupies 8bits, which means it waste some space (8/6 ), e.g. encode “Man”: three chars occupy 3*8=24bits, every 6bits will become a char, which results in “TWFu”. means base64 encoded string will have to align to 3 bytes ↩
评论
发表评论