跳至主要内容

Office Format Conversion (1): Design

Intuitives

Our application support the upload and download office file to server. In order to support the preview of office format, which is a relative popular format, we decided to make some trials.

The first things to do is of course searching. Via searching, we found a question show that we can preview office file through the website of Office365 viewer or Google Docs, in which case those website will interpret the url we feed and convert the document on the fly to show it. Considering the privacy and security problem, we decided not to do it.

We know the there is no native support for render office in the front end, i.e. through browser, so we have to convert the format of office to others, like pdf and html.

Challenges

In order to perform conversion, we need to handle much challenges, the first thing to notice is that office format has two major versions.

Version

The early office format (‘doc’) is binary, which means if you open it by some other editor, the content is not readable and maybe with some errors (because there may exists some binary content which can’t be represented by specific encoding and fail to present it in the editor).
The newer format (‘docx’), on the other hand, is based on xml and more open.

This means we need to have two versions of code, either in lib or written by hand.

Performance

The performance to preview a file is important because it will directly affect the user experience of our application and there may exists some document which is relative large. The performance influenced by three main parts:

  • conversion
  • transport of file
  • render

The output format can be:

  • pdf: converted file is too large; need specific js lib to render
  • image: converted file is large; render is easy
  • html: maybe smaller than original file; render html is easy and optimization of loading is easy (e.g. lazy loading image)

Considering all format’s advantages and disadvantages, we decided to convert it into html and show it.

Functionality

The document file is actually a collections of file: it contains text, images, tables etc. According to the way to handle images, we have following discussions.

EmBed Image

We can embed image into html page using base64:

  • simple but not easy to manipulate images
  • no need to remember mapping between document between images
  • file become larger1
Base64

Image will become 1.37 times larger than original pictures, because:

int len = 0;
if (doPadding) {
    len = 4 * ((srclen + 2) / 3);
} else {
    int n = srclen % 3;
    len = 4 * (srclen / 3) + (n == 0 ? 0 : n + 1);
}

And we will use MIME base64 encoder which is larger, with line separator and line max.

Extract Image

The other way is to extract image, which is more flexible:

  • load html and image separately, lazy load image to accelerate speed of preview;
  • compress image if necessary

Conclusion

By taking many aspects into account, we decided to:

  • convert doc into html
  • implement two versions:
    • version one: separate html and images, i.e. not embed images into html
    • version two: embed image, which is easy to store and load

Ref

Written with StackEdit.


  1. Base64 encode every 6bits into a char that occupies 8bits, which means it waste some space (8/6 ), e.g. encode “Man”: three chars occupy 3*8=24bits, every 6bits will become a char, which results in “TWFu”. means base64 encoded string will have to align to 3 bytes

评论

此博客中的热门博文

Spring Boot: Customize Environment

Spring Boot: Customize Environment Environment variable is a very commonly used feature in daily programming: used in init script used in startup configuration used by logging etc In Spring Boot, all environment variables are a part of properties in Spring context and managed by Environment abstraction. Because Spring Boot can handle the parse of configuration files, when we want to implement a project which uses yml file as a separate config file, we choose the Spring Boot. The following is the problems we met when we implementing the parse of yml file and it is recorded for future reader. Bind to Class Property values can be injected directly into your beans using the @Value annotation, accessed via Spring’s Environment abstraction or bound to structured objects via @ConfigurationProperties. As the document says, there exists three ways to access properties in *.properties or *.yml : @Value : access single value Environment : can access multi

Elasticsearch: Join and SubQuery

Elasticsearch: Join and SubQuery Tony was bothered by the recent change of search engine requirement: they want the functionality of SQL-like join in Elasticsearch! “They are crazy! How can they think like that. Didn’t they understand that Elasticsearch is kind-of NoSQL 1 in which every index should be independent and self-contained? In this way, every index can work independently and scale as they like without considering other indexes, so the performance can boost. Following this design principle, Elasticsearch has little related supports.” Tony thought, after listening their requirements. Leader notice tony’s unwillingness and said, “Maybe it is hard to do, but the requirement is reasonable. We need to search person by his friends, didn’t we? What’s more, the harder to implement, the more you can learn from it, right?” Tony thought leader’s word does make sense so he set out to do the related implementations Application-Side Join “The first implementation

Implement isdigit

It is seems very easy to implement c library function isdigit , but for a library code, performance is very important. So we will try to implement it and make it faster. Function So, first we make it right. int isdigit ( char c) { return c >= '0' && c <= '9' ; } Improvements One – Macro When it comes to performance for c code, macro can always be tried. #define isdigit (c) c >= '0' && c <= '9' Two – Table Upper version use two comparison and one logical operation, but we can do better with more space: # define isdigit(c) table[c] This works and faster, but somewhat wasteful. We need only one bit to represent true or false, but we use a int. So what to do? There are many similar functions like isalpha(), isupper ... in c header file, so we can combine them into one int and get result by table[c]&SOME_BIT , which is what source do. Source code of ctype.h : # define _ISbit(bit) (1 << (