跳至主要内容

Elasticsearch MySQL Sync Challenge (3): Impl

Elasticsearch MySQL Sync Challenge (3): Impl

Tony and leader has discussed many different ways to sync data from MySQL to Elasticsearch, from simple write split (add more code to send two copy of data to different persistent layers) and bulk load from MySQL using Logstash, to more sophisticated event-driven binlog sync.

Why Remake

Having decided to use the binlog sync method, tony was asked to choose a suitable tools to finish the job. After doing some search, tony decided to make one by using binlog connector because

  • The complex search requirement needs some extra join in data which is not supported in current tools;
  • Incremental and resume ability: using this way can restart from any location of binlog;
  • Maybe to expend this sync tool to other data stores: like redis, Hadoop etc;
  • Wildcard database support, for reduce the work of horizontal split;

Pipe and Filter

“Considering this tool is a data hub which just do little mapping/join work but many transferring work, I think the pipe and filter design is very suitable as this tools architecture like Logstash does. This tool will have three main parts:” Tony reported.

  • Input module handle how to get the data from MySQL;
  • Filter module filter the the data we are interested and process them;
  • Output module is responsible for sending the data to Elasticsearch;

“In order to ensure the flexibility of sync tool, we need to externalize the configurations about application specific info. For example, we need to make the MySQL address and port info outside of program, so we don’t need to change the code or even no need to restart the tool.” Leader reminded.

Config

"Yes, we need one more config module. Actually, we have two kinds of configurations: the first is for a single sync job, the second is for the tool itself. The config for sync job will be somewhat complicated because our requirement: we will need config multiple remote MySQL server, config multiple filter operations to rename field, to do some transformation, config multiple ways to send message to ES.

“I think we can use yml file as the config file because Spring Boot has very convenient way1 to load and interpret it, i.e. bind the config file to class/objects, which save us much time.” Tony added.

“Fine, this is a reasonable choice.” Leader said.

Input

"In input module, we have to connect to MySQL using the client server protocol. We can register ourselves as a slave2 of master, which will send the binlog event to us as stream. I have found a binlog connector library, what we need to do is write our listener and customize our configurations.

“We need to define the master address, ports, and the schema/table/column that we are interested” Tony said.

Filter

"Filter module is much more complicated. We need to support some common operations to update the event content, like rename field name, remove some field, add some common field. We also need to support some kind of control flow, from switch, if, to for to handle different tables.

"We enrich the functionality of this part in order to reduce the work of output module. Otherwise, the output module have to customize the mapping of fields between MySQL and ES.

Output

“When output, we can use two ways. Elasticsearch support two ways to index, one is http REST interface, another is Java’s native api which defines its network protocol to communicate with ES server. The http way is more general, may be reused in future for other destination. But it is more wasteful than the native protocol ES defined considering network usage and time spent.” Tony added.

Ref

Written with StackEdit.


  1. For more details, refer to this blog. ↩︎

  2. Every server in MySQL has a id (in the range from 1 to 2^32 - 1). This value MUST be unique across whole replication group (that is, different from any other server id being used by any master or slave). Keep in mind that each binary log client should be treated as a simplified slave and thus MUST also use a different server id.

    ↩︎

评论

此博客中的热门博文

Spring Boot: Customize Environment

Spring Boot: Customize Environment Environment variable is a very commonly used feature in daily programming: used in init script used in startup configuration used by logging etc In Spring Boot, all environment variables are a part of properties in Spring context and managed by Environment abstraction. Because Spring Boot can handle the parse of configuration files, when we want to implement a project which uses yml file as a separate config file, we choose the Spring Boot. The following is the problems we met when we implementing the parse of yml file and it is recorded for future reader. Bind to Class Property values can be injected directly into your beans using the @Value annotation, accessed via Spring’s Environment abstraction or bound to structured objects via @ConfigurationProperties. As the document says, there exists three ways to access properties in *.properties or *.yml : @Value : access single value Environment : can access multi

Elasticsearch: Join and SubQuery

Elasticsearch: Join and SubQuery Tony was bothered by the recent change of search engine requirement: they want the functionality of SQL-like join in Elasticsearch! “They are crazy! How can they think like that. Didn’t they understand that Elasticsearch is kind-of NoSQL 1 in which every index should be independent and self-contained? In this way, every index can work independently and scale as they like without considering other indexes, so the performance can boost. Following this design principle, Elasticsearch has little related supports.” Tony thought, after listening their requirements. Leader notice tony’s unwillingness and said, “Maybe it is hard to do, but the requirement is reasonable. We need to search person by his friends, didn’t we? What’s more, the harder to implement, the more you can learn from it, right?” Tony thought leader’s word does make sense so he set out to do the related implementations Application-Side Join “The first implementation

Implement isdigit

It is seems very easy to implement c library function isdigit , but for a library code, performance is very important. So we will try to implement it and make it faster. Function So, first we make it right. int isdigit ( char c) { return c >= '0' && c <= '9' ; } Improvements One – Macro When it comes to performance for c code, macro can always be tried. #define isdigit (c) c >= '0' && c <= '9' Two – Table Upper version use two comparison and one logical operation, but we can do better with more space: # define isdigit(c) table[c] This works and faster, but somewhat wasteful. We need only one bit to represent true or false, but we use a int. So what to do? There are many similar functions like isalpha(), isupper ... in c header file, so we can combine them into one int and get result by table[c]&SOME_BIT , which is what source do. Source code of ctype.h : # define _ISbit(bit) (1 << (