Elasticsearch MySQL Sync Challenge (3): Impl

Tony and leader has discussed many different ways to sync data from MySQL to Elasticsearch, from simple write split (add more code to send two copy of data to different persistent layers) and bulk load from MySQL using Logstash, to more sophisticated event-driven binlog sync.

Why Remake

Having decided to use the binlog sync method, tony was asked to choose a suitable tools to finish the job. After doing some search, tony decided to make one by using binlog connector because

The complex search requirement needs some extra join in data which is not supported in current tools;
Incremental and resume ability: using this way can restart from any location of binlog;
Maybe to expend this sync tool to other data stores: like redis, Hadoop etc;
Wildcard database support, for reduce the work of horizontal split;

Pipe and Filter

“Considering this tool is a data hub which just do little mapping/join work but many transferring work, I think the pipe and filter design is very suitable as this tools architecture like Logstash does. This tool will have three main parts:” Tony reported.

Input module handle how to get the data from MySQL;
Filter module filter the the data we are interested and process them;
Output module is responsible for sending the data to Elasticsearch;

“In order to ensure the flexibility of sync tool, we need to externalize the configurations about application specific info. For example, we need to make the MySQL address and port info outside of program, so we don’t need to change the code or even no need to restart the tool.” Leader reminded.

Config

"Yes, we need one more config module. Actually, we have two kinds of configurations: the first is for a single sync job, the second is for the tool itself. The config for sync job will be somewhat complicated because our requirement: we will need config multiple remote MySQL server, config multiple filter operations to rename field, to do some transformation, config multiple ways to send message to ES.

“I think we can use yml file as the config file because Spring Boot has very convenient way¹ to load and interpret it, i.e. bind the config file to class/objects, which save us much time.” Tony added.

“Fine, this is a reasonable choice.” Leader said.

Input

"In input module, we have to connect to MySQL using the client server protocol. We can register ourselves as a slave² of master, which will send the binlog event to us as stream. I have found a binlog connector library, what we need to do is write our listener and customize our configurations.

“We need to define the master address, ports, and the schema/table/column that we are interested” Tony said.

Filter

"Filter module is much more complicated. We need to support some common operations to update the event content, like rename field name, remove some field, add some common field. We also need to support some kind of control flow, from switch, if, to for to handle different tables.

"We enrich the functionality of this part in order to reduce the work of output module. Otherwise, the output module have to customize the mapping of fields between MySQL and ES.

Output

“When output, we can use two ways. Elasticsearch support two ways to index, one is http REST interface, another is Java’s native api which defines its network protocol to communicate with ES server. The http way is more general, may be reused in future for other destination. But it is more wasteful than the native protocol ES defined considering network usage and time spent.” Tony added.

Ref

Written with StackEdit.

For more details, refer to this blog. ↩︎
Every server in MySQL has a id (in the range from 1 to 2^32 - 1). This value MUST be unique across whole replication group (that is, different from any other server id being used by any master or slave). Keep in mind that each binary log client should be treated as a simplified slave and thus MUST also use a different server id.

↩︎

On teh way

Blog Search