Elasticsearch MySQL Sync Challenge (1)

Today, Tony was called into team leader’s office.

The leader said, “You have done a great job to finish the basic search engine implementations recently. Now, we have another very challenging task for you to do: To sync the data between database and ES. Do you have any ideas?”

Tony said, “em, for now, I can only come up with the idea that we write some codes in business layer to write another copy of data to ES.”

Leader, “This is a way, but there exists other methods that deserve to be dive into. You can do some research and we will talk about it later. But remember our requirements of this sync job”

Aims

Updated asynchronously - The user’s database query request search request should be delayed as little as possible.

Eventually consistent - While it can lag behind slightly, serving stale results indefinitely isn’t an option.

Easy to rebuild - Updates can be lost before reaching Elasticsearch, and Elasticsearch itself is known to lose data under network partitions.

Then, tony started his research about data synchronization between Elasticsearch and MySQL.

The next day, tony take his research result to the leader.

Write Split

“The first method is called ‘write split’: add some code in original code base to send data to Elasticsearch. In order to ease the change of business code, we can use AOP to add an aspect persistent layer¹. The asynchronous updates can be achieved easily, but it seems hard to rebuild if data lost” Tony said.

"Good. But this option has one obvious defeat: this data sync service is not independent. It relies on the status of business system. Furthermore, If we are going to sync data in other place, we have to repeat ourselves rather than reuse. By the way, how to handle the old data already stored in databases?” Leader asked.

Bulk Load

"This is the second way – bulk load. we can select data from database and bulk insert into Elasticsearch. This process is relative simple, we can do it by ourselves or using Logstash.

JDBC Input Plugin

"The Logstash has a plugin called JDBC input plugin. This plugin can execute SQL statement and get result of as input of Logstash, then go through filter and send to Elasticsearch. This plugin can resume last execution by using parameters like sql_last_value, use_column_value, clean_run.²

“But as this blog post said, marking source records is an anti-pattern. Adding a special timpstamp or counter in the business table is also an intrude. However, the good news is our table already has the create timestamp which is handy to use.” Tony added.

"Great. This sounds like it can solve the initial data sync perfectly. But this way is not suitable for daily sync of search. If we set the rate of execution of this plugin too often, we can’t get the result by search when user just create some entity in our system, which will fails the functionality of search engine. On the other hand, if we can increase the rate of pulling data, this will cost reasonable time/band-width for our database system to return result, which will downgrade its output.

“Is there any ways in event driven style: when there is a data change, we can sync the data, like database’s trigger? We can receive change when user insert/update/delete some data? In this way, we have better DB performance and better isolation from business system” Leader reminded tony.

“Great idea. I think I need to do more researches about event-driven style sync before continue.”

Written with StackEdit.

A common multitier architecture always have three layers: presentation layer, business layer and persistent layer. This situation also proves the usage of abstraction – good abstraction encapsulate information and operation, so we can handle change more graceful, like adding more operations in this case. ↩︎
The plugin will persist the sql_last_value parameter in the form of a metadata file stored in the configured last_run_metadata_path. Upon query execution, this file will be updated with the current value of sql_last_value. Next time the pipeline starts up, this value will be updated by reading from the file. We can also restart loading easily. If clean_run is set to true, this value will be ignored and sql_last_value will be set as following as if no query has ever been executed.
- Jan 1, 1970;
- or 0 if use_column_value is true and tracking_column is set;
↩︎

On teh way

Blog Search