跳至主要内容

Elasticsearch MySQL Sync Challenge (2): Event Driven

Elasticsearch MySQL Sync Challenge (2): Event Driven

Tony was reminded that whether there exists some event-driven update way to sync data from MySQL to Elasticsearch yesterday, so he made more effort to this direction.

Observer Pattern

"As you have suggested, I have dived into more elegant way of sync data – event-driven way, or in design pattern: a observer pattern. I found three different ways to do.

JDBC Logging

“Whenever we use JDBC to connect to data base, JDBC can log the SQL statement of our execution. So we can enable this functionality and listening the log. We can easily detect the newly added log entry using Java NIO watch service1 or other mature tools like FileBeat, but we need to translate the logged SQL statement to Elasticsearch query, which is a relative big work to do. Considering the work of make a simple SQL parser, I think we can deny this option” Tony finished.

DB Trigger

"The second way works in event-driven model is trigger. The PostgreSQL support notify and listen functionality to notify application when data changed. We make it via the internal notification queue2 of PostgreSQL (which may fails in long transaction) or like following:

  • Create a trigger against any tables that need to be pushed to the search cluster on modification.
  • The trigger calls a function that adds a reference to the staging table, then raises a notification with that reference as the payload.
  • On notification the client reads referenced data, pushes it to the search cluster and then deletes the reference in the staging table. This should be done in a transaction to avoid loss of references in case of a crash.
  • On startup the client performs a read/update of any outstanding references from the staging table and then deletes them.

“This way has many advantages like asynchronous update, the eventual consistence is ensured, no data lose worry for data is persisted. Unfortunately, our MySQL seems not having any related built-in notification function. Although we can code our udf to accomplish related functionality, which seems not so easy to do as here said.” Tony added.

Binlog & Dump

"The final way is to use the binlog of MySQL. The binlog of MySQL records the operations MySQL server received from client.

The binary log is a set of log files that contain information about data modifications made to a MySQL server instance.

"The binlog has two functionality: one for data recovery, one for replication. For replication, the binary log is used on master replication servers as a record of the statements to be sent to slave servers. The master server sends the events contained in its binary log to its slaves, which execute those events to make the same data changes that were made on the master. A slave stores events received from the master in its relay log until they can be executed. The relay log has the same format as the binary log.

“We will not going to retrieve the binlog directly, which is not easy to do and we need to understand the file structure of MySQL. We can just following MySQL replication protocol, in which we can write a client, registering to MySQL master as a MySQL slave, receiving the MySQL binlog event continuously. Once done, we can listen for the event, parse the binlog event and sending the changed data to Elasticsearch” Tony said.

“What about the un-indexed old data?” Leader asked.

“We can use mysqldump tools to dump the old data and do the similar things if needed. In normal cases, we can track the binlog location and even our client stopped for some times, we can catch up where we leave.” Tony said.

Tools:

  1. MySQL to ES in go
  2. MySQL to ES in Java

More

Change Schema

"One more thing is very problematic is how to change schema of Elasticsearch. Due to the internal mechanism of Elasticsearch, it have to reindex data in case of change of schema. In other word, we have to change all the old data. If we need to add new field, we have to re-read from MySQL.

“If we just need to change the mapping, we can change in Elasticsearch. Considering that we can’t stop the service when changing the schema, we need to create a new index with new mapping, then reading data from old index and reindex the data into a secondary index with the new schema.” Tony said.

“How to handle the request for new changes when reindexing is not finished?” Leader asked.

"Yes, we also have to deal with any modifications that happen during the reindexing process. Any changes made to records after they have been already reindexed would not be reflected in the new index since we’re still using the old index for all CRUD operations. To avoid that, we decided to dual write the data into two indices simultaneously during reindexing to ensure that both indices have the correct data while still reading from the primary one.

“You mean you will switch the index when process completed. But how? Will you rename the old index and new index at the same time?”

“Yes, we do in similar way using alias. An alias sits on top of an index and we direct point our requests to the alias instead of to the index directly. This gives us an extra layer of abstraction with the flexibility of quickly renaming your index on the fly. Once the reindexing process has been completed we need to point the codebase to the secondary index with alias. In other word, we need to refer to index using aliases in normal usage in case of rolling index3.” Tony said.

One For All

“The process to sync data is somewhat painful.” Tony sighed.

“Yes, so people are thinking whether we can do all of the things in one single data store. Apache CarbonData is one of the project aiming to combine full scan query, small scan query, OLAP etc.” Leader said.

Postscript

Considering the requirement of data synchronization between MySQL and Elasticsearch, tony and leader think sync with binlog is a better solution, because

  • Updated asynchronously - The user’s DB request and search request almost have no delay, because we use the internal master-slave mechanism of MySQL which has little impact on user’s DB request, and use cluster with multiple replica to make sure search request work;
  • Eventually consistent - This is ensured by MySQL sync protocol;
  • Easy to rebuild - If there is some data loss, we can resume sync from a specific binlog position to rebuild;

Ref

Written with StackEdit.


  1. For details, we can refer to this tutorial from Oracle. ↩︎

  2. For detail of notification, refer to the document of PostgreSQL. ↩︎

  3. For more details of aliases creation and update, refer to indices aliases and roll over index. ↩︎

评论

此博客中的热门博文

Spring Boot: Customize Environment

Spring Boot: Customize Environment Environment variable is a very commonly used feature in daily programming: used in init script used in startup configuration used by logging etc In Spring Boot, all environment variables are a part of properties in Spring context and managed by Environment abstraction. Because Spring Boot can handle the parse of configuration files, when we want to implement a project which uses yml file as a separate config file, we choose the Spring Boot. The following is the problems we met when we implementing the parse of yml file and it is recorded for future reader. Bind to Class Property values can be injected directly into your beans using the @Value annotation, accessed via Spring’s Environment abstraction or bound to structured objects via @ConfigurationProperties. As the document says, there exists three ways to access properties in *.properties or *.yml : @Value : access single value Environment : can access multi

Elasticsearch: Join and SubQuery

Elasticsearch: Join and SubQuery Tony was bothered by the recent change of search engine requirement: they want the functionality of SQL-like join in Elasticsearch! “They are crazy! How can they think like that. Didn’t they understand that Elasticsearch is kind-of NoSQL 1 in which every index should be independent and self-contained? In this way, every index can work independently and scale as they like without considering other indexes, so the performance can boost. Following this design principle, Elasticsearch has little related supports.” Tony thought, after listening their requirements. Leader notice tony’s unwillingness and said, “Maybe it is hard to do, but the requirement is reasonable. We need to search person by his friends, didn’t we? What’s more, the harder to implement, the more you can learn from it, right?” Tony thought leader’s word does make sense so he set out to do the related implementations Application-Side Join “The first implementation

Implement isdigit

It is seems very easy to implement c library function isdigit , but for a library code, performance is very important. So we will try to implement it and make it faster. Function So, first we make it right. int isdigit ( char c) { return c >= '0' && c <= '9' ; } Improvements One – Macro When it comes to performance for c code, macro can always be tried. #define isdigit (c) c >= '0' && c <= '9' Two – Table Upper version use two comparison and one logical operation, but we can do better with more space: # define isdigit(c) table[c] This works and faster, but somewhat wasteful. We need only one bit to represent true or false, but we use a int. So what to do? There are many similar functions like isalpha(), isupper ... in c header file, so we can combine them into one int and get result by table[c]&SOME_BIT , which is what source do. Source code of ctype.h : # define _ISbit(bit) (1 << (