跳至主要内容

Elasticsearch MySQL Sync Challenge (1)

Elasticsearch MySQL Sync Challenge (1)

Today, Tony was called into team leader’s office.

The leader said, “You have done a great job to finish the basic search engine implementations recently. Now, we have another very challenging task for you to do: To sync the data between database and ES. Do you have any ideas?”

Tony said, “em, for now, I can only come up with the idea that we write some codes in business layer to write another copy of data to ES.”

Leader, “This is a way, but there exists other methods that deserve to be dive into. You can do some research and we will talk about it later. But remember our requirements of this sync job”

Aims

  • Updated asynchronously - The user’s database query request search request should be delayed as little as possible.
  • Eventually consistent - While it can lag behind slightly, serving stale results indefinitely isn’t an option.
  • Easy to rebuild - Updates can be lost before reaching Elasticsearch, and Elasticsearch itself is known to lose data under network partitions.

Then, tony started his research about data synchronization between Elasticsearch and MySQL.


The next day, tony take his research result to the leader.

Write Split

“The first method is called ‘write split’: add some code in original code base to send data to Elasticsearch. In order to ease the change of business code, we can use AOP to add an aspect persistent layer1. The asynchronous updates can be achieved easily, but it seems hard to rebuild if data lost” Tony said.

"Good. But this option has one obvious defeat: this data sync service is not independent. It relies on the status of business system. Furthermore, If we are going to sync data in other place, we have to repeat ourselves rather than reuse. By the way, how to handle the old data already stored in databases?” Leader asked.

Bulk Load

"This is the second way – bulk load. we can select data from database and bulk insert into Elasticsearch. This process is relative simple, we can do it by ourselves or using Logstash.

JDBC Input Plugin

"The Logstash has a plugin called JDBC input plugin. This plugin can execute SQL statement and get result of as input of Logstash, then go through filter and send to Elasticsearch. This plugin can resume last execution by using parameters like sql_last_value, use_column_value, clean_run.2

“But as this blog post said, marking source records is an anti-pattern. Adding a special timpstamp or counter in the business table is also an intrude. However, the good news is our table already has the create timestamp which is handy to use.” Tony added.

"Great. This sounds like it can solve the initial data sync perfectly. But this way is not suitable for daily sync of search. If we set the rate of execution of this plugin too often, we can’t get the result by search when user just create some entity in our system, which will fails the functionality of search engine. On the other hand, if we can increase the rate of pulling data, this will cost reasonable time/band-width for our database system to return result, which will downgrade its output.

“Is there any ways in event driven style: when there is a data change, we can sync the data, like database’s trigger? We can receive change when user insert/update/delete some data? In this way, we have better DB performance and better isolation from business system” Leader reminded tony.

“Great idea. I think I need to do more researches about event-driven style sync before continue.”

Written with StackEdit.


  1. A common multitier architecture always have three layers: presentation layer, business layer and persistent layer. This situation also proves the usage of abstraction – good abstraction encapsulate information and operation, so we can handle change more graceful, like adding more operations in this case. ↩︎

  2. The plugin will persist the sql_last_value parameter in the form of a metadata file stored in the configured last_run_metadata_path. Upon query execution, this file will be updated with the current value of sql_last_value. Next time the pipeline starts up, this value will be updated by reading from the file. We can also restart loading easily. If clean_run is set to true, this value will be ignored and sql_last_value will be set as following as if no query has ever been executed.

    • Jan 1, 1970;
    • or 0 if use_column_value is true and tracking_column is set;
    ↩︎

评论

此博客中的热门博文

Spring Boot: Customize Environment

Spring Boot: Customize Environment Environment variable is a very commonly used feature in daily programming: used in init script used in startup configuration used by logging etc In Spring Boot, all environment variables are a part of properties in Spring context and managed by Environment abstraction. Because Spring Boot can handle the parse of configuration files, when we want to implement a project which uses yml file as a separate config file, we choose the Spring Boot. The following is the problems we met when we implementing the parse of yml file and it is recorded for future reader. Bind to Class Property values can be injected directly into your beans using the @Value annotation, accessed via Spring’s Environment abstraction or bound to structured objects via @ConfigurationProperties. As the document says, there exists three ways to access properties in *.properties or *.yml : @Value : access single value Environment : can access multi

Elasticsearch: Join and SubQuery

Elasticsearch: Join and SubQuery Tony was bothered by the recent change of search engine requirement: they want the functionality of SQL-like join in Elasticsearch! “They are crazy! How can they think like that. Didn’t they understand that Elasticsearch is kind-of NoSQL 1 in which every index should be independent and self-contained? In this way, every index can work independently and scale as they like without considering other indexes, so the performance can boost. Following this design principle, Elasticsearch has little related supports.” Tony thought, after listening their requirements. Leader notice tony’s unwillingness and said, “Maybe it is hard to do, but the requirement is reasonable. We need to search person by his friends, didn’t we? What’s more, the harder to implement, the more you can learn from it, right?” Tony thought leader’s word does make sense so he set out to do the related implementations Application-Side Join “The first implementation

Implement isdigit

It is seems very easy to implement c library function isdigit , but for a library code, performance is very important. So we will try to implement it and make it faster. Function So, first we make it right. int isdigit ( char c) { return c >= '0' && c <= '9' ; } Improvements One – Macro When it comes to performance for c code, macro can always be tried. #define isdigit (c) c >= '0' && c <= '9' Two – Table Upper version use two comparison and one logical operation, but we can do better with more space: # define isdigit(c) table[c] This works and faster, but somewhat wasteful. We need only one bit to represent true or false, but we use a int. So what to do? There are many similar functions like isalpha(), isupper ... in c header file, so we can combine them into one int and get result by table[c]&SOME_BIT , which is what source do. Source code of ctype.h : # define _ISbit(bit) (1 << (