跳至主要内容

Elasticsearch: Join and SubQuery

Elasticsearch: Join and SubQuery

Tony was bothered by the recent change of search engine requirement: they want the functionality of SQL-like join in Elasticsearch!

“They are crazy! How can they think like that. Didn’t they understand that Elasticsearch is kind-of NoSQL1 in which every index should be independent and self-contained? In this way, every index can work independently and scale as they like without considering other indexes, so the performance can boost. Following this design principle, Elasticsearch has little related supports.” Tony thought, after listening their requirements.

Leader notice tony’s unwillingness and said, “Maybe it is hard to do, but the requirement is reasonable. We need to search person by his friends, didn’t we? What’s more, the harder to implement, the more you can learn from it, right?”

Tony thought leader’s word does make sense so he set out to do the related implementations

Application-Side Join

“The first implementation method is Application-Side Join.” Tony said to himself, “Before we index data into Elasticsearch, we join the data we need to one complete entity like this process” He draw some illustration of this transformation.

id name friendId
1 tom 2
2 tim 1
select t1.id, t1.name, t2.name as friend from user t1, user t2 where t1.friendId = t2.id;
{
	"id": 1,
	"name": tom,
	"friend": tim
},
{
	"id": 2,
	"name": tim,
	"friend": tom
},

"The main advantage is that the data is already denormalized before it is reached the Elasticsearch, so ES no need to do extra works. However, the disadvantage is also very obvious. The first is that we have to run extra queries in order to join documents at index time, in our example is to join user table – this is some performance penalty.

If this penalty is acceptable, the second drawback is more complex to solve. We need to sync the change manually if the table used to join is updated. Due to the document in Elasticsearch is immutable, this means delete the old one and ad a new one. In other words, this approach is only suitable when the entity used to join (the user in this example) has a small number of documents and, preferably, they seldom change. This would allow the application to cache the results and avoid running the first query often and avoid reindex." Tony thought.

Nested Object

The advantage of data denormalization is speed. Because each document contains all of the information that is required to determine whether it matches the query, there is no need for expensive joins.

Tony read the paragraph in document and thought, "Of course, there is a speed gain in reading, but we will have performance penalty in writing. Anyway, how to de-normalize the data?

“If we just want to include other joined field in the out layer of document, there is no other problems. If we want the hierarchy like room has chairs, we need to do more things. We need to use nested object, to create nested mapping and do corresponding insert/update/delete” Tony finished the first method of application side join.

Elasticsearch Join

“Another choice, is let ES do the related object join. The first choice is, of course, to use parent and child relationship”

Parent-Child

"The parent-child relationship is similar to the foreign key in SQL world. The children store the parent id so when needed they can be joined together to search. It can be used to find children by parent or find parent by child.

"This approach has opposite pros and cons compared with application side join. It has good write performance for avoiding reindex all document when change only happen in parent/child, which is because parent document and child is store separately. And, the read performance is limited. One is because ES need to join the document when search the relationship between them. Another problem is parent index have to stay with child index in order to join, which restrict the scalability.

“Oh, I almost forget this functionality has the limitation of only handling one-to-many relations and can’t refer to itself (i.e. parent and child can’t be same type). What about many-to-many relationship and self referential problems like tree?” Tony wondered in bed, and soon fall in sleep.

Tree Structure

"Aha, I remember that ES has a tokenizer of path_hierarchy, which can be used to store tree like file system. The example is like this:

// create analyzer
PUT /fs
{
  "settings": {
    "analysis": {
      "analyzer": {
        "paths": { 
          "tokenizer": "path_hierarchy"
        }
      }
    }
  }
}
// create file mapping
PUT /fs/_mapping/file
{
  "properties": {
    "name": { 
      "type":  "string",
      "index": "not_analyzed"
    },
    "path": { 
      "type":  "string",
      "index": "not_analyzed",
      "fields": {
        "tree": { 
          "type":     "string",
          "analyzer": "paths"
        }
      }
    }
  }
}
// search under '/clinton'!
GET /fs/file/_search
{
  "query": {
    "filtered": {
      "query": {
        "match": {
          "contents": "elasticsearch"
        }
      },
      "filter": {
        "term": { 
          "path.tree": "/clinton"
        }
      }
    }
  }
}

“So if we need to do self-referential search, we can construct path and associate content to search with path. In this way, we can search content under a specific path easily.” Tony clapped excitedly.

Ref

Written with StackEdit.


  1. For details about the architecture of Elasticsearch and why it is kind-of NoSQL, refer to this blog page. ↩︎

评论

此博客中的热门博文

Spring Boot: Customize Environment

Spring Boot: Customize Environment Environment variable is a very commonly used feature in daily programming: used in init script used in startup configuration used by logging etc In Spring Boot, all environment variables are a part of properties in Spring context and managed by Environment abstraction. Because Spring Boot can handle the parse of configuration files, when we want to implement a project which uses yml file as a separate config file, we choose the Spring Boot. The following is the problems we met when we implementing the parse of yml file and it is recorded for future reader. Bind to Class Property values can be injected directly into your beans using the @Value annotation, accessed via Spring’s Environment abstraction or bound to structured objects via @ConfigurationProperties. As the document says, there exists three ways to access properties in *.properties or *.yml : @Value : access single value Environment : can access multi

Implement isdigit

It is seems very easy to implement c library function isdigit , but for a library code, performance is very important. So we will try to implement it and make it faster. Function So, first we make it right. int isdigit ( char c) { return c >= '0' && c <= '9' ; } Improvements One – Macro When it comes to performance for c code, macro can always be tried. #define isdigit (c) c >= '0' && c <= '9' Two – Table Upper version use two comparison and one logical operation, but we can do better with more space: # define isdigit(c) table[c] This works and faster, but somewhat wasteful. We need only one bit to represent true or false, but we use a int. So what to do? There are many similar functions like isalpha(), isupper ... in c header file, so we can combine them into one int and get result by table[c]&SOME_BIT , which is what source do. Source code of ctype.h : # define _ISbit(bit) (1 << (