跳至主要内容

Elasticsearch Learning (3): Search

In the last blog, we have discussed some natures of Elasticsearch: how it search text and how it store document. Now, we will focus on search deeper.

Inverted Index Revisit

We have talked about that Elasticsearch use the inverted index to assist searching functionality, but Elasticsearch build a more complex inverted index than we have talked.

Analyzed Inverted Index

So we have following two documents:

  1. Doc_1: The quick brown fox jumped over the lazy dog
  2. Doc_2: Quick brown foxes leap over lazy dogs in summer

A simple inverted index looks like following:

Term      Doc_1  Doc_2
-------------------------
Quick   |       |  X
The     |   X   |
brown   |   X   |  X
dog     |   X   |
dogs    |       |  X
fox     |   X   |
foxes   |       |  X
in      |       |  X
jumped  |   X   |
lazy    |   X   |  X
leap    |       |  X
over    |   X   |  X
quick   |   X   |
summer  |       |  X
the     |   X   |
------------------------

A normalized index in which document is first analyzed by a analyzer, then indexed by inverted index:

Term      Doc_1  Doc_2
-------------------------
brown   |   X   |  X
dog     |   X   |  X
fox     |   X   |  X
in      |       |  X
jump    |   X   |  X
lazy    |   X   |  X
over    |   X   |  X
quick   |   X   |  X
summer  |       |  X
the     |   X   |  X
------------------------

As we can notice some changes:

  • Quick -> quick
  • dogs -> dog, foxes -> fox
  • jumped -> jump
  • leap -> jump

So what analyzer do?

Analyzer

A analyzer is usually composed by following parts:

  • Character Filters: could be used to strip out HTML, or to convert & characters to the word and
  • Tokenizer: split text into tokens
  • Token Filters
    • change terms: e.g. lowercase
    • remove terms: e.g. remove stopwords such as a, and, the
    • add terms: e.g. synonym
Testing

we can test our analyzer through _analyze interface:

GET /_analyze?pretty
{
  "analyzer": "smartcn",
  "text": "如果这个发布想要指定盟外事务作为接收方,那么需要先给发布添加标签(通过标签搜索查看目前系统内支持的标签),然后才可以查看对标签感兴趣的事务再添加其成为接收方"
}

Doc Values Intro

Elasticsearch adds the terms/tokens to the inverted index for search. But it also extracts the terms and adds them to the columnar doc values for:

  • Sorting on a field
  • Aggregations on a field
  • Certain filters (for example, geolocation filters)
  • Scripts that refer to fields

Search with Body

Now, we will introduce more API about search, more than just using parameters in url.

Query DSL

A common query body looks like the following code block:

{
  QUERY_NAME: {
    FIELD_NAME: {
      ARGUMENT: VALUE,
      ARGUMENT: VALUE,...
    }
  }
}

Most Important Queries

The following is some commonly used queries:

  • match: match a field
  • multi_match: match multiple fields at a single query
  • range: whether a field is in some range
  • term, terms: search by exact value
  • exists, missing: test whether a value is existing or not

Compound

We have known some basic queries, now we can combine them to compose very complex query:

"bool": {
   "must":     { "match": { "title": "how to make millions" }},
   "must_not": { "match": { "tag":   "spam" }},
   "should": [
     { "match": { "tag": "starred" }}
   ],
   "filter": {
     "range": { "date": { "gte": "2014-01-01" }} 
   }
}

Filtering vs Scoring

The goal of filtering is to reduce the number of documents that have to be examined by the scoring queries.

So, as a general rule

  • use query clauses for full-text search or for any condition that should affect the relevance score;
  • use filters for everything else. e.g. for exact-value searches, we probably want to use a filter clause instead of a query, as a filter will be cached.

Sort and Relevance

Now, we get the document that match the term, then we are going to sort them by some order.

Relevance

If we don’t specify sorting orders, Elasticsearch will sort document by its relevance score. The standard similarity algorithm used in Elasticsearch is known as term frequency/inverse document frequency, or TF/IDF, which takes the following factors into account:

  • term frequency: search “test” against “a test in test” & “test in”, more frequent in a field, more relevant;
  • field length: found “test” in “title” & “content”, shorter the field, more relevant;
  • inverse document frequency: search “test” & “inverse”,

Sorting

We can also customize the order we want to have:

  • by field values:
GET /_search
{
    "query" : {
        "bool" : {
            "filter" : { "term" : { "user_id" : 1 }}
        }
    },
    "sort": "date"
}
  • sort multi-level: first sort by a, then b
"sort": [
    { "date":   { "order": "desc" }},
    { "_score": { "order": "desc" }}
]
  • sort multi-value field: we have to reduce a this field into a value through – min, max, avg or sum
"sort": {
    "dates": {
        "order": "asc",
        "mode":  "min"
    }
}

Search Execution

We can know exactly which shard in the cluster holds a document by a unique combination of _index, _type, and routing values (which defaults to the document’s _id).

But A search request has to consult a copy of every shard in the index or indices we’re interested in to see if they have any matching documents, because we don’t that unique combination.

Query Phase

Each shard executes the query locally (i.e. every shard have inverted index of its content) and builds a sorted priority queue of length from + size—in other words, enough results to satisfy the global search request all by itself. It returns a lightweight list of results to the coordinating node, which contains just the doc IDs and any values required for sorting, such as the _score.

An index can consist of one or more primary shards, so a search request against a single index needs to be able to combine the results from multiple shards. A search against multiple or all indices works in exactly the same way—there are just more shards involved.

Fetch Phase

Each node which receive the GET request from coordinating note, enriches documents if required, and then return them.

Deep Paging in Distributed System

As we have known from last query phase, each shard generates its own sorted results, which then need to be sorted centrally to ensure that the overall order is correct. So, in a distributed system, the cost of sorting results grows exponentially the deeper we page (num_doc = num_shards * (page * page_size)). There is a good reason that web search engines don’t return more than 1,000 results for any query.

Scroll

Apart from paging, we can also use scroll to get a large number of documents. But, a scroll query is more efficient, without paying the penalty of deep pagination.

A scrolled search takes a snapshot in time. It doesn’t see any changes that are made to the index after the initial search request has been made. It does this by keeping the old data files around, so that it can preserve its “view” on what the index looked like at the time it started.


Written with StackEdit.

评论

此博客中的热门博文

Spring Boot: Customize Environment

Spring Boot: Customize Environment Environment variable is a very commonly used feature in daily programming: used in init script used in startup configuration used by logging etc In Spring Boot, all environment variables are a part of properties in Spring context and managed by Environment abstraction. Because Spring Boot can handle the parse of configuration files, when we want to implement a project which uses yml file as a separate config file, we choose the Spring Boot. The following is the problems we met when we implementing the parse of yml file and it is recorded for future reader. Bind to Class Property values can be injected directly into your beans using the @Value annotation, accessed via Spring’s Environment abstraction or bound to structured objects via @ConfigurationProperties. As the document says, there exists three ways to access properties in *.properties or *.yml : @Value : access single value Environment : can access multi

Elasticsearch: Join and SubQuery

Elasticsearch: Join and SubQuery Tony was bothered by the recent change of search engine requirement: they want the functionality of SQL-like join in Elasticsearch! “They are crazy! How can they think like that. Didn’t they understand that Elasticsearch is kind-of NoSQL 1 in which every index should be independent and self-contained? In this way, every index can work independently and scale as they like without considering other indexes, so the performance can boost. Following this design principle, Elasticsearch has little related supports.” Tony thought, after listening their requirements. Leader notice tony’s unwillingness and said, “Maybe it is hard to do, but the requirement is reasonable. We need to search person by his friends, didn’t we? What’s more, the harder to implement, the more you can learn from it, right?” Tony thought leader’s word does make sense so he set out to do the related implementations Application-Side Join “The first implementation

Implement isdigit

It is seems very easy to implement c library function isdigit , but for a library code, performance is very important. So we will try to implement it and make it faster. Function So, first we make it right. int isdigit ( char c) { return c >= '0' && c <= '9' ; } Improvements One – Macro When it comes to performance for c code, macro can always be tried. #define isdigit (c) c >= '0' && c <= '9' Two – Table Upper version use two comparison and one logical operation, but we can do better with more space: # define isdigit(c) table[c] This works and faster, but somewhat wasteful. We need only one bit to represent true or false, but we use a int. So what to do? There are many similar functions like isalpha(), isupper ... in c header file, so we can combine them into one int and get result by table[c]&SOME_BIT , which is what source do. Source code of ctype.h : # define _ISbit(bit) (1 << (