Elasticsearch Learning (3): Search

In the last blog, we have discussed some natures of Elasticsearch: how it search text and how it store document. Now, we will focus on search deeper.

Inverted Index Revisit

We have talked about that Elasticsearch use the inverted index to assist searching functionality, but Elasticsearch build a more complex inverted index than we have talked.

Analyzed Inverted Index

So we have following two documents:

Doc_1: The quick brown fox jumped over the lazy dog

Doc_2: Quick brown foxes leap over lazy dogs in summer

A simple inverted index looks like following:

Term      Doc_1  Doc_2
-------------------------
Quick   |       |  X
The     |   X   |
brown   |   X   |  X
dog     |   X   |
dogs    |       |  X
fox     |   X   |
foxes   |       |  X
in      |       |  X
jumped  |   X   |
lazy    |   X   |  X
leap    |       |  X
over    |   X   |  X
quick   |   X   |
summer  |       |  X
the     |   X   |
------------------------

A normalized index in which document is first analyzed by a analyzer, then indexed by inverted index:

Term      Doc_1  Doc_2
-------------------------
brown   |   X   |  X
dog     |   X   |  X
fox     |   X   |  X
in      |       |  X
jump    |   X   |  X
lazy    |   X   |  X
over    |   X   |  X
quick   |   X   |  X
summer  |       |  X
the     |   X   |  X
------------------------

As we can notice some changes:

Quick -> quick
dogs -> dog, foxes -> fox
jumped -> jump
leap -> jump

So what analyzer do?

Analyzer

A analyzer is usually composed by following parts:

Character Filters: could be used to strip out HTML, or to convert & characters to the word and
Tokenizer: split text into tokens
Token Filters
- change terms: e.g. lowercase
- remove terms: e.g. remove stopwords such as a, and, the
- add terms: e.g. synonym

Testing

we can test our analyzer through _analyze interface:

GET /_analyze?pretty
{
  "analyzer": "smartcn",
  "text": "如果这个发布想要指定盟外事务作为接收方，那么需要先给发布添加标签（通过标签搜索查看目前系统内支持的标签），然后才可以查看对标签感兴趣的事务再添加其成为接收方"
}

Doc Values Intro

Elasticsearch adds the terms/tokens to the inverted index for search. But it also extracts the terms and adds them to the columnar doc values for:

Sorting on a field
Aggregations on a field
Certain filters (for example, geolocation filters)
Scripts that refer to fields

Search with Body

Now, we will introduce more API about search, more than just using parameters in url.

Query DSL

A common query body looks like the following code block:

{
  QUERY_NAME: {
    FIELD_NAME: {
      ARGUMENT: VALUE,
      ARGUMENT: VALUE,...
    }
  }
}

Most Important Queries

The following is some commonly used queries:

match: match a field
multi_match: match multiple fields at a single query
range: whether a field is in some range
term, terms: search by exact value
exists, missing: test whether a value is existing or not

Compound

We have known some basic queries, now we can combine them to compose very complex query:

"bool": {
   "must":     { "match": { "title": "how to make millions" }},
   "must_not": { "match": { "tag":   "spam" }},
   "should": [
     { "match": { "tag": "starred" }}
   ],
   "filter": {
     "range": { "date": { "gte": "2014-01-01" }} 
   }
}

Filtering vs Scoring

The goal of filtering is to reduce the number of documents that have to be examined by the scoring queries.

So, as a general rule

use query clauses for full-text search or for any condition that should affect the relevance score;
use filters for everything else. e.g. for exact-value searches, we probably want to use a filter clause instead of a query, as a filter will be cached.

Sort and Relevance

Now, we get the document that match the term, then we are going to sort them by some order.

Relevance

If we don’t specify sorting orders, Elasticsearch will sort document by its relevance score. The standard similarity algorithm used in Elasticsearch is known as term frequency/inverse document frequency, or TF/IDF, which takes the following factors into account:

term frequency: search “test” against “a test in test” & “test in”, more frequent in a field, more relevant;
field length: found “test” in “title” & “content”, shorter the field, more relevant;
inverse document frequency: search “test” & “inverse”,

Sorting

We can also customize the order we want to have:

by field values:

GET /_search
{
    "query" : {
        "bool" : {
            "filter" : { "term" : { "user_id" : 1 }}
        }
    },
    "sort": "date"
}

sort multi-level: first sort by a, then b …

"sort": [
    { "date":   { "order": "desc" }},
    { "_score": { "order": "desc" }}
]

sort multi-value field: we have to reduce a this field into a value through – min, max, avg or sum

"sort": {
    "dates": {
        "order": "asc",
        "mode":  "min"
    }
}

Search Execution

We can know exactly which shard in the cluster holds a document by a unique combination of _index, _type, and routing values (which defaults to the document’s _id).

But A search request has to consult a copy of every shard in the index or indices we’re interested in to see if they have any matching documents, because we don’t that unique combination.

Query Phase

Each shard executes the query locally (i.e. every shard have inverted index of its content) and builds a sorted priority queue of length from + size—in other words, enough results to satisfy the global search request all by itself. It returns a lightweight list of results to the coordinating node, which contains just the doc IDs and any values required for sorting, such as the _score.

An index can consist of one or more primary shards, so a search request against a single index needs to be able to combine the results from multiple shards. A search against multiple or all indices works in exactly the same way—there are just more shards involved.

Fetch Phase

Each node which receive the GET request from coordinating note, enriches documents if required, and then return them.

Deep Paging in Distributed System

As we have known from last query phase, each shard generates its own sorted results, which then need to be sorted centrally to ensure that the overall order is correct. So, in a distributed system, the cost of sorting results grows exponentially the deeper we page (num_doc = num_shards * (page * page_size)). There is a good reason that web search engines don’t return more than 1,000 results for any query.

Scroll

Apart from paging, we can also use scroll to get a large number of documents. But, a scroll query is more efficient, without paying the penalty of deep pagination.

A scrolled search takes a snapshot in time. It doesn’t see any changes that are made to the index after the initial search request has been made. It does this by keeping the old data files around, so that it can preserve its “view” on what the index looked like at the time it started.

Written with StackEdit.

Elasticsearch: Join and SubQuery

Elasticsearch: Join and SubQuery Tony was bothered by the recent change of search engine requirement: they want the functionality of SQL-like join in Elasticsearch! “They are crazy! How can they think like that. Didn’t they understand that Elasticsearch is kind-of NoSQL 1 in which every index should be independent and self-contained? In this way, every index can work independently and scale as they like without considering other indexes, so the performance can boost. Following this design principle, Elasticsearch has little related supports.” Tony thought, after listening their requirements. Leader notice tony’s unwillingness and said, “Maybe it is hard to do, but the requirement is reasonable. We need to search person by his friends, didn’t we? What’s more, the harder to implement, the more you can learn from it, right?” Tony thought leader’s word does make sense so he set out to do the related implementations Application-Side Join “The first implementation

阅读全文

On teh way

Blog Search