In the last blog, we have discussed some natures of Elasticsearch: how it search text and how it store document. Now, we will focus on search deeper.
Inverted Index Revisit
We have talked about that Elasticsearch use the inverted index to assist searching functionality, but Elasticsearch build a more complex inverted index than we have talked.
Analyzed Inverted Index
So we have following two documents:
- Doc_1: The quick brown fox jumped over the lazy dog
- Doc_2: Quick brown foxes leap over lazy dogs in summer
A simple inverted index looks like following:
Term Doc_1 Doc_2
-------------------------
Quick | | X
The | X |
brown | X | X
dog | X |
dogs | | X
fox | X |
foxes | | X
in | | X
jumped | X |
lazy | X | X
leap | | X
over | X | X
quick | X |
summer | | X
the | X |
------------------------
A normalized index in which document is first analyzed by a analyzer, then indexed by inverted index:
Term Doc_1 Doc_2
-------------------------
brown | X | X
dog | X | X
fox | X | X
in | | X
jump | X | X
lazy | X | X
over | X | X
quick | X | X
summer | | X
the | X | X
------------------------
As we can notice some changes:
- Quick -> quick
- dogs -> dog, foxes -> fox
- jumped -> jump
- leap -> jump
So what analyzer do?
Analyzer
A analyzer is usually composed by following parts:
- Character Filters: could be used to strip out HTML, or to convert & characters to the word
and
- Tokenizer: split text into tokens
- Token Filters
- change terms: e.g. lowercase
- remove terms: e.g. remove stopwords such as a, and, the
- add terms: e.g. synonym
Testing
we can test our analyzer through _analyze
interface:
GET /_analyze?pretty
{
"analyzer": "smartcn",
"text": "如果这个发布想要指定盟外事务作为接收方,那么需要先给发布添加标签(通过标签搜索查看目前系统内支持的标签),然后才可以查看对标签感兴趣的事务再添加其成为接收方"
}
Doc Values Intro
Elasticsearch adds the terms/tokens to the inverted index for search. But it also extracts the terms and adds them to the columnar doc values for:
- Sorting on a field
- Aggregations on a field
- Certain filters (for example, geolocation filters)
- Scripts that refer to fields
Search with Body
Now, we will introduce more API about search, more than just using parameters in url.
Query DSL
A common query body looks like the following code block:
{
QUERY_NAME: {
FIELD_NAME: {
ARGUMENT: VALUE,
ARGUMENT: VALUE,...
}
}
}
Most Important Queries
The following is some commonly used queries:
- match: match a field
- multi_match: match multiple fields at a single query
- range: whether a field is in some range
- term, terms: search by exact value
- exists, missing: test whether a value is existing or not
Compound
We have known some basic queries, now we can combine them to compose very complex query:
"bool": {
"must": { "match": { "title": "how to make millions" }},
"must_not": { "match": { "tag": "spam" }},
"should": [
{ "match": { "tag": "starred" }}
],
"filter": {
"range": { "date": { "gte": "2014-01-01" }}
}
}
Filtering vs Scoring
The goal of filtering is to reduce the number of documents that have to be examined by the scoring queries.
So, as a general rule
- use query clauses for full-text search or for any condition that should affect the relevance score;
- use filters for everything else. e.g. for exact-value searches, we probably want to use a filter clause instead of a query, as a filter will be cached.
Sort and Relevance
Now, we get the document that match the term, then we are going to sort them by some order.
Relevance
If we don’t specify sorting orders, Elasticsearch will sort document by its relevance score. The standard similarity algorithm used in Elasticsearch is known as term frequency/inverse document frequency, or TF/IDF, which takes the following factors into account:
- term frequency: search “test” against “a test in test” & “test in”, more frequent in a field, more relevant;
- field length: found “test” in “title” & “content”, shorter the field, more relevant;
- inverse document frequency: search “test” & “inverse”,
Sorting
We can also customize the order we want to have:
- by field values:
GET /_search
{
"query" : {
"bool" : {
"filter" : { "term" : { "user_id" : 1 }}
}
},
"sort": "date"
}
- sort multi-level: first sort by
a
, thenb
…
"sort": [
{ "date": { "order": "desc" }},
{ "_score": { "order": "desc" }}
]
- sort multi-value field: we have to reduce a this field into a value through –
min
,max
,avg
orsum
"sort": {
"dates": {
"order": "asc",
"mode": "min"
}
}
Search Execution
We can know exactly which shard in the cluster holds a document by a unique combination of _index, _type, and routing values (which defaults to the document’s _id).
But A search request has to consult a copy of every shard in the index or indices we’re interested in to see if they have any matching documents, because we don’t that unique combination.
Query Phase
Each shard executes the query locally (i.e. every shard have inverted index of its content) and builds a sorted priority queue of length from + size
—in other words, enough results to satisfy the global search request all by itself. It returns a lightweight list of results to the coordinating node, which contains just the doc IDs and any values required for sorting, such as the _score.
An index can consist of one or more primary shards, so a search request against a single index needs to be able to combine the results from multiple shards. A search against multiple or all indices works in exactly the same way—there are just more shards involved.
Fetch Phase
Each node which receive the GET request from coordinating note, enriches documents if required, and then return them.
Deep Paging in Distributed System
As we have known from last query phase, each shard generates its own sorted results, which then need to be sorted centrally to ensure that the overall order is correct. So, in a distributed system, the cost of sorting results grows exponentially the deeper we page (num_doc = num_shards * (page * page_size)
). There is a good reason that web search engines don’t return more than 1,000 results for any query.
Scroll
Apart from paging, we can also use scroll to get a large number of documents. But, a scroll query is more efficient, without paying the penalty of deep pagination.
A scrolled search takes a snapshot in time. It doesn’t see any changes that are made to the index after the initial search request has been made. It does this by keeping the old data files around, so that it can preserve its “view” on what the index looked like at the time it started.
Written with StackEdit.
评论
发表评论