When using Elasticsearch, we are confronted with many problems, this post will list the problems and solutions for future reader.
Elasticsearch Concepts
The first set of problems are related to the some concepts in Elasticsearch which we are confused about at the very beginning of adoption.
Indices or Shards
In Elasticsearch, shards is the minimal runnable unit for searching and scaling. When we search within a single index, Elasticsearch forwards the search request to a primary or replica of every shard in that index, and then gathers the results from each shard.
So, we can understand that
Searching one index that has five primary shards is exactly equivalent to searching five indices that have one primary shard each.
And, that is the reason why Elasticsearch suggests us to rolling the index to scale our applications, because shards number are fixed.
Parent Child vs Nested
Saying we have a person class to search, and we have some tags describing them:
class Person {
List<Tags> tags;
}
In oder to search person by tags, we can’t use normal object
in mapping (which will be flattened into a simple key-value format and lose the relationship between inner object and outer object, see here for detail), but using nested object
or using parent children relationship for rescue.
- parent children way:
GET /company/branch/_search
{
"query": {
"has_child": {
"type": "employee",
"score_mode": "max",
"query": {
"match": {
"name": "Alice"
}
}
}
}
}
- nested object:
GET /company/branch/_search
{
"query": {
"nested": {
"path": "employee",
"query": {
"match": {
"employee.name": "Alice"
}
}
}
}
}
So what’s the differences between them?
The difference mainly lays in the store type: nested object will store together with outer json, but using parent-child relationship will store children json separately.
Parent-child relationship make change on either of objects easier, which is the advantage over nested object. But it also has limitations: parent and children have to live in the same shard, in order to join the children and parent dynamically and fast (i.e. avoid joining the parent and children in different shards means avoid network communications, which save much time).
must
vs should
must
and should
are all sub-components of bool
query, and have different functionality when used in different context:
- filter:
- must: this query must match
- should: if no
must
andfilter
, at least oneshould
should match
- query:
- must: must match and contribute to score
- should: if has
must
orfilter
, even no should match, still work
multi_match
vs _all
The multi_match query provides a convenient shorthand way of running the same query against multiple fields.
If we list all fields in multi_match
query, in the most cases, we will get the same result of query _all
meta field. But the following is some differences
multi_match
is more flexible, you can change fields of you query dynamically;- play with different boosts per field when using the
multi_match
query, i.e. give different weight to different fields; - term frequencies and field length may be different in the _all field and in the individual fields, which may affect scoring;
Multiword vs More Like This
The match query makes multiword queries just as simple:
curl -XGET 'localhost:9200/my_index/my_type/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": {
"match": {
"title": "BROWN DOG!"
}
}
}
'
Elasticsearch will use the tokenizer to analyze the target string, and match it with specific field.
More Like This Query, on the other hand, is more complex, which is composed of the following steps:
- selects a set of representative terms of these input documents,
- forms a query using these terms
- executes the query and returns the results.
In other words, MLT Query has the process of enumerate the combinations of input words and find the best match from multiple multiword match.
Default Behavior for Highlight
- fragment size: 100
- number of fragments: will fetch all
- empty
fields
: no highlight field shown, just like a normal search
GET /_search
{
"query" : {
"match": { "content": "kimchy" }
},
"highlight" : {
"fields" : {
}
}
}
_source
The _source
field contains the original JSON document body that was passed at index time. The _source field itself is not indexed (and thus is not searchable), but it is stored so that:
- can be returned when executing fetch requests, like get or search;
- to reindex from one Elasticsearch index to another;
- to debug queries or aggregations by viewing the original document used at index time;
- etc
Store vs Index
By default, field values are indexed to make them searchable, but they are not stored. This means that the field can be queried, but the original field value cannot be retrieved.
Usually this doesn’t matter. The field value is already part of the _source field, which is stored by default. If you only want to retrieve the value of a single field or of a few fields, instead of the whole _source
, then this can be achieved with source filtering.
Elasticsearch Config
Client Version Differs from Server Version
The client must have the same major version (e.g. 2.x, or 5.x) as the nodes in the cluster. Clients may connect to clusters which have a different minor version (e.g. 2.3.x) but it is possible that new functionality may not be supported. Ideally, the client should have the same version as the cluster.
Allow Remote Access
The initial config for Elasticsearch not allow remote address access, in order to allow remote access, we have to edit the config file (/etc/elasticsearch/elasticsearch.yml
) to be:
network.host: 0.0.0.0
File Number
As the docs says:
Lucene uses a very large number of files. At the same time, Elasticsearch uses a large number of sockets to communicate between nodes and HTTP clients.
Elasticsearch need many file descriptors for it to work, so we may need to set the max file descriptors for our server process (In mac os, it may complains the following warning/error)
max file descriptors [10240] for elasticsearch process likely too low, consider increasing to at least [65536]
Fail to Join Cluster – Let to be Solved
I first start Elasticsearch in a Ubuntu server, then I start another instance in a mac with same cluster name. Then it just complains that it fails to join the cluster. Starting multiple instances in the same machine works fine.
Things that have already checked:
- versions of Elasticsearch in different machine are the same
- Elastcisearch cluster name are the same
Ref
Written with StackEdit.
评论
发表评论