跳至主要内容

Elasticsearch Problem Lists(1): Concepts Confusion

When using Elasticsearch, we are confronted with many problems, this post will list the problems and solutions for future reader.

Elasticsearch Concepts

The first set of problems are related to the some concepts in Elasticsearch which we are confused about at the very beginning of adoption.

Indices or Shards

In Elasticsearch, shards is the minimal runnable unit for searching and scaling. When we search within a single index, Elasticsearch forwards the search request to a primary or replica of every shard in that index, and then gathers the results from each shard.

So, we can understand that

Searching one index that has five primary shards is exactly equivalent to searching five indices that have one primary shard each.

And, that is the reason why Elasticsearch suggests us to rolling the index to scale our applications, because shards number are fixed.

Parent Child vs Nested

Saying we have a person class to search, and we have some tags describing them:

class Person {
  List<Tags> tags;
}

In oder to search person by tags, we can’t use normal object in mapping (which will be flattened into a simple key-value format and lose the relationship between inner object and outer object, see here for detail), but using nested object or using parent children relationship for rescue.

  • parent children way:
GET /company/branch/_search
{
  "query": {
    "has_child": {
      "type":       "employee",
      "score_mode": "max",
      "query": {
        "match": {
          "name": "Alice"
        }
      }
    }
  }
}
  • nested object:
GET /company/branch/_search
{
 "query": {
    "nested": {
      "path": "employee",
      "query": {
        "match": {
          "employee.name": "Alice"
        }
      }
    }
 }
}

So what’s the differences between them?

The difference mainly lays in the store type: nested object will store together with outer json, but using parent-child relationship will store children json separately.

Parent-child relationship make change on either of objects easier, which is the advantage over nested object. But it also has limitations: parent and children have to live in the same shard, in order to join the children and parent dynamically and fast (i.e. avoid joining the parent and children in different shards means avoid network communications, which save much time).

must vs should

must and should are all sub-components of bool query, and have different functionality when used in different context:

  • filter:
    • must: this query must match
    • should: if no must and filter, at least one should should match
  • query:
    • must: must match and contribute to score
    • should: if has must or filter, even no should match, still work

multi_match vs _all

The multi_match query provides a convenient shorthand way of running the same query against multiple fields.

If we list all fields in multi_match query, in the most cases, we will get the same result of query _all meta field. But the following is some differences

  • multi_match is more flexible, you can change fields of you query dynamically;
  • play with different boosts per field when using the multi_match query, i.e. give different weight to different fields;
  • term frequencies and field length may be different in the _all field and in the individual fields, which may affect scoring;

Multiword vs More Like This

The match query makes multiword queries just as simple:

curl -XGET 'localhost:9200/my_index/my_type/_search?pretty' -H 'Content-Type: application/json' -d'
{
    "query": {
        "match": {
            "title": "BROWN DOG!"
        }
    }
}
'

Elasticsearch will use the tokenizer to analyze the target string, and match it with specific field.

More Like This Query, on the other hand, is more complex, which is composed of the following steps:

  • selects a set of representative terms of these input documents,
  • forms a query using these terms
  • executes the query and returns the results.

In other words, MLT Query has the process of enumerate the combinations of input words and find the best match from multiple multiword match.

Default Behavior for Highlight

  • fragment size: 100
  • number of fragments: will fetch all
  • empty fields: no highlight field shown, just like a normal search
GET /_search
{
    "query" : {
        "match": { "content": "kimchy" }
    },
    "highlight" : {
        "fields" : {
        }
    }
}

_source

The _source field contains the original JSON document body that was passed at index time. The _source field itself is not indexed (and thus is not searchable), but it is stored so that:

  • can be returned when executing fetch requests, like get or search;
  • to reindex from one Elasticsearch index to another;
  • to debug queries or aggregations by viewing the original document used at index time;
  • etc

Store vs Index

By default, field values are indexed to make them searchable, but they are not stored. This means that the field can be queried, but the original field value cannot be retrieved.

Usually this doesn’t matter. The field value is already part of the _source field, which is stored by default. If you only want to retrieve the value of a single field or of a few fields, instead of the whole _source, then this can be achieved with source filtering.

Elasticsearch Config

Client Version Differs from Server Version

The client must have the same major version (e.g. 2.x, or 5.x) as the nodes in the cluster. Clients may connect to clusters which have a different minor version (e.g. 2.3.x) but it is possible that new functionality may not be supported. Ideally, the client should have the same version as the cluster.

Allow Remote Access

The initial config for Elasticsearch not allow remote address access, in order to allow remote access, we have to edit the config file (/etc/elasticsearch/elasticsearch.yml) to be:

network.host: 0.0.0.0

File Number

As the docs says:

Lucene uses a very large number of files. At the same time, Elasticsearch uses a large number of sockets to communicate between nodes and HTTP clients.

Elasticsearch need many file descriptors for it to work, so we may need to set the max file descriptors for our server process (In mac os, it may complains the following warning/error)

max file descriptors [10240] for elasticsearch process likely too low, consider increasing to at least [65536]

Fail to Join Cluster – Let to be Solved

I first start Elasticsearch in a Ubuntu server, then I start another instance in a mac with same cluster name. Then it just complains that it fails to join the cluster. Starting multiple instances in the same machine works fine.

Things that have already checked:

  • versions of Elasticsearch in different machine are the same
  • Elastcisearch cluster name are the same

Ref

Written with StackEdit.

评论

此博客中的热门博文

Spring Boot: Customize Environment

Spring Boot: Customize Environment Environment variable is a very commonly used feature in daily programming: used in init script used in startup configuration used by logging etc In Spring Boot, all environment variables are a part of properties in Spring context and managed by Environment abstraction. Because Spring Boot can handle the parse of configuration files, when we want to implement a project which uses yml file as a separate config file, we choose the Spring Boot. The following is the problems we met when we implementing the parse of yml file and it is recorded for future reader. Bind to Class Property values can be injected directly into your beans using the @Value annotation, accessed via Spring’s Environment abstraction or bound to structured objects via @ConfigurationProperties. As the document says, there exists three ways to access properties in *.properties or *.yml : @Value : access single value Environment : can access multi

Elasticsearch: Join and SubQuery

Elasticsearch: Join and SubQuery Tony was bothered by the recent change of search engine requirement: they want the functionality of SQL-like join in Elasticsearch! “They are crazy! How can they think like that. Didn’t they understand that Elasticsearch is kind-of NoSQL 1 in which every index should be independent and self-contained? In this way, every index can work independently and scale as they like without considering other indexes, so the performance can boost. Following this design principle, Elasticsearch has little related supports.” Tony thought, after listening their requirements. Leader notice tony’s unwillingness and said, “Maybe it is hard to do, but the requirement is reasonable. We need to search person by his friends, didn’t we? What’s more, the harder to implement, the more you can learn from it, right?” Tony thought leader’s word does make sense so he set out to do the related implementations Application-Side Join “The first implementation

Implement isdigit

It is seems very easy to implement c library function isdigit , but for a library code, performance is very important. So we will try to implement it and make it faster. Function So, first we make it right. int isdigit ( char c) { return c >= '0' && c <= '9' ; } Improvements One – Macro When it comes to performance for c code, macro can always be tried. #define isdigit (c) c >= '0' && c <= '9' Two – Table Upper version use two comparison and one logical operation, but we can do better with more space: # define isdigit(c) table[c] This works and faster, but somewhat wasteful. We need only one bit to represent true or false, but we use a int. So what to do? There are many similar functions like isalpha(), isupper ... in c header file, so we can combine them into one int and get result by table[c]&SOME_BIT , which is what source do. Source code of ctype.h : # define _ISbit(bit) (1 << (