跳至主要内容

Elasticsearch Learning (4): Mappings

Elasticsearch Learning (4): Mappings

“Today, we are going to dive into one of the most important settings when we use Elasticsearch, mappings, which is related to how is ES interpreting our document JSON, how is ES analyzing our field and indexing document, how is ES searching for our request.”, Tony said in the technology sharing time.

Mapping and Analysis

“As we all known, there exists different data types in common databases, and ES also has some common data types.”, Tony.

  • Text: string for full text search
  • Keyword: string for exact match
  • Whole number: byte, short, integer, long
  • Floating-point: float, double
  • Boolean: boolean
  • Date: date

"Text type and Keyword type seems different, can you explain what’s the difference?", someone asked.

Tony, “Good question. But before we explain their differences, we need to understand there exists two large categories of data types: exact values and full text”

Exact Values vs Full Text

"Data in Elasticsearch can be broadly divided into two types: exact values and full text.

"Exact values are exactly what they look like. Examples are a date or a user ID, but can also include exact strings such as a username or an email address. The exact value Foo is not the same as the exact value foo. The exact value 2014 is not the same as the exact value 2014-09-15.

"Full text, on the other hand, refers to textual data—usually written in some human language — like the text of a tweet or the body of an email, which often means that we seldom want to match the whole full-text field exactly and we should split it to words/chars.

"As we can expect that each of the core data types—strings, numbers, booleans, and dates—might be indexed slightly differently. And this is true: there are slight differences. To facilitate queries on full-text fields, Elasticsearch first analyzes the text, and then uses the results to build an inverted index. On the other hand, exact values will be indexed as it is.

“So, back to the question of what’s the differences between text and keyword, the differences lays whether it will be analyzed and how it is store in inverted index.” Tony said.

Indexing

"We have talked that ES use mapping to determine whether to analyze our document fields. But before that, in order to be able to treat date fields as dates, numeric fields as numbers, and string fields as full-text or exact-value strings, Elasticsearch needs to know what type of data each field contains. And this is actually the first functionality of mapping – interpret our JSON string to different types.

“When we index a document that contains a new field or a entire fresh types of document —one previously not seen—Elasticsearch will use dynamic mapping to try to guess the field data type from the basic datatypes available in JSON. But notice, dynamic mapping may hide some bugs of our programs and will cost much performance of indexing. If our document fields is fixed and need no change, we can disable dynamic mapping by:” Tony said.

PUT /test_idx
{
  "settings": {
    "index.mapper.dynamic": false # disable type creation
  },
  "mappings": {
    "test_type": {
      "dynamic": "strict", # disable field addition
      "properties": {
        "field1": {
          "type": "string"
        }
      }
    }
  }
}

Completion

“Understanding the basic functionality of mappings, we focus on how to handle a special cases – auto completion. Auto completion (i.e. search as you type) is a very useful and handy functionality used to assist the user input. In order to use completion assist by Elasticsearch, we can use completion suggester.” Tony said.

“I have heard that there exists prefix search in ES, what’s differences between prefix search and completion suggester?” someone asked.

“Thanks for your question. The difference between completion suggester and prefix search is the speed: completion suggester is built static, occupy more space but fast enough to provide instant response; prefix search is not fast for this use case.” Tony added.

Mapping

"To use this feature, we have to add another field in mapping, which will pre-process our target field for fast completion.

"Saying we want to auto complete user name when user searching, we can put user mapping like this:

PUT user
{
    "mappings": {
        "user" : {
            "properties" : {
                "name-suggest" : {
                    "type" : "completion"
                },
                "name" : {
                    "type": "keyword"
                }
            }
        }
    }
}

Indexing Document

"In Elasticsearch, we have to index the suggestion by ourselves.

PUT user/user/1?refresh
{
    "name-suggest" : {
        "input": [ "tony", "tom" ],
        "weight" : 34
    }
}

“It looks somewhat strange and tedious, because it seems should be done automatically by binding the this suggestion field to our target field, and sync with the target field when we adding/removing value, which has following advantages:”

  • convenient
  • no need to sync suggestion and target

“And I am not sure why ES design completion like this.” Tony added.

Query

POST user/_search?pretty
{
    "suggest": {
        "user-suggest" : {
            "prefix" : "to",
            "completion" : {
                "field" : "name-suggest"
            }
        }
    }
}

“Query works like common search, except we use suggest query. The weight of suggestions will be the _score of the match.”

Fuzzy

“The completion query also supports fuzzy query, i.e. allow some typo in query, which is very handy features to use:”

POST user/_search?pretty
{
    "suggest": {
        "user-suggest" : {
            "prefix" : "to",
            "completion" : {
                "field" : "suggest"
                "fuzzy" : {
                    "fuzziness" : 2
                }
            }
        }
    }
}

"For more fuzziness configuration, here is what you want.

“Thanks for coming, that’s all” Tony said.

Ref

Written with StackEdit.

评论

此博客中的热门博文

Spring Boot: Customize Environment

Spring Boot: Customize Environment Environment variable is a very commonly used feature in daily programming: used in init script used in startup configuration used by logging etc In Spring Boot, all environment variables are a part of properties in Spring context and managed by Environment abstraction. Because Spring Boot can handle the parse of configuration files, when we want to implement a project which uses yml file as a separate config file, we choose the Spring Boot. The following is the problems we met when we implementing the parse of yml file and it is recorded for future reader. Bind to Class Property values can be injected directly into your beans using the @Value annotation, accessed via Spring’s Environment abstraction or bound to structured objects via @ConfigurationProperties. As the document says, there exists three ways to access properties in *.properties or *.yml : @Value : access single value Environment : can access multi

Elasticsearch: Join and SubQuery

Elasticsearch: Join and SubQuery Tony was bothered by the recent change of search engine requirement: they want the functionality of SQL-like join in Elasticsearch! “They are crazy! How can they think like that. Didn’t they understand that Elasticsearch is kind-of NoSQL 1 in which every index should be independent and self-contained? In this way, every index can work independently and scale as they like without considering other indexes, so the performance can boost. Following this design principle, Elasticsearch has little related supports.” Tony thought, after listening their requirements. Leader notice tony’s unwillingness and said, “Maybe it is hard to do, but the requirement is reasonable. We need to search person by his friends, didn’t we? What’s more, the harder to implement, the more you can learn from it, right?” Tony thought leader’s word does make sense so he set out to do the related implementations Application-Side Join “The first implementation

Implement isdigit

It is seems very easy to implement c library function isdigit , but for a library code, performance is very important. So we will try to implement it and make it faster. Function So, first we make it right. int isdigit ( char c) { return c >= '0' && c <= '9' ; } Improvements One – Macro When it comes to performance for c code, macro can always be tried. #define isdigit (c) c >= '0' && c <= '9' Two – Table Upper version use two comparison and one logical operation, but we can do better with more space: # define isdigit(c) table[c] This works and faster, but somewhat wasteful. We need only one bit to represent true or false, but we use a int. So what to do? There are many similar functions like isalpha(), isupper ... in c header file, so we can combine them into one int and get result by table[c]&SOME_BIT , which is what source do. Source code of ctype.h : # define _ISbit(bit) (1 << (