跳至主要内容

Elasticsearch Learning (1): Introduction

Today, we start the learning of Elasticsearch, a widely used search engine based on Lucene. This will be a serial of blogs with following or more subjects:


Now, we starts from the basic concepts of Elasticsearch and then how to apply it in our programs.

Basic Concepts

As a distributed system, Elasticsearch has some common concepts like Node, Cluster, but as a storage system in general, it has some specific terms to describe its structure, as following introduced.

  • Node: single instance of Elasticsearch
  • Cluster: collection of nodes
Term Elasticsearch Equivalent in RDMS
Index collection of different type of documents and document properties database
Shards the horizontal separation of a index
Type/Mapping collection of dicuments sharing a set of common fields table
Document collection of fields defined in JSON row
UID Every document is associated with UID primary key

Structure

Document Oriented

As the Elasticsearch document says:

Elasticsearch is document oriented, meaning that it stores entire objects or documents. It not only stores them, but also indexes the contents of each document in order to make them searchable. In Elasticsearch, you index, search, sort, and filter documents—not rows of columnar data

Elasticsearch store entries of documents, not like RMDB, which store structured rows. In this perspective, Elasticsearch is more like a NoSQL store system.

Immutable Document

Like many other NoSQL, document in elasticsearch is also immutable, i.e. you can’t ‘update’ a document, you can just delete the old one and add a new one to replace it. This design choice bring some advantages:

  • No need of locking
  • Cache friendly

Detailed discussion will in next blog about the nature and implementation of Elasticsearch.

Inverted Index

The inverted index is a common data structure to search data fast, which store the mapping from the content of data to its location, named contrast to forward index.

The term index usually used in RDMS is actually a type of inverted index, which can be implemented using binary search tree or bit set etc.

Example

Sample inputs:

  1. The quick brown fox jumped over the lazy dog
  2. Quick brown foxes leap over lazy dogs in summer

What a simple inverted index may looks like:

Term      Doc_1  Doc_2
-------------------------
Quick   |       |  X
The     |   X   |
brown   |   X   |  X
dog     |   X   |
dogs    |       |  X
fox     |   X   |
foxes   |       |  X
in      |       |  X
jumped  |   X   |
lazy    |   X   |  X
leap    |       |  X
over    |   X   |  X
quick   |   X   |
summer  |       |  X
the     |   X   |
------------------------

If we search for quick brown, we will get the following results:

Term      Doc_1  Doc_2
-------------------------
brown   |   X   |  X
quick   |   X   |
------------------------
Total   |   2   |  1

Application

After installing the Elasticsearch in our server, we need some basic knowledge about Elasticsearch config.

A Elasticsearch cluster consists of one or more nodes with the same cluster.name that are working together to share their data and workload.

Cluster Config

  • log file location

    /var/log/elasticsearch/cluster_name_xxx.log

  • config file location

    /etc/elasticsearch/elasticsearch.yml

Discovery Service

Elasticsearch also provides the node management service, which provides the functionality of

  • node discovery: automatic adding and removal of node by heartbeat
  • master election
  • cluster state management
  • etc
Example

We can start multiple instances of Elasticsearch in a single computer if we just want a simple test:

By starting and stopping the server instances, we can see log of adding and removal of node in log file:

[INFO ][o.e.c.s.ClusterService   ] [-JQvRTY] added {{F5W-FEp}{F5W-FEpSTYGIWxZ7SUkcmQ}{dS9kWcnaSRK39fsw4apqFw}{127.0.0.1}{127.0.0.1:9302},}, reason: zen-disco-receive(from master [master {cJIcl8H}{cJIcl8HHQgyo9E58W4g6lg}{svKZPGm1QuGSsfpBRiNEaA}{127.0.0.1}{127.0.0.1:9300} committed version [9]])

[INFO ][o.e.c.s.ClusterService   ] [cJIcl8H] removed {{-JQvRTY}{-JQvRTYKQ929mMuIVp6YBA}{I22gPMBORv2EbCfcGnLg3Q}{127.0.0.1}{127.0.0.1:9301},}, reason: zen-disco-node-left({-JQvRTY}{-JQvRTYKQ929mMuIVp6YBA}{I22gPMBORv2EbCfcGnLg3Q}{127.0.0.1}{127.0.0.1:9301}), reason(left)[{-JQvRTY}{-JQvRTYKQ929mMuIVp6YBA}{I22gPMBORv2EbCfcGnLg3Q}{127.0.0.1}{127.0.0.1:9301} left]

Interaction

In order to interact with Elasticsearch, we can have two ways:

  • Java API
  • RESTful API

Java API

  • Node client: non data node, a part of cluster
  • Transport client: a light-weight transport client

Usage example of Java client’s configuration in Spring Boot (the combination with Spring will be discussed in the upcoming posts):

spring.data.elasticsearch.cluster-name=elasticsearch_demo
# the following determine the client type to be transport,
# rather than NodeClient, which is the default
spring.data.elasticsearch.cluster-nodes=192.168.1.100:9300

RESTful API

Elasticsearch also provides the RESTful API for other language to access the functionality of searching like following shows:

curl -XGET 'localhost:9200/_count?pretty' -H 'Content-Type: application/json' -d'
{
    "query": {
        "match_all": {}
    }
}
'

In fact, the Java API is just a wrapper layer of RESTful API, which will convert the method chain invocations into http requests.

Utility

Sense is a Kibana app that provides a UI to facilitate the interaction directly from brower. As for Elsticsearch 5.0 and after, sense is already shipped with kibana and no need to install manually.

GET /megacorp/employee/_search
{
    "query" : {
        "match" : {
            "last_name" : "Smith"
        }
    }
}'

Written with StackEdit.

评论

此博客中的热门博文

Spring Boot: Customize Environment

Spring Boot: Customize Environment Environment variable is a very commonly used feature in daily programming: used in init script used in startup configuration used by logging etc In Spring Boot, all environment variables are a part of properties in Spring context and managed by Environment abstraction. Because Spring Boot can handle the parse of configuration files, when we want to implement a project which uses yml file as a separate config file, we choose the Spring Boot. The following is the problems we met when we implementing the parse of yml file and it is recorded for future reader. Bind to Class Property values can be injected directly into your beans using the @Value annotation, accessed via Spring’s Environment abstraction or bound to structured objects via @ConfigurationProperties. As the document says, there exists three ways to access properties in *.properties or *.yml : @Value : access single value Environment : can access multi

Elasticsearch: Join and SubQuery

Elasticsearch: Join and SubQuery Tony was bothered by the recent change of search engine requirement: they want the functionality of SQL-like join in Elasticsearch! “They are crazy! How can they think like that. Didn’t they understand that Elasticsearch is kind-of NoSQL 1 in which every index should be independent and self-contained? In this way, every index can work independently and scale as they like without considering other indexes, so the performance can boost. Following this design principle, Elasticsearch has little related supports.” Tony thought, after listening their requirements. Leader notice tony’s unwillingness and said, “Maybe it is hard to do, but the requirement is reasonable. We need to search person by his friends, didn’t we? What’s more, the harder to implement, the more you can learn from it, right?” Tony thought leader’s word does make sense so he set out to do the related implementations Application-Side Join “The first implementation

Implement isdigit

It is seems very easy to implement c library function isdigit , but for a library code, performance is very important. So we will try to implement it and make it faster. Function So, first we make it right. int isdigit ( char c) { return c >= '0' && c <= '9' ; } Improvements One – Macro When it comes to performance for c code, macro can always be tried. #define isdigit (c) c >= '0' && c <= '9' Two – Table Upper version use two comparison and one logical operation, but we can do better with more space: # define isdigit(c) table[c] This works and faster, but somewhat wasteful. We need only one bit to represent true or false, but we use a int. So what to do? There are many similar functions like isalpha(), isupper ... in c header file, so we can combine them into one int and get result by table[c]&SOME_BIT , which is what source do. Source code of ctype.h : # define _ISbit(bit) (1 << (