Elasticsearch Learning (1): Introduction

Today, we start the learning of Elasticsearch, a widely used search engine based on Lucene. This will be a serial of blogs with following or more subjects:

Introduction: this post
Implementation
Search
Application:
- Similar Concepts
- With Spring

Now, we starts from the basic concepts of Elasticsearch and then how to apply it in our programs.

Basic Concepts

As a distributed system, Elasticsearch has some common concepts like Node, Cluster, but as a storage system in general, it has some specific terms to describe its structure, as following introduced.

Node: single instance of Elasticsearch
Cluster: collection of nodes

Term	Elasticsearch	Equivalent in RDMS
Index	collection of different type of documents and document properties	database
Shards	the horizontal separation of a index
Type/Mapping	collection of dicuments sharing a set of common fields	table
Document	collection of fields defined in JSON	row
UID	Every document is associated with UID	primary key

Structure

Document Oriented

As the Elasticsearch document says:

Elasticsearch is document oriented, meaning that it stores entire objects or documents. It not only stores them, but also indexes the contents of each document in order to make them searchable. In Elasticsearch, you index, search, sort, and filter documents—not rows of columnar data

Elasticsearch store entries of documents, not like RMDB, which store structured rows. In this perspective, Elasticsearch is more like a NoSQL store system.

Immutable Document

Like many other NoSQL, document in elasticsearch is also immutable, i.e. you can’t ‘update’ a document, you can just delete the old one and add a new one to replace it. This design choice bring some advantages:

No need of locking
Cache friendly

Detailed discussion will in next blog about the nature and implementation of Elasticsearch.

Inverted Index

The inverted index is a common data structure to search data fast, which store the mapping from the content of data to its location, named contrast to forward index.

The term index usually used in RDMS is actually a type of inverted index, which can be implemented using binary search tree or bit set etc.

Example

Sample inputs:

The quick brown fox jumped over the lazy dog

Quick brown foxes leap over lazy dogs in summer

What a simple inverted index may looks like:

Term      Doc_1  Doc_2
-------------------------
Quick   |       |  X
The     |   X   |
brown   |   X   |  X
dog     |   X   |
dogs    |       |  X
fox     |   X   |
foxes   |       |  X
in      |       |  X
jumped  |   X   |
lazy    |   X   |  X
leap    |       |  X
over    |   X   |  X
quick   |   X   |
summer  |       |  X
the     |   X   |
------------------------

If we search for quick brown， we will get the following results:

Term      Doc_1  Doc_2
-------------------------
brown   |   X   |  X
quick   |   X   |
------------------------
Total   |   2   |  1

Application

After installing the Elasticsearch in our server, we need some basic knowledge about Elasticsearch config.

A Elasticsearch cluster consists of one or more nodes with the same cluster.name that are working together to share their data and workload.

Cluster Config

log file location

/var/log/elasticsearch/cluster_name_xxx.log
config file location

/etc/elasticsearch/elasticsearch.yml

Discovery Service

Elasticsearch also provides the node management service, which provides the functionality of

node discovery: automatic adding and removal of node by heartbeat
master election
cluster state management
etc

Example

We can start multiple instances of Elasticsearch in a single computer if we just want a simple test:

By starting and stopping the server instances, we can see log of adding and removal of node in log file:

[INFO ][o.e.c.s.ClusterService   ] [-JQvRTY] added {{F5W-FEp}{F5W-FEpSTYGIWxZ7SUkcmQ}{dS9kWcnaSRK39fsw4apqFw}{127.0.0.1}{127.0.0.1:9302},}, reason: zen-disco-receive(from master [master {cJIcl8H}{cJIcl8HHQgyo9E58W4g6lg}{svKZPGm1QuGSsfpBRiNEaA}{127.0.0.1}{127.0.0.1:9300} committed version [9]])

[INFO ][o.e.c.s.ClusterService   ] [cJIcl8H] removed {{-JQvRTY}{-JQvRTYKQ929mMuIVp6YBA}{I22gPMBORv2EbCfcGnLg3Q}{127.0.0.1}{127.0.0.1:9301},}, reason: zen-disco-node-left({-JQvRTY}{-JQvRTYKQ929mMuIVp6YBA}{I22gPMBORv2EbCfcGnLg3Q}{127.0.0.1}{127.0.0.1:9301}), reason(left)[{-JQvRTY}{-JQvRTYKQ929mMuIVp6YBA}{I22gPMBORv2EbCfcGnLg3Q}{127.0.0.1}{127.0.0.1:9301} left]

Interaction

In order to interact with Elasticsearch, we can have two ways:

Java API
RESTful API

Java API

Node client: non data node, a part of cluster
Transport client: a light-weight transport client

Usage example of Java client’s configuration in Spring Boot (the combination with Spring will be discussed in the upcoming posts):

spring.data.elasticsearch.cluster-name=elasticsearch_demo
# the following determine the client type to be transport,
# rather than NodeClient, which is the default
spring.data.elasticsearch.cluster-nodes=192.168.1.100:9300

RESTful API

Elasticsearch also provides the RESTful API for other language to access the functionality of searching like following shows:

curl -XGET 'localhost:9200/_count?pretty' -H 'Content-Type: application/json' -d'
{
    "query": {
        "match_all": {}
    }
}
'

In fact, the Java API is just a wrapper layer of RESTful API, which will convert the method chain invocations into http requests.

Utility

Sense is a Kibana app that provides a UI to facilitate the interaction directly from brower. As for Elsticsearch 5.0 and after, sense is already shipped with kibana and no need to install manually.

GET /megacorp/employee/_search
{
    "query" : {
        "match" : {
            "last_name" : "Smith"
        }
    }
}'

Written with StackEdit.

On teh way

Blog Search