Elasticsearch Introduction Speech Note

Concepts

node, cluster

master: creating or deleting an index, or adding or removing a node from the cluster

index, mapping/type, document

Term	Elasticsearch	Equivalent in RDMS
Index	collection of different type of documents and document properties	database
Shards	the horizontal separation of a index	X
Type/Mapping	collection of dicuments sharing a set of common fields	table
Document	collection of fields defined in JSON	row
ID	Every document is associated with ID	primary key

Example

-------        --------          --------
 Node0          Node1             Node2
 shard0         shard1              
 shard1                           shard0
--------       --------          --------

one copy of replicas
two copy of data & two shards (shard0, shard1)
three nodes

Application

clients

Java

Restful

curl -XGET 'localhost:9200/my_index/my_type/_search' -d'
{
    "query": {
        "match": {
            "title": "BROWN DOG!"
        }
    }
}

Nature

use
debug
opt

Search

binary search
hashmap
skiplist
tree: binary tree, red-black tree, B+ tree
heap

Another data structure: Inverted index

Example

The quick brown fox jumped over the lazy dog

Quick brown foxes leap over lazy dogs in summer

A simple inverted index may looks like:

Term      Doc_1  Doc_2
-------------------------
Quick   |       |  X
The     |   X   |
brown   |   X   |  X
dog     |   X   |
dogs    |       |  X
fox     |   X   |
foxes   |       |  X
in      |       |  X
jumped  |   X   |
lazy    |   X   |  X
leap    |       |  X
over    |   X   |  X
quick   |   X   |
summer  |       |  X
the     |   X   |
------------------------

If we search for quick brown， we will get the following results:

Term      Doc_1  Doc_2
-------------------------
brown   |   X   |  X
quick   |   X   |
------------------------
Total   |   2   |  1

Store

Two design choice

Immutable
- No need of locking
  - programmer
  - performance: avoid distributed lock
- Cache friendly
  - file system cache
Updatable: multiple version index, example.
- Segments: fsync
- Memory buffer
- Disk cache: refresh – write new entry into disk cache, which is the lightweight way to make new documents visible to search
- Transaction log

Inverted Index + NoSQL

Algorithm

Three core algorithm

routing: search for id=1
- which shard: distributed hash table
- which node: round robin
analyzer: inverted index revisit
- character filter: could be used to strip out HTML, or to convert & characters to the word and
- tokenizer
- token filter: change/add/remove term

Term      Doc_1  Doc_2
-------------------------
brown   |   X   |  X
dog     |   X   |  X
fox     |   X   |  X
in      |       |  X
jump    |   X   |  X
lazy    |   X   |  X
over    |   X   |  X
quick   |   X   |  X
summer  |       |  X
the     |   X   |  X
------------------------

scoring
- term frequency: “test” – “a test in test” vs “test in”
- field length: “title” vs “content”
- inverse document frequency: across multiple documents

Questions

Document Deletion

When a document is “deleted,” it is actually just marked as deleted in the .del file. A document that has been marked as deleted can still match a query, but it is removed from the results list before the final query results are returned.

version-1	version-2
a	a’
b
c

version-1.del	version-2.del
b
a

Master election

Once a node joins, it will send a join request to the master.

There are two fault detection processes running. The first is by the master, to ping all the other nodes in the cluster and verify that they are alive. And on the other end, each node pings to master to verify if its still alive or an election process needs to be initiated.

ZenDiscovery service elect like this:

Each node calculates the lowest known node id and sends a vote for leadership to this node
If a node receives sufficiently many votes and the node also voted for itself, then it takes on the role of leader and starts publishing cluster state.

Update collision

document has version number
optimistic locking: CAS like, compare version and update if version match
auto retry

curl -XPOST 'http://localhost:9200/designs/shirt/1/_update?retry_on_conflict=5' -d'
{
   "script" : "ctx._source.votes += 1"
}'

On teh way

Blog Search

Elasticsearch Introduction Speech Note

Concepts

Example

Application

clients

Java

Restful

Nature

Search

Example

Store

Algorithm

Questions

Document Deletion

Master election

Update collision

Ref

标签

评论

发表评论

此博客中的热门博文

Spring Boot: Customize Environment

LevelDB Source Reading (1): Structure

LevelDB Source Reading (4): Concurrent Access