Concepts
node, cluster
master: creating or deleting an index, or adding or removing a node from the cluster
index, mapping/type, document
Term | Elasticsearch | Equivalent in RDMS |
---|---|---|
Index | collection of different type of documents and document properties | database |
Shards | the horizontal separation of a index | X |
Type/Mapping | collection of dicuments sharing a set of common fields | table |
Document | collection of fields defined in JSON | row |
ID | Every document is associated with ID | primary key |
Example
------- -------- --------
Node0 Node1 Node2
shard0 shard1
shard1 shard0
-------- -------- --------
- one copy of replicas
- two copy of data & two shards (shard0, shard1)
- three nodes
Application
clients
Java
Restful
curl -XGET 'localhost:9200/my_index/my_type/_search' -d'
{
"query": {
"match": {
"title": "BROWN DOG!"
}
}
}
Nature
- use
- debug
- opt
Search
- binary search
- hashmap
- skiplist
- tree: binary tree, red-black tree, B+ tree
- heap
Another data structure: Inverted index
Example
- The quick brown fox jumped over the lazy dog
- Quick brown foxes leap over lazy dogs in summer
A simple inverted index may looks like:
Term Doc_1 Doc_2
-------------------------
Quick | | X
The | X |
brown | X | X
dog | X |
dogs | | X
fox | X |
foxes | | X
in | | X
jumped | X |
lazy | X | X
leap | | X
over | X | X
quick | X |
summer | | X
the | X |
------------------------
If we search for quick brown
, we will get the following results:
Term Doc_1 Doc_2
-------------------------
brown | X | X
quick | X |
------------------------
Total | 2 | 1
Store
Two design choice
Immutable
- No need of locking
- programmer
- performance: avoid distributed lock
- Cache friendly
- file system cache
- No need of locking
Updatable: multiple version index, example.
- Segments: fsync
- Memory buffer
- Disk cache: refresh – write new entry into disk cache, which is the lightweight way to make new documents visible to search
- Transaction log
Inverted Index + NoSQL
Algorithm
Three core algorithm
- routing: search for id=1
- which shard: distributed hash table
- which node: round robin
- analyzer: inverted index revisit
- character filter: could be used to strip out HTML, or to convert & characters to the word and
- tokenizer
- token filter: change/add/remove term
Term Doc_1 Doc_2
-------------------------
brown | X | X
dog | X | X
fox | X | X
in | | X
jump | X | X
lazy | X | X
over | X | X
quick | X | X
summer | | X
the | X | X
------------------------
- scoring
- term frequency: “test” – “a test in test” vs “test in”
- field length: “title” vs “content”
- inverse document frequency: across multiple documents
Questions
Document Deletion
When a document is “deleted,” it is actually just marked as deleted in the .del file. A document that has been marked as deleted can still match a query, but it is removed from the results list before the final query results are returned.
version-1 | version-2 |
---|---|
a | a’ |
b | |
c |
version-1.del | version-2.del |
---|---|
b | |
a |
Master election
Once a node joins, it will send a join request to the master.
There are two fault detection processes running. The first is by the master, to ping all the other nodes in the cluster and verify that they are alive. And on the other end, each node pings to master to verify if its still alive or an election process needs to be initiated.
ZenDiscovery service elect like this:
- Each node calculates the lowest known node id and sends a vote for leadership to this node
- If a node receives sufficiently many votes and the node also voted for itself, then it takes on the role of leader and starts publishing cluster state.
Update collision
- document has version number
- optimistic locking: CAS like, compare version and update if version match
- auto retry
curl -XPOST 'http://localhost:9200/designs/shirt/1/_update?retry_on_conflict=5' -d'
{
"script" : "ctx._source.votes += 1"
}'
评论
发表评论