Elasticsearch Learning (2): Nature

In last blog, we introduced some basic things about Elasticsearch, like concepts and configurations.

In this blog, we will dive deeper, into the nature and cores of Elasticsearch, a distributed search engine. And, most of the following content is based on the book: Elasticsearch: The Definitive Guide

Distributed Nature

Master Node

Like a normal cluster, node in the Elasticsearch cluster will elect one of them to be the master node, which is in charge of managing cluster-wide management, like creating or deleting an index, or adding or removing a node from the cluster. But we don’t need to worry that just one master node will not become a bottleneck as traffic grows, the master node does not need to be involved in document-level changes or searches, so in the most common operations, master node will not be involved.

Shards

A index in Elasticsearch will be divided into several different shards, which is the scaling unit of this index. This is because a shard can be seen as a small instance of Elasticsearch and requests can be handled by shards separately. So the more shards, the more scalable our Elasticsearch index is.

But the number of primary shards can be set only when an index is created and never changed later. You can only delete the old index and create new if you want to change shards number.

Routing

When a search requests comes into the cluster, we need to route this request to a shard, as the following shows:

shard = hash(routing_key) % number_of_primary_shards

# The routing key is an arbitrary string, which defaults to the document’s _id but can also be set to a custom value.

This explains why shards number is fixed: if the number of primary shards ever changed in the future, all previous routing values would be invalid and documents would never be found unless we re-index all data, which is so expensive.

Although changing shards number is expensive and should be avoided, we can have multiple shards and relocation is relative cheap, which means move index around nodes as next section explains.

Primary and Replica

Index requests require a majority of shards copies to be available to ensure the cluster is always accessible:

This is to prevent writing data to the “wrong side” of a network partition.

A majority is defined as follows:

int( (num_of_primary_shards + num_of_replicas) / 2 ) + 1

In contrast, for read requests, the coordinating node will choose a different shard copy on every request in order to balance the load; it round-robins through all shard copies.

When a primary shard forwards changes to its replica shards, it doesn’t forward the update request. Instead it forwards the new version of the full document. The reason that Elasticsearch doesn’t forward change but full document is because data be arrive in disorder: if Elasticsearch forwarded just the change, it is possible that changes would be applied in the wrong order, resulting in a corrupt document.

Example

-------        --------          --------
 Node0          Node1             Node2
 shard1         shard1              
 shard0                           shard0
--------       --------          --------

one copy of replicas
two copy of data & two shards (shard0, shard1)
three nodes

Inside a Shard

Inverted Index

We have already introduced the concepts of inverted index in last blog, here we just provide some additional info about it.

The inverted index may hold a lot more information than the list of documents that contain a particular term. It may store:

a count of the number of documents that contain each term;
the number of times a term appears in a particular document;
the order of terms in each document;
the length of each document;
the average length of all documents;
and more.

Immutable

We have already discussed that, in Elasticsearch, all document are immutable, this design brings following advantages:

No need of locking
- In a complex system, multi-thread/process issues often confront programmers, no locking assist the developing;
- No locking, so no deadlock worry;
- Performance gain: in distributed system, distributed locking involves networking, which adds performance penalty;
Cache friendly
- File system cache;
- Filter cache;

But we need the functionality of updating documents, we need to make them updatable.

Updatable

In order to make document updatable, Elasticsearch choose to add new supplementary indices to reflect more-recent changes rather than rewriting the whole inverted index. Each inverted index can be queried in turn—starting with the oldest—and the results combined.

This behavior is so much like common NoSQL data store, like LevelDB that we have introduced, which have multi-versions of data.

Segments are immutable, so documents cannot be removed from older segments, nor can older segments be updated to reflect a newer version of a document. Instead, Elasticsearch have a serial of file named ‘commit point’ includes a .del file that lists which documents in which segments have been deleted.

When a document is “deleted,” it is actually just marked as deleted in the .del file. A document that has been marked as deleted can still match a query, but it is removed from the results list before the final query results are returned.

Small segments are merged into bigger segments (which likes NoSQL compaction process) and this is the moment when those old deleted documents are purged from the filesystem: Deleted documents (or old versions of updated documents) are not copied over to the new bigger segment.

The following lists some similarities between NoSQL and Elasticsearch:

LevelDB	Elasticsearch
memory table	memory buffer
multiple level files	multiple segments
compaction	merging segments
log	transaction log

Near Real-Time

We make it searchable by inverted index, make it work by multiple segments. Now, we need make it fast.

As documents says:

With the development of per-segment search, the delay between indexing a document and making it visible to search dropped dramatically. New documents could be made searchable within minutes, but that still isn’t fast enough.

When a index request comes, it’s first stored in memory buffer, then written to disk from time to time. The bottleneck is the disk.

Commiting a new segment to disk requires an fsync to ensure that the segment is physically written to disk and that data will not be lost if there is a power failure.

But an fsync is costly; it cannot be performed every time a document is indexed without a big performance hit.

So Elasticsearch adding a layer of cache, between disk and memory, which is so much faster because we can write to cache then open the cache file to search, no need of waiting disk writing.

But cache is memory, we may lose some updates if we don’t flush it to disk before server crash. So Elasticsearch add a log file of operations to recover data if necessary.

In conclusion, Elasticsearch has following architecture:

Memory buffer (Index buffer, used to collection new document)
Segments: fsync
Disk cache: refresh – write new entry into disk cache, which is the lightweight way to make new documents visible to search
Transaction log

Ref

Written with StackEdit.

Elasticsearch: Join and SubQuery

Elasticsearch: Join and SubQuery Tony was bothered by the recent change of search engine requirement: they want the functionality of SQL-like join in Elasticsearch! “They are crazy! How can they think like that. Didn’t they understand that Elasticsearch is kind-of NoSQL 1 in which every index should be independent and self-contained? In this way, every index can work independently and scale as they like without considering other indexes, so the performance can boost. Following this design principle, Elasticsearch has little related supports.” Tony thought, after listening their requirements. Leader notice tony’s unwillingness and said, “Maybe it is hard to do, but the requirement is reasonable. We need to search person by his friends, didn’t we? What’s more, the harder to implement, the more you can learn from it, right?” Tony thought leader’s word does make sense so he set out to do the related implementations Application-Side Join “The first implementation ...

阅读全文

On teh way

Blog Search