Today, we start the learning of Elasticsearch, a widely used search engine based on Lucene. This will be a serial of blogs with following or more subjects:
- Introduction: this post
- Implementation
- Search
- Application:
Now, we starts from the basic concepts of Elasticsearch and then how to apply it in our programs.
Basic Concepts
As a distributed system, Elasticsearch has some common concepts like Node
, Cluster
, but as a storage system in general, it has some specific terms to describe its structure, as following introduced.
- Node: single instance of Elasticsearch
- Cluster: collection of nodes
Term | Elasticsearch | Equivalent in RDMS |
---|---|---|
Index | collection of different type of documents and document properties | database |
Shards | the horizontal separation of a index | |
Type/Mapping | collection of dicuments sharing a set of common fields | table |
Document | collection of fields defined in JSON | row |
UID | Every document is associated with UID | primary key |
Structure
Document Oriented
As the Elasticsearch document says:
Elasticsearch is document oriented, meaning that it stores entire objects or documents. It not only stores them, but also indexes the contents of each document in order to make them searchable. In Elasticsearch, you index, search, sort, and filter documents—not rows of columnar data
Elasticsearch store entries of documents, not like RMDB, which store structured rows. In this perspective, Elasticsearch is more like a NoSQL store system.
Immutable Document
Like many other NoSQL, document in elasticsearch is also immutable, i.e. you can’t ‘update’ a document, you can just delete the old one and add a new one to replace it. This design choice bring some advantages:
- No need of locking
- Cache friendly
Detailed discussion will in next blog about the nature and implementation of Elasticsearch.
Inverted Index
The inverted index is a common data structure to search data fast, which store the mapping from the content of data to its location, named contrast to forward index.
The term index
usually used in RDMS is actually a type of inverted index, which can be implemented using binary search tree or bit set etc.
Example
Sample inputs:
- The quick brown fox jumped over the lazy dog
- Quick brown foxes leap over lazy dogs in summer
What a simple inverted index may looks like:
Term Doc_1 Doc_2
-------------------------
Quick | | X
The | X |
brown | X | X
dog | X |
dogs | | X
fox | X |
foxes | | X
in | | X
jumped | X |
lazy | X | X
leap | | X
over | X | X
quick | X |
summer | | X
the | X |
------------------------
If we search for quick brown
, we will get the following results:
Term Doc_1 Doc_2
-------------------------
brown | X | X
quick | X |
------------------------
Total | 2 | 1
Application
After installing the Elasticsearch in our server, we need some basic knowledge about Elasticsearch config.
A Elasticsearch cluster consists of one or more nodes with the same cluster.name
that are working together to share their data and workload.
Cluster Config
log file location
/var/log/elasticsearch/cluster_name_xxx.log
config file location
/etc/elasticsearch/elasticsearch.yml
Discovery Service
Elasticsearch also provides the node management service, which provides the functionality of
- node discovery: automatic adding and removal of node by heartbeat
- master election
- cluster state management
- etc
Example
We can start multiple instances of Elasticsearch in a single computer if we just want a simple test:
By starting and stopping the server instances, we can see log of adding and removal of node in log file:
[INFO ][o.e.c.s.ClusterService ] [-JQvRTY] added {{F5W-FEp}{F5W-FEpSTYGIWxZ7SUkcmQ}{dS9kWcnaSRK39fsw4apqFw}{127.0.0.1}{127.0.0.1:9302},}, reason: zen-disco-receive(from master [master {cJIcl8H}{cJIcl8HHQgyo9E58W4g6lg}{svKZPGm1QuGSsfpBRiNEaA}{127.0.0.1}{127.0.0.1:9300} committed version [9]])
[INFO ][o.e.c.s.ClusterService ] [cJIcl8H] removed {{-JQvRTY}{-JQvRTYKQ929mMuIVp6YBA}{I22gPMBORv2EbCfcGnLg3Q}{127.0.0.1}{127.0.0.1:9301},}, reason: zen-disco-node-left({-JQvRTY}{-JQvRTYKQ929mMuIVp6YBA}{I22gPMBORv2EbCfcGnLg3Q}{127.0.0.1}{127.0.0.1:9301}), reason(left)[{-JQvRTY}{-JQvRTYKQ929mMuIVp6YBA}{I22gPMBORv2EbCfcGnLg3Q}{127.0.0.1}{127.0.0.1:9301} left]
Interaction
In order to interact with Elasticsearch, we can have two ways:
- Java API
- RESTful API
Java API
- Node client: non data node, a part of cluster
- Transport client: a light-weight transport client
Usage example of Java client’s configuration in Spring Boot (the combination with Spring will be discussed in the upcoming posts):
spring.data.elasticsearch.cluster-name=elasticsearch_demo
# the following determine the client type to be transport,
# rather than NodeClient, which is the default
spring.data.elasticsearch.cluster-nodes=192.168.1.100:9300
RESTful API
Elasticsearch also provides the RESTful API for other language to access the functionality of searching like following shows:
curl -XGET 'localhost:9200/_count?pretty' -H 'Content-Type: application/json' -d'
{
"query": {
"match_all": {}
}
}
'
In fact, the Java API is just a wrapper layer of RESTful API, which will convert the method chain invocations into http requests.
Utility
Sense is a Kibana app that provides a UI to facilitate the interaction directly from brower. As for Elsticsearch 5.0 and after, sense is already shipped with kibana and no need to install manually.
GET /megacorp/employee/_search
{
"query" : {
"match" : {
"last_name" : "Smith"
}
}
}'
Written with StackEdit.
评论
发表评论