跳至主要内容

Elasticsearch Problem Lists(2): With Spring

In last blog, we have introduced some problems about Elasticsearch basic concepts confusions and some config problems we met. Now, we come to the problems came cross when using Elasticsearch in application with the help of Spring Data Elasticsearch.

With Spring

After the understanding of Elasticsearch and configuration of server, we need to write code to interact with it. We choose the Spring Data Elasticsearch framework to assist our implementations. So the following is the problem we met when using Spring to access Elasticsearch.

  • spring-boot-starter-data-elasticsearch: 1.5.3-RELEASE
  • Elasticsearch server: 2.4.x

Connection

Clients

When using Java to access Elasticsearch, we have two types of clients to choose to communicate with server:

  • Transport Client: this client won’t be part of cluster, It just communicate with server
  • Node Client: this client will be part of cluster – store data shards and respond search request

In our cases, we just want to communicate with a dedicated cluster of Elasticsearch, so we choose the Transport Client.

In order to config a Transport Client, we can do it in Java code:

@Bean
public Client elasticClient() {
    Settings settings = Settings.builder().put(ClusterName.SETTING, "demo").build();
    return TransportClient.builder().settings(settings).build()
            .addTransportAddress(new InetSocketTransportAddress(new InetSocketAddress("xxx", 9300)));
}

or even cleaner, using Spring Boot’s property file:

spring.data.elasticsearch.cluster-name=demo
# the following determine the client type to be transport, rather than node
spring.data.elasticsearch.cluster-nodes=xxx:9300

Index Definition

Dup id

If we just mark our id field with @Id, we will have a dup id in _source:

"_index" : "file-8947",
"_type" : "file",
"_id" : "7685",
"_source" : {
  "id" : "7685",
  "name" : "directory3",
  "uploadRoleId" : "4353",
  "type" : 1
}

and because Spring Data use the Jackson to transform object to JSON:

indexRequestBuilder.setSource(resultsMapper.getEntityMapper().mapToString(query.getObject()));

we can mark our id with @JsonIgnore to remove the field in source.

DateFormat

When define Field, we will notice that there is DateFormat to fill. This represent the format we want to use to interpret the JSON we send to Elasticsearch.

In JSON documents, dates are represented as strings. Elasticsearch uses a set of the formats to recognize and parse these strings into a long value representing milliseconds-since-the-epoch in UTC.

FieldIndex: no vs not_analyzed

FieldIndex can be used to specify how Elastcisearch will handle this field:

  • analyzed: it is a string type and will be analyzed by analyzer;
  • not_analyzed: it is a string, but no need to be analyzed. It will be stored as exact value;
  • no: not index this field, i.e. not searchable by filter or query etc

In the latest Spring Data Elasticsearch build, this element has been replaced by a boolean to represent whether to index and types to specify analyzed string and exact value string.

Type Auto Detection Failure

In our project, if we don’t specify the type but with index type specified, like the following shows:

@Field(index = FieldIndex.not_analyzed)
private String modifier;

Spring will log a exception message:

AbstractElasticsearchRepository : failed to load elasticsearch nodes : org.elasticsearch.index.mapper.MapperParsingException: No type specified for field [modifier]

Repo Definition

Class Cast When Using Slice

When use Slice as the return value of our query as the Spring Data document suggests, Spring will complain the class cast exception.

After reading the source code, we find that Spring not take this query as a paged query:

public final boolean isPageQuery() {
    return org.springframework.util.ClassUtils.isAssignable(Page.class, this.unwrappedReturnType);
}

Because Slice is the super type of Page, it isn’t assignable to Page.class. As a result, it is implemented using queryForObject which only return one result and causes the exception.

findBy vs findAllBy

Spring seems not making differences of the following two kinds of notions:

List<Announcement> findByTitle(String title);
List<Announcement> findAllByTitle(String title);

They will produce exact same JSON bodies to send.

Find by _all

If we want to match content of _all meta field, we can’t write repo method like a common field, because Spring can’t find that field in our Document class.

We can do like following as a workaround:

@Query("{\"bool\" : {\"must\" : [ {\"match\" : {\"?0\" : \"?1\"}} ]}}")
Page<MyDoc> getbyAll(String field, String query, Pageable pageable);

@Query

The content of “query” annotation includes brace:

{"bool" : {"should" : [ {"match" : {"?0" : "?1"}} ]}}"

Otherwise, if we miss a { like following:

@Query(" \"multi_match\": {\n" +
        "        \"query\":    \"?0\",\n" +
        "        \"fields\":   [ \"name^2\", \"path\" ]\n" +
        "    }" +
        "}")
Page<Affair> findByNameOrPath(String info, Pageable pageable);

It will miss intercept the query and simple query becomes a strange query_binary

nested: SearchParseException[failed to parse search source [{"from":0,"size":10,"query_binary":"..."}]]

Add Implementation in Repo

Spring Data Elasticsearch is very convenient way to implement many simple queries, in which it will auto generate query by method name and parameter:

Page<RolePO> findByTaskIdAndTitle(Long taskId, String title, Pageable pageable);

Or by specify string query:

@Query("{\"bool\" : {\"should\" : [ {\"match\" : {\"?0\" : \"?1\"}} ]}}")
Page<AnnouncementPO> findByAll(String field, String info, Pageable pageable);

But sometimes, we need more complex query and the functionality of Spring Data Elasticsearch at the same time. We can add some method by:

interface UserRepositoryCustom {
  public void someCustomMethod(User user);
}

@Component
class UserRepositoryImpl implements UserRepositoryCustom {
  public void someCustomMethod(User user) {
  }
}

interface UserRepository extends CrudRepository<User, Long>, UserRepositoryCustom {
  // Declare query methods here
}

Two points to notice:

  • the Impl postfix of the name on custom repo compared to the core repository interface;
  • @Component to let implementation to be found by Spring;

Searching

Nested Class Searching

Saying we have Tag as the nested object in class A, If we want to searching tag to find outer class A, we have to add toString() method in Tag like following:

@Field(type = FieldType.Nested)
private List<Tag> tags;


public class Tag {

    @Field(type = FieldType.String, index = FieldIndex.not_analyzed)
    private String des;

    // **have to add toString()**
}

Otherwise, Spring will fail to convert the query:

 "query_string" : {
   "query" : "com.superid.query.Tag@5d1d9d73",
   "fields" : [ "tags" ]
 }

It is because Spring use the toString() to construct the json query:

CriteriaQueryProcessor#processCriteriaEntry(..)

private QueryBuilder processCriteriaEntry(Criteria.CriteriaEntry entry,/* OperationKey key, Object value,*/ String fieldName) {
    Object value = entry.getValue();
    if (value == null) {
        return null;
    }
    OperationKey key = entry.getKey();
    QueryBuilder query = null;

    String searchText = StringUtils.toString(value);
    //...
}

Page Count From 0 or 1?

When we use Page as the return value of our query, we should send a Pagable parameter to specify the page. What should be noticed is the count of page starts from 0.

Completion

In this blog, we have introduce how to do auto completion in pure Elasticsearch. Here, we focus on how to do with Spring Data Elasticsearch.

Mapping

In order to use auto complete features, we can use json file or @CompletionField to define define

Concise way of using annotation:

@CompletionField()
private Completion suggest;

Or more powerful but tedious way:

{
    "file" : {
        "properties" : {
            "title" : { "type" : "string" },
            "suggest" : { "type" : "completion",
                "analyzer" : "simple",
                "search_analyzer" : "simple"
            }
        }
    }
}

// Then we can refer to the mapping by `@Mapping`:
@Setting(settingPath = "elasticsearch-settings.json")
@Document(indexName = "file", type = "file", shards = 1, replicas = 0, createIndex = true,   refreshInterval = "-1")
@Mapping(mappingPath = "/mappings/file-mapping.json")
public class file {...}
Index

We can index as our common entity:

esTemplate.save(new File(...));
Query

The ElasticsearchTemplate has the method for query suggest:

public SuggestResponse suggest(SuggestBuilder.SuggestionBuilder<?> suggestion, String... indices);

Dynamic Index Creation/Using

Elasticsearch recommend the use of rolling index, which can be used to scale our application. Spring Data Elasticsearch now don’t have the direct support.
But we can have workaround using Spring EL.

First, we define a bean to used as suffix of index:

@Bean
public Suffix suffix(){
    return suffix;
}

Then, we can use Spring Expression Language to define our index name:

@Document(indexName = "role_#{suffix.toString()}", type = "role")
public class Role {}

Now, we can change the suffix to access different index:

suffix.setSuffix("123");
roleRepo.save(new Role("7", "后端开发", false, 2L, taskId));
suffix.setSuffix("234");
roleRepo.save(new Role("3", "前端架构", false, 2L, taskId));

A Suffix example can be found here.

Furthermore, I have already submit a pull request of this kind of utility class to assist rolling index.

Search Across Index

If we have to search across multiple index, Spring can’t generate method for us. We have to write code manually:

SearchQuery searchQuery = new NativeSearchQueryBuilder()
        .withQuery(matchQuery("title", query))
        .withIndices("role_*", "-role_xxx")
        .build();

Partial Update

Sometimes, we want to do partial update:

POST /website/blog/1/_update
{
   "doc" : {
      "tags" : [ "testing" ],
      "views": 0
   }
}
IndexRequest indexRequest = new IndexRequest();
indexRequest.source("name", file.getName());
UpdateQuery updateQuery = new UpdateQueryBuilder().withId(file.getId())
    // class is used to get index and type
    .withClass(FilePO.class)
    // indexRequest will be used as 'doc'
    .withIndexRequest(indexRequest).build();
template.update(updateQuery);

Debug

In this section, we are going to introduce some necessary utilities to debug Elasticsearch.

Explain

curl -XGET 'localhost:9200/_search?pretty' -H 'Content-Type: application/json' -d'
{
    "explain": true,
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}'

Log

  • Setting logging level to DEBUG or lower, which will let Spring print more info and stacktrace when exception occurs.
  • Using the Node Client of Elasticsearch will report the error of Elasticsearch’s internal error through stacktrace.

Samples

If we can’t find how a functionality is achieve, we can find samples in following places:

  • Sample project
  • Test cases in repos

Ref

Written with StackEdit.

评论

此博客中的热门博文

Spring Boot: Customize Environment

Spring Boot: Customize Environment Environment variable is a very commonly used feature in daily programming: used in init script used in startup configuration used by logging etc In Spring Boot, all environment variables are a part of properties in Spring context and managed by Environment abstraction. Because Spring Boot can handle the parse of configuration files, when we want to implement a project which uses yml file as a separate config file, we choose the Spring Boot. The following is the problems we met when we implementing the parse of yml file and it is recorded for future reader. Bind to Class Property values can be injected directly into your beans using the @Value annotation, accessed via Spring’s Environment abstraction or bound to structured objects via @ConfigurationProperties. As the document says, there exists three ways to access properties in *.properties or *.yml : @Value : access single value Environment : can access multi

Elasticsearch: Join and SubQuery

Elasticsearch: Join and SubQuery Tony was bothered by the recent change of search engine requirement: they want the functionality of SQL-like join in Elasticsearch! “They are crazy! How can they think like that. Didn’t they understand that Elasticsearch is kind-of NoSQL 1 in which every index should be independent and self-contained? In this way, every index can work independently and scale as they like without considering other indexes, so the performance can boost. Following this design principle, Elasticsearch has little related supports.” Tony thought, after listening their requirements. Leader notice tony’s unwillingness and said, “Maybe it is hard to do, but the requirement is reasonable. We need to search person by his friends, didn’t we? What’s more, the harder to implement, the more you can learn from it, right?” Tony thought leader’s word does make sense so he set out to do the related implementations Application-Side Join “The first implementation

Implement isdigit

It is seems very easy to implement c library function isdigit , but for a library code, performance is very important. So we will try to implement it and make it faster. Function So, first we make it right. int isdigit ( char c) { return c >= '0' && c <= '9' ; } Improvements One – Macro When it comes to performance for c code, macro can always be tried. #define isdigit (c) c >= '0' && c <= '9' Two – Table Upper version use two comparison and one logical operation, but we can do better with more space: # define isdigit(c) table[c] This works and faster, but somewhat wasteful. We need only one bit to represent true or false, but we use a int. So what to do? There are many similar functions like isalpha(), isupper ... in c header file, so we can combine them into one int and get result by table[c]&SOME_BIT , which is what source do. Source code of ctype.h : # define _ISbit(bit) (1 << (