Logstash Learning (2): Config

In the last blog, we have introduced some concepts in Logstash: the log data flow from input to filter to output, the buffer & batch etc.

In the this blog, we focus on how to setup Logstash.

Settings Files

After installing Logstash, we can find its settings files under /etc/logstash (in linux):

logstash.yml: Logstash parameter config file
log4j2.properties: Logstash logging config
jvm.options: Logstash JVM config
startup.options: It is used by system-install script in /usr/share/logstash/bin to build the startup script. It will setup options like user, group, service name, and service decription.

logstash.yml

Except the normal form of yml config file functionality, the logstash.yml file also supports bash-style interpolation of environment variables in setting values.

pipeline:
  batch:
    size: ${BATCH_SIZE}
    delay: ${BATCH_DELAY:5}
node:
  name: "node_${LS_NODE_NAME}"
path:
   queue: "/tmp/${QUEUE_DIR:queue}"

We can also set logging variables in logstash.yml to be used in log4j2.properties of logstash logging config:

log.level: debug 
log.format: plain

--------------

rootLogger.level = ${sys:ls.log.level}
rootLogger.appenderRef.rolling.ref = ${sys:ls.log.format}_rolling

Glob Patten Paths

This is the pattern for specify path of file, which can be used either in logstash.yml or in pipe.conf, anywhere we want to refer to a file:

*: match any file except dot files
**: match directories recursively
?: match any one character
{p,q}: match either literal p or literal q.

E.g. "/path/to/logs/{app1,app2,app3}/data-*.log"

Notice that this is not same with grok pattern, which is the regex pattern for interpret log message.

Pipeline Config

Now, we finish the config of Logstash, now we can config a pipeline workflow to handle our log data.

Input

The first part of pipeline is input plugin:

file: read the log from a file;
syslog: listen on port 514 and parse log according to RFC3164;
redis: read log from redis;
beats: read log from Elastic Beats, a light data shipper to send data to Logstash;

If there exists more than one input plugins, it will read from all of them at the same time and combine them into one stream.

Filter

Then, stream of log will go through a lists of filters to handle it.

grok: change unstructured text into structured, break it up into many different discrete bits of information
- e.g. match => { "message" => "%{COMBINEDAPACHELOG}" }
- grok debugger
mutate: rename/remove/replace/modfiy fields
drop: drops everything that gets to this filter
date: parse date from a field and then store in field
- match => [ "logdate", "MMM dd yyyy HH:mm:ss" ]
- This filter parses out a timestamp and uses it as the timestamp for the event (regardless of when you’re ingesting the log data)
mutate: rename, strip
discect: extracts unstructured event data into fields by using delimiters.

Apr 26 12:20:02 localhost systemd[1]: Starting system activity accounting tool...

mapping => { "message" => "%{ts} %{+ts} %{+ts} %{src} %{prog}[%{pid}]: %{msg}" }

kv: parse key-value pairs
geoip, dns, useragent, translate etc

Output

Finally, log events will go to output plugins.

elasticsearch: write to elasticsearch cluster
file: write to file
graphite: pull metrics from logs and ship them to Graphite, which is an open source tool for storing and graphing metrics
statsd etc

If there exists more than one output plugins, it will write to all of them: one target one copy of data.

Codec

Input codecs provide a convenient way to decode your data before it enters the input. Output codecs provide a convenient way to encode your data before it leaves the output. Using an input or output codec eliminates the need for a separate filter in Logstash pipeline.

It is used to separate the transport of message form serialization process. Some common codec is listed like following:

json
multiline
msgpack
plain

Customize Filter

In the handle of our log data, we can not only use plugins, but also refer to the field of our log. Because inputs generate events, there are no fields to evaluate within the input block—they do not exist yet. Because of their dependency on events and fields, the following configuration options will only work within filter and output blocks.

Field Ref

The syntax to access a field is [fieldname]. If we are referring to a top-level field, you can omit the [] and simply use fieldname. To refer to a nested field, you specify the full path to that field: [top-level field][nested field].

`sprintf` Format

To put a field content in string, we can use %{}:

output {
  statsd {
    increment => "apache.%{[response][status]}"
  }
}

Instead of specifying a field name inside the %{}, we can use the +FORMAT syntax to represent timestamp, where FORMAT is a time format.

Conditionals

Field references, sprintf format and conditionals, described below, will not work in an input block.

if [@metadata][test] == "Hello" {
  stdout { codec => rubydebug }
}

equality: ==, !=, <, >, <=, >=
regexp: =~, !~ (checks a pattern on the right against a string value on the left)
inclusion: in, not in
and, or, nand, xor, !

The `@metadata`

Make use of the @metadata field any time you need a temporary field but do not want it to be in the final output.

For example, used in timestamp extraction.

date {
  match => [ "[@metadata][timestamp]" , "ISO8601" ]
}

Environment Variable

Give a default value by using the form ${var:default value}. Logstash uses the default value if the environment variable is undefined.
Environment variables are immutable. If you update the environment variable, you’ll have to restart Logstash to pick up the updated value.
The replacement is case-sensitive.

Example

Here’s an example that uses an environment variable to set the path to a log file:

filter {
  mutate {
    add_field => {
      "my_path" => "${HOME}/file.log"
    }
  }
}

the value of HOME:

export HOME="/path"

At startup, Logstash uses the following configuration:

filter {
  mutate {
    add_field => {
      "my_path" => "/path/file.log"
    }
  }
}

Multiline Events

A common usage of Logstash is to combine the multiple lines log into a single one log event, here we explore three examples:

Combining a Java stack trace into a single event
Combining C-style line continuations into a single event
Combining multiple lines from time-stamped events

According to the document of Filebeat, if we use Filebeat to handle the log, we might better combine them using Filebeat but Logstash. Otherwise, we may corrupt the stream of data.

Java Stack Traces

input {
  stdin {
    codec => multiline {
      pattern => "(^[a-zA-Z.]+(?:Error|Exception): .+)|(^\s+at .+)|(^\s+... \d+ )|(^\s*Caused by:.+)"
      what => "previous"
    }
  }
}

Line Continuations

This configuration merges any line that ends with the \ character with the following line.

input {
  stdin {
    codec => multiline {
      pattern => "\\$"
      what => "next"
    }
  }
}

Timestamps

This configuration uses the negate option to specify that any line that does not begin with a timestamp belongs to the previous line.

input {
  file {
    path => "/var/log/someapp.log"
    codec => multiline {
      pattern => "^%{TIMESTAMP_ISO8601} "
      negate => true
      what => previous
    }
  }
}

Ref

Wikipedia glob introduction

Written with StackEdit.

Elasticsearch: Join and SubQuery

Elasticsearch: Join and SubQuery Tony was bothered by the recent change of search engine requirement: they want the functionality of SQL-like join in Elasticsearch! “They are crazy! How can they think like that. Didn’t they understand that Elasticsearch is kind-of NoSQL 1 in which every index should be independent and self-contained? In this way, every index can work independently and scale as they like without considering other indexes, so the performance can boost. Following this design principle, Elasticsearch has little related supports.” Tony thought, after listening their requirements. Leader notice tony’s unwillingness and said, “Maybe it is hard to do, but the requirement is reasonable. We need to search person by his friends, didn’t we? What’s more, the harder to implement, the more you can learn from it, right?” Tony thought leader’s word does make sense so he set out to do the related implementations Application-Side Join “The first implementation ...

阅读全文

On teh way

Blog Search