Elasticsearch : Analyzers

Elasticsearch uses a data structure called an inverted index, which is designed to allow very fast full-text searches. An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears. Lets see this with an example. Suppose we have following 3 sentences in 3 documents:

Doc1- Android operating system is built on java.

Doc2 – Java is an object oriented programming language.

Doc3 – James Gosling invented Java programing language.

Elasticsearch will break down each continuous block of letters and numbers as a ‘Term’. Inverted index consist of 2 parts – the sorted dictionary of the terms and the posting list.

inverted_index_0

That’s why it’s called inverted. The terms are considered the primary data. So when you search, you first process the sorted dictionary and then process the posting.

The process of breaking down the sentences into terms is done by an ‘Analyzer’. This process is called ‘Tokenization’. The analyser does 3 main jobs when reading in a string of characters:

  • Character filtering – like removing html, converting 9 to nine etc.
  • Tokenize by whitespace, period, comma etc.
  • Token filter to remove certain terms like ‘and’ or ‘the’ or some offensive words.

Elastic search comes with a lot of built in analyzers and you can create your own for your needs. You can combine the built in character filters, tokenizers and token filters to create custom analyzers. Some of the built-in analyzers :

  • Standard analyser : is the default analyzer that Elasticsearch uses. It is the best general choice for analyzing text that may be in any language. It splits the text and removes most punctuation. Finally, it lowercases all terms.
  • Simple analyser : splits the text on anything that isn’t a letter, and lowercases the terms.
  • Whitespace analyser : splits the text on whitespace. It doesn’t lowercase. Good for strings like computer code or logs

Elasticsearch comes with an API for testing analysis. You just have to send a basic http request along with the body.

(POST) – http://localhost:9200/_analyze?analyzer=standard

(BODY) – String to be analysed

We will not be using the sense plugin for this, as it doesn’t quite work well with body having only texts. I am using the restclient add on firefox. You can use any client of your choice like postman is quite popular in chrome etc.

Lets take the following string:

Convert string-to-interger by using Integer.parseInt(String) method

Lets use the standard one first

restclient_analyzer_standard

As you can see , the standard analyser removed the characters like – ( ) . It also lowercased every term.

Now lets use the whitespace analyser

restclient_analyzer_whitespace

As you can see, it does not lowercase the terms. The – is still there and its only separating the terms by whitespaces, so all our original terms are still intact.

So now the question is when to choose your analyser. There are actually a few places where you can specify the analyser, but the most common place is in the mapping settings of your index. Lets take the following mapping for now:


POST /filerepo
{
"mappings": {
"files": {
"properties": {
"name": {
"type": "string"

},
"publish_date": {
"type": "date",
"format":"YYYY-MM-DD"
},
"file_text": {
"type": "string"
}
}
}
}
}

Above we have an index called ‘filerepo’ which have a type ‘files’ that contains 3 properties. As we know the default analyzer is the standard one. Lets add a posting to it:

POST filerepo/files/1
{
"name":"Java Basics",
"publish_date":"2012-03-20",
"file_text":"Convert string-to-interger by using Integer.parseInt(String) method"
}

Lets do a search


GET filerepo/files/_search
{
"query": {
"term": {
"file_text": "Convert"
}
}
}

standard_analyzer_0

As you can see it didn’t find the term as it was lowercased before storing. If you search ‘convert’ you will get the result.

standard_analyzer_1

To use an analyser of your choice you can add the analyser attribute to the property of your type. Lets use the whitespace analyser. First delete your previous index using


DELETE /filerepo

Then post


POST /filerepo
{
"mappings": {
"files": {
"properties": {
"name": {
"type": "string"

},
"publish_date": {
"type": "date",
"format":"YYYY-MM-DD"
},
"file_text": {
"type": "string",
"analyzer":"whitespace"
}
}
}
}
}

Notice the whitespace analyzer used in filte_text property. Now post the same data as previously and search ‘Convert’

whitespace_analyzer_0

As you can guess it will not able to seach ‘convert’ now.

All the fields/properties are analyzed by default when they are indexed. You can tell elastic search not to analyse particular fields or whole types using the not_analyzed options


"id": {
"type": "integer",
"index":" not_analyzed "
}

So whatever is sent is indexed as it is without any analysis.

Thats it for now. hope this was informative.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: