ElasticSeach : Datatypes , _source, _all

In the previous post , we saw how to create an index and do a simple search. Before moving further lets take a look into the 5 datatypes that elasticsearch provides out of the box.

  • String -The text based string type is the most basic type, and contains one or more characters.
  • Boolean –The boolean type supports passing the value as a number or a string (in this case 0, an empty string, false, off and no are false, all other values are true).
  • Number – A number based type supporting float, double, byte, short, integer, and long. It uses specific constructs within Lucene in order to support numeric values. These corresponds to the core types in java
  • Date – Date in elastic search can be stored in many different ways. By default elasticsearch Date in UTC and the default format is dateOptionalTime. ELasticsearch will allow to to specify multiple formats at once, so you can enter in one format or the other.
  • Binary – The binary type is a base64 representation of binary data (say of image or blob data) that can be stored in the index. The field is not stored by default and not indexed at all

Let’s delete our previous index and create a new mapping. To delete use the followin:


DELETE media

To check :


GET /_cat/indices

Lets create new mapping of media/movies with date format now.


POST/media{
"settings": {
"index": {
"number_of_shards": 5
}
},
"mappings": {
"movies": {
"properties": {
"name": {
"type": "string"
},
"release_date": {
"type": "date",
"format": "YYYY-MM-DD"
},
"actors": {
"type": "string"
},
"directors": {
"type": "string"
}
}
}
}
}

Notice the extra added setting parameter, which is number_of_shards (The number of primary shards that an index should have, which defaults to 5. This setting cannot be changed after index creation). Another important setting is the number_of_replicas (The number of replica shards (copies) that each primary shard should have, which defaults to 1. This setting can be changed at any time on a live index.).

You can check the mapping by :


GET media/_mapping

Let’s try to insert an incorrect data haing a different date format

es_insertIncorrect0

As you can see, elasticsearch throws an error. Change the date to the correct format , then elasicsearch will accept it.

_source

For most scenarios, your elastic search may not be the primary data store. You may only be using it for searching. You may have written some code that pulls data from a RDBMS or a file repository and index it. Elastic search by default keeps a copy of the original data in addition to the data it has indexed for you. The _source field is an automatically generated field that stores the actual JSON that was used as the indexed document. It is not indexed (searchable), just stored. When executing “fetch” requests, like get or search, the _source field is returned by default.

es_fields0

The downside of this is that your index tends to be quite large. Large indexes are not a problem for elasticsearch, but you may choose to tell elasticsearch not to keep the original copy of the data, but instead only keep the indexes version. Lets delete our previous index and create a new one.


POST/media{
"mappings": {
"movies": {
"_source": {
"enabled": false
},
"properties": {
"name": {
"type": "string",
"store": true
},
"release_date": {
"type": "date",
"format": "YYYY-MM-DD"
},
"actors": {
"type": "string"
},
"directors": {
"type": "string"
}
}
}
}
}

In the above mapping, we have disabled the source, at the same time we have told elastic search to store the name. So the properties – release_date, actors, directors won’t be stored but only indexed. Meaning, I would be able to search on the indexed version of the fields but not retrieve the original data, for that we need another source. Let’s insert a data and search.

es_storespecific0

As you can see we don’t have the _source field now. Also I cannot see any of the properties except id. To get the ‘name’ property that we have stored we actually have to fetch is specifically.


GET media/movies/1?fields=name

es_storespecific1

Even if I try to retrieve other field like

GET media/movies/1?fields=name,actors

It will still only get the name field because that’s the only field we have stored. Disabling source can save you a lot of space especially on production index, but for debugging purposes, it makes  elastic search to work with a little bit more difficult.

_all

Another important field to keep in mind is the _all field. The idea of the _all field is that it includes the text of every other fields within the document indexed. It is enabled by default. It can come very handy especially for search requests, where we want to execute a search query against the content of a document, without knowing which fields to search on. If you want to disable it, you must specify it while creating the mapping similar to the pevious example like :


POST/media{
"mappings": {
"movies": {
"_all": {
"enabled": false
},
"properties":.....(rest of your properties)

When disabling the _all field, it is a good practice to set index.query.default_field to a different value.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: