Regexp vs Include performance comparison in Elasticsearch - regex

I work on a project and I need to aggregate the results based on "created" and "labels" field.
I created following queries that both give the result as I expected. But I want to learn that which query runs more fast?
My first query:
{
"size": 0,
"aggs": {
"HEATMAP": {
"date_histogram": {
"field": "created",
"interval": "day"
},
"aggs": {
"BEHAVIOUR_CHANGE": {
"terms": {
"field": "labels",
"include": "behavior-change"
}
},
"FIRST_OCCURRENCE": {
"terms": {
"field": "labels",
"include": "first-occurrence"
}
}
}
}
}
}
My second query:
{
"size": 0,
"aggs": {
"HEATMAP": {
"date_histogram": {
"field": "created",
"interval": "day"
},
"aggs": {
"BEHAVIOUR_CHANGE": {
"filter": {
"regexp": {
"labels": "behavior-change"
}
}
},
"FIRST_OCCURRENCE": {
"filter": {
"regexp": {
"labels": "first-occurrence"
}
}
}
}
}
}
}

Since that field is a keyword and you don't need anything special when it comes to a regular expression (only a perfect match), I would do it like the following. You'd note also that I added a terms filter to the query part to try and narrow down the results before being put through the aggregations (theoretically, for the aggregations to have less work to do). Also, I don't see a reason to use regexp here, thus I used the terms aggregations. If you are really interested in the performance comparison, I'd suggest setting up a load test with many more documents and terms in that field and perform some tests. Elastic has its own benchmarking tool that you could use for this: Rally.
{
"size": 0,
"query": {
"terms": {
"labels": [
"behavior-change",
"first-occurrence"
]
}
},
"aggs": {
"HEATMAP": {
"date_histogram": {
"field": "created",
"interval": "day"
},
"aggs": {
"BEHAVIOUR_CHANGE": {
"terms": {
"field": "labels",
"include": "behavior-change"
}
},
"FIRST_OCCURRENCE": {
"terms": {
"field": "labels",
"include": "first-occurrence"
}
}
}
}
}
}

Related

How to apply custom score to a search filed in Elastic Search

I am making a search query in Elastic Search and I want to treat the fields the same when they match. For example if I search for field field1 and it matches, then the _score is increase by 10(for example), same for the field2.
I was tried function_score but it's not working. It throws an error.
"caused_by": {
"type": "class_cast_exception",
"reason": "class
org.elasticsearch.index.fielddata.plain.SortedSetDVOrdinalsIndexFieldData
cannot be cast to class
org.elasticsearch.index.fielddata.IndexNumericFieldData
(org.elasticsearch.index.fielddata.plain.SortedSetDVOrdinalsIndexFieldData
and org.elasticsearch.index.fielddata.IndexNumericFieldData are in unnamed
module of loader 'app')"
}
The query:
{
"track_total_hits": true,
"size": 50,
"query": {
"function_score": {
"query": {
"bool": {
"must": [
{
"term": {
"field1": {
"value": "Value 1"
}
}
},
{
"term": {
"field2": {
"value": "value 2"
}
}
}
]
}
},
"functions": [
{
"field_value_factor": {
"field": "field1",
"factor": 10,
"missing": 0
}
},
{
"field_value_factor": {
"field": "field2",
"factor": 10,
"missing": 0
}
}
],
"boost_mode": "multiply"
}
}
}
You can use function score with filter function to boost.
assuming that your mapping looks like the one below
{
"mappings": {
"properties": {
"field_1": {
"type": "keyword"
},
"field_2": {
"type": "keyword"
}
}
}
}
with documents
{"index":{}}
{"field_1": "foo", "field_2": "bar"}
{"index":{}}
{"field_1": "foo", "field_2": "foo"}
{"index":{}}
{"field_1": "bar", "field_2": "bar"}
you can use weight parameter to boost the documents matched for each query.
{
"query": {
"function_score": {
"query": {
"match_all": {}
},
"functions": [
{
"filter": {
"term": {
"field_1": "foo"
}
},
"weight": 10
},
{
"filter": {
"term": {
"field_2": "foo"
}
},
"weight": 20
}
],
"score_mode": "multiply"
}
}
}
You can refer below solution if you want to provide manual weight for different field in query. This will always replace highest weight field on top of your query response -
Elasticsearch query different fields with different weight

Getting all values of 2 columns

I am looking for appropriate elasticsearch query for,
SELECT col1,col2 FROM myTable WHERE col1="value1" AND col2 = "value2"
eg:
This is my mapping,
{
"mapping": {
"doc": {
"properties": {
"book": {
"properties": {
"name": {
"type": "text"
},
"price": {
"type": "integer"
},
"booktype": {
"properties": {
"booktype": {
"type": "text"
}
}
}
}
}
}
}
}
}
I am trying to write a query which will give me price and name which has booktype=Fiction
Try this:
GET myTable/_search
{
"size": 1000,
"_source": [
"price",
"name"
],
"query": {
"bool": {
"must": [
{
"match": {
"booktype.booktype": "Fiction"
}
}
]
}
}
}
Note: you might need to adapt "size" to fit your needs.

Aggregation count of mail domains using elastisearch

I have following documents in my index:
{
"name":"rakesh"
"age":"26"
"email":"rakesh#gmail.com"
}
{
"name":"sam"
"age":"24"
"email":"samjoe#elastic.com"
}
{
"name":"joseph"
"age":"26"
"email":"joseph#gmail.com"
}
{
"name":"genny"
"age":"24"
"email":"genny#hotmail.com"
}
Now i need to get the count of all mail domains. Like:
#gmail.com:2,
#hotmail.com:1,
#elastic.com:1
using elastic search aggregations.
I can able to find the records which matches the given query. But i need have a count of each domain.
Thanks in advance for your help.
This can easily be achieved by creating a sub-field that will contain only the email domain name. First create the index with the appropriate analyzer:
PUT my_index
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"email_domain_analyzer": {
"type": "pattern",
"pattern": "(.+)#",
"lowercase": true
}
}
}
}
},
"mappings": {
"doc": {
"properties": {
"email": {
"type": "text",
"fields": {
"domain": {
"type": "text",
"fielddata": true,
"analyzer": "email_domain_analyzer"
}
}
}
}
}
}
}
Then create your documents:
POST my_index/doc/_bulk
{ "index": {"_id": 1 }}
{ "name":"rakesh", "age":"26", "email":"rakesh#gmail.com" }
{ "index": {"_id": 2 }}
{ "name":"sam", "age":"24", "email":"samjoe#elastic.com" }
{ "index": {"_id": 3 }}
{ "name":"joseph", "age":"26", "email":"joseph#gmail.com" }
{ "index": {"_id": 4 }}
{ "name":"genny", "age":"24", "email":"genny#gmail.com" }
And finally, you can aggregate on the email.domain field and you'll get exactly what you need:
POST my_index/_search
{
"size": 0,
"aggs": {
"domains": {
"terms": {
"field": "email.domain"
}
}
}
}

How can I exclude results from elasticsearch based on the contents of a field?

I'm using elasticsearch on AWS to store logs from Cloudfront. I have created a simple query that will give me all entries from the past 24h, sorted from new to old:
{
"from": 0,
"size": 1000,
"query": {
"bool": {
"must": [
{ "match": { "site_name": "some-site" } }
],
"filter": [
{
"range": {
"timestamp": {
"lt": "now",
"gte": "now-1d"
}
}
}
]
}
},
"sort": [
{ "timestamp": { "order": "desc" } }
]
}
Now, there a are certain sources (based on the user agent) for which I would like to exclude results. So my question boils down to this:
How can I filter out entries from the results when a certain field contains a certain string? Or:
query.filter.where('cs_user_agent').does.not.contain('Some string')
(This is not real code, obviously.)
I have tried to make sense of the Elasticsearch documentation, but I couldn't find a good example of how to achieve this.
I hope this makes sense. Thanks in advance!
Okay, I figured it out. What I've done is use a Bool Query in combination with a wildcard:
{
"from": 0,
"size": 1000,
"query": {
"bool": {
"must": [
{ "match": { "site_name": "some-site" } }
],
"filter": [
{
"range": {
"timestamp": {
"lt": "now",
"gte": "now-1d"
}
}
}
],
"must_not": [
{ "wildcard": { "cs_user_agent": "some string*" } }
]
}
},
"sort": [
{ "timestamp": { "order": "desc" } }
]
}
This basically matches any user agent string containing "some string", and then filters it out (because of the "must_not").
I hope this helps others who run into this problem.
nod.js client version:
const { from, size, value, tagsIdExclude } = req.body;
const { body } = await elasticWrapper.client.search({
index: ElasticIndexs.Tags,
body: {
from: from,
size: size,
query: {
bool: {
must: {
wildcard: {
name: {
value: `*${value}*`,
boost: 1.0,
rewrite: 'constant_score',
},
},
},
filter: {
bool: {
must_not: [
{
terms: {
id: tagsIdExclude ? tagsIdExclude : [],
},
},
],
},
},
},
},
},
});

Elasticsearch/Lucene Regex fquery/query_string not returning all documents

I currently have this mapping in Elasticsearch that I am indexing with a not_analyzed field:
PUT /twitter/_mapping/tweet
{
"tweet": {
"properties" : {
"user" : {
"type" : "string",
"index": "not_analyzed"
}
}
}
}
PUT /twitter/tweet/1
{
"user": "CNN"
}
PUT /twitter/tweet/2
{
"user": "cnn"
}
PUT /twitter/tweet/3
{
"user": "Cnn"
}
PUT /twitter/tweet/4
{
"user": "cNN"
}
PUT /twitter/tweet/5
{
"user": "CnN"
}
I want to search on this index with a case-insensitive filter like so (generated through NEST, so not too flexible in changing this query syntax):
POST /twitter/tweet/_search
{
"from": 0,
"size": 10,
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"fquery": {
"query": {
"query_string": {
"query": "user:/[cC][nN][nN]/"
}
}
}
}
]
}
}
}
}
}
This query only returns 1 documents though: "user": "cnn" (lowercase), not all of the documents.
Why is this? The same query with "query": "user:CNN" returns the correct document with the correct casing (uppercase).
EDIT: Also, if I remove the document with cnn (lowercase), the query returns nothing.
EDIT 2: In the case that this is a problem with my NEST code, here's the code used to generate the query:
// property path would be something like "user". queryTerm would be something like "cnn"
filterDescriptor.Query(
q =>
q.QueryString(
d =>
d.Query(string.Format("{0}:{1}", propertyPath,
GetCaseInsentitiveRegexExpression(queryTerm))))); // returns something like /[cC][nN][nN]/
You need to set lowercase_expanded_terms:false. By default lowercase_expanded_terms is set to true which lower-cases wildcard,regexp queries.
Example:
POST /twitter/tweet/_search
{
"from": 0,
"size": 10,
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"fquery": {
"query": {
"query_string": {
"query": "user:/[Cc][nN][nN]/",
"lowercase_expanded_terms": false
}
}
}
}
]
}
}
}
}
}
Or on nest code it would be something on these lines
q.QueryString(
d =>
d.Query(string.Format("{0}:{1}", propertyPath,
GetCaseInsentitiveRegexExpression(queryTerm))).LowercaseExpendedTerms(false))