Aggregation count of mail domains using elastisearch - regex

I have following documents in my index:
{
"name":"rakesh"
"age":"26"
"email":"rakesh#gmail.com"
}
{
"name":"sam"
"age":"24"
"email":"samjoe#elastic.com"
}
{
"name":"joseph"
"age":"26"
"email":"joseph#gmail.com"
}
{
"name":"genny"
"age":"24"
"email":"genny#hotmail.com"
}
Now i need to get the count of all mail domains. Like:
#gmail.com:2,
#hotmail.com:1,
#elastic.com:1
using elastic search aggregations.
I can able to find the records which matches the given query. But i need have a count of each domain.
Thanks in advance for your help.

This can easily be achieved by creating a sub-field that will contain only the email domain name. First create the index with the appropriate analyzer:
PUT my_index
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"email_domain_analyzer": {
"type": "pattern",
"pattern": "(.+)#",
"lowercase": true
}
}
}
}
},
"mappings": {
"doc": {
"properties": {
"email": {
"type": "text",
"fields": {
"domain": {
"type": "text",
"fielddata": true,
"analyzer": "email_domain_analyzer"
}
}
}
}
}
}
}
Then create your documents:
POST my_index/doc/_bulk
{ "index": {"_id": 1 }}
{ "name":"rakesh", "age":"26", "email":"rakesh#gmail.com" }
{ "index": {"_id": 2 }}
{ "name":"sam", "age":"24", "email":"samjoe#elastic.com" }
{ "index": {"_id": 3 }}
{ "name":"joseph", "age":"26", "email":"joseph#gmail.com" }
{ "index": {"_id": 4 }}
{ "name":"genny", "age":"24", "email":"genny#gmail.com" }
And finally, you can aggregate on the email.domain field and you'll get exactly what you need:
POST my_index/_search
{
"size": 0,
"aggs": {
"domains": {
"terms": {
"field": "email.domain"
}
}
}
}

Related

How to apply custom score to a search filed in Elastic Search

I am making a search query in Elastic Search and I want to treat the fields the same when they match. For example if I search for field field1 and it matches, then the _score is increase by 10(for example), same for the field2.
I was tried function_score but it's not working. It throws an error.
"caused_by": {
"type": "class_cast_exception",
"reason": "class
org.elasticsearch.index.fielddata.plain.SortedSetDVOrdinalsIndexFieldData
cannot be cast to class
org.elasticsearch.index.fielddata.IndexNumericFieldData
(org.elasticsearch.index.fielddata.plain.SortedSetDVOrdinalsIndexFieldData
and org.elasticsearch.index.fielddata.IndexNumericFieldData are in unnamed
module of loader 'app')"
}
The query:
{
"track_total_hits": true,
"size": 50,
"query": {
"function_score": {
"query": {
"bool": {
"must": [
{
"term": {
"field1": {
"value": "Value 1"
}
}
},
{
"term": {
"field2": {
"value": "value 2"
}
}
}
]
}
},
"functions": [
{
"field_value_factor": {
"field": "field1",
"factor": 10,
"missing": 0
}
},
{
"field_value_factor": {
"field": "field2",
"factor": 10,
"missing": 0
}
}
],
"boost_mode": "multiply"
}
}
}
You can use function score with filter function to boost.
assuming that your mapping looks like the one below
{
"mappings": {
"properties": {
"field_1": {
"type": "keyword"
},
"field_2": {
"type": "keyword"
}
}
}
}
with documents
{"index":{}}
{"field_1": "foo", "field_2": "bar"}
{"index":{}}
{"field_1": "foo", "field_2": "foo"}
{"index":{}}
{"field_1": "bar", "field_2": "bar"}
you can use weight parameter to boost the documents matched for each query.
{
"query": {
"function_score": {
"query": {
"match_all": {}
},
"functions": [
{
"filter": {
"term": {
"field_1": "foo"
}
},
"weight": 10
},
{
"filter": {
"term": {
"field_2": "foo"
}
},
"weight": 20
}
],
"score_mode": "multiply"
}
}
}
You can refer below solution if you want to provide manual weight for different field in query. This will always replace highest weight field on top of your query response -
Elasticsearch query different fields with different weight

ElasticSearch reindexing with selected fields result into addition of non selected empty field

Scenario:
We are using AWS ElasticSearch 6.8. We got an index (index-A) with a mapping structure consist of multiple nested objects and JSON hierarchy. We need to create new index (index-B) and move all documents from index-A to index-B.
We need to create index-B with only specific fields.
We need to rename field names while reindexing
e.g.
index-A mapping:
{
"userdata": {
"properties": {
"payload": {
"type": "object",
"properties": {
"Alldata": {
"Username": {
"type": "keyword"
},
"Designation": {
"type": "keyword"
},
"Company": {
"type": "keyword"
},
"Region": {
"type": "keyword"
}
}
}
}
}
}}
Expected structure of index-B mapping after reindexing with rename (Company-cnm, Region-rg) :-
{
"userdata": {
"properties": {
"cnm": {
"type": "keyword"
},
"rg": {
"type": "keyword"
}
}
}}
Steps we are Following:
First we are using Create index API to create index-B with above mapping structure
Once index is created we are creating an ingest pipeline.
PUT ElasticSearch domain endpoint/_ingest/pipeline/my_rename_pipeline
{
"description": "rename field pipeline",
"processors": [{
"rename": {
"field": "payload.Company",
"target_field": "cnm",
"ignore_missing": true
}
},
{
"rename": {
"field": "payload.Region",
"target_field": "rg",
"ignore_missing": true
}
}
]
}
Perform reindexing operation, payload for the same below
let reindexParams = {
wait_for_completion: false,
slices: "auto",
body: {
"conflicts": "proceed",
"source": {
"size": 8000,
"index": "index-A",
"_source": ["payload.Company", "payload.Region"]
},
"dest": {
"index": "index-B",
"pipeline": "my_rename_pipeline",
"version_type": "external"
}
}
};
Problem:
Once the reindexing is complete as expected all documents transferred to new index with renamed fields but there is one additional field which is not selected. As you can see below the "payload" object with metadata is also added to the new index after reindexing. This field is empty and consist of no data.
index-B looks like below after reindexing:
{
"userdata": {
"properties": {
"cnm": {
"type": "keyword"
},
"rg": {
"type": "keyword"
},
"payload": {
"properties": {
"Alldata": {
"type": "object"
}
}
}
}
}}
We are unable to find the workaround and need help how to stop this field from creating. Any help will be appreciated.
Great job!! You're almost there, you simply need to remove the payload field within your pipeline using the remove processor and you're good:
{
"description": "rename field pipeline",
"processors": [
{
"rename": {
"field": "payload.Company",
"target_field": "cnm",
"ignore_missing": true
}
},
{
"rename": {
"field": "payload.Region",
"target_field": "rg",
"ignore_missing": true
}
},
{
"remove": { <--- add this processor
"field": "payload"
}
}
]
}

Getting all values of 2 columns

I am looking for appropriate elasticsearch query for,
SELECT col1,col2 FROM myTable WHERE col1="value1" AND col2 = "value2"
eg:
This is my mapping,
{
"mapping": {
"doc": {
"properties": {
"book": {
"properties": {
"name": {
"type": "text"
},
"price": {
"type": "integer"
},
"booktype": {
"properties": {
"booktype": {
"type": "text"
}
}
}
}
}
}
}
}
}
I am trying to write a query which will give me price and name which has booktype=Fiction
Try this:
GET myTable/_search
{
"size": 1000,
"_source": [
"price",
"name"
],
"query": {
"bool": {
"must": [
{
"match": {
"booktype.booktype": "Fiction"
}
}
]
}
}
}
Note: you might need to adapt "size" to fit your needs.

How can I exclude results from elasticsearch based on the contents of a field?

I'm using elasticsearch on AWS to store logs from Cloudfront. I have created a simple query that will give me all entries from the past 24h, sorted from new to old:
{
"from": 0,
"size": 1000,
"query": {
"bool": {
"must": [
{ "match": { "site_name": "some-site" } }
],
"filter": [
{
"range": {
"timestamp": {
"lt": "now",
"gte": "now-1d"
}
}
}
]
}
},
"sort": [
{ "timestamp": { "order": "desc" } }
]
}
Now, there a are certain sources (based on the user agent) for which I would like to exclude results. So my question boils down to this:
How can I filter out entries from the results when a certain field contains a certain string? Or:
query.filter.where('cs_user_agent').does.not.contain('Some string')
(This is not real code, obviously.)
I have tried to make sense of the Elasticsearch documentation, but I couldn't find a good example of how to achieve this.
I hope this makes sense. Thanks in advance!
Okay, I figured it out. What I've done is use a Bool Query in combination with a wildcard:
{
"from": 0,
"size": 1000,
"query": {
"bool": {
"must": [
{ "match": { "site_name": "some-site" } }
],
"filter": [
{
"range": {
"timestamp": {
"lt": "now",
"gte": "now-1d"
}
}
}
],
"must_not": [
{ "wildcard": { "cs_user_agent": "some string*" } }
]
}
},
"sort": [
{ "timestamp": { "order": "desc" } }
]
}
This basically matches any user agent string containing "some string", and then filters it out (because of the "must_not").
I hope this helps others who run into this problem.
nod.js client version:
const { from, size, value, tagsIdExclude } = req.body;
const { body } = await elasticWrapper.client.search({
index: ElasticIndexs.Tags,
body: {
from: from,
size: size,
query: {
bool: {
must: {
wildcard: {
name: {
value: `*${value}*`,
boost: 1.0,
rewrite: 'constant_score',
},
},
},
filter: {
bool: {
must_not: [
{
terms: {
id: tagsIdExclude ? tagsIdExclude : [],
},
},
],
},
},
},
},
},
});

Elasticsearch/Lucene Regex fquery/query_string not returning all documents

I currently have this mapping in Elasticsearch that I am indexing with a not_analyzed field:
PUT /twitter/_mapping/tweet
{
"tweet": {
"properties" : {
"user" : {
"type" : "string",
"index": "not_analyzed"
}
}
}
}
PUT /twitter/tweet/1
{
"user": "CNN"
}
PUT /twitter/tweet/2
{
"user": "cnn"
}
PUT /twitter/tweet/3
{
"user": "Cnn"
}
PUT /twitter/tweet/4
{
"user": "cNN"
}
PUT /twitter/tweet/5
{
"user": "CnN"
}
I want to search on this index with a case-insensitive filter like so (generated through NEST, so not too flexible in changing this query syntax):
POST /twitter/tweet/_search
{
"from": 0,
"size": 10,
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"fquery": {
"query": {
"query_string": {
"query": "user:/[cC][nN][nN]/"
}
}
}
}
]
}
}
}
}
}
This query only returns 1 documents though: "user": "cnn" (lowercase), not all of the documents.
Why is this? The same query with "query": "user:CNN" returns the correct document with the correct casing (uppercase).
EDIT: Also, if I remove the document with cnn (lowercase), the query returns nothing.
EDIT 2: In the case that this is a problem with my NEST code, here's the code used to generate the query:
// property path would be something like "user". queryTerm would be something like "cnn"
filterDescriptor.Query(
q =>
q.QueryString(
d =>
d.Query(string.Format("{0}:{1}", propertyPath,
GetCaseInsentitiveRegexExpression(queryTerm))))); // returns something like /[cC][nN][nN]/
You need to set lowercase_expanded_terms:false. By default lowercase_expanded_terms is set to true which lower-cases wildcard,regexp queries.
Example:
POST /twitter/tweet/_search
{
"from": 0,
"size": 10,
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"fquery": {
"query": {
"query_string": {
"query": "user:/[Cc][nN][nN]/",
"lowercase_expanded_terms": false
}
}
}
}
]
}
}
}
}
}
Or on nest code it would be something on these lines
q.QueryString(
d =>
d.Query(string.Format("{0}:{1}", propertyPath,
GetCaseInsentitiveRegexExpression(queryTerm))).LowercaseExpendedTerms(false))