Elasticsearch not working with 'not_analyzed' index - ruby-on-rails-4

I am unable to figure out why elasticsearch not searching with not_analysed indexes. I have following settings in my model,
settings index: { number_of_shards: 1 } do
mappings dynamic: 'false' do
indexes :id
indexes :name, index: 'not_analyzed'
indexes :email, index: 'not_analyzed'
indexes :contact_number
end
end
def as_indexed_json(options = {})
as_json(only: [ :id, :name, :username, :user_type, :is_verified, :email, :contact_number ])
end
And my mapping at elasticsearch is right, as below.
{
"users-development" : {
"mappings" : {
"user" : {
"dynamic" : "false",
"properties" : {
"contact_number" : {
"type" : "string"
},
"email" : {
"type" : "string",
"index" : "not_analyzed"
},
"id" : {
"type" : "string"
},
"name" : {
"type" : "string",
"index" : "not_analyzed"
}
}
}
}
}
}
But issue is when I make search on not analyzed fields (name and email, as I wanted them to be not analyzed) it only search on full word. Like in the example below it should have return John, Johny and Tiger, all 3 records. But it only returns 2 of the records.
I am searching as below
settings = {
query: {
filtered: {
filter: {
bool: {
must: [
{ terms: { name: [ "john", "tiger" ] } },
]
}
}
}
},
size: 10
}
User.__elasticsearch__.search(settings).records
This is how I am creating index on my user object in callback after_save,
User.__elasticsearch__.client.indices.create(
index: User.index_name,
id: self.id,
body: self.as_indexed_json,
)
Some of the document that should match
[{
"_index" : "users-development",
"_type" : "user",
"_id" : "670",
"_score" : 1.0,
"_source":{"id":670,"email":"john#monkeyofdoom.com","name":"john baba","contact_number":null}
},
{
"_index" : "users-development",
"_type" : "user",
"_id" : "671",
"_score" : 1.0,
"_source":{"id":671,"email":"human#monkeyofdoom.com","name":"Johny Rocket","contact_number":null}
}
, {
"_index" : "users-development",
"_type" : "user",
"_id" : "736",
"_score" : 1.0,
"_source":{"id":736,"email":"tiger#monkeyofdoom.com","name":"tiger sherof", "contact_number":null}
} ]
Any suggestions please.

I think you would get desired results with keyword toknizer combined with lowercase filter rather than using not_analyzed.
The reason john* did not match Johny was due to case sensitivity.
This setup will work
{
"settings": {
"analysis": {
"analyzer": {
"keyword_analyzer": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"name": {
"type": "string",
"analyzer": "keyword_analyzer"
}
}
}
}
}
Now john* will match johny. You should be using multi-fields if you have various requirements. terms query for john wont give you john baba as inside inverted index there is no token as john. You could use standard analyzer on one field and keyword analyzer on other.

As per the documentation term query
The term query finds documents that contain the exact term specified in the inverted index.
You are searching for john but none of your documnents contain john i.e why you were not getting any result. Either you can your field analysed and then apply query string or search for exact term.
Refer https://www.elastic.co/guide/en/elasticsearch/reference/2.x/query-dsl-term-query.html for more details

Related

must match query not working as expected in Elasticsearch

I've created my index below using Kibana which connected to my AWS ES domain:
PUT sals_poc_test_20210217-7
{
"settings" : {
"index" : {
"number_of_shards" : 10,
"number_of_replicas" : 1,
"max_result_window": 50000,
"max_rescore_window": 50000
}
},
"mappings": {
"properties": {
"identifier": {
"type": "keyword"
},
"CLASS_NAME": {
"type": "keyword"
},
"CLIENT_ID": {
"type": "keyword"
}
}
}
}
then I've indexed 100 documents, using below command returns all 100 documents:
POST /sals_poc_test_20210217-7/_search
{
"query": {
"match": {
"_index": "sals_poc_test_20210217-7"
}
}
}
two sample documents are below:
{
"_index" : "sals_poc_test_20210217-7",
"_type" : "_doc",
"_id" : "cd0a3723-106b-4aea-b916-161e5563290f",
"_score" : 1.0,
"_source" : {
"identifier" : "xweeqkrz",
"class_name" : "/Sample_class_name_1",
"client_id" : "random_str"
}
},
{
"_index" : "sals_poc_test_20210217-7",
"_type" : "_doc",
"_id" : "cd0a3723-106b-4aea-b916-161e556329ab",
"_score" : 1.0,
"_source" : {
"identifier" : "xweeqkra",
"class_name" : "/Sample_class_name_2",
"client_id" : "random_str_2"
}
}
but when I wanted to search by CLASS_NAME by below command:
POST /sals_poc_test_20210217-7/_search
{
"size": 200,
"query": {
"bool": {
"must": [
{ "match": { "CLASS_NAME": "/Sample_class_name_1"}}
]
}
}
}
Not only the documents that match this class_name returned, but also other ones.
Anyone could shed any light into this case please?
I'm suspecting the way I wrote my search query is problematic. But cannot figure out why.
Thanks!
Elastic search, is case sensitive. class_name is not equal to CLASS_NAME sample documents seems to have class_name but mapping in index seems to have 'CLASS_NAME.
If we GET sals_poc_test_20210217-7, both class name attributes should be in the index mapping. The one when creating the index and second one created when adding documents to index.
so, query should be on CLASS_NAME or class_name.keyword , by default elastic search creates both text and .keyword field for dynamic attributes
"CLASS_NAME" : {
"type" : "keyword"
},
"class_name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}

Why doesn't the Keyword analyzer applied to a Text field return results when the pattern contains a dash in Regexp search query?

I have created a small example to demonstrate the specific issue I'm having. Briefly, when I create a multi-field mapping using a field type of Text and the Keyword analyzer, no documents are returned from an Elasticsearch Regexp search query that contains punctuation. I use a dash in the following example to demonstrate the problem.
I’m using Elasticsearch 7.10.2. The index I’m targeting is already populated with millions of documents. The field of type Text where I need to run some regular expressions uses the Standard (default) analyzer. I understand that, because the field gets tokenized by the Standard analyzer, the following request:
POST _analyze
{
"analyzer" : "default",
"text" : "The number is: 123-4576891-73.\n\n"
}
will yield three words: "the", "number", "is" and three groups of numbers: "123", "4567891", "73". It's obvious that a regular expression that relies on punctuation, like this one that contains two literal dashes:
"(.*[^a-z0-9_])?[0-9]{3}-[0-9]{7}-[0-9]{2}([^a-z0-9_].*)?"
will not return a result. Note, for those not familiar with this, regex shortcuts do not work for Lucene-based Elasticsearch requests (at least not yet). Here's a reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/regexp-syntax.html. Also, the use of word boundaries that I show in my examples (.*[^a-z0-9_])? and ([^a-z0-9_].*)? are from this post: Word boundary in Lucene regex.
To see this for yourself with an example, create and populate an index like so:
PUT /index-01
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"properties": {
"text": { "type": "text" }
}
}
}
POST index-01/_doc/
{
"text": "The number is: 123-4576891-73.\n\n"
}
The following Regexp search query will return nothing because of the tokenization issue I described earlier:
POST index-01/_search
{
"size": 1,
"query": {
"regexp": {
"text": {
"value": "(.*[^a-z0-9_])?[0-9]{3}-[0-9]{7}-[0-9]{2}([^a-z0-9_].*)?",
"flags": "ALL",
"case_insensitive": true,
"max_determinized_states": 100000
}
}
},
"_source": false,
"highlight": {
"fields": {
"text": {}
}
}
}
Most posts suggest a quick fix would be to target the Keyword type multi-field instead of the text field. The Keyword multi-type field gets created automatically, as this shows:
GET index-01/_mapping/field/text
response:
{
"index-01" : {
"mappings" : {
"text" : {
"full_name" : "text",
"mapping" : {
"text" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
Targeting the keyword field, I get return results for the following Regexp search query:
POST index-01/_search
{
"size": 1,
"query": {
"regexp": {
"text.keyword": {
"value": "(.*[^a-z0-9_])?[0-9]{3}-[0-9]{7}-[0-9]{2}([^a-z0-9_].*)?",
"flags": "ALL",
"case_insensitive": true,
"max_determinized_states": 100000
}
}
},
"_source": false,
"highlight": {
"fields": {
"text.keyword": {}
}
}
}
here's the hit-highlighted part of the result:
...
"highlight" : {
"text.keyword" : [
"<em>This is my number 123-4576891-73. Thanks\n\n</em>"
]
}
...
Because some of the documents have a large amount of text, I adjusted the text.keyword field size with ignore_above parameter:
PUT /index-01/_mapping
{
"properties": {
"text": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 32766
}
}
}
}
}
However, this will skip some documents since the targeted index, contains larger text fields than this upper-bound for a field type Keyword. Also, according to the Elasticsearch documentation here: https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html, this type of field is really designed for structured data, constant values and wildcard queries.
Following that guidance, I assigned the Keyword analyzer to a new field type Text (text.raw) by making this update to the mapping:
PUT /index-01/_mapping
{
"properties": {
"text": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 32766
},
"raw": {
"type": "text",
"analyzer": "keyword",
"index": true
}
}
}
}
}
Now, you can see the additional mapping text.raw with this request:
GET index-01/_mapping/field/text
response:
{
"index-01" : {
"mappings" : {
"text" : {
"full_name" : "text",
"mapping" : {
"text" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 32766
},
"raw" : {
"type" : "text",
"analyzer" : "keyword"
}
}
}
}
}
}
}
}
Next, I verified that the data was, in fact, mapped to the multi-fields:
POST index-01/_search
{
"query":
{
"match_all": {}
},
"fields": ["text", "text.keyword", "text.raw"]
}
response:
...
"hits" : [
{
"_index" : "index-01",
"_type" : "_doc",
"_id" : "2R-OgncBn-TNB4PjXYAh",
"_score" : 1.0,
"_source" : {
"text" : "The number is: 123-4576891-73.\n\n"
},
"fields" : {
"text" : [
"The number is: 123-4576891-73.\n\n"
],
"text.keyword" : [
"The number is: 123-4576891-73.\n\n"
],
"text.raw" : [
"The number is: 123-4576891-73.\n\n"
]
}
}
]
...
I also verified that the Keyword analyzer applied to the text.raw field contains a single token, as shown in the following request:
POST _analyze
{
"analyzer" : "keyword",
"text" : "The number is: 123-4576891-73.\n\n"
}
response:
{
"tokens" : [
{
"token" : "The number is: 123-4576891-73.\n\n",
"start_offset" : 0,
"end_offset" : 32,
"type" : "word",
"position" : 0
}
]
}
However, the exact same Regexp search query targeting the text.raw field returns nothing:
POST index-01/_search
{
"size": 1,
"query": {
"bool": {
"must": [
{
"regexp": {
"text.raw": {
"value": "(.*[^a-z0-9_])?[0-9]{3}-[0-9]{7}-[0-9]{2}([^a-z0-9_].*)?",
"flags": "ALL",
"case_insensitive": true,
"max_determinized_states": 100000
}
}
}
]
}
},
"_source": false,
"highlight" : {
"fields" : {
"text.raw": {}
}
}
}
Please let me know if you know why I'm not getting back a result using the field type Text with the Keyword analyzer.

AWS Elastic search : Search should be performed on all combination with given query

I'm working on AWS Elastic Search. I've come across one situation in my project where in my reports i have to search keywords like "corona virus".
But result should come with containing keywords like "Corona virus" and "corona" and "virus" and "coronavirus".
Please guide me how i should build my query DSL.
Note: Working on PHP language.
Appreciate your help.
//Amit
You need to use shingle token filter
A token filter of type shingle that constructs shingles (token
n-grams) from a token stream. In other words, it creates combinations
of tokens as a single token. For example, the sentence "please divide
this sentence into shingles" might be tokenized into shingles "please
divide", "divide this", "this sentence", "sentence into", and "into
shingles".
Mapping
PUT index91
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"shingle_filter"
]
}
},
"filter": {
"shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3,
"output_unigrams": true,
"token_separator": ""
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Data:
POST index91/_doc
{
"title":"corona virus"
}
Query:
GET index91/_search
{
"query": {
"match": {
"title": "coronavirus"
}
}
}
Result:
"hits" : [
{
"_index" : "index91",
"_type" : "_doc",
"_id" : "gNmUZHEBrJsHVOidaoU_",
"_score" : 0.9438393,
"_source" : {
"title" : "corona virus"
}
}
It will also work for "corona", "corona virus","virus"

How do i define ElasticSearch Dynamic Templates?

I'm trying to define dynamic templates in Elastic Search to automatically set analysers for currently undefined properties for translations.
E.g. The following does exactly what i want, which is to set lang.en.title to use the english analyzer:
PUT /cl
{
"mappings" : {
"titles" : {
"properties" : {
"id" : {
"type" : "integer",
"index" : "not_analyzed"
},
"lang" : {
"type" : "object",
"properties" : {
"en" : {
"type" : "object",
"properties" : {
"title" : {
"type" : "string",
"index" : "analyzed",
"analyzer" : "english"
}
}
}
}
}
}
}
}
}
Which stems lang.en.title as expected e.g.
GET /cl/_analyze?field=lang.en.title&text=knocked
{
"tokens": [
{
"token": "knock",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
}
]
}
But what i'm trying to do is set all future string properties of lang.en to use the english analyser using a dynamic template which i can't seem to get working...
PUT /cl
{
"mappings" : {
"titles" : {
"dynamic_templates" : [{
"lang_en" : {
"path_match" : "lang.en.*",
"mapping" : {
"type" : "string",
"index" : "analyzed",
"analyzer" : "english"
}
}
}],`enter code here`
"properties" : {
"id" : {
"type" : "integer",
"index" : "not_analyzed"
},
"lang" : {
"type" : "object"
}
}
}
}
}
The english analyser isn't being applied as lang.en.title isn't stemmed as desired -
GET /cl/_analyze?field=lang.en.title&text=knocked
{
"tokens": [
{
"token": "knocked",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
}
]
}
What am i missing? :)
Your dynamic template is defined correctly. The issue is that you will need to index a document with the lang.en.title field in it before the dynamic template will apply the appropriate mapping. I ran the same dynamic mapping that you have defined above in your question locally and got the same result as you.
However, then I added a single document to the index.
POST /cl/titles/1
{
"lang.en.title": "Knocked out"
}
After adding the document, I ran the analyzer again and I got the expected output:
GET /cl/_analyze?field=lang.en.title&text=knocked
{
"tokens": [
{
"token": "knock",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
}
]
}
The index needs to have a document inserted so that it can execute the defined mapping template for the inserted fields. Once that field exists in the index and has the dynamic mapping applied, _analyze API calls will execute as expected.

Regex inside array in mongoDB

i want to do a query inside a array in mongodb with regex, the collections have documents like this:
{
"_id" : ObjectId("53340d07d6429d27e1284c77"),
"company" : "New Company",
"worktypes" : [
{
"name" : "Pompas",
"works" : [
{
"name" : "name 2",
"code" : "A00011",
"price" : "22,22"
},
{
"name" : "name 3",
"code" : "A00011",
"price" : "22,22"
},
{
"name" : "name 4",
"code" : "A00011",
"price" : "22,22"
},
{
"code" : "asdasd",
"name" : "asdads",
"price" : "22"
},
{
"code" : "yy",
"name" : "yy",
"price" : "11"
}
]
},
{
"name" : "name 4",
"works" : [
{
"code" : "A112",
"name" : "Nombre",
"price" : "11,2"
}
]
},
{
"name" : "ee",
works":[
{
"code" : "aa",
"name" : "aa",
"price" : "11"
},
{
"code" : "A00112",
"name" : "Nombre",
"price" : "12,22"
}
]
}
]
}
Then i need to find a document by the company name and any work inside it have match a regex in code or name work.
I have this:
var companyquery = { "company": "New Company"};
var regQuery = new RegExp('^A0011.*$', 'i');
db.categories.find({$and: [companyquery,
{$or: [
{"worktypes.works.$.name": regQuery},
{"worktypes.works.$.code": regQuery}
]}]})
But dont return any result..I think the error is try to search inside array with de dot and $..
Any idea?
Edit:
With this:
db.categories.find({$and: [{"company":"New Company"},
{$or: [
{"worktypes.works.name": {"$regex": "^A00011$|^a00011$"}},
{"worktypes.works.code": {"$regex": "^A00011$|^a00011$"}}
]}]})
This is the result:
{
"_id" : ObjectId("53340d07d6429d27e1284c77"),
"company" : "New Company",
"worktypes" : [
{
"name" : "Pompas",
"works" : [
{
"name" : "name 2",
"code" : "A00011",
"price" : "22,22"
},
{
"code" : "aa",
"name" : "aa",
"price" : "11"
},
{
"code" : "A00112",
"name" : "Nombre",
"price" : "12,22"
},
{
"code" : "asdasd",
"name" : "asdads",
"price" : "22"
},
{
"code" : "yy",
"name" : "yy",
"price" : "11"
}
]
},
{
"name" : "name 4",
"works" : [
{
"code" : "A112",
"name" : "Nombre",
"price" : "11,2"
}
]
},
{
"name" : "Bombillos"
},
{
"name" : "Pompas"
},
{
"name" : "Bombillos 2"
},
{
"name" : "Other type"
},
{
"name" : "Other new type"
}
]
}
The regex dont field the results ok..
You are using a JavaScript native RegExp object for the regular expression, however for mongo to process the regular expression it needs to be sent as part of the query document, and this is not the same thing.
Also the regex will not match the values that you want. It could actualy be ^A0111$ for the exact match, but your case insensitive match causes a problem causing a larger scan of a possible index. So there is a better way to write that. Also see the documentation link for the problems with case insensitive matches.
Use the $regex operator instead:
db.categories.find({
"$and": [
{"company":"New Company"},
{ "$or": [
{ "worktypes.works.name": { "$regex": "^A00011$|^a00011$" }},
{ "worktypes.works.code": { "$regex": "^A00011$|^a00011$" }}
]}
]
})
Also the positional $ placeholders are not valid for a query, they are only used in projection or an update or the first matching element found by the query.
But your actual problem seems to be that you are trying to only get the elements of an array that "match" your conditions. You cannot do this with .find() and for that you need to use .aggregate() instead:
db.categories.aggregate([
// Always makes sense to match the actual documents
{ "$match": {
"$and": [
{"company":"New Company"},
{ "$or": [
{ "worktypes.works.name": { "$regex": "^A00011$|^a00011$" }},
{ "worktypes.works.code": { "$regex": "^A00011$|^a00011$" }}
]}
]
}},
// Unwind the worktypes array
{ "$unwind": "$worktypes" },
// Unwind the works array
{ "$unwind": "$worktypes.works" },
// Then use match to filter only the matching entries
{ "$match": {
"$or": [
{ "worktypes.works.name": { "$regex": "^A00011$|^a00011$" } },
{ "worktypes.works.code": { "$regex": "^A00011$|^a00011$" } }
]
}},
/* Stop */
// If you "really" need the arrays back then include all the following
// Otherwise the steps up to here actually got you your results
// First put the "works" array back together
{ "$group": {
"_id": {
"_id": "$_id",
"company": "$company",
"workname": "$worktypes.name"
},
"works": { "$push": "$worktypes.works" }
}},
// Then put the "worktypes" array back
{ "$group": {
"_id": "$_id._id",
"company": { "$first": "$_id.company" },
"worktypes": {
"$push": {
"name": "$_id.workname",
"works": "$works"
}
}
}}
])
So what .aggregate() does with all of these stages is it breaks the array elements into normal document form so they can be filtered using the $match operator. In that way, only the elements that "match" are returned.
What "find" is correctly doing is matching the "document" that meets the conditions. Since documents contain the elements that match then they are returned. The two principles are very different things.
When you mean to "filter" use aggregate.
i think there is a typo :
the regex should be : ^A00011.*$
triple 0 instead of double 0
You can try aggregate method and aggregation array operators, so this query will be supported from MongoDB 4.2,
$match to match your condition
$addFields to add/edit field in document
$map to iterate loop of worktypes array
$filter to iterate loop of works array and it will return the filtered result as per provided condition
$regexMatch to match regex expression same as we did in $match stage, it will return a boolean response, so we checked $or condition here,
$mergeObjects to merge current object of worktypes and updated works array property
second $addFields for remove empty result of works array
$filter to iterate loop of worktypes array and check negative condition to remove empty works document
db.categories.aggregate([
{
$match: {
$and: [
{ "company": "New Company" },
{
$or: [
{ "worktypes.works.name": { "$regex": "^A00011$|^a00011$" } },
{ "worktypes.works.code": { "$regex": "^A00011$|^a00011$" } }
]
}
]
}
},
{
$addFields: {
worktypes: {
$map: {
input: "$worktypes",
in: {
$mergeObjects: [
"$$this",
{
works: {
$filter: {
input: "$$this.works",
cond: {
$or: [
{
$regexMatch: {
input: "$$this.name",
regex: "^A00011$|^a00011$"
}
},
{
$regexMatch: {
input: "$$this.code",
regex: "^A00011$|^a00011$"
}
}
]
}
}
}
}
]
}
}
}
}
},
{
$addFields: {
worktypes: {
$filter: {
input: "$worktypes",
cond: { $ne: ["$$this.works", []] }
}
}
}
}
])
Playground