Case Insensitive Search With Regex - regex

I'm trying to implement a case-insensitive search with regex.
Example: /^sanford/i (searching for anything starting with "sanford" disregarding case sensivity.
For case insensitive queries, creating indeces with a custom collation is recommended by the documentation. This works fine as long as no regex is involved.
The problem: searching with a regex (in this case "starts with"), the case-insensitive search does NOT take the index into account.
This is stated in the documentation multiple times and is also reproducable with an explain command.
To sum it up: It works, but without effectively using the index. I'd be glad to get any hints, I can't get rid of the feeling that I'm missing something fundamentally important here.
Inserting the string with toLowerCase and then searching only with lower cased strings is not an option.
I can't use a text index because there can only be one per collection.
Example from the documentation see here, the green info box on the bottom.
#D.SM: The index is used, but it scans all documents.
https://docs.atlas.mongodb.com/schema-suggestions/case-insensitive-regex/
Example document:
{
"name": [{
"family": "Test",
"given": "Name",
}],
}
Index with collation:
{ "name" : "name_family", "key" : { "name.family" : 1 }, "host" : "noneofyourbusiness.com", "accesses" : { "ops" : NumberLong(114), "since" : ISODate("2020-07-30T20:25:59.079Z") }, "shard" : "shA", "spec" : { "v" : 2, "key" : { "name.family" : 1 }, "name" : "name_family", "ns" : "noneofyourbusiness.somethingwithaname", "collation" : { "locale" : "de", "caseLevel" : false, "caseFirst" : "off", "strength" : 1, "numericOrdering" : false, "alternate" : "non-ignorable", "maxVariable" : "punct", "normalization" : false, "backwards" : false, "version" : "57.1" } } }
}

Related

Partial String Search along with Multiple Word in AWS Elastic Search

I have started using AWS Elastic Search Service and I want to search in JSON array object with partial string search along with Multiple word search.
For example I have added the three objects in an array.
[{
"_id" : "1",
"TitleKeywords" : "Game of thrones"
},
{
"_id" : "2",
"TitleKeywords" : "Baywatch"
},
{
"_id" : "3",
"TitleKeywords" : "Spider Man"
}]
Now I want to perform search on field name TitleKeywords and I want to search for partial word also search on multiple words.
Like for example, If I want to search 'Spi' character then it should results me into below JSON object.
{
"_id" : "3",
"TitleKeywords" : "Spider Man"
}
I have searched query syntax for this and found below query :
query: {
query_string: {
default_field: "TitleKeywords",
query: "*Spi*"
}
}
Also, for search keyword 'Spider M' (which is a multiple word search) it should result me into below JSON object :
{
"_id" : "3",
"TitleKeywords" : "Spider Man"
}
Now, if I want to search on multiple words then I can use below query :
query: {
match: {
"TitleKeywords": {
"query": "Spider M",
"operator": "and"
}
}
}
I want my result to be the mixture of both query which results into partial string search on multiple words.
Can anyone please help me on this ?
Thanks
I think you should consider using combination of multi-fields with NGram Tokenizer.
So you are adding ngram field to your "TitleKeywords" and querying this way:
query: {
match: {
"TitleKeywords.ngram": {
"query" : "Spider M",
"operator": "and"
}
}
}
But NGram-ing from 1 char can be ineffective so I'm not sure if this suits your needs.

elasticsearch span_near query false hits

I have a text field which contains an xml-document where I try to find this kind of match:
<Payer> [...] bic=\"123456789\" [...] </Payer>
with the following query:
{
"query": {
"span_near" : {
"clauses" : [
{ "span_term" : { "field" : "payer" }},
{ "span_term" : { "field" : "bic" }},
{ "span_term" : { "field" : "123456789" }},
{ "span_term" : { "field" : "payer"}}
],
"slop" : 500,
"in_order" : true
}
}
}
The problem is that sometimes I get wrong matches if xml-document contains something like:
<Payer>bic=\"111111111\"</Payer><Payee>bic=\"123456789\"</Payee><Payer>bic=\"222222222\"</Payer>
Query finds PayeE instead of PayeR. From elastic point of view it is still valid.
Any ideas I can prevent this "greedy" search?
As far as I know from this topic regexp is not an option because "Elasticsearch (and lucene) don't support full Perl-compatible regex syntax". It means regexp-query matches tokens, not the whole string.
I also tried to make last span_term like /payer or \\/payer or </payer but it finds nothing at all.
You may add a span_not query:
Removes matches which overlap with another span query. The span not query maps to Lucene SpanNotQuery.

MongoDB Query For Fields That Vary - Wildcards?

I am looking for a way to get distinct "unit" values from a collection that has a structure similar to the following:
{
"_id" : ObjectId("548b1aee6e444414f00d5cf1"),
"KPI" : {
"NPV" : {
"value" : 100,
"unit" : "kUSD"
},
"NPM" : {
"value" : 100,
"unit" : "kUSD"
},
"GPM" : {
"value" : 50,
"unit" : "CAD"
}
}
}
I looked into using wildcards and regex but from what I have come across this is not supported for field matching. I would like to do something like db.collection.distinct('KPI.*.unit') but cannot determine how and it seems like performance would be poor. Does anyone have a recommendation? Thanks.
It's not a good practice to make the keys a part of the content of the document - don't use keys as data. If you don't change your document structure, you'll need to know what the possible subfields of KPI are. If you don't know what those could be, you will need to examine the documents manually to find them. Then you can issue a distinct for each using dot notation, e.g. db.collection.distinct("KPI.NPM.unit").
If what you're looking for instead is the distinct values of unit across all values of the parent KPI subfield, then you could take the union of all of the results of the distincts. You can also do it easily with an aggregation framework in MongoDB 2.6. For simplicity, I'll assume there's just three distinct subfields of KPI, the ones in the document above.
db.collection.aggregate([
{ "$group" : { "_id" : 0, "NPVunits" : { "$addToSet" : "$KPI.NPV.unit" }, "NPMunits" : { "$addToSet" : "$KPI.NPM.unit" }, "GPMunits" : { "$addToSet" : "$KPI.GPM.unit" } }
{ "$project" : { "distinct_units" : { "$setUnion" : ["$NPVunits", "$NPMunits", "$GPMunits"] } } }
])
You could also structure your data as dynamic attributes. The document above would be recast as something like
{
"_id" : ObjectId("548b1aee6e444414f00d5cf1"),
"KPI" : [
{ "type" : "NPV", "value" : 100, "unit" : "kUSD" },
{ "type" : "NPM", "value" : 100, "unit" : "kUSD" },
{ "type" : "GPM", "value" : 50, "unit" : "CAD" }
]
}
Querying for distinct units is easy now, whether you want it per type or over all types:
Per type (all types in one query)
db.collection.aggregate([
{ "$unwind" : "$KPI" },
{ "$group" : { "_id" : "$KPI.type", "units" : { "$addToSet" : "$KPI.unit" } } }
])
Over all types
db.collection.distinct("KPI.unit")

Can I do a MongoDB "starts with" query on an indexed subdocument field?

I'm trying to find documents where a field starts with a value.
Table scans are disabled using notablescan.
This works:
db.articles.find({"url" : { $regex : /^http/ }})
This doesn't:
db.articles.find({"source.homeUrl" : { $regex : /^http/ }})
I get the error:
error: { "$err" : "table scans not allowed:moreover.articles", "code" : 10111 }
There are indexes on both url and source.homeUrl:
{
"v" : 1,
"key" : {
"url" : 1
},
"ns" : "mydb.articles",
"name" : "url_1"
}
{
"v" : 1,
"key" : {
"source.homeUrl" : 1
},
"ns" : "mydb.articles",
"name" : "source.homeUrl_1",
"background" : true
}
Are there any limitations with regex queries on subdocument indexes?
When you disable table scans, it means that any query where a table scan "wins" in the query optimizer will fail to run. You haven't posted an explain but it's reasonable to assume that's what is happening here based on the error. Try hinting the index explicitly:
db.articles.find({"source.homeUrl" : { $regex : /^http/ }}).hint({"source.homeUrl" : 1})
That should eliminate the table scan as a possible choice and allow the query to return successfully.

combining regex and embedded objects in mongodb queries

I am trying to combine regex and embedded object queries and failing miserably. I am either hitting a limitation of mongodb or just getting something slightly wrong maybe someone out ther has encountered this. The documentation certainly does'nt cover this case.
data being queried:
{
"_id" : ObjectId("4f94fe633004c1ef4d892314"),
"productname" : "lightbulb",
"availability" : [
{
"country" : "USA",
"storeCode" : "abc-1234"
},
{
"country" : "USA",
"storeCode" : "xzy-6784"
},
{
"country" : "USA",
"storeCode" : "abc-3454"
},
{
"country" : "CANADA",
"storeCode" : "abc-6845"
}
]
}
assume the collection contains only one record
This query returns 1:
db.testCol.find({"availability":{"country" : "USA","storeCode":"xzy-6784"}}).count();
This query returns 1:
db.testCol.find({"availability.storeCode":/.*/}).count();
But, this query returns 0:
db.testCol.find({"availability":{"country" : "USA","storeCode":/.*/}}).count();
Does anyone understand why? Is this a bug?
thanks
You are referencing the embedded storecode incorrectly - you are referencing it as an embedded object when in fact what you have is an array of objects. Compare these results:
db.testCol.find({"availability.0.storeCode":/x/});
db.testCol.find({"availability.0.storeCode":/a/});
Using your sample doc above, the first one will not return, because the first storeCode does not have an x in it ("abc-1234"), the second will return the document. That's fine for the case where you are looking at a single element of the array and pass in the position. In order to search all of the objcts in the array, you want $elemMatch
As an example, I added this second example doc:
{
"_id" : ObjectId("4f94fe633004c1ef4d892315"),
"productname" : "hammer",
"availability" : [
{
"country" : "USA",
"storeCode" : "abc-1234"
},
]
}
Now, have a look at the results of these queries:
PRIMARY> db.testCol.find({"availability" : {$elemMatch : {"storeCode":/a/}}}).count();
2
PRIMARY> db.testCol.find({"availability" : {$elemMatch : {"storeCode":/x/}}}).count();
1