Regex inside array in mongoDB - regex

i want to do a query inside a array in mongodb with regex, the collections have documents like this:
{
"_id" : ObjectId("53340d07d6429d27e1284c77"),
"company" : "New Company",
"worktypes" : [
{
"name" : "Pompas",
"works" : [
{
"name" : "name 2",
"code" : "A00011",
"price" : "22,22"
},
{
"name" : "name 3",
"code" : "A00011",
"price" : "22,22"
},
{
"name" : "name 4",
"code" : "A00011",
"price" : "22,22"
},
{
"code" : "asdasd",
"name" : "asdads",
"price" : "22"
},
{
"code" : "yy",
"name" : "yy",
"price" : "11"
}
]
},
{
"name" : "name 4",
"works" : [
{
"code" : "A112",
"name" : "Nombre",
"price" : "11,2"
}
]
},
{
"name" : "ee",
works":[
{
"code" : "aa",
"name" : "aa",
"price" : "11"
},
{
"code" : "A00112",
"name" : "Nombre",
"price" : "12,22"
}
]
}
]
}
Then i need to find a document by the company name and any work inside it have match a regex in code or name work.
I have this:
var companyquery = { "company": "New Company"};
var regQuery = new RegExp('^A0011.*$', 'i');
db.categories.find({$and: [companyquery,
{$or: [
{"worktypes.works.$.name": regQuery},
{"worktypes.works.$.code": regQuery}
]}]})
But dont return any result..I think the error is try to search inside array with de dot and $..
Any idea?
Edit:
With this:
db.categories.find({$and: [{"company":"New Company"},
{$or: [
{"worktypes.works.name": {"$regex": "^A00011$|^a00011$"}},
{"worktypes.works.code": {"$regex": "^A00011$|^a00011$"}}
]}]})
This is the result:
{
"_id" : ObjectId("53340d07d6429d27e1284c77"),
"company" : "New Company",
"worktypes" : [
{
"name" : "Pompas",
"works" : [
{
"name" : "name 2",
"code" : "A00011",
"price" : "22,22"
},
{
"code" : "aa",
"name" : "aa",
"price" : "11"
},
{
"code" : "A00112",
"name" : "Nombre",
"price" : "12,22"
},
{
"code" : "asdasd",
"name" : "asdads",
"price" : "22"
},
{
"code" : "yy",
"name" : "yy",
"price" : "11"
}
]
},
{
"name" : "name 4",
"works" : [
{
"code" : "A112",
"name" : "Nombre",
"price" : "11,2"
}
]
},
{
"name" : "Bombillos"
},
{
"name" : "Pompas"
},
{
"name" : "Bombillos 2"
},
{
"name" : "Other type"
},
{
"name" : "Other new type"
}
]
}
The regex dont field the results ok..

You are using a JavaScript native RegExp object for the regular expression, however for mongo to process the regular expression it needs to be sent as part of the query document, and this is not the same thing.
Also the regex will not match the values that you want. It could actualy be ^A0111$ for the exact match, but your case insensitive match causes a problem causing a larger scan of a possible index. So there is a better way to write that. Also see the documentation link for the problems with case insensitive matches.
Use the $regex operator instead:
db.categories.find({
"$and": [
{"company":"New Company"},
{ "$or": [
{ "worktypes.works.name": { "$regex": "^A00011$|^a00011$" }},
{ "worktypes.works.code": { "$regex": "^A00011$|^a00011$" }}
]}
]
})
Also the positional $ placeholders are not valid for a query, they are only used in projection or an update or the first matching element found by the query.
But your actual problem seems to be that you are trying to only get the elements of an array that "match" your conditions. You cannot do this with .find() and for that you need to use .aggregate() instead:
db.categories.aggregate([
// Always makes sense to match the actual documents
{ "$match": {
"$and": [
{"company":"New Company"},
{ "$or": [
{ "worktypes.works.name": { "$regex": "^A00011$|^a00011$" }},
{ "worktypes.works.code": { "$regex": "^A00011$|^a00011$" }}
]}
]
}},
// Unwind the worktypes array
{ "$unwind": "$worktypes" },
// Unwind the works array
{ "$unwind": "$worktypes.works" },
// Then use match to filter only the matching entries
{ "$match": {
"$or": [
{ "worktypes.works.name": { "$regex": "^A00011$|^a00011$" } },
{ "worktypes.works.code": { "$regex": "^A00011$|^a00011$" } }
]
}},
/* Stop */
// If you "really" need the arrays back then include all the following
// Otherwise the steps up to here actually got you your results
// First put the "works" array back together
{ "$group": {
"_id": {
"_id": "$_id",
"company": "$company",
"workname": "$worktypes.name"
},
"works": { "$push": "$worktypes.works" }
}},
// Then put the "worktypes" array back
{ "$group": {
"_id": "$_id._id",
"company": { "$first": "$_id.company" },
"worktypes": {
"$push": {
"name": "$_id.workname",
"works": "$works"
}
}
}}
])
So what .aggregate() does with all of these stages is it breaks the array elements into normal document form so they can be filtered using the $match operator. In that way, only the elements that "match" are returned.
What "find" is correctly doing is matching the "document" that meets the conditions. Since documents contain the elements that match then they are returned. The two principles are very different things.
When you mean to "filter" use aggregate.

i think there is a typo :
the regex should be : ^A00011.*$
triple 0 instead of double 0

You can try aggregate method and aggregation array operators, so this query will be supported from MongoDB 4.2,
$match to match your condition
$addFields to add/edit field in document
$map to iterate loop of worktypes array
$filter to iterate loop of works array and it will return the filtered result as per provided condition
$regexMatch to match regex expression same as we did in $match stage, it will return a boolean response, so we checked $or condition here,
$mergeObjects to merge current object of worktypes and updated works array property
second $addFields for remove empty result of works array
$filter to iterate loop of worktypes array and check negative condition to remove empty works document
db.categories.aggregate([
{
$match: {
$and: [
{ "company": "New Company" },
{
$or: [
{ "worktypes.works.name": { "$regex": "^A00011$|^a00011$" } },
{ "worktypes.works.code": { "$regex": "^A00011$|^a00011$" } }
]
}
]
}
},
{
$addFields: {
worktypes: {
$map: {
input: "$worktypes",
in: {
$mergeObjects: [
"$$this",
{
works: {
$filter: {
input: "$$this.works",
cond: {
$or: [
{
$regexMatch: {
input: "$$this.name",
regex: "^A00011$|^a00011$"
}
},
{
$regexMatch: {
input: "$$this.code",
regex: "^A00011$|^a00011$"
}
}
]
}
}
}
}
]
}
}
}
}
},
{
$addFields: {
worktypes: {
$filter: {
input: "$worktypes",
cond: { $ne: ["$$this.works", []] }
}
}
}
}
])
Playground

Related

Kotlin - group by list of Maps

I have a fieldList variable.
val fieldList: List<MutableMap<String, String>>
// fieldList Data :
[ {
"field_id" : "1",
"section_id" : "1",
"section_name" : "section1",
"field_name" : "something_1"
}, {
"field_id" : "2",
"section_id" : "1",
"section_name" : "section1",
"field_name" : "something_2"
}, {
"field_id" : "3",
"section_id" : "2",
"section_name" : "section2",
"field_name" : "something_3"
}, {
"field_id" : "4",
"section_id" : "3",
"section_name" : "section3",
"field_name" : "something_4"
} ]
And I want to group by section_id.
The results should be as follows:
val result: List<MutableMap<String, Any>>
// result Data :
[
{
"section_id": "1",
"section_name": "section1",
"field": [
{
"id": “1”,
"name": "something_1"
},
{
"id": “2”,
"name": "something_2"
}
]
},
{
"section_id": "2",
"section_name": "section2",
"field": [
{
"id": “3”,
"name": "something_3"
}
]
},
.
.
.
]
What is the most idiomatic way of doing this in Kotlin?
I have an ugly looking working version in Java, but I am quite sure Kotlin has a nice way of doing it..
it's just that I am not finding it so far !
Any idea ?
Thanks
Another way:
val newList = originalList.groupBy { it["section_id"] }.values
.map {
mapOf(
"section_id" to it[0]["section_id"]!!,
"section_name" to it[0]["section_name"]!!,
"field" to it.map { mapOf("id" to it["field_id"], "name" to it["field_name"]) }
)
}
Playground
Also, as broot mentioned, prefer using data classes instead of such maps.
Assuming we are guaranteed that the data is correct and we don't have to validate it, so:
all fields always exist,
section_name is always the same for a specific section_id.
This is how you can do this:
val result = fieldList.groupBy(
keySelector = { it["section_id"]!! to it["section_name"]!! },
valueTransform = {
mutableMapOf(
"id" to it["field_id"]!!,
"name" to it["field_name"]!!,
)
}
).map { (section, fields) ->
mutableMapOf(
"section_id" to section.first,
"section_name" to section.second,
"field" to fields
)
}
However, I suggest not using maps and lists, but proper data classes. Using a Map to store known properties and using Any to store either String or List is just very inconvenient to use and error-prone.

Why doesn't the Keyword analyzer applied to a Text field return results when the pattern contains a dash in Regexp search query?

I have created a small example to demonstrate the specific issue I'm having. Briefly, when I create a multi-field mapping using a field type of Text and the Keyword analyzer, no documents are returned from an Elasticsearch Regexp search query that contains punctuation. I use a dash in the following example to demonstrate the problem.
I’m using Elasticsearch 7.10.2. The index I’m targeting is already populated with millions of documents. The field of type Text where I need to run some regular expressions uses the Standard (default) analyzer. I understand that, because the field gets tokenized by the Standard analyzer, the following request:
POST _analyze
{
"analyzer" : "default",
"text" : "The number is: 123-4576891-73.\n\n"
}
will yield three words: "the", "number", "is" and three groups of numbers: "123", "4567891", "73". It's obvious that a regular expression that relies on punctuation, like this one that contains two literal dashes:
"(.*[^a-z0-9_])?[0-9]{3}-[0-9]{7}-[0-9]{2}([^a-z0-9_].*)?"
will not return a result. Note, for those not familiar with this, regex shortcuts do not work for Lucene-based Elasticsearch requests (at least not yet). Here's a reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/regexp-syntax.html. Also, the use of word boundaries that I show in my examples (.*[^a-z0-9_])? and ([^a-z0-9_].*)? are from this post: Word boundary in Lucene regex.
To see this for yourself with an example, create and populate an index like so:
PUT /index-01
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"properties": {
"text": { "type": "text" }
}
}
}
POST index-01/_doc/
{
"text": "The number is: 123-4576891-73.\n\n"
}
The following Regexp search query will return nothing because of the tokenization issue I described earlier:
POST index-01/_search
{
"size": 1,
"query": {
"regexp": {
"text": {
"value": "(.*[^a-z0-9_])?[0-9]{3}-[0-9]{7}-[0-9]{2}([^a-z0-9_].*)?",
"flags": "ALL",
"case_insensitive": true,
"max_determinized_states": 100000
}
}
},
"_source": false,
"highlight": {
"fields": {
"text": {}
}
}
}
Most posts suggest a quick fix would be to target the Keyword type multi-field instead of the text field. The Keyword multi-type field gets created automatically, as this shows:
GET index-01/_mapping/field/text
response:
{
"index-01" : {
"mappings" : {
"text" : {
"full_name" : "text",
"mapping" : {
"text" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
Targeting the keyword field, I get return results for the following Regexp search query:
POST index-01/_search
{
"size": 1,
"query": {
"regexp": {
"text.keyword": {
"value": "(.*[^a-z0-9_])?[0-9]{3}-[0-9]{7}-[0-9]{2}([^a-z0-9_].*)?",
"flags": "ALL",
"case_insensitive": true,
"max_determinized_states": 100000
}
}
},
"_source": false,
"highlight": {
"fields": {
"text.keyword": {}
}
}
}
here's the hit-highlighted part of the result:
...
"highlight" : {
"text.keyword" : [
"<em>This is my number 123-4576891-73. Thanks\n\n</em>"
]
}
...
Because some of the documents have a large amount of text, I adjusted the text.keyword field size with ignore_above parameter:
PUT /index-01/_mapping
{
"properties": {
"text": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 32766
}
}
}
}
}
However, this will skip some documents since the targeted index, contains larger text fields than this upper-bound for a field type Keyword. Also, according to the Elasticsearch documentation here: https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html, this type of field is really designed for structured data, constant values and wildcard queries.
Following that guidance, I assigned the Keyword analyzer to a new field type Text (text.raw) by making this update to the mapping:
PUT /index-01/_mapping
{
"properties": {
"text": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 32766
},
"raw": {
"type": "text",
"analyzer": "keyword",
"index": true
}
}
}
}
}
Now, you can see the additional mapping text.raw with this request:
GET index-01/_mapping/field/text
response:
{
"index-01" : {
"mappings" : {
"text" : {
"full_name" : "text",
"mapping" : {
"text" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 32766
},
"raw" : {
"type" : "text",
"analyzer" : "keyword"
}
}
}
}
}
}
}
}
Next, I verified that the data was, in fact, mapped to the multi-fields:
POST index-01/_search
{
"query":
{
"match_all": {}
},
"fields": ["text", "text.keyword", "text.raw"]
}
response:
...
"hits" : [
{
"_index" : "index-01",
"_type" : "_doc",
"_id" : "2R-OgncBn-TNB4PjXYAh",
"_score" : 1.0,
"_source" : {
"text" : "The number is: 123-4576891-73.\n\n"
},
"fields" : {
"text" : [
"The number is: 123-4576891-73.\n\n"
],
"text.keyword" : [
"The number is: 123-4576891-73.\n\n"
],
"text.raw" : [
"The number is: 123-4576891-73.\n\n"
]
}
}
]
...
I also verified that the Keyword analyzer applied to the text.raw field contains a single token, as shown in the following request:
POST _analyze
{
"analyzer" : "keyword",
"text" : "The number is: 123-4576891-73.\n\n"
}
response:
{
"tokens" : [
{
"token" : "The number is: 123-4576891-73.\n\n",
"start_offset" : 0,
"end_offset" : 32,
"type" : "word",
"position" : 0
}
]
}
However, the exact same Regexp search query targeting the text.raw field returns nothing:
POST index-01/_search
{
"size": 1,
"query": {
"bool": {
"must": [
{
"regexp": {
"text.raw": {
"value": "(.*[^a-z0-9_])?[0-9]{3}-[0-9]{7}-[0-9]{2}([^a-z0-9_].*)?",
"flags": "ALL",
"case_insensitive": true,
"max_determinized_states": 100000
}
}
}
]
}
},
"_source": false,
"highlight" : {
"fields" : {
"text.raw": {}
}
}
}
Please let me know if you know why I'm not getting back a result using the field type Text with the Keyword analyzer.

ElasticSearch sorting by regexp

I have a field in ElasticSearch 6 index that can be matched with regexp. I need to sort search result so documents with values, that matches go before the ones, that don't. Is there some way to use regexp in sorting clause?
Example document:
"mappings" : {
"unit" : {
"properties" : {
"description" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
I thought about script sorting kind of this:
"sort" : {
"_script" : {
"type" : "number",
"script" : {
"source": "regex('some_regexp_here').match(doc['description'].value) ? 1 : 0 ",
},
"order" : "desc"
}
}
Is it possible? Are there any other workarounds? Thank you.
I figured this out. Sort clause should be like this:
"sort": {
"_script": {
"order": "desc",
"type": "number",
"script": {
"source":
"def m = /my_regex_here/.matcher(doc['description'].value);
if(m.matches()) {
return 1
} else {
return 0
}"
}
}
}
Note that '/' symbols around regexp are required.

Elasticsearch not working with 'not_analyzed' index

I am unable to figure out why elasticsearch not searching with not_analysed indexes. I have following settings in my model,
settings index: { number_of_shards: 1 } do
mappings dynamic: 'false' do
indexes :id
indexes :name, index: 'not_analyzed'
indexes :email, index: 'not_analyzed'
indexes :contact_number
end
end
def as_indexed_json(options = {})
as_json(only: [ :id, :name, :username, :user_type, :is_verified, :email, :contact_number ])
end
And my mapping at elasticsearch is right, as below.
{
"users-development" : {
"mappings" : {
"user" : {
"dynamic" : "false",
"properties" : {
"contact_number" : {
"type" : "string"
},
"email" : {
"type" : "string",
"index" : "not_analyzed"
},
"id" : {
"type" : "string"
},
"name" : {
"type" : "string",
"index" : "not_analyzed"
}
}
}
}
}
}
But issue is when I make search on not analyzed fields (name and email, as I wanted them to be not analyzed) it only search on full word. Like in the example below it should have return John, Johny and Tiger, all 3 records. But it only returns 2 of the records.
I am searching as below
settings = {
query: {
filtered: {
filter: {
bool: {
must: [
{ terms: { name: [ "john", "tiger" ] } },
]
}
}
}
},
size: 10
}
User.__elasticsearch__.search(settings).records
This is how I am creating index on my user object in callback after_save,
User.__elasticsearch__.client.indices.create(
index: User.index_name,
id: self.id,
body: self.as_indexed_json,
)
Some of the document that should match
[{
"_index" : "users-development",
"_type" : "user",
"_id" : "670",
"_score" : 1.0,
"_source":{"id":670,"email":"john#monkeyofdoom.com","name":"john baba","contact_number":null}
},
{
"_index" : "users-development",
"_type" : "user",
"_id" : "671",
"_score" : 1.0,
"_source":{"id":671,"email":"human#monkeyofdoom.com","name":"Johny Rocket","contact_number":null}
}
, {
"_index" : "users-development",
"_type" : "user",
"_id" : "736",
"_score" : 1.0,
"_source":{"id":736,"email":"tiger#monkeyofdoom.com","name":"tiger sherof", "contact_number":null}
} ]
Any suggestions please.
I think you would get desired results with keyword toknizer combined with lowercase filter rather than using not_analyzed.
The reason john* did not match Johny was due to case sensitivity.
This setup will work
{
"settings": {
"analysis": {
"analyzer": {
"keyword_analyzer": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"name": {
"type": "string",
"analyzer": "keyword_analyzer"
}
}
}
}
}
Now john* will match johny. You should be using multi-fields if you have various requirements. terms query for john wont give you john baba as inside inverted index there is no token as john. You could use standard analyzer on one field and keyword analyzer on other.
As per the documentation term query
The term query finds documents that contain the exact term specified in the inverted index.
You are searching for john but none of your documnents contain john i.e why you were not getting any result. Either you can your field analysed and then apply query string or search for exact term.
Refer https://www.elastic.co/guide/en/elasticsearch/reference/2.x/query-dsl-term-query.html for more details

Combining $regex and $or operators in Mongo

I want to use $or and $regex operators same time.
db.users.insert([{name: "Alice"}, {name: "Bob"}, {name: "Carol"}, {name: "Dan"}, {name: "Dave"}])
Using $regex works fine:
> db.users.find({name: {$regex: "^Da"}})
{ "_id" : ObjectId("53e33682b09f1ca437078b1d"), "name" : "Dan" }
{ "_id" : ObjectId("53e33682b09f1ca437078b1e"), "name" : "Dave" }
When introducing $or, the response is changed. I expected the same response:
> db.users.find({name: {$regex: {$or: ["^Da"]}}})
{ "_id" : ObjectId("53e33682b09f1ca437078b1a"), "name" : "Alice" }
{ "_id" : ObjectId("53e33682b09f1ca437078b1b"), "name" : "Bob" }
{ "_id" : ObjectId("53e33682b09f1ca437078b1c"), "name" : "Carol" }
{ "_id" : ObjectId("53e33682b09f1ca437078b1d"), "name" : "Dan" }
{ "_id" : ObjectId("53e33682b09f1ca437078b1e"), "name" : "Dave" }
I also tried to change the order of the operators:
> db.users.find({name: {$or: [{$regex: "^Da"}, {$regex: "^Ali"}]}})
error: { "$err" : "invalid operator: $or", "code" : 10068 }
However, it seems that following query works fine, but it's a little bit long (name is repeated):
> db.users.find({$or: [{name: {$regex: "^Da"}}, {name: {$regex: "^Ali"}}]})
{ "_id" : ObjectId("53e33682b09f1ca437078b1a"), "name" : "Alice" }
{ "_id" : ObjectId("53e33682b09f1ca437078b1d"), "name" : "Dan" }
{ "_id" : ObjectId("53e33682b09f1ca437078b1e"), "name" : "Dave" }
Is there any shorter way to use $regex and $or in queries like this?
The goal is to use $regex operator and not /.../ (real regular expressions).
The $or operator expects whole conditions so the correct form would be:
db.users.find({ "$or": [
{ "name": { "$regex": "^Da"} },
{ "name": { "$regex": "^Ali" }}
]})
Or of course using $in:
db.users.find({ "name": { "$in": [/^Da/,/^Ali/] } })
But it's a regex so you can do:
db.users.find({ "name": { "$regex": "^Da|^Ali" } })
It is been a while. However, I would add case insensitive to the regex query like the query below. So that, it doesn't matter if names were saved into the database with capital letters:
db.users.find({ "name": { "$regex": "^Da|^Ali", "$options": "i" } })
Hope it helps
It seems when you have $and or $or and multiple search based and used at least one $regex you have to use $regex for all conditions.
First from below works ok, second more like $or operator.
db.big_data.users.find(
{ $and: [
{ sex: { $regex: /^M.*/ } },
{ name: { $regex: /^J.*/ } }
] })
db.big_data.users.find({ $and: [ {sex: "M"}, { name: { $regex: /^J*/m } } ] })
you can use OR operator like
db.collName.find({ "name": { "$regex": "^Da|^Ali" ,"$options": "i" } })
and operator
db.collName.find({ "name": { "$regex": "Ali" ,"$options": "i" } })
for more info
source - https://www.cs.jhu.edu/~jason/405/lectures1-2/sld049.htm