How to use match as with Regular Expression in Mongodb with in Aggregate switch case? - regex

Here what i did.
Inside $AddFields
{
ClientStatus:{
$switch: {
branches: [
{ case: {
$eq:
[
"$CaseClientStatus",
/In Progress/i
]},
then:'In Progress'
},
{ case: {
$eq:
[
"$CaseClientStatus",
{regex:/Cancelled/i}
],
},then:'Cancelled'},
{ case: {$eq:['$CaseClientStatus','Complete - All Results Clear']}, then:'Complete'},
{ case: {$eq:['$CaseClientStatus','Case on Hold']}, then:'Case on Hold'}
],
default: 'Other'
}}
}
but with this my ClientStatus is showing only Complete,Other,Case On Hold not the one with specified with regex. alghough field contains those words.
here is the one of the doc
{
"CandidateName": "Bruce Consumer",
"_id": "61b30daeaa237672bb7a17cc",
"CaseClientStatus": "Background Check Case In Progress",
"TAT": "N/A",
"CaseCloseDate": null,
"FormationAutomationStatus": "Automated",
"MethodOfDataSupply": "Automated",
"Status": "Background Case In Progress",
"CreatedDate": "2021-12-10T08:19:58.389Z",
"OrderId": "Ord3954",
"PONumber": "",
"Position": "",
"FacilityCode": "",
"IsCaseClose": false,
"Requester": "Shah Shah",
"ReportErrorList": 0
}

Assuming you are on version 4.2 or higher (and you should be because 4.2 came out almost 2 years ago) then the $regexFind function gives you what you need. Prior to 4.2, regex was only available in a $match operator, not in complex agg expressions. Your attempt above is admirable but the // regex syntax is not doing what you think it should be doing. Notably, {regex:/Cancelled/i} is simply creating a new object with key regex and string value /Cancelled/i (including the slashes) which clearly will not equal anything in $CaseClientStatus. Here is a solution:
ClientStatus:{
$switch: {
branches: [
{ case: {
$ne: [null, {$regexFind: {input: "$CaseClientStatus", regex: /In Progress/i}}]
}, then:'In Progress'},
{ case: {
$ne: [null, {$regexFind: {input: "$CaseClientStatus", regex: /Cancelled/i}}]
},then:'Cancelled'},
{ case: {$eq:['$CaseClientStatus','Complete - All Results Clear']}, then:'Complete'},
{ case: {$eq:['$CaseClientStatus','Case on Hold']}, then:'Case on Hold'}
],
default: 'Other'
}}
It looks like you are trying to take a somewhat free-form status "description" field and create a strong enumerated status from it. I would recommend that your $ClientStatus output be more code-ish e.g. IN_PROGRESS, COMPLETE, CXL etc. Eliminate case and certainly whitespace.

Related

Match decimal as string to 0 in MongoDB without regex

I have a MongoDB collection where decimal numbers are stored as string. I need to find all those items that have one of such fields, quantity, equal to 0. Thus when looking for 0 I am actually looking for the strings:
"0"
"0.0"
"0.00"
...
and so on
I tried to use $toInt as follows
db.MyCollection.find({$toInt(Product.Quantity): 0})
but the query is flagged as wrong, it does not even get executed
After some digging I finally found the solution using regex:
db.MyCollection.find({"Product.Quantity": {$regex: "^0+$|^0+(\.0+)"}})
which indeed works but it I am sure that there is a more straightforward solution, it cannot be so utterly complex. Does anybody have a better solution?
Does this help?
Live
db.collection.aggregate({
$match: {
$expr: {
$eq: [
{
"$convert": {
"input": "$key",
"to": "double",
"onError": "$key",
"onNull": "$key"
}
},
0
]
}
}
})
Just replace key by your field.
On this example, you can see how it is operating under the hood.
Or using find
db.collection.find({
$expr: {
$eq: [
{
"$convert": {
"input": "$key",
"to": "double",
"onError": "$key",
"onNull": "$key"
}
},
0
]
}
})
Yes, regex is convenient to search for numbers in string format. You could simplify the regex a bit:
db.MyCollection.find({"Product.Quantity": {$regex: "^0(\.0+)?$"}})
Explanation of regex:
^ ... $ - anchor at the beginning and end
0 - expect a 0
(\.0+)? - followed by optional .0, .00, etc

mongodb aggregate - match $nin array regex values

Must work in mongo version 3.4
Hi,
As part of aggregating relevant tags, I would like to return tags that have script_url that is not contained in the whiteList array.
The thing is, i want to compare script_url to the regex of the array values.
I have this projection:
{
"script_url" : "www.analytics.com/path/file-7.js",
"whiteList" : [
null,
"www.analytics.com/path/*",
"www.analytics.com/path/.*",
"www.analytics.com/path/file-6.js",
"www.maps.com/*",
"www.maps.com/.*"
]
}
This $match compares script_url to exact whiteList values. So the document given above passes when it shouldn't since it has www.analytics.com/path/.* in whiteList
{
"$match": {
"script_url": {"$nin": ["$whiteList"]}
}
}
How do i match script_url with regex values of whiteList?
update
I was able to reach this stage in my aggregation:
{
"script_url" : "www.asaf-test.com/path/file-1.js",
"whiteList" : [
"http://sd.bla.com/bla/878/676.js",
"www.asaf-test.com/path/*"
],
"whiteListRegex" : [
"/http:\/\/sd\.bla\.com\/bla\/878\/676\.js/",
"/www\.asaf-test\.com\/path\/.*/"
]
}
But $match is not filtering out this script_url as it suppose to because its comparing literal strings and not casting the array values to regex values.
Is there a way to convert array values to Regex values in $map using v3.4?
I know you specifically mentioned v3.4, but I can't find a solution to make it work using v3.4.
So for others who have less restrictions and are able to use v4.2 this is one solution.
For version 4.2 or later only
The trick is to use $filter on whitelist using $regexMatch (available from v4.2) and if the filtered array is empty, that means script_url doesn't match anything in whitelist
db.collection.aggregate([
{
$match: {
$expr: {
$eq: [
{
$filter: {
input: "$whiteList",
cond: {
$regexMatch: { input: "$script_url", regex: "$$this" }
}
}
},
[]
]
}
}
}
])
Mongo Playground
It's also possible to use $reduce instead of $filter
db.collection.aggregate([
{
$match: {
$expr: {
$not: {
$reduce: {
input: "$whiteList",
initialValue: false,
in: {
$or: [
{
$regexMatch: { input: "$script_url", regex: "$$this" }
},
"$$value"
]
}
}
}
}
}
}
])
Mongo Playground

negative lookahead Regexp doesnt work in ES dsl query

The mapping of my Elastic search looks like below:
{
"settings": {
"index": {
"number_of_shards": "5",
"number_of_replicas": "1"
}
},
"mappings": {
"node": {
"properties": {
"field1": {
"type": "keyword"
},
"field2": {
"type": "keyword"
},
"query": {
"properties": {
"regexp": {
"properties": {
"field1": {
"type": "keyword"
},
"field2": {
"type": "keyword"
}
}
}
}
}
}
}
}
}
Problem is :
I am forming ES queries using elasticsearch_dsl Q(). It works perfectly fine in most of the cases when my query contains any complex regexp. But it totally fails if it contains regexp character '!' in it. It doesn't give any result when the search term contains '!' in it.
For eg:
1.) Q('regexp', field1 = "^[a-z]{3}.b.*") (works perfectly)
2.) Q('regexp', field1 = "^f04.*") (works perfectly)
3.)Q('regexp', field1 = "f00.*") (works perfectly)
4.) Q('regexp', field1 = "f04baz?") (works perfectly)
Fails in below case:
5.) Q('regexp', field1 = "f04((?!z).)*") (Fails with no results at all)
I tried adding "analyzer":"keyword" along with "type":"keyword" as above in the fields, but in that case nothing works.
In the browser i tried to check how analyzer:keyword will work on the input on the case it fails:
http://localhost:9210/search/_analyze?analyzer=keyword&text=f04((?!z).)*
Seems to look fine here with result:
{
"tokens": [
{
"token": "f04((?!z).)*",
"start_offset": 0,
"end_offset": 12,
"type": "word",
"position": 0
}
]
}
I'm running my queries like below:
search_obj = Search(using = _conn, index = _index, doc_type = _type).query(Q('regexp', field1 = "f04baz?"))
count = search_obj.count()
response = search_obj[0:count].execute()
logger.debug("total nodes(hits):" + " " + str(response.hits.total))
PLease help, its really a annoying problem as all the regex characters work fine in all the queries except !.
Also, how do i check what analyzer is currently applied with above setting in my mappings?
ElasticSearch Lucene regex engine does not support any type of lookarounds. The ES regex documentation is rather ambiguous saying matching everything like .* is very slow as well as using lookaround regular expressions (which is not only ambiguous, but also wrong since lookarounds, when used wisely, may greatly speed up regex matching).
Since you want to match any string that contains f04 and does not contain z, you may actually use
[^z]*fo4[^z]*
Details
[^z]* - any 0+ chars other than z
fo4 - fo4 substring
[^z]* - any 0+ chars other than z.
In case you have a multicharacter string to "exclude" (say, z4 rather than z), you may use your approach using a complement operator:
.*f04.*&~(.*z4.*)
This means almost the same but does not support line breaks:
.* - any chars other than newline, as many as possible
f04 - f04
.* - any chars other than newline, as many as possible
& - AND
~(.*z4.*) - any string other than the one having z4

MongoDB Search and Sort, with Number of Matches and Exact Match

I want to create a small MongoDB Search Query where I want to sort the result set based exact match followed by no. of matches.
For eg. if I have following labels
Physics
11th-Physics
JEE-IIT-Physics
Physics-Physics
Then, if I search for "Physics" it should sort as
Physics
Physics-Physics
11th-Physics
JEE-IIT-Physics
Looking for the sort of "scoring" you are talking about here is an excercise in "imperfect solutions". In this case, the "best fit" here starts with "text search", and "imperfect" is the term to consider first when working with the text search capabilties of MongoDB.
MongoDB is "not" a dedicated "text search" product, nor is it ( like most databases ) trying to be one. Full capabilites of "text search" is reserved for dedicated products that do that as there area of expertise. So maybe not the best fit, but "text search" is given as an option for those who can live with the limitations and don't want to implement another engine. Or Yet! At least.
With that said, let's look at what you can do with the data sample as given. First set up some data in a collection:
db.junk.insert([
{ "data": "Physics" },
{ "data": "11th-Physics" },
{ "data": "JEE-IIT-Physics" },
{ "data": "Physics-Physics" },
{ "data": "Something Unrelated" }
])
Then of course to "enable" the text search capabilties, then you need to index at least one of the fields in the document with the "text" index type:
db.junk.createIndex({ "data": "text" })
Now that is "ready to go", let's have a look at a first basic query:
db.junk.find(
{ "$text": { "$search": "\"Physics\"" } },
{ "score": { "$meta": "textScore" } }
).sort({ "score": { "$meta": "textScore" } })
That is going to give results like this:
{
"_id" : ObjectId("55af83b964876554be823f33"),
"data" : "Physics-Physics",
"score" : 1.5
}
{
"_id" : ObjectId("55af83b964876554be823f30"),
"data" : "Physics",
"score" : 1
}
{
"_id" : ObjectId("55af83b964876554be823f31"),
"data" : "11th-Physics",
"score" : 0.75
}
{
"_id" : ObjectId("55af83b964876554be823f32"),
"data" : "JEE-IIT-Physics",
"score" : 0.6666666666666666
}
So that is "close" to your desired result, but of course there is no "exact match" component. In addition, the logic here used by the text search capabilities with the $text operator means that "Physics-Physics" is the preferred match here.
This is because then engine does not recognize "non words" such as the "hyphen" in between. To it, the word "Physics" appears several times in the indexed content for the document, therefore it has a higher score.
Now the rest of your logic here depends on the application of "exact match" and what you mean by that. If you are looking for "Physics" in the string and "not" surrounded by "hyphens" or other characters then the following does not suit. But you can just match a field "value" that is "exactly" just "Physics":
db.junk.aggregate([
{ "$match": {
"$text": { "$search": "Physics" }
}},
{ "$project": {
"data": 1,
"score": {
"$add": [
{ "$meta": "textScore" },
{ "$cond": [
{ "$eq": [ "$data", "Physics" ] },
10,
0
]}
]
}
}},
{ "$sort": { "score": -1 } }
])
And that will give you a result that both looks at the "textScore" produced by the engine and then applies some math with a logical test. In this case where the "data" is exactly equal to "Physics" then we "weight" the score by an additional factor using $add:
{
"_id": ObjectId("55af83b964876554be823f30"),
"data" : "Physics",
"score" : 11
}
{
"_id" : ObjectId("55af83b964876554be823f33"),
"data" : "Physics-Physics",
"score" : 1.5
}
{
"_id" : ObjectId("55af83b964876554be823f31"),
"data" : "11th-Physics",
"score" : 0.75
}
{
"_id" : ObjectId("55af83b964876554be823f32"),
"data" : "JEE-IIT-Physics",
"score" : 0.6666666666666666
}
That is what the aggregation framework can do for you, by allowing manipulation of the returned data with additional conditions. The end result is passed to the $sort stage ( notice it is reversed in descending order ) to allow that new value to be to sorting key.
But the aggregation framework can really only deal with "exact matches" like this on strings. There is no facility at present to deal with regular expression matches or index positions in strings that return a meaningful value for projection. Not even a logical match. And the $regex operation is only used to "filter" in queries, so not of use here.
So if you were looking for something in a "phrase" thats was a bit more invovled than a "string equals" exact match, then the other option is using mapReduce.
This is another "imperfect" approach as the limitations of the mapReduce command mean that the "textScore" from such a query by the engine is "completely gone". While the actual documents will be selected correctly, the inherrent "ranking data" is not available to the engine. This is a by-product of how MongoDB "projects" the "score" into the document in the first place, and "projection" is not a feature available to mapReduce.
But you can "play with" the strings using JavaScript, as in my "imperfect" sample:
db.junk.mapReduce(
function() {
var _id = this._id,
score = 0;
delete this._id;
score += this.data.indexOf(search);
score += this.data.lastIndexOf(search);
emit({ "score": score, "id": _id }, this);
},
function() {},
{
"out": { "inline": 1 },
"query": { "$text": { "$search": "Physics" } },
"scope": { "search": "Physics" }
}
)
Which gives results like this:
{
"_id" : {
"score" : 0,
"id" : ObjectId("55af83b964876554be823f30")
},
"value" : {
"data" : "Physics"
}
},
{
"_id" : {
"score" : 8,
"id" : ObjectId("55af83b964876554be823f33")
},
"value" : {
"data" : "Physics-Physics"
}
},
{
"_id" : {
"score" : 10,
"id" : ObjectId("55af83b964876554be823f31")
},
"value" : {
"data" : "11th-Physics"
}
},
{
"_id" : {
"score" : 16,
"id" : ObjectId("55af83b964876554be823f32")
},
"value" : {
"data" : "JEE-IIT-Physics"
}
}
My own "silly little algorithm" here is basically taking both the "first" and "last" index position of the matched string here and adding them together to produce a score. It's likely not what you really want, but the point is that if you can code your logic in JavaScript, then you can throw it at the engine to produce the desired "ranking".
The only real "trick" here to remember is that the "score" must be the "preceeding" part of the grouping "key" here, and that if including the orginal document _id value then that composite key part must be renamed, otherwise the _id will take precedence of order.
This is just part of mapReduce where as an "optimization" all output "key" values are sorted in "ascending order" before being processed by the reducer. Which of course does nothing here since we are not "aggregating", but just using the JavaScript runner and document reshaping of mapReduce in general.
So the overall note is, those are the available options. None of them perfect, but you might be able to live with them or even just "accept" the default engine result.
If you want more then look at external "dedicated" text search products, which would be better suited.
Side Note: The $text searches here are preferred over $regex because they can use an index. A "non-anchored" regular expression ( without the caret ^ ) cannot use an index optimally with MongoDB. Therefore the $text searches are generally going to be a better base for finding "words" within a phrase.
One more way is using the $indexOfCp aggregation operator to get the index of matched string and then apply sort on the indexed field
Data insertion
db.junk.insert([
{ "data": "Physics" },
{ "data": "11th-Physics" },
{ "data": "JEE-IIT-Physics" },
{ "data": "Physics-Physics" },
{ "data": "Something Unrelated" }
])
Query
const data = "Physics";
db.junk.aggregate([
{ "$match": { "data": { "$regex": data, "$options": "i" }}},
{ "$addFields": { "score": { "$indexOfCP": [{ "$toLower": "$data" }, { "$toLower": data }]}}},
{ "$sort": { "score": 1 }}
])
Here you can test the output
[
{
"_id": ObjectId("5a934e000102030405000000"),
"data": "Physics",
"score": 0
},
{
"_id": ObjectId("5a934e000102030405000003"),
"data": "Physics-Physics",
"score": 0
},
{
"_id": ObjectId("5a934e000102030405000001"),
"data": "11th-Physics",
"score": 5
},
{
"_id": ObjectId("5a934e000102030405000002"),
"data": "JEE-IIT-Physics",
"score": 8
}
]

ElasticSearch and Regex queries

I am trying to query for documents that have dates within the body of the "content" field.
curl -XGET 'http://localhost:9200/index/_search' -d '{
"query": {
"regexp": {
"content": "^(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.]((19|20)\\d\\d)$"
}
}
}'
Getting closer maybe?
curl -XGET 'http://localhost:9200/index/_search' -d '{
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"regexp":{
"content" : "^(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.]((19|20)\\d\\d)$"
}
}
}
}'
My regex seems to have been off. This regex has been validated on regex101.com The following query still returns nothing from the 175k documents I have.
curl -XPOST 'http://localhost:9200/index/_search?pretty=true' -d '{
"query": {
"regexp":{
"content" : "/[0-9]{4}-[0-9]{2}-[0-9]{2}|[0-9]{2}-[0-9]{2}-[0-9]{4}|[0-9]{2}/[0-9]{2}/[0-9]{4}|[0-9]{4}/[0-9]{2}/[0-9]{2}/g"
}
}
}'
I am starting to think that my index might not be set up for such a query. What type of field do you have to use to be able to use regular expressions?
mappings: {
doc: {
properties: {
content: {
type: string
}title: {
type: string
}host: {
type: string
}cache: {
type: string
}segment: {
type: string
}query: {
properties: {
match_all: {
type: object
}
}
}digest: {
type: string
}boost: {
type: string
}tstamp: {
format: dateOptionalTimetype: date
}url: {
type: string
}fields: {
type: string
}anchor: {
type: string
}
}
}
I want to find any record that has a date and graph the volume of documents by that date. Step 1. is to get this query working. Step 2. will be to pull the dates out and group them by them accordingly. Can someone suggest a way to get the first part working as I know the second part will be really tricky.
Thanks!
You should read Elasticsearch's Regexp Query documentation carefully, you are making some incorrect assumptions about how the regexp query works.
Probably the most important thing to understand here is what the string you are trying to match is. You are trying to match terms, not the entire string. If this is being indexed with StandardAnalyzer, as I would suspect, your dates will be separated into multiple terms:
"01/01/1901" becomes tokens "01", "01" and "1901"
"01 01 1901" becomes tokens "01", "01" and "1901"
"01-01-1901" becomes tokens "01", "01" and "1901"
"01.01.1901" actually will be a single token: "01.01.1901" (Due to decimal handling, see UAX #29)
You can only match a single, whole token with a regexp query.
Elasticsearch (and lucene) don't support full Perl-compatible regex syntax.
In your first couple of examples, you are using anchors, ^ and $. These are not supported. Your regex must match the entire token to get a match anyway, so anchors are not needed.
Shorthand character classes like \d (or \\d) are also not supported. Instead of \\d\\d, use [0-9]{2}.
In your last attempt, you are using /{regex}/g, which is also not supported. Since your regex needs to match the whole string, the global flag wouldn't even make sense in context. Unless you are using a query parser which uses them to denote a regex, your regex should not be wrapped in slashes.
(By the way: How did this one validate on regex101? You have a bunch of unescaped /s. It complains at me when I try it.)
To support this sort of query on such an analyzed field, you'll probably want to look to span queries, and particularly Span Multiterm and Span Near. Perhaps something like:
{
"span_near" : {
"clauses" : [
{ "span_multi" : {
"match": {
"regexp": {"content": "0[1-9]|[12][0-9]|3[01]"}
}
}},
{ "span_multi" : {
"match": {
"regexp": {"content": "0[1-9]|1[012]"}
}
}},
{ "span_multi" : {
"match": {
"regexp": {"content": "(19|20)[0-9]{2}"}
}
}}
],
"slop" : 0,
"in_order" : true
}
}
For newer elasticsearch versions (tested 8.5).
We can use .keyword in the field. It will match the whole sentence.
{
"size": 10,
"_source": [
"load",
"unload"
],
"query": {
"bool": {
"should": [
{
"regexp": {
"load.keyword": {
"value": ".*Search Term.*",
"flags": "ALL"
}
}
}
]
}
}
}