Mongodb distinct query with contains query - regex

I have a mongo collection User which contains data like:-
{
id : 1,
name : "gaurav",
skills : "C++ HTML CSS"
}
when I am searching for all users that have C++ skill in it with the following query I am getting correct results as expected
db.user.find({skills:{contains:"C++"}});
But when I am searching all the unique names from the user using the same condition I m not getting any desired result
db.user.distinct('name',{skills:{contains:"C++"}});
Can anyone help me with what I am doing wrong?

The "contains" is not a valid keyword for MongoDB queries. You need $regex which submits a general "regular expression" statement matching the pcre specifications:
db.user.distinct( "name", { "skills": { "$regex": "C\+\+" } })
If using JavaScript as you language then this is also safe:
db.user.distinct( "name", { "skills": /C\+\+/ })
To determine if the string "C++" occurred somewhere within the string value of the field being tested. The + character is reserved in "regex" operations and therefore you need to escape it with a \ char as the standard escaping mechanism.
On your data this is the result:
db.user.distinct( "name", { "skills": { "$regex": "C\+\+" } })
[ "gaurav" ]

Try to use REGEX like below query
db.user.distinct("name",{"skills":{"$regex":"C++.*"}})

Related

Nested Array search in MongoDB/PyMongo while using aggregate

I am trying to search for a keyword inside array of arrays in a Mongo document.
{
"PRODUCT_NAME" : "Truffle Cake",
"TAGS": [
["Cakes", 100],
["Flowers", 100],
]
}
Usually, I would do something like this and it would work.
db.collection.find( {"TAGS":{"$elemMatch":{ "$elemMatch": {"$in":['search_text']} } }} )
But now, I changed this query to an aggregate based query due to other requirements. I've tried $filter , $match but not able to replicate the above query exactly..
Can anyone convert the above code so that it can directly work with aggregate?
(I use PyMongo)
$match uses the same query syntax as the query language (find), from the docs:
The query syntax is identical to the read operation query syntax;
This means if you have a query that works in a "find", it will also work within a $match stage, like so:
db.collection.aggregate([
{
$match: {
"TAGS": {
"$elemMatch": {
"$elemMatch": {
"$in": [
"Cakes"
]
}
}
}
}
}
])
Check this live on Mongo Playground

How to apply Mongo DB find command for nested dynamic keys

Want to search all matching pattern from Mongo DB nested fields with dynamic keys.
DB Structure:
_id: 'dsdsdsadadad',
results: {
tables: {
jvm: {
data: [
{
Prediction: 1,
Jvm: 'service_name',
Status: 'OK'
},
{
second: 'New second set'
}
}
}
}
Tried By $,
db.col_name.find('results.tables.jvm.data.$.Jvm': {'$regexp': 'service.*'})
By using $i
db.col_name.find('results.tables.jvm.data.$i.Jvm': {'$regexp': 'service.*'})
By giving particular key 0 also,
db.col_name.find('results.tables.jvm.data.0.Jvm': {'$regexp': 'service.*'})
No results!
Expected O/P:
The above doc and where all Jvm starts with service* keyword
Thanks,
You should directly use the dot notation to query an array of nested objects:
db.collection.find({ "results.tables.jvm.data.Jvm": { $regex: "service.*" } })
MongoDB will try to find every document that contains at least one nested document under data having Jvm field matching your regex.
MongoDB Playground

regex breaks when I use a colon(:)

I just started working with elastic search. By started working I mean I have to query an already running elastic database. Is there a good documentation of the regex they follow. I know about the one on their official site, but its not very helpful.
The more specific problem is that I want to query for lines of the sort:
10:02:37:623421|0098-TSOT {TRANSITION} {ID} {1619245525} {securityID} {} {fromStatus} {NOT_PRESENT} {toStatus} {WAITING}
or
01:01:36:832516|0058-CT {ADD} {0} {3137TTDR7} {23} {COM} {New} {0} {0} {52} {1}
and more of a similar structure. I don't want a generalized regex. If possible, could someone give me a regex expression for each of these that would run with elastic?
I noticed that it matches if the regexp matches with a substring too when I ran with:
query = {"query":
{"regexp":
{
"message": "[0-9]{2}"
}
},
"sort":
[
{"#timestamp":"asc"}
]
}
But it wont match anything if I use:
query = {"query":
{"regexp":
{
"message": "[0-9]{2}:.*"
}
},
"sort":
[
{"#timestamp":"asc"}
]
}
I want to write regex that are more specific and that are different for the two examples given near the top.
turns out my message is present in the tokenized form instead of the raw form, and : is one of the default delimiters of the tokenizer, in elastic. And as a reason, I can't use regexp query on the whole message because it matches it with each token individually.

Elasticsearch Query on indexes whose name is matching a certain pattern

I have a couple of indexes in my Elasticsearch DB as follows
Index_2019_01
Index_2019_02
Index_2019_03
Index_2019_04
.
.
Index_2019_12
Suppose I want to search only on the first 3 Indexes.
I mean a regular expression like this:
select count(*) from Index_2019_0[1-3] where LanguageId="English"
What is the correct way to do that in Elasticsearch?
How can I query several indexes with certain names?
This can be achieved via multi-index search, which is a built-in capability of Elasticsearch. To achieve described behavior one should try a query like this:
POST /index_2019_01,index_2019_02/_search
{
"query": {
"match": {
"LanguageID": "English"
}
}
}
Or, using URI search:
curl 'http://<host>:<port>/index_2019_01,index_2019_02/_search?q=LanguageID:English'
More details are available here. Note that Elasticsearch requires index names to be lowercase.
Can I use a regex to specify index name pattern?
In short, no. It is possible to use index name in queries using a special "virtual" field _index but its use is limited. For instance, one cannot use a regexp against index name:
The _index is exposed as a virtual field — it is not added to the
Lucene index as a real field. This means that you can use the _index
field in a term or terms query (or any query that is rewritten to a
term query, such as the match, query_string or simple_query_string
query), but it does not support prefix, wildcard, regexp, or fuzzy
queries.
For instance, the query from above can be rewritten as:
POST /_search
{
"query": {
"bool": {
"must": [
{
"terms": {
"_index": [
"index_2019_01",
"index_2019_02"
]
}
},
{
"match": {
"LanguageID": "English"
}
}
]
}
}
}
Which employs a bool and a terms queries.
Hope that helps!
Why use POST when you are not adding any additional data to it.
I advise using GET for your case. Secondly, If the Index have similar names like in your case, you should be using an index pattern like in the query below,
GET /index_2019_*/_search
{
"query": {
"match": {
"LanguageID": "English"
}
}
}
OR in a URL
curl -XGET "http://<host>:<port>/index_2019_*/_search" -H 'Content-Type: application/json' -d'{"query": {"match":{"LanguageID": "English"}}}'
While searching for indices using a regex is not possible you might be able to use date math to take you a bit further.
You can look at the docs here
As an example, lets say you wish the last 3 months from those indices
that means that if we have
index_2019_01
index_2019_02
index_2019_03
index_2019_04
And today is 2019/04/20, we could use the following query to get 04,03 and 02
GET /<index-{now/M-0M{yyyy_MM}}>,<index-{now/M-1M{yyyy_MM}}>,<index-{now/M-2M{yyyy_MM}}>
I used M-0M for the first one so the query construction loop doesn't need a special case for the first index
Look at the docs regarding URL encoding this query and how to have literal braces in the index name, if a client is used the URL encoding is done for you (at least in the python client)

Regexp starts with not working Elasticsearch 6.*

I got trouble with understanding regexp mechanizm in ElasticSearch. I have documents that represent property units:
{
"Unit" :
{
"DailyAvailablity" :
"UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUUUUUUIAAAAAAAAOUUUUUUIAAAAAAAAAOUUUUUUUUUUUUUUUUUUIUUUUUUUUIUUUUUUUUUUUUUUIAAAOUUUUUUUUUUUUUIUUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}
}
DailyAvailability field codes availability of property by days for the next two years from today. 'A' means available, 'U' unabailable, 'I' can check in, 'O' can check out. How can I write regexp filter to get all units that are available in particular dates?
I tried to find the 'A' substring with particular length and offset in DailyAvailability field. For example to find units that would be available for 7 days in 7 days from today:
{
"query": {
"bool": {
"filter": [
{
"regexp": { "Unit.DailyAvailability": {"value": ".{7}a{7}.*" } }
}
]
}
}
}
This query returns for instance unit with DateAvailability that starts from "UUUUUUUUUUUUUUUUUUUIAA", but contains suitable sequences somehere inside the field. How can I anchor regexp for entire source string? ES docs say that lucene regex should be anchored by default.
P.S. I have tried '^.{7}a{7}.*$'. Returns empty set.
It looks like you are using text datatype to store Unit.DailyAvailability (which is also the default one for strings if you are using dynamic mapping). You should consider using keyword datatype instead.
Let me explain in a bit more detail.
Why does my regex match something in the middle of a text field?
What happens with text datatype is that the data gets analyzed for full-text search. It does some transformations like lowercasing and splitting into tokens.
Let's try to use the Analyze API against your input:
POST _analyze
{
"text": "UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUUUUUUIAAAAAAAAOUUUUUUIAAAAAAAAAOUUUUUUUUUUUUUUUUUUIUUUUUUUUIUUUUUUUUUUUUUUIAAAOUUUUUUUUUUUUUIUUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}
The response is:
{
"tokens": [
{
"token": "uiaouuuuuuuiaaaaaaaaaaaaaaaaaouuuuiaaaaouuuiaouuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuiaaaaaouuuuuuuuuuuuuiaaaaouuuuuuuuuuuuuiaaaaaaaaouuuuuuiaaaaaaaaaouuuuuuuuuuuuuuuuuuiuuuuuuuuiuuuuuuuuuuuuuuiaaaouuuuuuuuuuuuuiuuuuiaouuuuuuuuuuuuuuu",
"start_offset": 0,
"end_offset": 255,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "uuuuuuuuuuuuuuiaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
"start_offset": 255,
"end_offset": 510,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
"start_offset": 510,
"end_offset": 732,
"type": "<ALPHANUM>",
"position": 2
}
]
}
As you can see, Elasticsearch has split your input into three tokens and lowercased them. This looks unexpected, but if you think that it actually tries to facilitate search for words in human language, it makes sense - there are no such long words.
That's why now regexp query ".{7}a{7}.*" will match: there is a token that actually starts with a lot of a's, which is an expected behavior of regexp query.
...Elasticsearch will apply the regexp to the terms produced by the
tokenizer for that field, and not to the original text of the field.
How can I make regexp query consider the entire string?
It is very simple: do not apply analyzers. The type keyword stores the string you provide as is.
With a mapping like this:
PUT my_regexes
{
"mappings": {
"doc": {
"properties": {
"Unit": {
"properties": {
"DailyAvailablity": {
"type": "keyword"
}
}
}
}
}
}
}
You will be able to do a query like this that will match the document from the post:
POST my_regexes/doc/_search
{
"query": {
"bool": {
"filter": [
{
"regexp": { "Unit.DailyAvailablity": "UIAOUUUUUUUIA.*" }
}
]
}
}
}
Note that the query became case-sensitive because the field is not analyzed.
This regexp won't return any results anymore: ".{12}a{7}.*"
This will: ".{12}A{7}.*"
So what about anchoring?
The regexes are anchored:
Lucene’s patterns are always anchored. The pattern provided must match the entire string.
The reason why it looked like the anchoring was wrong was most likely because tokens got split in an analyzed text field.
Just in addition to brilliant and helpfull answer of Nikolay Vasiliev. In my case I was forced to go farther to make it work on NEST .net. I added attribute mapping to DailyAvailability:
[Keyword(Name = "DailyAvailability")]
public string DailyAvailability { get; set; }
The filter still didn't work and I got mapping:
"DailyAvailability":"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
My field contained about 732 symbols so it was ignored by index. I tried:
[Keyword(Name = "DailyAvailability", IgnoreAbove = 1024)]
public string DailyAvailability { get; set; }
It didn't make any difference on mapping. And only after adding manual mappings it started working properly:
var client = new ElasticClient(settings);
client.CreateIndex("vrp", c => c
.Mappings(ms => ms.Map<Unit>(m => m
.Properties(ps => ps
.Keyword(k => k.Name(u => u.DailyAvailability).IgnoreAbove(1024))
)
)
));
The point is that:
ignore_above - Do not index any string longer than this value. Defaults to 2147483647 so that all values would be accepted. Please however note that default dynamic mapping rules create a sub keyword field that overrides this default by setting ignore_above: 256.
So use explicit mapping for long keyword fields to set ignore_above if you need to filter them with regexp.
For anyone could be useful, the ES tool does not support the \d \w modes, you should write those as [0-9] and [a-z]