Nested Array search in MongoDB/PyMongo while using aggregate - django

I am trying to search for a keyword inside array of arrays in a Mongo document.
{
"PRODUCT_NAME" : "Truffle Cake",
"TAGS": [
["Cakes", 100],
["Flowers", 100],
]
}
Usually, I would do something like this and it would work.
db.collection.find( {"TAGS":{"$elemMatch":{ "$elemMatch": {"$in":['search_text']} } }} )
But now, I changed this query to an aggregate based query due to other requirements. I've tried $filter , $match but not able to replicate the above query exactly..
Can anyone convert the above code so that it can directly work with aggregate?
(I use PyMongo)

$match uses the same query syntax as the query language (find), from the docs:
The query syntax is identical to the read operation query syntax;
This means if you have a query that works in a "find", it will also work within a $match stage, like so:
db.collection.aggregate([
{
$match: {
"TAGS": {
"$elemMatch": {
"$elemMatch": {
"$in": [
"Cakes"
]
}
}
}
}
}
])
Check this live on Mongo Playground

Related

Regexp starts with not working Elasticsearch 6.*

I got trouble with understanding regexp mechanizm in ElasticSearch. I have documents that represent property units:
{
"Unit" :
{
"DailyAvailablity" :
"UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUUUUUUIAAAAAAAAOUUUUUUIAAAAAAAAAOUUUUUUUUUUUUUUUUUUIUUUUUUUUIUUUUUUUUUUUUUUIAAAOUUUUUUUUUUUUUIUUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}
}
DailyAvailability field codes availability of property by days for the next two years from today. 'A' means available, 'U' unabailable, 'I' can check in, 'O' can check out. How can I write regexp filter to get all units that are available in particular dates?
I tried to find the 'A' substring with particular length and offset in DailyAvailability field. For example to find units that would be available for 7 days in 7 days from today:
{
"query": {
"bool": {
"filter": [
{
"regexp": { "Unit.DailyAvailability": {"value": ".{7}a{7}.*" } }
}
]
}
}
}
This query returns for instance unit with DateAvailability that starts from "UUUUUUUUUUUUUUUUUUUIAA", but contains suitable sequences somehere inside the field. How can I anchor regexp for entire source string? ES docs say that lucene regex should be anchored by default.
P.S. I have tried '^.{7}a{7}.*$'. Returns empty set.
It looks like you are using text datatype to store Unit.DailyAvailability (which is also the default one for strings if you are using dynamic mapping). You should consider using keyword datatype instead.
Let me explain in a bit more detail.
Why does my regex match something in the middle of a text field?
What happens with text datatype is that the data gets analyzed for full-text search. It does some transformations like lowercasing and splitting into tokens.
Let's try to use the Analyze API against your input:
POST _analyze
{
"text": "UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUUUUUUIAAAAAAAAOUUUUUUIAAAAAAAAAOUUUUUUUUUUUUUUUUUUIUUUUUUUUIUUUUUUUUUUUUUUIAAAOUUUUUUUUUUUUUIUUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}
The response is:
{
"tokens": [
{
"token": "uiaouuuuuuuiaaaaaaaaaaaaaaaaaouuuuiaaaaouuuiaouuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuiaaaaaouuuuuuuuuuuuuiaaaaouuuuuuuuuuuuuiaaaaaaaaouuuuuuiaaaaaaaaaouuuuuuuuuuuuuuuuuuiuuuuuuuuiuuuuuuuuuuuuuuiaaaouuuuuuuuuuuuuiuuuuiaouuuuuuuuuuuuuuu",
"start_offset": 0,
"end_offset": 255,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "uuuuuuuuuuuuuuiaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
"start_offset": 255,
"end_offset": 510,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
"start_offset": 510,
"end_offset": 732,
"type": "<ALPHANUM>",
"position": 2
}
]
}
As you can see, Elasticsearch has split your input into three tokens and lowercased them. This looks unexpected, but if you think that it actually tries to facilitate search for words in human language, it makes sense - there are no such long words.
That's why now regexp query ".{7}a{7}.*" will match: there is a token that actually starts with a lot of a's, which is an expected behavior of regexp query.
...Elasticsearch will apply the regexp to the terms produced by the
tokenizer for that field, and not to the original text of the field.
How can I make regexp query consider the entire string?
It is very simple: do not apply analyzers. The type keyword stores the string you provide as is.
With a mapping like this:
PUT my_regexes
{
"mappings": {
"doc": {
"properties": {
"Unit": {
"properties": {
"DailyAvailablity": {
"type": "keyword"
}
}
}
}
}
}
}
You will be able to do a query like this that will match the document from the post:
POST my_regexes/doc/_search
{
"query": {
"bool": {
"filter": [
{
"regexp": { "Unit.DailyAvailablity": "UIAOUUUUUUUIA.*" }
}
]
}
}
}
Note that the query became case-sensitive because the field is not analyzed.
This regexp won't return any results anymore: ".{12}a{7}.*"
This will: ".{12}A{7}.*"
So what about anchoring?
The regexes are anchored:
Lucene’s patterns are always anchored. The pattern provided must match the entire string.
The reason why it looked like the anchoring was wrong was most likely because tokens got split in an analyzed text field.
Just in addition to brilliant and helpfull answer of Nikolay Vasiliev. In my case I was forced to go farther to make it work on NEST .net. I added attribute mapping to DailyAvailability:
[Keyword(Name = "DailyAvailability")]
public string DailyAvailability { get; set; }
The filter still didn't work and I got mapping:
"DailyAvailability":"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
My field contained about 732 symbols so it was ignored by index. I tried:
[Keyword(Name = "DailyAvailability", IgnoreAbove = 1024)]
public string DailyAvailability { get; set; }
It didn't make any difference on mapping. And only after adding manual mappings it started working properly:
var client = new ElasticClient(settings);
client.CreateIndex("vrp", c => c
.Mappings(ms => ms.Map<Unit>(m => m
.Properties(ps => ps
.Keyword(k => k.Name(u => u.DailyAvailability).IgnoreAbove(1024))
)
)
));
The point is that:
ignore_above - Do not index any string longer than this value. Defaults to 2147483647 so that all values would be accepted. Please however note that default dynamic mapping rules create a sub keyword field that overrides this default by setting ignore_above: 256.
So use explicit mapping for long keyword fields to set ignore_above if you need to filter them with regexp.
For anyone could be useful, the ES tool does not support the \d \w modes, you should write those as [0-9] and [a-z]

Conditional query with Elastic Search

My Elasticsearch Index looks like as below.
title:The Godfather
year:1972
genres:Crime & Drama
title:Lawrence of Arabia
year:1962
genres:Adventure,Biography &Drama
title:To Kill a Mockingbird
year:1973
genres:Mystery
title:Apocalypse Now
year:1975
genres:Thriller
I am trying to write query in elasticsearch which should first check for generes field if it contains & If it does than perform other matching operation on same fields. if generes doesnt contain &, it should skill other matching operation. Basically i am looking for if condition in Elasticsearch.
below is my query but its doesnt seems to be working fine..
{
"query": {
"bool": {
"should": [
{
"regexp": {
"genres": ".*&.*"
}
},
{"match": {"genres": {"query": "Adventure"}}}
]
}
}
}
I was following below suggestion on stackOverFlow.
How to write a conditional query with Elasticsearch?
You can nest bool queries, so what you can do is do have a top level bool query only with a should clause, then inside of that should clause, you have two more bool queries. Each of those contains a must part, that contains the search for & and whatever else. Like this
bool:
should:
- bool:
must: [ _search for & and whatever else_ ]
- bool:
must: [ _search for another criteria_]
Hope this helps!

elasticsearch - search with regex involving space

I want to perform searching using regular expression involving whitespace in elasticsearch.
I have already set my field to not_analyzed. And it's mapping is just like
"type1": {
"properties": {
"field1": {
"type": "string",
"index": "not_analyzed",
"store": true
}
}
}
And I input two value for test,
"field1":"XXX YYY ZZZ"
"field1":"XXX ZZZ YYY"
And i do some case using regex query /XXX YYY/ (I want to use this query to find record1 but not record2)
{
"query": {
"query_string": {
"query": "/XXX YYY/"
}
}
}
But it return 0 results.
However if I search without using regex (without the forward slash '/'), both record1 and record2 are returned.
Is that in elasticsearch, i cannot search using regex query involving space?
What you need is a ''term'' query that doesn't tokenise the search query by breaking it down into smaller parts. More about the term query here: https://www.elastic.co/guide/en/elasticsearch/reference/2.0/query-dsl-term-query.html
There's a special breed of term queries that allows you to use regexes called regexp queries. That should match any whitespaces as well: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html
You can keep using your query string, but your regexp is just missing a tiny part, i.e. the .* at the end. If you run that you'll get the single result you expect.
{
"query": {
"query_string": {
"query": "/XXX YYY.*/"
}
}
}
You can use regexp queries to achieve this. Mind you, the query performance may be slow. The below query will search for all documents in which the value of field1 contains "XXX YYY".
POST <index_name>/type1/_search
{
"query": {
"regexp": {
"field1": ".*XXX YYY.*"
}
}
}

Mongodb distinct query with contains query

I have a mongo collection User which contains data like:-
{
id : 1,
name : "gaurav",
skills : "C++ HTML CSS"
}
when I am searching for all users that have C++ skill in it with the following query I am getting correct results as expected
db.user.find({skills:{contains:"C++"}});
But when I am searching all the unique names from the user using the same condition I m not getting any desired result
db.user.distinct('name',{skills:{contains:"C++"}});
Can anyone help me with what I am doing wrong?
The "contains" is not a valid keyword for MongoDB queries. You need $regex which submits a general "regular expression" statement matching the pcre specifications:
db.user.distinct( "name", { "skills": { "$regex": "C\+\+" } })
If using JavaScript as you language then this is also safe:
db.user.distinct( "name", { "skills": /C\+\+/ })
To determine if the string "C++" occurred somewhere within the string value of the field being tested. The + character is reserved in "regex" operations and therefore you need to escape it with a \ char as the standard escaping mechanism.
On your data this is the result:
db.user.distinct( "name", { "skills": { "$regex": "C\+\+" } })
[ "gaurav" ]
Try to use REGEX like below query
db.user.distinct("name",{"skills":{"$regex":"C++.*"}})

'like' or $regex query inside $cond in MongoDB

Please go through this question of mine:
MongoDB $group and explicit group formation with computed column
But this time, I need to compare strings, not numbers.The CASE query must have a LIKE:
CASE WHEN source LIKE '%Web%' THEN 'Web'
I then need to group by source. How to write this in Mongo? I am trying the following but not sure if $regex is supported inside $cond. By the way, is there a list of valid operators inside $cond somewhere? Looks like $cond isn't very fond of me :)
db.Twitter.aggregate(
{ $project: {
"_id":0,
"Source": {
$cond: [
{ $regex:['$source','/.* Android.*/'] },
'Android',
{ $cond: [
{ $eq: ['$source', 'web'] }, 'Web', 'Others'
] }
]
}
} }
);
There're many other values that I need to write in there, doing a deeper nesting. This is just an example with just 'Android' and 'Web' for the sake of brevity. I have tried both with $eq and $regex. Using $regex gives error of invalid operator whereas using $eq doesn't understand the regex expression and puts everything under 'Others'. If this is possible with regex, kindly let me know how to write it for case-insensitive match.
Thanks for any help :-)
Well, it still seems to be not even scheduled to be implemented :(
https://jira.mongodb.org/browse/SERVER-8892
I'm using 2.6 and took a peek on 3.0, but it's just not there.
There's one workaround though, if you can project your problem onto a stable substring. Then you can $substr the field and use multiple nested $cond. It's awkward, but it works.
Maybe you can try it with MapReduce.
var map = function()
{
var reg1=new RegExp("(Android)+");
var reg2=new RegExp("(web)+");
if (reg1.test(this.source)){
emit(this._id,'Android');
}
else if (reg2.test(this.source))
{
emit(this._id,'web');
}
}
var reduce = function (key,value){
var reduced = {
id:key,
source:value
}
return reduced;
}
db.Twitter.mapReduce(map,reduce,{out:'map_reduce_result'});
db.map_reduce_result.find();
You can use JavaScript regular expresions instead of MongoDB $regex.