Group documents using substring of a field - regex

I am working with MongoDB and I am enjoying a lot!
There is one query I am having problems to work with:
I have this set of data that represents an hierarchy (a tree, where 1 is the root, 1.1 and 1.2 are children of 1, and so on)
db.test.insert({id:1, hierarchy:"1"})
db.test.insert({id:2, hierarchy:"1.1"})
db.test.insert({id:3, hierarchy:"1.2"})
db.test.insert({id:4, hierarchy:"1.1.1"})
db.test.insert({id:5, hierarchy:"1.1.2"})
db.test.insert({id:6, hierarchy:"1.2.1"})
db.test.insert({id:7, hierarchy:"1.2.2"})
db.test.insert({id:8, hierarchy:"1.2.3"})
So if I make a query:
> db.test.find()
{ "_id" : ObjectId("546a6095cafd2fa3ff8e4760"), "id" : 1, "hierarchy" : "1" }
{ "_id" : ObjectId("546a6095cafd2fa3ff8e4761"), "id" : 2, "hierarchy" : "1.1" }
{ "_id" : ObjectId("546a6095cafd2fa3ff8e4762"), "id" : 3, "hierarchy" : "1.2" }
{ "_id" : ObjectId("546a6095cafd2fa3ff8e4763"), "id" : 4, "hierarchy" : "1.1.1" }
{ "_id" : ObjectId("546a6095cafd2fa3ff8e4764"), "id" : 5, "hierarchy" : "1.1.2" }
{ "_id" : ObjectId("546a6095cafd2fa3ff8e4765"), "id" : 6, "hierarchy" : "1.2.1" }
{ "_id" : ObjectId("546a6095cafd2fa3ff8e4766"), "id" : 7, "hierarchy" : "1.2.2" }
{ "_id" : ObjectId("546a6095cafd2fa3ff8e4767"), "id" : 8, "hierarchy" : "1.2.3" }
the document with id 1 represents the CEO and I would like to gather information about the teams under the VPs (1.1 and 1.2).
I would like to have an output like this
{
id: null,
teams:
[
{
manager: 2,
hierarchy: "1.1",
subordinates: 2
},
{
manager: 3,
hierarchy: "1.2",
subordinates: 3
}
]
}
I am having problems to aggregate the documents in the right "slot".
I tried to use a regex to aggregate using the substring, and project before grouping and create a new field which would be "manager_hierarchy", so I could group using this field. But with none of these approaches I had any kind of success, so I am stuck now.
Is there anyway I could accomplish this task?
EDIT: sorry, I forgot to make one thing explicit:
This query is to get information about subordinate teams of an employee. I've used an example as the CEO, but if I was the employee 1.2.3 in the hierarchy, I would like to see the teams 1.2.3.1, 1.2.3.2, ..., 1.2.3.xx
There is also the possibility (rare, but possible) that someone has more than 9 subordinates, so making a "hardcoded" substring would not work, since
$substr:["$hierarchy",0,3]}
would work for 1.2 but not for 1.10
and
$substr:["$hierarchy",0,4]}
would work for 1.10, but not for 1.2

You can get your results using the below aggregate pipeline operations.
Sort the rows based on their hierarchy, so that the manager
comes on top.
Group together records that start with similar ancestors.(i.e 1.1
or 1.2,...). The manager record will be on top for each group
due to the above sort operation.
Take the count of each group, so the number of subordinates
will be count-1(the manager record).
Again group the records to get a single array.
The code:
db.test.aggregate([
{$match:{"id":{$gt:1}}},
{$sort:{"hierarchy":1}},
{$group:{"_id":{"grp":{$substr:["$hierarchy",0,3]}},
"manHeir":{$first:"$hierarchy"},
"count":{$sum:1},"manager":{$first:"$id"}}},
{$project:{"manager":1,
"hierarchy":"$manHeir",
"subordinates":{$subtract:["$count",1]},"_id":0}},
{$group:{"_id":null,"teams":{$push:"$$ROOT"}}},
{$project:{"_id":0,"teams":1}}
])

Related

Case Insensitive Search With Regex

I'm trying to implement a case-insensitive search with regex.
Example: /^sanford/i (searching for anything starting with "sanford" disregarding case sensivity.
For case insensitive queries, creating indeces with a custom collation is recommended by the documentation. This works fine as long as no regex is involved.
The problem: searching with a regex (in this case "starts with"), the case-insensitive search does NOT take the index into account.
This is stated in the documentation multiple times and is also reproducable with an explain command.
To sum it up: It works, but without effectively using the index. I'd be glad to get any hints, I can't get rid of the feeling that I'm missing something fundamentally important here.
Inserting the string with toLowerCase and then searching only with lower cased strings is not an option.
I can't use a text index because there can only be one per collection.
Example from the documentation see here, the green info box on the bottom.
#D.SM: The index is used, but it scans all documents.
https://docs.atlas.mongodb.com/schema-suggestions/case-insensitive-regex/
Example document:
{
"name": [{
"family": "Test",
"given": "Name",
}],
}
Index with collation:
{ "name" : "name_family", "key" : { "name.family" : 1 }, "host" : "noneofyourbusiness.com", "accesses" : { "ops" : NumberLong(114), "since" : ISODate("2020-07-30T20:25:59.079Z") }, "shard" : "shA", "spec" : { "v" : 2, "key" : { "name.family" : 1 }, "name" : "name_family", "ns" : "noneofyourbusiness.somethingwithaname", "collation" : { "locale" : "de", "caseLevel" : false, "caseFirst" : "off", "strength" : 1, "numericOrdering" : false, "alternate" : "non-ignorable", "maxVariable" : "punct", "normalization" : false, "backwards" : false, "version" : "57.1" } } }
}

Ordering a term aggregation with a multi-bucket sub-aggregation

Given a term aggregation (label), I would like to sort the bucket by a string field (energy).
The problem is that we cannot use a multi-bucket value in the order clause.
For a given label, I'm sure that there is only one energy. What I would like to do is to use the first (and only) result of my energy sub aggregation.
I'm using the AWS elasticsearch service which is in a 1.5 version, and scripts are disabled, so I did not find a way to sort the bucket by another term :(
Any idea ?
{
"aggs" : {
"label" : {
"terms" : { "field" : "label" },
"order" : { "energy[0]" : "desc" } // cannot do this
},
"aggs" : {
"energy" : {
"terms" : {
"field" : "energy",
"size" : 1
}
}
}
}
}

MongoDB Query For Fields That Vary - Wildcards?

I am looking for a way to get distinct "unit" values from a collection that has a structure similar to the following:
{
"_id" : ObjectId("548b1aee6e444414f00d5cf1"),
"KPI" : {
"NPV" : {
"value" : 100,
"unit" : "kUSD"
},
"NPM" : {
"value" : 100,
"unit" : "kUSD"
},
"GPM" : {
"value" : 50,
"unit" : "CAD"
}
}
}
I looked into using wildcards and regex but from what I have come across this is not supported for field matching. I would like to do something like db.collection.distinct('KPI.*.unit') but cannot determine how and it seems like performance would be poor. Does anyone have a recommendation? Thanks.
It's not a good practice to make the keys a part of the content of the document - don't use keys as data. If you don't change your document structure, you'll need to know what the possible subfields of KPI are. If you don't know what those could be, you will need to examine the documents manually to find them. Then you can issue a distinct for each using dot notation, e.g. db.collection.distinct("KPI.NPM.unit").
If what you're looking for instead is the distinct values of unit across all values of the parent KPI subfield, then you could take the union of all of the results of the distincts. You can also do it easily with an aggregation framework in MongoDB 2.6. For simplicity, I'll assume there's just three distinct subfields of KPI, the ones in the document above.
db.collection.aggregate([
{ "$group" : { "_id" : 0, "NPVunits" : { "$addToSet" : "$KPI.NPV.unit" }, "NPMunits" : { "$addToSet" : "$KPI.NPM.unit" }, "GPMunits" : { "$addToSet" : "$KPI.GPM.unit" } }
{ "$project" : { "distinct_units" : { "$setUnion" : ["$NPVunits", "$NPMunits", "$GPMunits"] } } }
])
You could also structure your data as dynamic attributes. The document above would be recast as something like
{
"_id" : ObjectId("548b1aee6e444414f00d5cf1"),
"KPI" : [
{ "type" : "NPV", "value" : 100, "unit" : "kUSD" },
{ "type" : "NPM", "value" : 100, "unit" : "kUSD" },
{ "type" : "GPM", "value" : 50, "unit" : "CAD" }
]
}
Querying for distinct units is easy now, whether you want it per type or over all types:
Per type (all types in one query)
db.collection.aggregate([
{ "$unwind" : "$KPI" },
{ "$group" : { "_id" : "$KPI.type", "units" : { "$addToSet" : "$KPI.unit" } } }
])
Over all types
db.collection.distinct("KPI.unit")

Can I do a MongoDB "starts with" query on an indexed subdocument field?

I'm trying to find documents where a field starts with a value.
Table scans are disabled using notablescan.
This works:
db.articles.find({"url" : { $regex : /^http/ }})
This doesn't:
db.articles.find({"source.homeUrl" : { $regex : /^http/ }})
I get the error:
error: { "$err" : "table scans not allowed:moreover.articles", "code" : 10111 }
There are indexes on both url and source.homeUrl:
{
"v" : 1,
"key" : {
"url" : 1
},
"ns" : "mydb.articles",
"name" : "url_1"
}
{
"v" : 1,
"key" : {
"source.homeUrl" : 1
},
"ns" : "mydb.articles",
"name" : "source.homeUrl_1",
"background" : true
}
Are there any limitations with regex queries on subdocument indexes?
When you disable table scans, it means that any query where a table scan "wins" in the query optimizer will fail to run. You haven't posted an explain but it's reasonable to assume that's what is happening here based on the error. Try hinting the index explicitly:
db.articles.find({"source.homeUrl" : { $regex : /^http/ }}).hint({"source.homeUrl" : 1})
That should eliminate the table scan as a possible choice and allow the query to return successfully.

combining regex and embedded objects in mongodb queries

I am trying to combine regex and embedded object queries and failing miserably. I am either hitting a limitation of mongodb or just getting something slightly wrong maybe someone out ther has encountered this. The documentation certainly does'nt cover this case.
data being queried:
{
"_id" : ObjectId("4f94fe633004c1ef4d892314"),
"productname" : "lightbulb",
"availability" : [
{
"country" : "USA",
"storeCode" : "abc-1234"
},
{
"country" : "USA",
"storeCode" : "xzy-6784"
},
{
"country" : "USA",
"storeCode" : "abc-3454"
},
{
"country" : "CANADA",
"storeCode" : "abc-6845"
}
]
}
assume the collection contains only one record
This query returns 1:
db.testCol.find({"availability":{"country" : "USA","storeCode":"xzy-6784"}}).count();
This query returns 1:
db.testCol.find({"availability.storeCode":/.*/}).count();
But, this query returns 0:
db.testCol.find({"availability":{"country" : "USA","storeCode":/.*/}}).count();
Does anyone understand why? Is this a bug?
thanks
You are referencing the embedded storecode incorrectly - you are referencing it as an embedded object when in fact what you have is an array of objects. Compare these results:
db.testCol.find({"availability.0.storeCode":/x/});
db.testCol.find({"availability.0.storeCode":/a/});
Using your sample doc above, the first one will not return, because the first storeCode does not have an x in it ("abc-1234"), the second will return the document. That's fine for the case where you are looking at a single element of the array and pass in the position. In order to search all of the objcts in the array, you want $elemMatch
As an example, I added this second example doc:
{
"_id" : ObjectId("4f94fe633004c1ef4d892315"),
"productname" : "hammer",
"availability" : [
{
"country" : "USA",
"storeCode" : "abc-1234"
},
]
}
Now, have a look at the results of these queries:
PRIMARY> db.testCol.find({"availability" : {$elemMatch : {"storeCode":/a/}}}).count();
2
PRIMARY> db.testCol.find({"availability" : {$elemMatch : {"storeCode":/x/}}}).count();
1