MongoDB MapReduce update in place how to - mapreduce

*Basically I'm trying to order objects by their score over the last hour.
I'm trying to generate an hourly votes sum for objects in my database. Votes are embedded into each object. The object schema looks like this:
{
_id: ObjectId
score: int
hourly-score: int <- need to update this value so I can order by it
recently-voted: boolean
votes: {
"4e4634821dff6f103c040000": { <- Key is __toString of voter ObjectId
"_id": ObjectId("4e4634821dff6f103c040000"), <- Voter ObjectId
"a": 1, <- Vote amount
"ca": ISODate("2011-08-16T00:01:34.975Z"), <- Created at MongoDate
"ts": 1313452894 <- Created at timestamp
},
... repeat ...
}
}
This question is actually related to a question I asked a couple of days ago Best way to model a voting system in MongoDB
How would I (can I?) run a MapReduce command to do the following:
Only run on objects with recently-voted = true OR hourly-score > 0.
Calculate the sum of the votes created in the last hour.
Update hourly-score = the sum calculated above, and recently-voted = false.
I also read here that I can perform a MapReduce on the slave DB by running db.getMongo().setSlaveOk() before the M/R command. Could I run the reduce on a slave and update the master DB?
Are in-place updates even possible with Mongo MapReduce?

You can definitely do this. I'll address your questions one at a time:
1.
You can specify a query along with your map-reduce, which filters the set of objects which will be passed into the map phase. In the mongo shell, this would look like (assuming m and r are the names of your mapper and reducer functions, respectively):
> db.coll.mapReduce(m, r, {query: {$or: [{"recently-voted": true}, {"hourly-score": {$gt: 0}}]}})
2.
Step #1 will let you use your mapper on all documents with at least one vote in the last hour (or with recently-voted set to true), but not all the votes will have been in the last hour. So you'll need to filter the list in your mapper, and only emit those votes you wish to count:
function m() {
var hour_ago = new Date() - 3600000;
this.votes.forEach(function (vote) {
if (vote.ts > hour_ago) {
emit(/* your key */, this.vote.a);
}
});
}
And to reduce:
function r(key, values) {
var sum = 0;
values.forEach(function(value) { sum += value; });
return sum;
}
3.
To update the hourly scores table, you can use the reduceOutput option to map-reduce, which will call your reducer with both the emitted values, and the previously saved value in the output collection, (if any). The result of that pass will be saved into the output collection. This looks like:
> db.coll.mapReduce(m, r, {query: ..., out: {reduce: "output_coll"}})
In addition to re-reducing output, you can use merge which will overwrite documents in the output collection with newly created ones (but leaving behind any documents with an _id different than the _ids created by your m-r job), replace, which is effectively a drop-and-create operation and is the default, or use {inline: 1}, which will return the results directly to the shell or to your driver. Note that when using {inline: 1}, your results must fit in the size allowed for a single document (16MB in recent MongoDB releases).
(4.)
You can run map-reduce jobs on secondaries ("slaves"), but since secondaries cannot accept writes (that's what makes them secondary), you can only do this when using inline output.

Related

REST Api pagination Loop... Power Query M language

I am wondering if anyone can help me with api pagination... I am trying to get all records from an external api but it restricts me with only getting maximum of 10. There are around 40k records..
The api also does not shows "no.of pages"(response below). hence i cant get my head around a solution.
There is NO "skip" or "count" or "top" supported either.. i am stuck...and i dont know how to create a loop in M language until all records are fetched. Can someone help me write a code or how it can look like
Below is my code.
let
Source = Json.Document(
Web.Contents(
"https://api.somedummy.com/api/v2/Account",
[
RelativePath ="Search",
Headers =
[
ApiKey = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXx",
Authorization = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
#"Content-Type" = "application/json"
],
Content=
Json.FromValue(
[key="status", operator="EqualTo", value="Active", resultType="Full"]
)
]
)
)
in
Source
and below is output
"data": {
"totalCount": 6705,
"page": 1,
"pageSize": 10,
"list":[
This might help you along your way. While I was looking into something similar for working with Jira, I found some helpful info from two individuals in the Atlassian Community site. Below is what I think might be a relevant snippet from a query I developed with the assistance of their posts. (To be clear this snippet is their code, which I used in my query.) While I'm providing more of the query (the segment of which is also comprised of their code) below, I think the key part that relates to your particular issue is this part.
yourJiraInstance = "https://site.atlassian.net/rest/api/2/search",
Source = Json.Document(Web.Contents(yourJiraInstance, [Query=[maxResults="100",startAt="0"]])),
totalIssuesCount = Source[total],
// Now it is time to build a list of startAt values, starting on 0, incrementing 100 per item
startAtList = List.Generate(()=>0, each _ < totalIssuesCount, each _ +100),
urlList = List.Transform(startAtList, each Json.Document(Web.Contents(yourJiraInstance, [Query=[maxResults="100",startAt=Text.From(_)]]))),
// ===== Consolidate records into a single list ======
// so we have all the records in data, but it is in a bunch of lists each 100 records
// long. The issues will be more useful to us if they're consolidated into one long list
I'm thinking that maybe you could try substituting pageSize for maxResults and totalCount for totalIssuesCount. I don't know about startAt. There must be something similar available to you. Who knows? It could actually be startAt. I believe your pageSize would be 10 and you would increment your startAt by 10 instead of 100.
This is from Nick's and Tiago's posts on this thread. I think the only real difference may be that I buffered a table. (It's been a while and I did not dig into their thread and compare it for this answer.)
let
// I must credit the first part of this code -- the part between the ********** lines -- as being from Nick Cerneaz (and Tiago Machado) from their posts on this thread:
// https://community.atlassian.com/t5/Marketplace-Apps-Integrations/All-data-not-displayed-in-Power-BI-from-Jira/qaq-p/723117.
// **********
yourJiraInstance = "https://site.atlassian.net/rest/api/2/search",
Source = Json.Document(Web.Contents(yourJiraInstance, [Query=[maxResults="100",startAt="0"]])),
totalIssuesCount = Source[total],
// Now it is time to build a list of startAt values, starting on 0, incrementing 100 per item
startAtList = List.Generate(()=>0, each _ < totalIssuesCount, each _ +100),
urlList = List.Transform(startAtList, each Json.Document(Web.Contents(yourJiraInstance, [Query=[maxResults="100",startAt=Text.From(_)]]))),
// ===== Consolidate records into a single list ======
// so we have all the records in data, but it is in a bunch of lists each 100 records
// long. The issues will be more useful to us if they're consolidated into one long list
//
// In essence we need extract the separate lists of issues in each data{i}[issues] for 0<=i<#"total"
// and concatenate those into single list of issues .. from which then we can analyse
//
// to figure this out I found this post particulary helpful (thanks Vitaly!):
// https://potyarkin.ml/posts/2017/loops-in-power-query-m-language/
//
// so first create a single list that has as its members each sub-list of the issues,
// 100 in each except for the last one that will have just the residual list.
// So iLL is a List of Lists (of issues):
iLL = List.Generate(
() => [i=-1, iL={} ],
each [i] < List.Count(urlList),
each [
i = [i]+1,
iL = urlList{i}[issues]
],
each [iL]
),
// and finally, collapse that list of lists into just a single list (of issues)
issues = List.Combine(iLL),
// Convert the list of issues records into a table
#"Converted to table" = Table.Buffer(Table.FromList(issues, Splitter.SplitByNothing(), null, null, ExtraValues.Error)),
// **********

Kinda newbie: PowerApps to populate a list with constructed data fields

Updated...
Trying to inject a series of rows into a SharePoint List via PowerApps, but running across the fact that PowerApps seems to only have FORALL as a looping function, and that does not support SET.
Set(AlertString,""); // to be used later
Set(REQ_Value,"");
Set(RITM_Value,"");
Set(Asset_Value,"");
Set(CustomerSignatureFileLocation_Value,"File location: ");
Set(LoanerKitCode_Value,"");
Set(IncidentCode_Value,"");
Set(TransferOrderCode_Value,"");
Set(TransactionType_Value,Workflow.SelectedText.Value & " - " & Workflow_Steps.SelectedText.Value);
Set(ScanItemCodeType,"");
Set(ErrorString,"");
Collect(ScanDataCollection,Split(ScanData.Text,Char(10))); // Split the data into ScanDataCollection collection
ForAll(
ScanDataCollection,
If(Left(Result,4)="RITM",Set(RITM_Value,Result); // FAIL HERE
Collect('Spider - Master Transaction List', {
REQ: REQ_Value,
RITM: RITM_Value,
Scan_Code: Result,
Asset: Asset_Value,
Transaction_Type: TransactionType_Value,
Timestamp: Now(),
Agent_Name: User().FullName,
Agent_Email: User().Email,
Agent_Location: DD_Location.SelectedText.Value,
Agent_Notes: "It was weird, man.",
Customer_Name: Cust_Name.Text,
Customer_Email: Cust_NTAccount.Text,
Customer_Signature: CustomerSignatureFileLocation_Value,
Task_Name: "",
Task_Action: "",
State_Name: "",
State_Action: "",
Stage_Name: "",
Stage_Action: "",
Work_Note_String: "",
Customer_Note_String: "",
Loaner_Kit_Code: LoanerKitCode_Value,
Incident: IncidentCode_Value,
Transfer_Order_Code: TransferOrderCode_Value,
Item_Description: ""});
);
My scanner tool will pick up a variety of different kinds of item scans, all in the same scan. Depending on what type of data it is, it populates different columns in Spider - Master Transaction List.
But we are forbidden to use the SET function inside a FORALL.
How would you recommend I approach this -- considering that each piece of data from the SPLIT could be any of the sorts of codes (such as RITM Code, REQ Code, Transfer Order Code, etc.)?
You can do that you want on various way.
Using Collection or Gallery, in powrapps Galleries can be used like a collection.
I Suggest:
ForAll(
Gallery.Allitems,
Patch(
'SharepointListName',
ThisRecord
)
);
Fields in gallery must have the same name of sharepoint list, or you have to create a record to asign the names.
{sharepoitColumnName: ThisRecord.ColumnName, ...}

CouchDB View - filter keys before grouping

I have a CouchDB database which has documents with the following format:
{ createdBy: 'userId', at: 123456, type: 'action_type' }
I want to write a view that will give me how many actions of each type were created by which user. I was able to do that creating a view that does this:
emit([doc.createdBy, doc.type, doc.at], 1);
With the reduce function "sum" and consuming the view in this way:
/_design/userActionsDoc/_view/userActions?group_level=2
this returns a result with rows just in the way I want:
"rows":[ {"key":["userId","ACTION_1"],"value":20}, ...
the problem is that now I want to filter the results for a given time period. So I want to have the exact same information but only considering actions which happened within a given time period.
I can filter the documents by "at" if I emit the fields in a different order.
?group_level=3&startkey=[149328316160]&endkey=[1493283161647,{},{}]
emit([doc.at, doc.type, doc.createdBy], 1);
but then I won't get the results grouped by userId and actionType. Is there a way to have both? Maybe writing my own reduce function?
I feel your pain. I have done two different things in the past to attempt to solve similar issues.
The first pattern is a pain and may work great or may not work at all. I've experienced both. Your map function looks something like this:
function(doc) {
var obj = {};
obj[doc.createdBy] = {};
obj[doc.createdBy][doc.type] = 1;
emit(doc.at, obj);
// Ignore this for now
// emit(doc.at, JSON.stringify(obj));
}
Then your reduce function looks like this:
function(key, values, rereduce) {
var output = {};
values.forEach(function(v) {
// Ignore this for now
// v = JSON.parse(v);
for (var user in v) {
for (var action in v[user]) {
output[user][action] = (output[user][action] || 0) + v[user][action];
}
}
});
return output;
// Ignore this for now
// return JSON.stringify(output);
}
With large datasets, this usually results in a couch error stating that your reduce function is not shrinking fast enough. In that case, you may be able to stringify/parse the objects as shown in the "ignore" comments in the code.
The reasoning behind this is that couchdb ultimately wants you to output a simple object like a string or integer in a reduce function. In my experience, it doesn't seem to matter that the string gets longer, as long as it remains a string. If you output an object, at some point the function errors because you have added too many props to that object.
The second pattern is potentially better, but requires that your time periods are "defined" ahead of time. If your time period requirements can be locked down to a specific year, specific month, day, quarter, etc. You just emit multiple times in your map function. Below I assume the at property is epoch milliseconds, or at least something that the date constructor can accurately parse.
function(doc) {
var time_key;
var my_date = new Date(doc.at);
//// Used for filtering results in a given year
//// e.g. startkey=["2017"]&endkey=["2017",{}]
time_key = my_date.toISOString().substr(0,4);
emit([time_key, doc.createdBy, doc.type], 1);
//// Used for filtering results in a given month
//// e.g. startkey=["2017-01"]&endkey=["2017-01",{}]
time_key = my_date.toISOString().substr(0,7);
emit([time_key, doc.createdBy, doc.type], 1);
//// Used for filtering results in a given quarter
//// e.g. startkey=["2017Q1"]&endkey=["2017Q1",{}]
time_key = my_date.toISOString().substr(0,4) + 'Q' + Math.floor(my_date.getMonth()/3).toString();
emit([time_key, doc.createdBy, doc.type], 1);
}
Then, your reduce function is the same as in your original. Essentially you're just trying to define a constant value for the first item in your key that corresponds to a defined time period. Works well for business reporting, but not so much for allowing for flexible time periods.

Using a CouchDB view, can I count groups and filter by key range at the same time?

I'm using CouchDB. I'd like to be able to count occurrences of values of specific fields within a date range that can be specified at query time. I seem to be able to do parts of this, but I'm having trouble understanding the best way to pull it all together.
Assuming documents that have a timestamp field and another field, e.g.:
{ date: '20120101-1853', author: 'bart' }
{ date: '20120102-1850', author: 'homer'}
{ date: '20120103-2359', author: 'homer'}
{ date: '20120104-1200', author: 'lisa'}
{ date: '20120815-1250', author: 'lisa'}
I can easily create a view that filters documents by a flexible date range. This can be done with a view like the one below, called with key range parameters, e.g. _view/all-docs?startkey=20120101-0000&endkey=20120201-0000.
all-docs/map.js:
function(doc) {
emit(doc.date, doc);
}
With the data above, this would return a CouchDB view containing just the first 4 docs (the only docs in the date range).
I can also create a query that counts occurrences of a given field, like this, called with grouping, i.e. _view/author-count?group=true:
author-count/map.js:
function(doc) {
emit(doc.author, 1);
}
author-count/reduce.js:
function(keys, values, rereduce) {
return sum(values);
}
This would yield something like:
{
"rows": [
{"key":"bart","value":1},
{"key":"homer","value":2}
{"key":"lisa","value":2}
]
}
However, I can't find the best way to both filter by date and count occurrences. For example, with the data above, I'd like to be able to specify range parameters like startkey=20120101-0000&endkey=20120201-0000 and get a result like this, where the last doc is excluded from the count because it is outside the specified date range:
{
"rows": [
{"key":"bart","value":1},
{"key":"homer","value":2}
{"key":"lisa","value":1}
]
}
What's the most elegant way to do this? Is this achievable with a single query? Should I be using another CouchDB construct, or is a view sufficient for this?
You can get pretty close to the desired result with a list:
{
_id: "_design/authors",
views: {
authors_by_date: {
map: function(doc) {
emit(doc.date, doc.author);
}
}
},
lists: {
count_occurrences: function(head, req) {
start({ headers: { "Content-Type": "application/json" }});
var result = {};
var row;
while(row = getRow()) {
var val = row.value;
if(result[val]) result[val]++;
else result[val] = 1;
}
return result;
}
}
}
This design can be requested as such:
http://<couchurl>/<db>/_design/authors/_list/count_occurrences/authors_by_date?startkey=<startDate>&endkey=<endDate>
This will be slower than a normal map-reduce, and is a bit of a workaround. Unfortunately, this is the only way to do a multi-dimensional query, "which CouchDB isn’t suited for".
The result of requesting this design will be something like this:
{
"bart": 1,
"homer": 2,
"lisa": 2
}
What we do is basically emit a lot of elements, then using a list to group them as we want. A list can be used to display a result in any way you want, but will also often be slower. Whereas a normal map-reduce can be cached and only change according to the diffs, the list will have to be built anew every time it is requested.
It is pretty much as slow as getting all the elements resulting from the map (the overhead of orchestrating the data is mostly negligible): a lot slower than getting the result of a reduce.
If you want to use the list for a different view, you can simply exchange it in the URL you request:
http://<couchurl>/<db>/_design/authors/_list/count_occurrences/<view>
Read more about lists on the couchdb wiki.
You need to create a combined view:
combined/map.js:
function(doc) {
emit([doc.date, doc.author], 1);
}
combined/reduce.js:
_sum
This way you will be able to filter documents by start/end date.
startkey=[20120101-0000, "a"]&endkey=[20120201-0000, "a"]
Although your problem is hard to solve in general case, knowing some more restrictions on the possible queries can help a lot. E.g. if you know you will search on the ranges that will cover full days/months you can user the arrays of [year, month, day, time] instead of the string:
emit([doc.date_year, doc.date_month, doc.date_day, doc.date_time, doc.author] doc);
Even if you cannot predict that all possible queries will fit into grouping based on this key type, splitting the key may help you to optimize your range queries and decrease number of lookups needed (with the cost of some extra space).

Getting odd behavior from $query->setMaxResults()

When I call setMaxResults on a query, it seems to want to treat the max number as "2", no matter what it's actual value is.
function findMostRecentByOwnerUser(\Entities\User $user, $limit)
{
echo "2: $limit<br>";
$query = $this->getEntityManager()->createQuery('
SELECT t
FROM Entities\Thread t
JOIN t.messages m
JOIN t.group g
WHERE
g.ownerUser = :owner_user
ORDER BY m.timestamp DESC
');
$query->setParameter("owner_user", $user);
$query->setMaxResults(4);
echo $query->getSQL()."<br>";
$results = $query->getResult();
echo "3: ".count($results);
return $results;
}
When I comment out the setMaxResults line, I get 6 results. When I leave it in, I get the 2 most recent results. When I run the generated SQL code in phpMyAdmin, I get the 4 most recent results. The generated SQL, for reference, is:
SELECT <lots of columns, all from t0_>
FROM Thread t0_
INNER JOIN Message m1_ ON t0_.id = m1_.thread_id
INNER JOIN Groups g2_ ON t0_.group_id = g2_.id
WHERE g2_.ownerUser_id = ?
ORDER BY m1_.timestamp DESC
LIMIT 4
Edit:
While reading the DQL "Limit" documentation, I came across the following:
If your query contains a fetch-joined collection specifying the result limit methods are not working as you would expect. Set Max Results restricts the number of database result rows, however in the case of fetch-joined collections one root entity might appear in many rows, effectively hydrating less than the specified number of results.
I'm pretty sure that I'm not doing a fetch-joined collection. I'm under the impression that a fetch-joined collection is where I do something like SELECT t, m FROM Threads JOIN t.messages. Am I incorrect in my understanding of this?
An update : With Doctrine 2.2+ you can use the Paginator http://docs.doctrine-project.org/en/latest/tutorials/pagination.html
Using ->groupBy('your_entity.id') seem to solve the issue!
I solved the same issue by only fetching contents of the master table and having all joined tables fetched as fetch="EAGER" which is defined in the Entity (described here http://www.doctrine-project.org/docs/orm/2.1/en/reference/annotations-reference.html?highlight=eager#manytoone).
class VehicleRepository extends EntityRepository
{
/**
* #var integer
*/
protected $pageSize = 10;
public function page($number = 1)
{
return $this->_em->createQuery('SELECT v FROM Entities\VehicleManagement\Vehicles v')
->setMaxResults(100)
->setFirstResult($number - 1)
->getResult();
}
}
In my example repo you can see I only fetched the vehicle table to get the correct result amount. But all properties (like make, model, category) are fetched immediately.
(I also iterated over the Entity-contents because I needed the Entity represented as an array, but that shouldn't matter afaik.)
Here's an excerpt from my entity:
class Vehicles
{
...
/**
* #ManyToOne(targetEntity="Makes", fetch="EAGER")
* #var Makes
*/
public $make;
...
}
Its important that you map every Entity correctly otherwise it won't work.