Scalding operate on group after groupBy - mapreduce

I am writing a scalding job.
Here's what I want to do: first groupBy a Key. That should give me a bunch of (Key, Iterator[Value]) pairs for each Key (correct me if I'm wrong here). Then for each Ley, I want to apply a function on its associated Iterator[Value].
How should I do this? I am currently using a groupBy followed by a mapGroup. I get an Iterator[Value] but that iterator only has one value for some reason. mapGroup is not getting to operate on multiple values. It's important that I get to see all the values for any given key at the same time. Any ideas?
Thanks!

Related

How can I partition an Arrow Table by value in one pass?

I would like to be able to partition an Arrow table by the values of one of its columns (assuming the set of n values occurring in that column is known). The straightforward way is a for-loop: for each of these values, scan the whole table and build a new table of matching rows. Are there ways to do this in one pass instead of n passes?
I initially thought that Arrow's support for group-by scans would be the solution -- but Arrow (in contrast to Pandas) does not support extracting groups after a group-by scan.
Am I just thinking about this wrong and there is another way to partition a table in one pass?
For the group by support, there is a "hash_list" function that returns all values in the group. Is that what you're looking for? You could then slice the resulting values after-the-fact to extract the individual groups.

How AWS DynamoDB query result is a list of items while partition key is a unique value

I'm new to AWS DynamoDB and wanted to clarify something.
As I learned when we use query we should use the partition key which is unique among the list of items, then how the result from a query is a list!!! it should be one row!! and why do we even need adding more condition?
I think I am missing something here, can someone please help me to understand?
I need this because I want to query on list of applications with specific status value or for specific range of time but if I am supposed to provide the appId what is the point of the condition?
Thank you in advance.
Often your table will have sort key, which together with your partition key will create composite primary key. In this scenario a query can return multiple items. To return only one value, not a list, you can use get_item instead if you know unique value of the composite primary key.

DynamoDB - Update Map Object by appending Another Map Object to it

I want to update a Map Object in DynamoDB by adding another map object to it.
Here they add a single key value pair to a map. I want to add multiple key value pairs to a map in one request. There should be something like map_append similar to list_append. How can I do this? I went through the docs but could not find anything similar.
Thanks in advance!
Just use multiple SET update expressions separated by commas, following the example of the question you cited. SET map.#number1 = :string1, map.#number2 = :string2, map.#number3 = :string3

rereduce and group=true in CouchDB

Actually this question was raised by someone else at here https://stackoverflow.com/questions/13338799/does-couchdbs-group-true-prevent-rereduce.
But there is no convincing answer.
group=true is the conceptual equivalent of group_level=exact, so CouchDB runs a reduce per unique key in the map row set.
This is how it is explained in doc.
It sounds like CouchDB would collect all the values for the same key and only reduce one time per each distinct key.
But in another article, it is said that
If the query is on the reduce value of each key (group_by_key = true),
then CouchDB try to locate the boundary of each key. Since this range
is probably not fitting exactly along the B+Tree node, CouchDB need to
figure out the edge of both ends to locate the partially matched leave
B+Tree node and resend its map result (with that key) to the View
Server. This reduce result will then merge with existing rereduce
result to compute the final reduce result of this key.
It sounds like rereduce may happen when group=true.
In my project, there are many documents but there are most 2 values with the same key after grouping for each distinct key.
Will rereduce happen in this case?
Best Regards
Yes. Rereduce is always a possibility.
If this is a problem, there is a rereduce parameter in the reduce function, which allows you to detect if this is happening.
http://docs.couchdb.org/en/latest/couchapp/ddocs.html#reduce-and-rereduce-functions

Do I need to sort after Group by, before an Merge Join

I need to GroupBy and MergeJoin in PDI (Kettle). Both are made ​​using the same field as the key.
I could not find anywhere to confirm that after the GroupBy data remains ordered.
In case I need to know if it would be correct:
SORT> GROUPBY> SORT> MERGEJOIN
or
SORT> GROUPBY> MERGEJOIN
Someone could tell me what the correct and why?
Thank you very much.
You need to sort BEFORE the Group By and the Merge Join based on the keys you're grouping or joining on. The data on exit will have the same order as before, so if you group and then merge based on the same keys, you don't need the sort between Group by and Merge Join.
If the keys change, however, you do.