Emulating an INNER JOIN with mapreduce is relatively trivial, mapping common keys and joining values in the reducer the job gets done. But when it comes to a LEFT OUTER JOIN one faces the problem of combinations filling empties for the Right table when there are keys in the Left table not present in the Right table. The non matching keys would be discarded when mapped into the reducer, how can one then add these non matching keys from the Left table?
For example, let's assume we have two files:
Left = {'matches': 1}
Right = {'matches': 2,
'matches_not': 3}
One would want an output like:
Output: {'matches-matches': [1, 2],
'matches-matches_not': [1, None]}
Emitting the common key 'matches' from the mapper has no problem as both occurences from Left and Right will get to the reducer with common key, but how can one get the combination for the 'matches_not' if it never makes it to the reducer?
The reducer will get all records mapped, even if the matching record is absent. You just have to be add an indicator to each key/value pair showing which side it came from. That indicator will help your reduce() method determine the exact case you're dealing with for each key.
For more details on how to set that up, see my answer to a different question about join.
Related
so I'm working in the Ab Initio GDE and got a specific set of IDs in different columns, that if I put them together should equal the levels of a tree structure.
the two columns are as follows:
enter image description here
Both IDs describe operators (each with a specific ID#). The TARGET_ID is the root while the SOURCE_ID is the followed up operator one Level under the previous one.
So the tree structure looks like this: Tree structure
I can't seem to find a way in the GDE to reference the previous read value in these columns, when transforming the data.
this data is sorted by a timestamp key because for each timestamp there is one tree structure.
After sort component I added a Levels column with Nulls to get the goal format.
My data consists of Time column, the two ID columns and the level column now.
I have tried partitioning the data for each SOURCE_ID value, so all entries of that partitioning has to be the next iterated level.
I can't seem to get the Code in "Text View" right, with the temp_level that I iterate up and the function which loads the previous or next entry value (I read it was the metacomp-function but I cant seem to find example implementation/code).
Can someone help me on how to solve this? Maybe I'm thinking too complicated and a different combination of components can do the trick as well.
Thanks in advance everyone
I would like to be able to partition an Arrow table by the values of one of its columns (assuming the set of n values occurring in that column is known). The straightforward way is a for-loop: for each of these values, scan the whole table and build a new table of matching rows. Are there ways to do this in one pass instead of n passes?
I initially thought that Arrow's support for group-by scans would be the solution -- but Arrow (in contrast to Pandas) does not support extracting groups after a group-by scan.
Am I just thinking about this wrong and there is another way to partition a table in one pass?
For the group by support, there is a "hash_list" function that returns all values in the group. Is that what you're looking for? You could then slice the resulting values after-the-fact to extract the individual groups.
I have a DyanmoDB table that for the sake of this question looks like this:
id (String partition key)
origin (String sort key)
I want to query the table for a subset of origins under a specific id.
From my understanding, the only operator DynamoDB allows on sort keys in a Query are 'between', 'begins_with', '=', '<=' and '>='.
The problem is that my query needs a form of 'CONTAINS' because the 'origins' list is not necessarily ordered (for a between operator).
If this was SQL it would be something like:
SELECT * from Table where id={id} AND origin IN {origin_list}
My exact question is: What do I need to do to achieve this functionality in the most efficient way? should I change my table structure? maybe add a GSI? Open to suggestions.
I am aware that this can be achieved with a Scan operation but I want to have an efficient query. Same goes for BatchGetItem, I would rather avoid that functionality unless absolutely necessary.
Thanks
This is a case for using Filter Expressions for Query. It has IN operator
Comparison Operator
a IN (b, c, d) — true if a is equal to any value in the list — for
example, any of b, c or d. The list can contain up to 100 values,
separated by commas.
However, you cannot use condition expressions on key attributes.
Filter Expressions for Query
A filter expression cannot contain partition key or sort key
attributes. You need to specify those attributes in the key condition
expression, not the filter expression.
So, what you could do is to use origin not as a sort key (or duplicate it with another attribute) to filter it after the query. Of course filter first reads all the items has that 'id' and filters later which consumes read capacity and less efficient but there is no other way to query that otherwise. Depending on your item sizes and query frequency and estimated number of returned items BatchGetItem could be a better choice.
I want to write a map-side join and want to include a reducer code as well. I have a smaller data set which I will send as distributed cache.
Can I write the map-side join with reducer code?
Yes!! Why not. Look, reducer is meant for aggregation of the key values emitted from the map. So you can always have a reducer in your code whenever you want to aggregate your result (say you want to count or find average or any numerical summarization) based on certain criteria that you've set in your code or in accordance with the problem statement. Map is just for filtering the data and emitting some useful key value pairs out of a LOT of data. Map side join is just needed when one of the dataset is small enough to fit the memory of the commodity machine. By the way reduce-side join serves your purpose too!!
I'm using Ben Firshman's fork of django-MPTT (hat tip to Daniel Roseman for the recommendation).
I've got stuck trying to re-order nodes which share a common parent. I've got a list of primary keys, like this:
ids = [5, 9, 7, 3]
All of these nodes have a parent, say with primary key 1.
At present, these nodes are ordered [5, 3, 9, 7], how can I re-order them to [5, 9, 7, 3]?
I've tried something like this:
last_m = MyModel.get(pk = ids.pop(0))
last_m.move_to(last_m.parent, position='first-child')
for id in ids:
m = MyModel.get(pk = id)
m.move_to(last_m, position='right')
Which I'd expect to do what I want, per the docs on move_to, but it doesn't seem to change anything. Sometimes it seems to move the first item in ids to be the first child of its parent, sometimes it doesn't.
Am I right in my reading of the docs for move_to that calling move_to on a node n with position=right and a target which is a sibling of n will move n to immediately after the target?
It's possible I've screwed up my models table in trying to figure this out, so maybe the code above is actually right. It's also possible there's a much more elegant way of doing this (perhaps one that doesn't involve O(n) selects and O(n) updates).
Have I misunderstood something?
Bonus question: is there a way of forcing django-MPTT to rebuild lft and rght values for all instances of a given model?
I think this is an artefact of a failure in MPTT that I've mentioned before - when you move nodes around, it correctly updates the instance of the node you're moving, but it doesn't update the instance of the target (although it does get updated in the database).
The consequence of this is that in your code, each m gets moved to the right of last_m - but the values in the last_m still reflect the position before the move, so the next move uses the original lft/right values instead of the new post-move ones.
The solution is to reload last_m each time:
for id in ids:
last_m = MyModel.objects.get(pk=last_m.id)
m = MyModel.get(pk = id)
m.move_to(last_m, position='right')