I wrote a UDF that implements the Accumulator interface. However, for my UDF to work, the incoming relation needs to be sorted. I'm managing this with a secondary sort:
out = FOREACH (GROUP test BY key) {
sorted = ORDER test BY sub_key;
GENERATE MyUDF(sorted);
}
Per the Accumulator docs, my UDF is expecting a series of incremental bags. Is the total order in which my UDF receives tuples maintained? I.e. is each incremental bag internally ordered, and is the sequence I see incremental bags ordered?
Everything seems to be ordered when I test it, but I'd like to be sure since the Pig docs describe bags as "unordered".
Since you have used ORDER operator to sort the tuples inside a bag, your UDF would receive the tuples inside the 'sorted' bag ordered for sure.
Related
I need to get the rows by key (e.g. where status is "Active") but with sorting on multiple columns.
I'm using the pagination that's why I cannot sort the result after fetching it from the DynamoDB. (Just for the information, I'm using the serverless framework)
Expected Output is array of rows sorted (ordered) by multiple columns.
In DynamoDB you get "free" lexicographical sorting on the range keys.
When an item is being inserted first its partition is calculated based on the partition key then the item is inserted into a b-tree which keeps the partition lexicographically sorted at all times. This doesn't give you all of the features of SQLs Order By, which is not supported
So if your sort keys look something like this
Status#Active#UserId#0000004
You can do "begins_with" query with SK = "Status#Active"
This will give you all of the items that are in active status ordered by the UserId (that has to be zero-padded in order to enforce the lexicographical order).
You can't do that. Sorting can be only done on SK under the same PK. You could combine multiple columns into one value and query based on it. Something like column1-value1#column2-value2.
In that case you'll probably have issue in updating that field, dynamodb streams could help with it. You can trigger event on any modification and asynchronously update that sorting field.
I'm trying to come up with an algorithm to do the following in a map reduce. I receive a bunch of objects and the user ids of the owner. In other words, I receive a bunch of pairs:
(object, uid)
I want to end up with a list of pairs (object,count), where count refers to the number of times the object occurs in the list. The caveat is that we would need to filter everything as follows:
We should only include object pairs such that the object is repeated for at least n different uids.
We should only include objects such that the total count of times it repeats is at least m.
Objects and users are all represented as integers. The problem is that it would be trivial to convert each (object,uid) pair into (object, 1) and then reduce together these by summing the second integers. I could then filter everything that doesn't hit the threshold of (2). However, at this point I would have lost the information necessary to filter by (1), which is what I don't know how to incorporate into this. Anyone have any suggestions?
The easiest and most natural way is to run two MR jobs in sequence. Goal of the first job is to count how much times each object is owned by each uid. Result is triplets (object, uid, count). uid field here is for debugging purpose only -- it is not required in second job. Second job groups triplets by object. In the end of each reduce() call, you know:
number of different uids for object (number of received triplets)
total number of how much time object is owned (sum of count fields)
So, now you may apply both filters.
Single-job setup is also possible, but it requires manipulating with job on a bit lower level with setSortComparatorClass(), setGroupingComparatorClass() and setPartitionerClass(). Idea is that map() should emit composite keys which contain both object and uid fields, value is not used at all (NullWritable):
Partitioner class partitions keys only by using object field of the key. This guarantees that all records with the same object will go to the same reduce task.
SortComparator class is implemented in such way that first it compares object field, and if they are identical, uid field.
GroupingComparatorClass uses only object field for comparison.
In the result, input of single reduce task will look like following:
object1 uid1
object1 uid2
object1 uid2
object1 uid2
object1 uid5
object1 uid6
object1 uid6
------------ <- boundary of call to reduce
object7 uid1
object7 uid1
object7 uid5
------------- <-- boundary of call to reduce()
object9 uid3
As you can see, uids are strictly ordered inside each call to reduce(), which allows you to count both number of distinct and non-distinct uids simultaneously.
I want to write a map-side join and want to include a reducer code as well. I have a smaller data set which I will send as distributed cache.
Can I write the map-side join with reducer code?
Yes!! Why not. Look, reducer is meant for aggregation of the key values emitted from the map. So you can always have a reducer in your code whenever you want to aggregate your result (say you want to count or find average or any numerical summarization) based on certain criteria that you've set in your code or in accordance with the problem statement. Map is just for filtering the data and emitting some useful key value pairs out of a LOT of data. Map side join is just needed when one of the dataset is small enough to fit the memory of the commodity machine. By the way reduce-side join serves your purpose too!!
I have a String Set attribute i.e SS in a dynamodb table. I need to scan the database to check the value present in the any one list of the items.
Which comparison operator should I use for this scan?
example the db has items like this:
name
[email1, email2]
phone
I need to search for a items containing a particular email say email1 alone not giving the entire tuple.
It seems like you are looking for the CONTAINS operator of Scan operation. It basically is the equivalent of in in Python.
This said, if you need to perform this often, you probably should de-normalize your data to make it faster.
For example, you could build a second table like this:
hash_key: name
range_key: email
Of course, you would have to maintain this index yourself and query it manually.
I have let the modeling tools in my IDE create entities from tables, so each entity is one record. How can I select n records starting at the i'th record, such that I may easily implement pagination?
Using criteria queries but a simple reference should be enough. My tables are varied so I can't do this by key. I can do this with native queries but am uncertain how at the moment how a criteria query and native query can be combined.
Currently I am returning a list and discarding the portion I do not want, this is proving to be too inefficient.
you can use the combination of javax.persistence.Query#setFirtsResult and javax.persistence.Query#setMaxResult if you don't insist on using criteria.
Criteria criteria
= session.createCriteria(SomeClass.class);
criteria.setFirstResult(0);
criteria.setMaxResults(10);