How to group duplicates using MapReduce?

How to group duplicates using MapReduce? - mapreduce

I want to use MapReduce to return a list of duplicate tuples. By duplicate tuples, I mean tuples having similar values of a set of attributes.
Could I put the values of this set of attributes as an intermediate key and adjust reduce to process all similar keys as one key?

Yes, I could implement my own class of intermediate key witch implements the interface WritableComparable. So I was forced to implement the function CompareTo which return 0 if the inputs are equals.
In my case, the attributes' class are the attributes of my tuples. So, I just wrote the function "CompareTo" in such a way that it returns 0 when all these attributes are similar. The similarity here can be computed by the Levenshtein Edit Distance.

Related

Do all MapReduce implementations take a keys paramater input into the reduce function?

Previously I asked what the use case for passing a list of keys to CouchDB's reduce function was; the answer (https://stackoverflow.com/a/46713674/3114742) mentions two potential use-cases:
From a design perspective you may want to work with keys emitted by a map function
You may be calculating something based on key input (such as how often a particular key appears)
Do all implementations of MapReduce take an array of keys as input to the reduce functions? CouchDB specifically keeps track of the original document that produces a key. i.e. the input to a CouchDB reduce function:
function(keys, values, rereduce) {...}
The keys arg looks like this: [[key1,id1], [key2,id2], [key3,id3]].
i.e. Couch keeps track of the entity that emitted the key as a result of the Map function, even in the reduce function. Do other MapReduce implementations keep track of this information? Or is this specific to CouchDB...

Not all mapreduce implementation has the same structure as in couchdb.
For example in mongodb mapreduce, there is just a single key and list of values unlike the couch db. So, all the keys that is emitted by map function is grouped and passed as just one key and list of values to reduce function.
Example:
emit(1,10)
emit(1,20)
will be grouped to
reduce(1,[10,20])

how to create custom enumerator in Ember?

I'm looking for a way to create custom enumerator for collection on the example of sum enumerator - using reduce() several times that only sums value is not a good solution.
The Enumerable API says that an enumerator has to implement nextObject method and have length property. But not all the built-in enumerators seem to have them.
I tried with reopening the Ember.Enumerable class but it failed.
An example how to create a sum enumerator based on reduce (or a resource how to do it) will be a great help.
edit
# custom enumarator I want to have
sum = #get('items').sum(0)
# current solution
sum = #get('items').reduce ((prev, curr) -> prev + curr,) 0
Ember has Em.computed.sum but I couldn't find it as a normal enumerable.

I think you are confusing enumerating functions and enumerable objects.
An object that is enumerable will have to implement nextObject and length. These are things like linked lists, queues, stacks, and sets. Basically, data structures that you would want to iterate.
An enumerating function is something you can apply to an enumerable object, like sum, min, max.
You should not be adding this method to Ember.Enumerable because not all enumerables can be summed. A list of numbers can, but how would you sum a collection of fruits?
That said, this should answer you question:
http://emberjs.jsbin.com/nodojadi/1/edit

Why is a set used instead of a map? C++

Sets are used to get information of an object by providing all the information, usually used to check if the data exists. A map is used to get the information of an object by using a key (single data). Correct me if I am wrong. Now the question is why would we need a set in the first place, can't we a map to see if the data exist? why would we need to provide all the information just to see if it exist?

There are many operations where you just need a set. Using a map would be just extra space.
Set operations (Union, Intersection etc.).
Keeping unique elements from a collection of numbers, objects etc.

A set serves to group items of the same type that are different among themselves (i.e., they are not equal). For example, the numbers 1 and 2 are both of int type, but 1!=2.
set containers are useful when you want to keep track of collections of homogeneous things as a group, and perform mathematical operations on such groups (like intersection, union, difference, etc). For example, imagine a set of search results containing all the documents mentioning the words cat and dog. And then another set containing all the documents mentioning the words pet. The union of those two sets would give you the group of documents containing the words cat, dog, and pet. Notice that such group will have no repetitions (i.e., if a document was in the both sets initially, it will be only once in the second set).
maps are most certainly not a set, but they can be seen as an arrangement which allows you to associate a value with every element of a set. They are used to represent relationships. For example, the set of people working for a company have an associated employee_number; in this case a map would be a useful structure to represent such relationship.
Going back to the previous example, if you wanted to know how many times has each page been accessed, you could probably create a map along the lines of std::map<Page, int>, that is, a relationship between the pages, and the number of times each has been visited.
Notice that the keys of a map form a set (probably this is what confuses many people), and an implication of this property is that you can only have a given key once (there are some esoteric containers where a key can be mapped to different values though).
So, if you need to interact with groups and collections as a whole, and with the members of the group itself, probably you want a set. If you need to associate certain things with members of a group or a collection, probably you want a map. If the association spans more than one dimension, probably you want a multi_map.
Important notice that in C++ std::set and std::map are ordered. C++11 offers alternative unordered containers called std::unordered_set and std::unordered_map.

A Set contains a unique list of ordered values, but a Map can contain a non unique set of unordered values accessed using a key.
Either could be used to determine if an object exists, it depends on your use case and how you need to be able to access that object - can you test to see if the Set contains an object that you have a reference to, or do you need to look it up by one or more keys to be able to compare it?

QMap::insertMulti or QMultiMap?

What should i use between QMap::insertMulti and QMultiMap to handle :
2 -> abc
2 -> def
3 -> ghi
3 -> jkl
What's the difference enter the 2 solutions ?

Reading Container Classes:
QMap<Key, T>
This provides a dictionary (associative array) that maps keys of type Key to values of type T. Normally each key is associated with a single value. QMap stores its data in Key order; if order doesn't matter QHash is a faster alternative.
QMultiMap<Key, T>
This is a convenience subclass of QMap that provides a nice interface for multi-valued maps, i.e. maps where one key can be associated with multiple values.
it looks like both can do the job. In this document there is also Algorithmic Complexity section where you can see that both classes have the same complexity.
I would choose QMultiMap just to better document the fact I'm going to hold multiple values with the same key.

Both can serve this purpose. QMultiMap is actually a subclass of QMap.
If you are willing to have multiple values for single key, you can use:
QMap : for inserting use insertMulti
QMultiMap : for inserting use insert
If you are willing to have single value for single key, you can use:
QMap : for inserting use insert
QMultiMap : for inserting use replace
You can see that both can server both purpose. But, each have unique default behavior which matches its name. Also, each have some methods or operators which is convenient for single/multi.
It is better to choose type depending on your need. It is a good practice. For example, if you use QMap for storing single key multiple values, some other person who is going through your class members might get the impression that you are willing to save single key value pairs (from the data type)
Similarly, if you use QMultiMap, anyone reading the definition can get the idea that the data will have multiple value for same key.

Multiple keys Hash Table (unordered_map)

I need to use multiple keys(int type) to store and retrieve a single value from a hash table. I would use multiple key to index a single item. I need fast insertion and look up for the hash table. By the way, I am not allowed to use the Boost library in the implementation.
How could I do that?

If you mean that two ints form a single key then unordered_map<std::pair<int,int>, value_type>. If you want to index the same set of data by multiple keys then look at Boost.MultiIndex.

If the key to your container is comprised of the combination of multiple ints, you could use boost::tuple as your key, to encapsulate the ints without more work on your part. This holds provided your count of key int subcomponents is fixed.

Easiest way is probably to keep a map of pointers/indexes to the elements in a list.
A few more details are needed here though, do you need to support deletion? how are the elements setup? Can you use boost::shared pointers? (rather helpful if you need to support deletion)
I'm assuming that the value object in this case is large, or there is some other reason you can't simply duplicate values in a regular map.

If its always going to be a combination for retrieval.
Then its better to form a single compound key using multiple keys.
You can do this either
Storing the key as a concatenated string of ints like
(int1,int2,int3) => data
Using a higher data type like uint64_t where in u can add individual values to form a key
// Refer comment below for the approach

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js