Do all MapReduce implementations take a keys paramater input into the reduce function? - mapreduce

Previously I asked what the use case for passing a list of keys to CouchDB's reduce function was; the answer (https://stackoverflow.com/a/46713674/3114742) mentions two potential use-cases:
From a design perspective you may want to work with keys emitted by a map function
You may be calculating something based on key input (such as how often a particular key appears)
Do all implementations of MapReduce take an array of keys as input to the reduce functions? CouchDB specifically keeps track of the original document that produces a key. i.e. the input to a CouchDB reduce function:
function(keys, values, rereduce) {...}
The keys arg looks like this: [[key1,id1], [key2,id2], [key3,id3]].
i.e. Couch keeps track of the entity that emitted the key as a result of the Map function, even in the reduce function. Do other MapReduce implementations keep track of this information? Or is this specific to CouchDB...

Not all mapreduce implementation has the same structure as in couchdb.
For example in mongodb mapreduce, there is just a single key and list of values unlike the couch db. So, all the keys that is emitted by map function is grouped and passed as just one key and list of values to reduce function.
Example:
emit(1,10)
emit(1,20)
will be grouped to
reduce(1,[10,20])

Related

Using of hash(map) as key in durable cache

I'm going to use redis cache where key is a clojure map (serialized into bytes by nippy).
Can i use hash of the clojure map as a key in redis cache?
Another words, does clojure map hash depends only on data structure value and does not depend on any memory allocation.
Investigating:
I navigated through code and found out IHashEq interface which is implemented by clojure data structures.
In the result, IHashEq impl ends with calling of Object.hashCode which has following contract:
Whenever it is invoked on the same object more than once during
an execution of a Java application, the {#code hashCode} method
must consistently return the same integer, provided no information
used in {#code equals} comparisons on the object is modified.
This integer need not remain consistent from one execution of an
application to another execution of the same application.
Well, just want to clarify that i cannot use hash as id persistent in other process because:
two equal values give two equal hash codes, but not vice verse. So there is a chance of collision
there is no guarantee that clojure map hash will be the same for the the same values in different jvm processes
Please, confirm.
Re your two points:
Two equal values will yield the same hash code, but two unequal values may also give the same hash code. So the chance of collision makes this a bad choice of key.
Different JVM's should generate the same hashcode for a given value, given the same version of Java & Clojure (& very probably for different versions, although this is not guarenteed).
You can use secure hash library (like this one) to address your concerns (like in blockchain). Albeit you have to pay for its performance penalty.

Why is a set used instead of a map? C++

Sets are used to get information of an object by providing all the information, usually used to check if the data exists. A map is used to get the information of an object by using a key (single data). Correct me if I am wrong. Now the question is why would we need a set in the first place, can't we a map to see if the data exist? why would we need to provide all the information just to see if it exist?
There are many operations where you just need a set. Using a map would be just extra space.
Set operations (Union, Intersection etc.).
Keeping unique elements from a collection of numbers, objects etc.
A set serves to group items of the same type that are different among themselves (i.e., they are not equal). For example, the numbers 1 and 2 are both of int type, but 1!=2.
set containers are useful when you want to keep track of collections of homogeneous things as a group, and perform mathematical operations on such groups (like intersection, union, difference, etc). For example, imagine a set of search results containing all the documents mentioning the words cat and dog. And then another set containing all the documents mentioning the words pet. The union of those two sets would give you the group of documents containing the words cat, dog, and pet. Notice that such group will have no repetitions (i.e., if a document was in the both sets initially, it will be only once in the second set).
maps are most certainly not a set, but they can be seen as an arrangement which allows you to associate a value with every element of a set. They are used to represent relationships. For example, the set of people working for a company have an associated employee_number; in this case a map would be a useful structure to represent such relationship.
Going back to the previous example, if you wanted to know how many times has each page been accessed, you could probably create a map along the lines of std::map<Page, int>, that is, a relationship between the pages, and the number of times each has been visited.
Notice that the keys of a map form a set (probably this is what confuses many people), and an implication of this property is that you can only have a given key once (there are some esoteric containers where a key can be mapped to different values though).
So, if you need to interact with groups and collections as a whole, and with the members of the group itself, probably you want a set. If you need to associate certain things with members of a group or a collection, probably you want a map. If the association spans more than one dimension, probably you want a multi_map.
Important notice that in C++ std::set and std::map are ordered. C++11 offers alternative unordered containers called std::unordered_set and std::unordered_map.
A Set contains a unique list of ordered values, but a Map can contain a non unique set of unordered values accessed using a key.
Either could be used to determine if an object exists, it depends on your use case and how you need to be able to access that object - can you test to see if the Set contains an object that you have a reference to, or do you need to look it up by one or more keys to be able to compare it?

How to get a clojure array-map to maintain insertion order after assoc?

I have an array-map which I am associng some values into it. After a certain size the returned value is a PersistentHashMap rather than the original PersistentArrayMap. I've read about this behavior on a few web sites. Is there any way to force the insertion order even after assoc?
I do have a separate function which will take a ash-map and a vector of keys, and return a "fresh" array-map with keys in this order, but it means that for each assoc, I have to extract the keys first, cons/conj the new key to the vector, then create a new array-map. Seems kludgey, even if written in a separate function.
Is there a more direct language-supported way of keeping insertion order even on large-ish (>10, but < 50) keys array-map?
In case it's relevant, I'm using a list of array-maps as data into an incanter dataset and then outputting to excel. The save-xls function keeps the order of the keys/columns.
Thanks
You can use an ordered map: https://github.com/flatland/ordered

Hash table with two keys

I have a large amount of data the I want to be able to access in two different ways. I would like constant time look up based on either key, constant time insertion with one key, and constant time deletion with the other. Is there such a data structure and can I construct one using the data structures in tr1 and maybe boost?
Use two parallel hash-tables. Make sure that the keys are stored inside the element value, because you'll need all the keys during deletion.
Have you looked at Bloom Filters? They aren't O(1), but I think they perform better than hash tables in terms of both time and space required to do lookups.
Hard to find why you need to do this but as someone said try using 2 different hashtables.
Just pseudocode in here:
Hashtable inHash;
Hashtable outHash;
//Hello myObj example!!
myObj.inKey="one";
myObj.outKey=1;
myObj.data="blahblah...";
//adding stuff
inHash.store(myObj.inKey,myObj.outKey);
outHash.store(myObj.outKey,myObj);
//deleting stuff
inHash.del(myObj.inKey,myObj.outKey);
outHash.del(myObj.outKey,myObj);
//findin stuff
//straight
myObj=outHash.get(1);
//the other way; still constant time
key=inHash.get("one");
myObj=outHash.get(key);
Not sure, thats what you're looking for.
This is one of the limits of the design of standard containers: a container in a sense "own" the contained data and expects to be the only owner... containers are not merely "indexes".
For your case a simple, but not 100% effective, solution is to have two std::maps with "Node *" as value and storing both keys in the Node structure (so you have each key stored twice). With this approach you can update your data structure with reasonable overhead (you will do some extra map search but that should be fast enough).
A possibly "correct" solution however would IMO be something like
struct Node
{
Key key1;
Key key2;
Payload data;
Node *Collision1Prev, *Collision1Next;
Node *Collision2Prev, *Collision2Next;
};
basically having each node in two different hash tables at the same time.
Standard containers cannot be combined this way. Other examples I coded by hand in the past are for example an hash table where all nodes are also in a doubly-linked list, or a tree where all nodes are also in an array.
For very complex data structures (e.g. network of structures where each one is both the "owner" of several chains and part of several other chains simultaneously) I even resorted sometimes to code generation (i.e. scripts that generate correct pointer-handling code given a description of the data structure).

Multiple keys Hash Table (unordered_map)

I need to use multiple keys(int type) to store and retrieve a single value from a hash table. I would use multiple key to index a single item. I need fast insertion and look up for the hash table. By the way, I am not allowed to use the Boost library in the implementation.
How could I do that?
If you mean that two ints form a single key then unordered_map<std::pair<int,int>, value_type>. If you want to index the same set of data by multiple keys then look at Boost.MultiIndex.
If the key to your container is comprised of the combination of multiple ints, you could use boost::tuple as your key, to encapsulate the ints without more work on your part. This holds provided your count of key int subcomponents is fixed.
Easiest way is probably to keep a map of pointers/indexes to the elements in a list.
A few more details are needed here though, do you need to support deletion? how are the elements setup? Can you use boost::shared pointers? (rather helpful if you need to support deletion)
I'm assuming that the value object in this case is large, or there is some other reason you can't simply duplicate values in a regular map.
If its always going to be a combination for retrieval.
Then its better to form a single compound key using multiple keys.
You can do this either
Storing the key as a concatenated string of ints like
(int1,int2,int3) => data
Using a higher data type like uint64_t where in u can add individual values to form a key
// Refer comment below for the approach