Map reduce and hash partitioning - mapreduce

While learning about MapReduce, I encountered this question:
A given Mapreduce program has the Map phase generate 100 key-value pairs with 10 unique keys.
How many Reduce tasks can this program have when at least one Reduce task will certainly be assigned no keys when a hash partitioner is used (select all answers that are correct)?
[ ] A. 3
[ ] B. 11
[ ] C. 50
[ ] D. 101
The answers are B, C, D.
Since the unique keys' number is 10. We must have at least 10 reduce task. And at least one reduce task has null key.
I am not able to understand how these answers where arrived at. Please help me in this.

Unique keys from the map output are assigned to only one reduce task. If there are 10 unique keys and there are 11, 50, or 101 reduce tasks then there will necessarily be some reduce tasks that have no keys.

As there are 10 unique keys we need 10 reducers and since we want 1 reducer with no keys assigned, in total its 11 reducers.
If the number of reducers are more than or equal to 11, the job would run without exceptions. So any number more than or equal to 11 would be an answer.

Hash partitioner in this context merely means reduce tasks will be consolidated by unique key. It is assumed that a reduce task is completed on only one server, therefore each of 10 tasks are atomic.
The modulo operator (or any reasonable partitioner) will assure that each of 3 servers/reducers will be active for the case of 10 tasks.
For the other options, if there are more "reducers" than tasks, all tasks will be assigned to one "reducer" (only one remainder), if we are to believe the partitioning function. This is ridiculous or at least confusing without additional context. Apparently, partitioning is only required when the number of tasks exceeds the number of reducers/servers.

To get one reducer output as empty file, i.e., no key assigned to reducer, we need at least 11 reducers because hashpartitioner distributes based on hash function. The eligible reducers to receive data here are part-r-00000 to part-r-00009.
Reducer no = key hashcode % n ( no of reducers)
So possible remainders are 0 to n-1. Here we have 10 unique keys, i.e., 10 different remainders. We will have empty reducers files even if number of reducers less than unique number of keys. In the worst scenario also we will get one reducer file empty if the number of reducers is more than unique keys.

Related

Please reply::HashTable:Determining Table size and which hash function to use

If the input data entries are around 10 raised to power of 9, do we keep the size of the hash table the same as input size or reduce the size? how to decide the table size?
if we are using numbers in the range of 10 raised to power of 6 as the key, how do we hash these numbers to smaller values? I know we use the modulo operator but module with what?
Kindly explain how these two things work. Its getting quite confusing. Thanks!!
I tried to make the table size around 75% of the input data size, that you can call as X. Then I did key%(X) to get the hash code. But I am not sure if this is correct.
If the input data entries are around 10 raised to power of 9, do we keep the size of the hash table the same as input size or reduce the size? how to decide the table size?
The ratio of the number of elements stored to the number of buckets in the hash table is known as the load factor. In a separate chaining implementation, I'd suggest doing what std::unordered_set et al do and keeping it roughly in the range 0.5 to 1.0. So, for 10^9 elements have 10^9 to 2x10^9 buckets. Luckily, with separate chaining nothing awful happens if you go a bit outside this range (lower load factors just waste some memory on extra unused buckets, and higher load factors lead to increased collisions, longer lists and search times, but at load factors under 5 or 10 with an ok hash function the slow down will be roughly linear on average (so 5 or 10x slower than at load factor 1).
One important decision you should make is whether to pick a number around this magnitude that is a power of two, or a prime number. Explaining the implications is tedious, and anyway - which will work best for you is best determined by trying both and measuring the performance (if you really have to care about smallish differences in performance; if not - a prime number is the safer bet).
if we are using numbers in the range of 10 raised to power of 6 as the key, how do we hash these numbers to smaller values? I know we use the modulo operator but module with what?
Are these keys unsigned integers? In general, you can't have only 10^6 potential keys and end up with 10^9 hash table entries, as hash tables don't normally store duplicates (std::unordered_multiset/multi_map can, but it'll be easier for you to model that kind of thing as being a hash table from distinct keys to a container or values). More generally, it's best to separate the act of hashing (which usually is expected to generate a size_t result), from the "folding" of the hash value over the number of buckets in the hash table. That folding can be done using % in the general case, or by bitwise-ANDing with a bitmask for power-of-two bucket counts (e.g. for 256 buckets, & 255 is the same as % 256, but may execute faster on the CPU when those 255/256 values aren't known at compile time).
I tried to make the table size around 75% of the input data size, that you can call as X.
So that's a load factor around 1.33, which is ok.
Then I did key%(X) to get the hash code. But I am not sure if this is
correct.
It ends up being the same thing, but I'd suggest thinking of that as having a hash function hash(key) = key, followed by mod-ing into the bucket count. Such a hash function is known as an identity hash function, and is the implementation used for integers by all major C++ compiler Standard Libraries, though no particular hash functions are specified in the C++ Standard. It tends to work ok, but if your integer keys are particularly prone to collisions (for example, if they were all distinct multiples of 16 and your bucket count was a power of two they'd tend to only map to every 16th bucket) then it'd be better to use a stronger hash function. There are other questions about that - e.g. What integer hash function are good that accepts an integer hash key?
Rehashing
If the number of elements may increase dramatically beyond your initial expectations at run-time, then you'll want to increase the number of buckets to keep the load factor reasonable (in the range discussed above). Implementing support for that can easily be done by first writing a hash table class that doesn't support rehashing - simply taking the number of buckets to use as a constructor argument. Then write an outer rehashing-capable hash table class with a data member of the above type, and when an insert would push the load factor too high (Standard Library containers have a max_load_factor member which defaults to 1.0), you can construct an additional inner hash table object telling the constructor a new larger bucket count to use, then iterate over the smaller hash table inserting (or - better - moving, see below) the elements to the new hash table, then swap the two hash tables so the data member ends up with the new larger content and the smaller one is destructed. By "moving" above With a little I mean simply relink linked list elements from the smaller hash table into the lists in the larger one, instead of deep copying the elements, which will be dramatically faster and use less memory momentarily while rehashing.

Easiest primary key for main table?

My main table, Users, stores information about users. I plan to have a UserId field as the primary key of the table. I have full control of creation and assignment of these keys, and I want to ensure that I assign keys in a way that provides good performance. What should I do?
You have a few options:
The most generic solution is to use UUIDs, as specified in RFC 4122.
For example, you could have a STRING(36) that stores UUIDs. Or you could store the UUID as a pair of INT64s or as a BYTE(16). There are some pitfalls to using UUIDs, so read the details of this answer.
If you want to save a bit of space and are absolutely sure that you will have fewer than a few billion users, then you could use an INT64 and then assign UserIds using a random number generator. The reason you want to be sure you have fewer than a few billion users is because of the Birthday Problem, the odds that you get at least one collision are about 50% once you have 4B users, and they increase very fast from there. If you assign a UserId that has already been assigned to a previous user, then your insertion transaction will fail, so you'll need to be prepared for that (by retrying the transaction after generating a new random number).
If there's some column, MyColumn, in the Users table that you would like to have as primary key (possibly because you know you'll want to look up entries using this column frequently), but you're not sure about the tendency of this column to cause hotspots (say, because it's generated sequentially or based on timestamps), then you two other options:
3a) You could "encrypt" MyColumn and use this as your primary key. In mathematical terms, you could use an automorphism on the key values, which has the effect of chaotically scrambling them while still never assigning the same value multiple times. In this case, you wouldn't need to store MyColumn separately at all, but rather you would only store/use the encrypted version and could decrypt it when necessary in your application code. Note that this encryption doesn't need to be secure but instead just needs to guarantee that the bits of the original value are sufficiently scrambled in a reversible way. For example: If your values of MyColumn are integers assigned sequentially, you could just reverse the bits of MyColumn to create a sufficiently scrambled primary key. If you have a more interesting use-case, you could use an encryption algorithm like XTEA.
3b) Have a compound primary key where the first part is a ShardId, computed ashash(MyColumn) % numShards and the second part is MyColumn. The hash function will ensure that you don't create a hot-spot by allocating your rows to a single split. More information on this approach can be found here. Note that you do not need to use a cryptographic hash, although md5 or sha512 are fine functions. SpookyHash is a good option too. Picking the right number of shards is an interesting question and can depend upon the number of nodes in your instance; it's effectively a trade-off between hotspot-avoiding power (more shards) and read/scan efficiency (fewer shards). If you only have 3 nodes, then 8 shards is probably fine. If you have 100 nodes; then 32 shards is a reasonable value to try.

mapreduce - how to anonymize a column's values

input
1 - - GET hm_brdr.gif
2 - - GET s102382.gif ( "1", {"- - GET hm_brdr.gif"})
3 - - GET bg_stars.gif map-reduce-> ( "2", {"- - GET s102382.gif"})
3 - - GET phrase.gif ( "3", {"- - GET bg_stars.gif,"- - GET phrase.gif"})
I want to make the first column values 1,2,3.. anonymous using random integers. But it shouldn't change it like 1->x in one line and 1->t in another line. so my solution is to replace the "keys" with random integers (rand(1)=x, rand(2)=y ..) in the reduce step and ungroup the values with their new keys and write to files again as shown below.
output file
x - - GET hm_brdr.gif
y - - GET s102382.gif
z - - GET bg_stars.gif
z - - GET phrase.gif
my question is, is there a better way of doing this in the means of running time ?
If you want to assign a random integer to a key value then you'll have to do that in a reducer where all key/value pairs for that key are gathered in one place. As #jason pointed out, you don't want to assign a random number since there's no guarantee that a particular random number won't be chosen for two different keys. What you can do is just increment a counter held as an instance variable on the reducer to get the next available number to associate with a key. If you have a small amount of data then a single reducer can be used and the numbers will be unique. If you're forced to use multiple reducers then you'll need a slightly more complicated technique. Use
Context.getTaskAttemptID().getTaskID().getId()
to get a unique reducer number with which to calculate an overall unique number for each key.
There is no way this is a bottleneck to your MapReduce job. More precisely, the runtime of your job is dominated by other concerns (network and disk I/O, etc.). A quick little key function? Meh.
But that's not even the biggest issue with your proposal. The biggest issue with your proposal is that it's doomed to fail. What is a key fact about keys? They serve as unique identifiers for records. Do random number generators guarantee uniqueness? No.
In fact, pretend for just a minute that your random key space has 365 possible values. It turns out that if you generate a mere 23 random keys, you are more likely than not to have a key collision; welcome to the birthday paradox. And all of a sudden, you've lost the whole point to the keys in the first place as you've started smashing together records by giving two that shouldn't have the same key the same key!
And you might be thinking, well, my key space isn't as small as 365 possible keys, it's more like 2^32 possible keys, so I'm, like, totally in the clear. No. After approximately 77,000 keys you're more likely than not to have a collision.
Your idea is just completely untenable because it's the wrong tool for the job. You need unique identifiers. Random doesn't guarantee uniqueness. Get a different tool.
In your case, you need a function that is injective on your input key space (that is, it guarantees that f(x) != f(y) if x != y). You haven't given me enough details to propose anything concrete, but that's what you're looking for.
And seriously, there is no way that performance of this function will be an issue. Your job's runtime really will be completely dominated by other concerns.
Edit:
To respond to your comment:
here i am actually trying to make the ip numbers anonymous in the log files, so if you think there is a better way i ll be happy to know.
First off, we have a serious XY problem here. You should have asked searched for answers to that question. Anonymizing IP addresses, or anything for that matter, is hard. You haven't even told us the criteria for a "solution" (e.g., who are the attackers?). I recommend taking a look at this answer on the IT Security Stack Exchange site.

mapreduce program

Consider one .txt file.. in that i have no of paragraphs separated by a new line character.
Now i need to count the no of words in each paragraph.. Consider the counted words as a key in
the mapper and assign a value 1 initially for all And in Reducer give me a sorted output
Please give me a complete code for better understanding, Because I am a fresher
And please give me better clarification in how it counts the no of words in each paragraph
Mapper doing the counting would not yield you the performance that you are trying to achieve through the map reduce technique.
To really utilise the benefit of map reduce, you should consider treating the paragraph number
(1 for the 1st paragraph, 2 for the 2nd and so on) and then sending these paragraphs for individual counting to different reducers running on different nodes (harnesses the capability of parallel processing) and then to sort the output you may feed it into a simple program to do the sorting for you, or if the number of paragraphs is large, feeding this into another map reduce job. In that case, you would need to consider a range of numbers as the key for map reduce, say numbers (count of words in the paragraph obtained from the previous map reduce job) from 1 to 10 should fall into one bucket and should be mapped to one key, and then the individual reducers can work on these individual buckets to sort them, and the result can be collated in the end to get the complete sorted output.
An example implementation of map-reduce can be found at : http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html

Can you encode to less bits when you don't need to preserve order?

Say you have a List of 32-bit Integers and the same collection of 32-bit Integers in a Multiset (a set that allows duplicate members)
Since Sets don't preserve order but List do, does this mean we can encode a Multiset in less bits than the List?
If so how would you encode the Multiset?
If this is true what other examples are there where not needing to preserve order saves bits?
Note, I just used 32-bit Integers as an example. Does the data type matter in the encoding? Does the data type need to be fixed length and comparable for you to get the savings?
EDIT
Any solution should work well for collections that have low duplication as well as high duplication. Its obvious with high duplication encoding a Multiset by just simply counting duplicates is very easy, but this takes more space if there is no duplication in the collection.
In the multiset, each entry would be a pair of numbers: The integer value, and a count of how many times it is used in the set. This means additional repeats of each value in the multiset do not cost any more to store (you just increment the counter).
However (assuming both values are ints) this would only be more efficient storage than a simple list if each list item is repeated twice or more on average - There could be more efficient or higher performance ways of implementing this, depending on the ranges, sparsity, and repetitive of the numbers being stored. (For example, if you know there won't be more than 255 repeats of any value, you could use a byte rather than an int to store the counter)
This approach would work with any types of data, as you are just storing the count of how many repeats there are of each data item. Each data item needs to be comparable (but only to the point where you know that two items are the same or different). There is no need for the items to take the same amount of storage each.
If there are duplicates in the multiset, it could be compressed to a smaller size than a naive list. You may want to have a look at Run-length encoding, which can be used to efficiently store duplicates (a very simple algorithm).
Hope that is what you meant...
Data compression is a fairly complicated subject, and there are redundancies in data that are hard to use for compression.
It is fundamentally ad hoc, since a non-lossy scheme (one where you can recover the input data) that shrinks some data sets has to enlarge others. A collection of integers with lots of repeats will do very well in a multimap, but if there's no repetition you're using a lot of space on repeating counts of 1. You can test this by running compression utilities on different files. Text files have a lot of redundancy, and can typically be compressed a lot. Files of random numbers will tend to grow when compressed.
I don't know that there really is an exploitable advantage in losing the order information. It depends on what the actual numbers are, primarily if there's a lot of duplication or not.
In principle, this is the equivalent of sorting the values and storing the first entry and the ordered differences between subsequent entries.
In other words, for a sparsely populated set, only little saving can be had, but for a more dense set, or one with clustered entries - more significant compression is possible (i.e. less bits need to be stored per entry, possibly less than one in the case of many duplicates). I.e. compression is possible but the level depends on the actual data.
The operation sort followed by list delta will result in a serialized form that is easier to compress.
E.G. [ 2 12 3 9 4 4 0 11 ] -> [ 0 2 3 4 4 9 11 12 ] -> [ 0 2 1 1 0 5 2 1 ] which weighs about half as much.