Good MapReduce examples [closed] - mapreduce

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I couldn't think of any good examples other than the "how to count words in a long text with MapReduce" task. I found this wasn't the best example to give others an impression of how powerful this tool can be.
I'm not looking for code-snippets, really just "textual" examples.

Map reduce is a framework that was developed to process massive amounts of data efficiently.
For example, if we have 1 million records in a dataset, and it is stored in a relational representation - it is very expensive to derive values and perform any sort of transformations on these.
For Example In SQL, Given the Date of Birth, to find out How many people are of age > 30 for a million records would take a while, and this would only increase in order of magnitute when the complexity of the query increases.
Map Reduce provides a cluster based implementation where data is processed in a distributed manner
Here is a wikipedia article explaining what map-reduce is all about
Another good example is Finding Friends via map reduce can be a powerful example to understand the concept, and
a well used use-case.
Personally, found this link quite useful to understand the concept
Copying the explanation provided in the blog (In case the link goes stale)
Finding Friends
MapReduce is a framework originally developed at Google that allows
for easy large scale distributed computing across a number of domains.
Apache Hadoop is an open source implementation.
I'll gloss over the details, but it comes down to defining two
functions: a map function and a reduce function. The map function
takes a value and outputs key:value pairs. For instance, if we define
a map function that takes a string and outputs the length of the word
as the key and the word itself as the value then map(steve) would
return 5:steve and map(savannah) would return 8:savannah. You may have
noticed that the map function is stateless and only requires the input
value to compute it's output value. This allows us to run the map
function against values in parallel and provides a huge advantage.
Before we get to the reduce function, the mapreduce framework groups
all of the values together by key, so if the map functions output the
following key:value pairs:
3 : the
3 : and
3 : you
4 : then
4 : what
4 : when
5 : steve
5 : where
8 : savannah
8 : research
They get grouped as:
3 : [the, and, you]
4 : [then, what, when]
5 : [steve, where]
8 : [savannah, research]
Each of these lines would then be passed as an argument to the reduce
function, which accepts a key and a list of values. In this instance,
we might be trying to figure out how many words of certain lengths
exist, so our reduce function will just count the number of items in
the list and output the key with the size of the list, like:
3 : 3
4 : 3
5 : 2
8 : 2
The reductions can also be done in parallel, again providing a huge
advantage. We can then look at these final results and see that there
were only two words of length 5 in our corpus, etc...
The most common example of mapreduce is for counting the number of
times words occur in a corpus. Suppose you had a copy of the internet
(I've been fortunate enough to have worked in such a situation), and
you wanted a list of every word on the internet as well as how many
times it occurred.
The way you would approach this would be to tokenize the documents you
have (break it into words), and pass each word to a mapper. The mapper
would then spit the word back out along with a value of 1. The
grouping phase will take all the keys (in this case words), and make a
list of 1's. The reduce phase then takes a key (the word) and a list
(a list of 1's for every time the key appeared on the internet), and
sums the list. The reducer then outputs the word, along with it's
count. When all is said and done you'll have a list of every word on
the internet, along with how many times it appeared.
Easy, right? If you've ever read about mapreduce, the above scenario
isn't anything new... it's the "Hello, World" of mapreduce. So here is
a real world use case (Facebook may or may not actually do the
following, it's just an example):
Facebook has a list of friends (note that friends are a bi-directional
thing on Facebook. If I'm your friend, you're mine). They also have
lots of disk space and they serve hundreds of millions of requests
everyday. They've decided to pre-compute calculations when they can to
reduce the processing time of requests. One common processing request
is the "You and Joe have 230 friends in common" feature. When you
visit someone's profile, you see a list of friends that you have in
common. This list doesn't change frequently so it'd be wasteful to
recalculate it every time you visited the profile (sure you could use
a decent caching strategy, but then I wouldn't be able to continue
writing about mapreduce for this problem). We're going to use
mapreduce so that we can calculate everyone's common friends once a
day and store those results. Later on it's just a quick lookup. We've
got lots of disk, it's cheap.
Assume the friends are stored as Person->[List of Friends], our
friends list is then:
A -> B C D
B -> A C D E
C -> A B D E
D -> A B C E
E -> B C D
Each line will be an argument to a mapper. For every friend in the
list of friends, the mapper will output a key-value pair. The key will
be a friend along with the person. The value will be the list of
friends. The key will be sorted so that the friends are in order,
causing all pairs of friends to go to the same reducer. This is hard
to explain with text, so let's just do it and see if you can see the
pattern. After all the mappers are done running, you'll have a list
like this:
For map(A -> B C D) :
(A B) -> B C D
(A C) -> B C D
(A D) -> B C D
For map(B -> A C D E) : (Note that A comes before B in the key)
(A B) -> A C D E
(B C) -> A C D E
(B D) -> A C D E
(B E) -> A C D E
For map(C -> A B D E) :
(A C) -> A B D E
(B C) -> A B D E
(C D) -> A B D E
(C E) -> A B D E
For map(D -> A B C E) :
(A D) -> A B C E
(B D) -> A B C E
(C D) -> A B C E
(D E) -> A B C E
And finally for map(E -> B C D):
(B E) -> B C D
(C E) -> B C D
(D E) -> B C D
Before we send these key-value pairs to the reducers, we group them by their keys and get:
(A B) -> (A C D E) (B C D)
(A C) -> (A B D E) (B C D)
(A D) -> (A B C E) (B C D)
(B C) -> (A B D E) (A C D E)
(B D) -> (A B C E) (A C D E)
(B E) -> (A C D E) (B C D)
(C D) -> (A B C E) (A B D E)
(C E) -> (A B D E) (B C D)
(D E) -> (A B C E) (B C D)
Each line will be passed as an argument to a reducer. The reduce
function will simply intersect the lists of values and output the same
key with the result of the intersection. For example, reduce((A B) ->
(A C D E) (B C D)) will output (A B) : (C D) and means that friends A
and B have C and D as common friends.
The result after reduction is:
(A B) -> (C D)
(A C) -> (B D)
(A D) -> (B C)
(B C) -> (A D E)
(B D) -> (A C E)
(B E) -> (C D)
(C D) -> (A B E)
(C E) -> (B D)
(D E) -> (B C)
Now when D visits B's profile, we can quickly look up (B D) and see
that they have three friends in common, (A C E).

One of the best examples of Hadoop-like MapReduce implementation.
Keep in mind though that they are limited to key-value based implementations of the MapReduce idea (so they are limiting in applicability).

One set of familiar operations that you can do in MapReduce is the set of normal SQL operations: SELECT, SELECT WHERE, GROUP BY, ect.
Another good example is matrix multiply, where you pass one row of M and the entire vector x and compute one element of M * x.

From time to time I present MR concepts to people. I find processing tasks familiar to people and then map them to the MR paradigm.
Usually I take two things:
Group By / Aggregations. Here the advantage of the shuffling stage is clear. An explanation that shuffling is also distributed sort + an explanation of distributed sort algorithm also helps.
Join of two tables. People working with DB are familiar with the concept and its scalability problem. Show how it can be done in MR.

Related

What kind of structure is this? (Monad with a partial inverse but not a comonad)

I have encountered a structure that looks like a monad with a one-sided inverse and some additional properties. I am not sure which properties of this structure are essential and which are accidental, so I will follow a simple example in my description.
I have a base type a which consists of sorted strings (e.g "aacdee" but not "abca") and the monad M from a, which is just the List monad:M a is lists of
sorted strings. This monad defines pure: a -> M a, fmap: (a -> a) -> M a -> M a and bind: (a -> M a) -> M a -> M a.
Now I define extract: M a -> a which takes a list of strings, concatenates them and sorts the result. This is a left inverse of pure, i.e. extract . pure = id on a, but not a right inverse.
I also want to define extend: (M a -> a) -> M a -> M a in such a way that extract . (extend f) = f for all f: M a -> a.
While it would be possible to define extend f = pure . f, I do not want to do this.
For example, if f is the function that replaces each character with the next one in the alphabet, concatenates and sorts, I want extend f to just replace each character with the next. Similarly if f removes all “a”’s from the first string, all “b”’s from the second, etc.
For a less trivial example, take f as the function that takes the first string, then if the second string is longer than the first extends the first string with the last elements of the second and so on. For example, f ["ab", "c", "def"] = "abf". In this case I want extend f to just fiter each string leaving only the letters that will contribute to the result, in the example (extend f) ["ab", "c", "def"] = ["ab", "", "f"].
The idea behind all of this is that in M a one can have parallel optimization for many kinds of f, and I want to define extend f as an optimized implementation for many specific cases, falling back to extend f = pure f only in the unoptimized cases.
My extend will not satisfy the comonad axioms, but will at least satisfy the following conditions (or very similar ones, I am not completely sure about the associativity):
(extend f) . pure = pure . f . pure, i.e on a single string f and extend f are essentially the same,
extend (extract . (fmap h)) = fmap h, i.e. if g = extract . (fmap h) acts on each string separately, then extend g does the same,
(extend f) . (extend g) = extend (f . pure . g), i.e. associativity, or maybe a weaker form of it.
My question. Is this a well known structure? Does it have any peculiar interesting properties?
Looking at extract alone, we see that extract . pure = id. We also see that extract . join = extract . fmap extract. This makes extract an algebra over the [] monad.
In particular, algebras over the [] monad correspond exactly to monoids (category theory explanation: the forgetful functor Monoids -> Sets is monadic, and its left adjoint is [], so monoids are exactly algebras over the [] functor). So extract defines a monoid on a with the obvious unit and composition law.
As for extend, I don't think you have the correct types. This is because extend f :: M a -> M a, so extend f cannot be an argument to extract and therefore extract (extend f) doesn't type check. Perhaps once you fix this, it'll be easier to understand what's going on here.

Regular Language Closure Unconcatenation

I'm trying to find an operation that can take a regular language and "unconcatenate" it with another. For example:
a*L - a* = L | where L is a regular language
I know that difference (subtraction) isn't the operation I want. But I believe I'm getting my point across.
Another way to look at it is if there have a set L that is logically equal to (A ∪ B), but we do not have access to A. So if we can only use L, B, and derivations of such, can we somehow derive A. Basically:
L - B = A | L = (A ∪ B)
I have put plenty of thought into this problem, using many variations of compliment, intersection, and other closure properties of regular languages, but I simply can't figure it out.
The best I've managed to come up with is:
A = ((L - B) ∪ (A ∩ B) | L = (A ∪ B)
However this requires A on the right side.
If L = A U B, define an operator - such that L - B = A.
The problem with this is that the operator - is not well-defined: Given L and B, there are potentially several languages which satisfy L = A U B. In particular, if A is a subset of L and any (possibly improper) superset of L \ B, then A is a solution; that is, if A = (L \ B) U C, where C is a (possibly improper) subset of B, then L - B might as well be equal to that set.
Now, you could define - to mean the set of all such A, and in that case, you could make this workable using set difference, union and power set operators. Then, L - B = Q where Q = {(L \ B) U {}, (L \ B) U {B[0]}, ..., (L \ B) U B = L}.
You can make this well-defined if you specify - always returns the "smallest" element of Q (for finite sets, the one with the fewest elements; for infinite sets, the one which is a subset of all other sets) in which case you recover simply L \ B.
If L = B.A, define an operator - such that L - B = A.
A similar problem exists here: there may be several languages which, when appended to B, give L. For example, consider B = a*, and two choices for A: a* and {e}, the language containing only the empty set. You can show without much effort that a* a* = a* e, so L is the same either way, B is the same, and L - B must now produce two different values: either a* or {e}.

Interleaving in OCaml

I am trying to create a function which interleaves a pair of triples such as ((6, 3, 2), ( 4, 5 ,1)) and create a 6-tuple out of this interleaving.
I made some research but could understand how interleaving is supposed to work so I tried something on my own end ended up with a code that is creating a 6-tuple but not in the right interleaved way. This is my code
let interleave ((a, b, c), (a', b', c')) =
let sort2 (a, b) = if a > b then (a, b) else (b, a) in
let sort3 (a, b, c) =
let (a, b) = sort2 (a, b) in
let (b, c) = sort2 (b, c) in
let (a, b) = sort2 (a, b) in
(a, b, c) in
let touch ((x), (y)) =
let (x) = sort3 (x) in
let (y) = sort3 (y) in
((x),(y)) in
let ((a, b, c), (a', b', c')) = touch ((a, b, c), (a', b', c')) in
(a, b', a', b, c, c');;
Can someone please explain to me how with what functions I can achieve a proper form of interleaving. I haven't learned about recursions and lists in case you would ask why I am trying to do it this way.
Thank you already.
The problem statement uses the word "max" without defining it. If you use the built-in compare function of OCaml as your definition, it uses lexicographic order. So you want the largest value (of the 6 values) in the first position in the 6-tuple, the second largest value next, and so on.
This should be pretty easy given your previously established skill with the sorting of tuples.
For what it's worth, there doesn't seem to be much value in preserving the identities of the two 3-tuples. Once inside the outermost function you can just work with the 6 values as a 6-tuple. Or so it would seem to me.
Update
From your example (should probably have given it at the beginning :-) it's pretty clear what you're being asked to do. You want to end up with a sequence in which the elements of the original tuples are in their original order, but they can be interleaved arbitrarily. This is often called a "shuffle" (or a merge). You have to find the shuffle that has the maximum value lexicographically.
If you reason this out, it amounts to taking whichever value is largest from the front of the two tuples and putting it next in the output.
This is much easier to do with lists.
Now that I understand what your end-goal is . . .
Since tuples of n elements are different types for different n's, you need to define helper functions for manipulating different sizes of tuples.
One approach, that basically mimics a recursive function over lists (but requires many extra functions because of tuples all having different types), is to have two sets of helper functions:
functions that prepend a value to an existing tuple: prepend_to_2, up through prepend_to_5. For example,
let prepend_to_3 (a, (b, c, d)) = (a, b, c, d)
functions that interleave two tuples of each possible size up to 3: interleave_1_1, interleave_1_2, interleave_1_3, interleave_2_2, interleave_2_3, and interleave_3_3. (Note that we don't need e.g. interleave_2_1, because we can just call interleave_1_2 with the arguments in the opposite order.) For example,
let interleave_2_2 ((a, b), (a', b')) =
if a > a'
then prepend_to_3 (a, interleave_1_2 (b, (a', b')))
else prepend_to_3 (a', interleave_1_2 (b', (a, b)))
(Do you see how that works?)
Then interleave is just interleave_3_3.
With lists and recursion this would be much simpler, since a single function can operate on lists of any length, so you don't need multiple different copies of the same logic.

Form the route with the endpoints

So let's say we are given the endpoints (A, B), (B, C), (C, D), then we can form the route A -> B -> C.
Note that the order the endpoints are given is random. So (A, B), (C, D), (B, C) would also have yielded the route A -> B -> C.
But in general, if we are given ordered pairs of endpoints, how to construct the route?
I'm not sure what data structure is most helpful here. I'm thinking of storing each coordinates (x,y) into a list as the inputs are read in.
So (A, B), (C, D) would be stored as {A, B, C, D}. Whether each element is x or y coordinates can be determined by the parity of its position in the list (so the 1st entry in the list is x, 2nd entry is y, 3rd is x, etc). Then as each ordered pair is read in, we look up the list to see if either the x or y coordinate is already in the list. If so, we connect.
To demonstrate, suppose we are reading in (A, B), (C, D), (B, C), our list would be {A, B, C, D} after (C, D) is just read. When (B, C) is read, we see that B is already in the list. So we know A -> B -> C. Also C is in the list, and we have A-> B -> C -> D, and then we add (B, C) to the list to form {A, B, C, D, B, C}.
My difficult is: how do we store A -> B -> C? What data structure should I use? How do we keep track of the partial route we have formed as we go?
Thank you!
Construct a graph of directed edges with adjacency list representation. Then use DFS on start point till end point and store previously visited nodes in buffer and as soon as you reach destination the values in the buffer is the path.

Lisp : (A (B C)), why 1 list and 1 atom?

I'm learning Lisp and i don't understand some examples they give in a course for explaning lists and atoms.
I understand :
(A B) : 1 list, 2 atoms
(A B C) : 1 list, 3 atoms
I don't understand this part:
(A (B C)) : 1 list, 1 atom
After thinking a lot, I think that :
A is the atom and (B C) is the list, but i don't really understand why....
why the first and the last parenthesis are not considered as being one list ?
why don't we count B and C as atoms here ?
Thanks in advance for any enlightenment on this weird thing :)
I'd say the answer is wrong. For consistency with the previous answers, it should have been:
(A (B C)) : 2 lists, 3 atoms
Here's why: there are three atoms in total: A B C. There's a nested list: (B C) and an outer list: (A (B C)), totaling two lists.
It'd be correct to state that there's "1 list, 1 atom" if the question were "count the top-level elements inside the list" - but that's not consistent with the first two examples, which take into account all the atoms and lists shown, including the outer list.
(A (B C)) : 1 list, 1 atom, 1 list, 2 atoms
As other people have indicated, this question is kind of confusing. But you can understand the concepts, even if the questions are confusing.
Let's take the first one:
(A B)
What is this? Well, it's a list, so it contains smaller things. Yay! How many elements are in it? No, really. Stop here and answer the question.
...
Two!
(A B)
^ ^
___/ \___
/ \
| |
element element
one two
What are the elements? Two atoms: A and B . Note that, as the name "atom" suggests, they can't be broken down into further elements.
How about the second one?
(A B C)
It's also a list, but this one has three elements in it, again all atoms: A, B, and C.
Let's take the third one, which is more confusing:
(A (B C))
If you've been keeping track, this is also a list. How many elements does it have? This one is trickier.
...
Two! Two elements.
(A (B C))
^ \___/
/ \
/ \
/ \
| |
element element
one two
The first element is A , and the second element is (B C) . But wait, what are their types?
A is an atom, but (B C) is a list! So we recurse, and talk about (B C). It's a list, with two elements: B and C. Both of these are atoms, so we're done.
So now you should understand lists a little better, even if the question from whatever book you're learning from doesn't quite make sense. But now it doesn't make sense because it's ill-defined, not because you don't understand the concepts.
Extra credit! List the types of the elements in this list, and if they're lists, keep going!
(A ((B C) D) (E F))
Let's call A = Fred
(B C) = George
and (A (B C)) = Ginny.
How many lists are Ginny? Just one. What does Ginny consist of? One list, George, and one atom, Fred.
Hope that helps.
PS: Don't over-think it.
Number of lists = number of paren pairs. Number of atoms = number of everything else. Lets apply it:
(A B) : 1 list, 2 atoms
(A B C) : 1 list, 3 atoms
(A (B C)) : 2 lists, 3 atoms
(((1))) : 3 lists, 1 atom
(A . B) : 1 (improper) list, 2 atoms