MapReduce Common friends pseudocode - mapreduce

I am currently trying to develop some MapReduce pseudocode for calculating the number of friends in common each user of a particular site has. However, friendships are not mutual and therein lies my problem.
My input consists of a list of users and their respective friends in the form:
A -> (B C E), B -> (C F H) and so on.
The pseudocode I have developed so far is
MAP (String Input_key, String Input_Value):
\\ Input_key: person
\\ Input_Value: list of their friends
for each pair in Input_Value:
Emit (pair, 1)
REDUCE (String Intermediate_key, Iterator (Intermediate_Values):
\\ Intermediate_key: pair of friends
\\ Intermediate_values: list of '1's corresponding to the number of people who are friends with both in the pair
int count = 0;
for each v value in Intermediate_values:
count += ParseInt (v);
Emit (Intermediate_key, count);
However, I do not think the above code is suitable for the situation where friendships are not mutual (just because A is friends with the pair B and C does not mean that B and C are friends with A).
Can anybody shed some light on this or point me in the right direction to solve this?
Any help greatly appreciated.
Thanks in advance.

Related

How to implement the division of two relations in mapreduce?

I want to implement the division of two relations in MapReduce. I have two relations: T(A,B,C) and U(B,C). I know that
for the relations R(A,B,D) and S(A,B). This is pretty much my scenario. I am not sure how I would go about implementing this in mapReduce. With my limited knowledge Im guessing there would be 3 map reduce jobs. I would assume the first round might be the (projection of B -C(T) x U) - T
Mapper 1 our input is either a tuple from T or U
If tuple t belongs to (a,b,c) from T then we have key: NULL and value ("T" a)
if tuple t belongs to (b,c) from U then we have key NULL and value (b,c "U")
With these values we can perform the cartesian product between ("T" a ) with the values (b,c "U") and emit the new key null and value (a,b,c)
Reducer 2
we remove from the new cartesian tuples any that are in the original table T and emit the tuples that are not contained in the original table.
I am confused about what I would do next. Whether it would be another mapper or could I use again a reducer for the next B -C projection? I'm not sure if I did the first round correctly. If anyone can tell me the steps how this would go preferably in pseudo-code that would me understand. Online I do not find any answers for this.

Finding Shortest Distance on Graph -- Logic Issue

In my C++ course we have been working on graphs for a while, and there's a certain question that I've been stuck on for quite some time. The teacher gave us a program that created a graph of integers and then was able to find the shortest path between two integers in the graph. Our job was to adapt this to work for strings (specifically, find the shortest path only jumping to words that have 1 different letter than the previous word e.g bears -> beard).
Here is a sample of what I would expect my program to do:
Given the list [board, beard, bears, brand, boars, bland, blank]
it would create an edge matrix that resembled this:
board | beard boars
beard | board bears
bears | beard boars
brand | bland
boars | board bears
bland | brand blank
blank | bland
And then if asked to find the distance between board & bears it would output:
board->beard->bears
The way I adapted my program is that it creates a graph of a struct named 'node' which contains a number and a word. I use the number to compare the order within one vector to other variables, and the word to create the path. My adapted program successfully creates the graph from data in a text file and connects all words that have a 1 letter difference, however, when I run my function to find the shortest distance it bypasses my edges and simply print out that the start word and end word are a distance of 1 apart.
I will post my full, compile-able program below and explain what I know about the problem.
Here is a link to two pastebin links (I do not have a high enough reputation yet to post more than two links so I must combine them) The first is my full program, I have adapted it to use a set of words that I know are a word distance of 1 apart rather than a text file.
http://pastebin.com/W7HRZG2v
This second link is a download of the code my teacher gave (in case you wish to see a working version of the program)
I've narrowed the problem down to how I'm filling the vector "parents". Somehow it isn't generating properly and is creating an issue when the program tries to retrieve a path from it.
Here is a link to a photo (reputation not high enough to post images yet) comparing what the parents vector looks like in my teacher's "healthy" program (find distance between 2 & 5) to the parents vector in my program:
http://puu.sh/95zQI/26e9b83b9a.png
Notice how in my teacher's, 2 and 4, both integers used in the path, are present in the parents vector and called on to create it.
Notice how in mine the only word present in the parents vector is the beginning word, and hence it is the only word available to call on. However when comparing the way my teacher filled parents with the way I do, there are no differences I can see, aside from the fact that my parents is a string so I am entering a word instead of a number:
(my adapted version is on the left, teacher's is on the right)
if (distanceNodes[edgeNum] > distanceNodes[currNum] + 1) | if (distanceNodes[edge] > distanceNodes[curr] + 1)
{ | {
distanceNodes[edgeNum] = distanceNodes[currNum] + 1; | distanceNodes[edge] = distanceNodes[curr] + 1;
parents[edgeNum] = curr->word; | parents[edge] = curr;
} | }
If someone more proficient in graph application could look at this and assist me I would be extremely grateful. I've been stuck on this problem for over a week and the only tip my teacher will give me is that I should compare my program to his line by line; I did that and I still can't find the problem, I'm about ready to give up.
If you can help me, thank you very much,
Tristan
Here:
node * one = createNewNode(1, "board");
...
node * three = createNewNode(3, "bears");
...
insertEdge(&g, one, three);
The program correctly finds the edge you put there.
More generally, you must learn to step through your code and see what's happening.
And don't use global variables if you can help it.
EDIT:
I had some free time, so here's another problem:
int currNum = start->num;
while (! inTree[currNum])
{
...
parents[edgeNum] = curr->word;
...
}
You iterate by number, but you look things up by a pointer which you never update.
I'm sure there are other problems. The bottom line is that you're not checking things. For some reason, testing, which real coders do all the time, is never taught in programming courses.

calculating "levenshtein social network" *very* efficiently

I'm doing a code challenge online involving finding the 'social network' of words who are related through their Levenshtein distances. My Levenshtein function is correct. I'm recursively adding to a global set, and I'm using a map of tuples to boolean values to cache whether or not any pair of words has a Levenshtein distance of 1. The code is supposed to terminate in 5 seconds. I'm not sure how this is even close to possible. I'm sure that there is some aha insight that makes
this possible. Can anyone see that right off the bat?
Problem Statement:
Two words are friends if they have a Levenshtein distance of 1. That is, you can add, remove, or substitute exactly one letter in word X to create word Y. A word’s social network consists of all of its friends, plus all of their friends, and all of their friends’ friends, and so on. Write a program to tell us how big the social network for the word 'hello' is, using this word list
My pseudocode:
get_network(friend)
if friend not in network
add friend to network
friends = []
check friend against all words in network
consult cache or calculate lev distance
cache if necessary, append to friends if necessary
for all friends
get_network(friend)
To rephrase the question: "what's the fundamental insight that makes possible an astronomical boost in efficiency?"

Prolog - Term replacement, Term alteration in workflow graphs

In this link ( Meta Interpreter ) I believe to have found a nifty way of solving a problem I have to tackle, but since my prolog is very bad I'd first ask if its even possible what I have in mind.
I want to transform certain parts of a workflow/graph depending on a set of rules. A graph basically consists of sequences (a->b) and split/joins, which are either parallel or conditional, i.e. two steps run in parallel in the workflow or a single branch is picked depending on a condition (the condition itself does not matter on this level) (parallel-split - (a && b) - parallel-join) etc. Now a graph usually has nodes and edges, with the form of using terms I want to get rid of edges.
Furthermore each node has a partner attribute, specifying who will execute it.
I'll try to give a simple example what I want to achieve:
A node called A, executed by a partner X, connected with a node called B, executed by a partner Y.
A_X -> B_Y
seq((A,X),(B,Y))
If I detect a pattern like this, i.e. two steps in sequence with different partners, I want this to be replaced with:
A_X -> Send_(X-Y) -> Receive_(Y-X) - B_Y // send step from X to Y and a receive step at Y waiting for something from X
seq((A,X), seq(send(X-Y), seq(receive(Y-X), B)))
If anyone could give me some pointers or help to come up with a solution I would be very thankful!
A graph basically consists of sequences (a->b) and split/joins, which are either parallel or conditional, i.e. two steps run in parallel in the workflow or a single branch is picked depending on a condition
This sounds an awful lot like an and/or graph. Prolog algorithms on these graphs are covered by Ivan Bratko in Prolog Programming for Artificial Intelligence, chapter 13. Even if your graphs aren't really and/or graphs, you may be able to adapt some of these algorithms to your task.

recursively find subsets

Here is a recursive function that I'm trying to create that finds all the subsets passed in an STL set. the two params are an STL set to search for subjects, and a number i >= 0 which specifies how big the subsets should be. If the integer is bigger then the set, return empty subset
I don't think I'm doing this correctly. Sometimes it's right, sometimes its not. The stl set gets passed in fine.
list<set<int> > findSub(set<int>& inset, int i)
{
list<set<int> > the_list;
list<set<int> >::iterator el = the_list.begin();
if(inset.size()>i)
{
set<int> tmp_set;
for(int j(0); j<=i;j++)
{
set<int>::iterator first = inset.begin();
tmp_set.insert(*(first));
the_list.push_back(tmp_set);
inset.erase(first);
}
the_list.splice(el,findSub(inset,i));
}
return the_list;
}
From what I understand you are actually trying to generate all subsets of 'i' elements from a given set right ?
Modifying the input set is going to get you into trouble, you'd be better off not modifying it.
I think that the idea is simple enough, though I would say that you got it backwards. Since it looks like homework, i won't give you a C++ algorithm ;)
generate_subsets(set, sizeOfSubsets) # I assume sizeOfSubsets cannot be negative
# use a type that enforces this for god's sake!
if sizeOfSubsets is 0 then return {}
else if sizeOfSubsets is 1 then
result = []
for each element in set do result <- result + {element}
return result
else
result = []
baseSubsets = generate_subsets(set, sizeOfSubsets - 1)
for each subset in baseSubssets
for each element in set
if no element in subset then result <- result + { subset + element }
return result
The key points are:
generate the subsets of lower rank first, as you'll have to iterate over them
don't try to insert an element in a subset if it already is, it would give you a subset of incorrect size
Now, you'll have to understand this and transpose it to 'real' code.
I have been staring at this for several minutes and I can't figure out what your train of thought is for thinking that it would work. You are permanently removing several members of the input list before exploring every possible subset that they could participate in.
Try working out the solution you intend in pseudo-code and see if you can see the problem without the stl interfering.
It seems (I'm not native English) that what you could do is to compute power set (set of all subsets) and then select only subsets matching condition from it.
You can find methods how to calculate power set on Wikipedia Power set page and on Math Is Fun (link is in External links section on that Wikipedia page named Power Set from Math Is Fun and I cannot post it here directly because spam prevention mechanism). On math is fun mainly section It's binary.
I also can't see what this is supposed to achieve.
If this isn't homework with specific restrictions i'd simply suggest testing against a temporary std::set with std::includes().