how to compute mutual information MapReduce based in R?? - mapreduce

I want to compute mutual information for all x, y in features : I(x , y)
So I need to compute P(x) P(y) and P(x, Y) in Data for example:
X Y
- - yes 2 no 2 yes 2 no 1 yes 1
p(yes)=3/5 p(2)=3/5 p(yes,2)=2/5
counting in Map Reduce is easy and I did it for P(x) and P(y)
but for p(x, Y) I want to compute co-occuring in each object.
I write a Map-function:
mapper <- function(key, line) {
fvec <- unlist(strsplit(line, split = " "))
for(i in 1:55){
for(j in (i+1):56){
fvec<-c(fvec,paste0(fvec[i],",",fvec[j]))}}
keyval(fvec, 1)
}
"fvec" is a vector of features as first row in our example :
fvec[1]="yes" "2"
in for loops I want to concatenate this features so that
fvec[1]="yes" "2" "yes2"
in Order to count the Occurance of yes-2 together in reduce func :
reduce = function(k,v)
{return(keyval(k,length(v)))
but because of "LOOP PROBLEM" in Hadoop it does not work properly.
please help me with a R solution to handle it :)

Related

Given N lines on a Cartesian plane. How to find the bottommost intersection of lines efficiently?

I have N distinct lines on a cartesian plane. Since slope-intercept form of a line is, y = mx + c, slope and y-intercept of these lines are given. I have to find the y coordinate of the bottommost intersection of any two lines.
I have implemented a O(N^2) solution in C++ which is the brute-force approach and is too slow for N = 10^5. Here is my code:
int main() {
int n;
cin >> n;
vector<pair<int, int>> lines(n);
for (int i = 0; i < n; ++i) {
int slope, y_intercept;
cin >> slope >> y_intercept;
lines[i].first = slope;
lines[i].second = y_intercept;
}
double min_y = 1e9;
for (int i = 0; i < n; ++i) {
for (int j = i + 1; j < n; ++j) {
if (lines[i].first ==
lines[j].first) // since lines are distinct, two lines with same slope will never intersect
continue;
double x = (double) (lines[j].second - lines[i].second) / (lines[i].first - lines[j].first); //x-coordinate of intersection point
double y = lines[i].first * x + lines[i].second; //y-coordinate of intersection point
min_y = min(y, min_y);
}
}
cout << min_y << endl;
}
How to solve this efficiently?
In case you are considering solving this by means of Linear Programming (LP), it could be done efficiently, since the solution which minimizes or maximizes the objective function always lies in the intersection of the constraint equations. I will show you how to model this problem as a maximization LP. Suppose you have N=2 first degree equations to consider:
y = 2x + 3
y = -4x + 7
then you will set up your simplex tableau like this:
x0 x1 x2 x3 b
-2 1 1 0 3
4 1 0 1 7
where row x0 represents the negation of the coefficient of "x" in the original first degree functions, x1 represents the coefficient of "y" which is generally +1, x2 and x3 represent the identity matrix of dimensions N by N (they are the slack variables), and b represents the value of the idepent term. In this case, the constraints are subject to <= operator.
Now, the objective function should be:
x0 x1 x2 x3
1 1 0 0
To solve this LP, you may use the "simplex" algorithm which is generally efficient.
Furthermore, the result will be an array representing the assigned values to each variable. In this scenario the solution is:
x0 x1 x2 x3
0.6666666667 4.3333333333 0.0 0.0
The pair (x0, x1) represents the point which you are looking for, where x0 is its x-coordinate and x1 is it's y-coordinate. There are other different results that you could get, for an example, there could exist no solution, you may find out more at plenty of books such as "Linear Programming and Extensions" by George Dantzig.
Keep in mind that the simplex algorithm only works for positive values of X0, x1, ..., xn. This means that before applying the simplex, you must make sure the optimum point which you are looking for is not outside of the feasible region.
EDIT 2:
I believe making the problem feasible could be done easily in O(N) by shifting the original functions into a new position by means of adding a big factor to the independent terms of each function. Check my comment below. (EDIT 3: this implies it won't work for every possible scenario, though it's quite easy to implement. If you want an exact answer for any possible scenario, check the following explanation on how to convert the infeasible quadrants into the feasible back and forth)
EDIT 3:
A better method to address this problem, one that is capable of precisely inferring the minimum point even if it is in the negative side of either x or y: converting to quadrant 1 all of the other 3.
Consider the following generic first degree function template:
f(x) = mx + k
Consider the following generic cartesian plane point template:
p = (p0, p1)
Converting a function and a point from y-negative quadrants to y-positive:
y_negative_to_y_positive( f(x) ) = -mx - k
y_negative_to_y_positive( p ) = (p0, -p1)
Converting a function and a point from x-negative quadrants to x-positive:
x_negative_to_x_positive( f(x) ) = -mx + k
x_negative_to_x_positive( p ) = (-p0, p1)
Summarizing:
quadrant sign of corresponding (x, y) converting f(x) or p to Q1
Quadrant 1 (+, +) f(x)
Quadrant 2 (-, +) x_negative_to_x_positive( f(x) )
Quadrant 3 (-, -) y_negative_to_y_positive( x_negative_to_x_positive( f(x) ) )
Quadrant 4 (+, -) y_negative_to_y_positive( f(x) )
Now convert the functions from quadrants 2, 3 and 4 into quadrant 1. Run simplex 4 times, one based on the original quadrant 1 and the other 3 times based on the converted quadrants 2, 3 and 4. For the cases originating from a y-negative quadrant, you will need to model your simplex as a minimization instance, with negative slack variables, which will turn your constraints to the >= format. I will leave to you the details on how to model the same problem based on a minimization task.
Once you have the results of each quadrant, you will have at hands at most 4 points (because you might find out, for example, that there is no point on a specific quadrant). Convert each of them back to their original quadrant, going back in an analogous manner as the original conversion.
Now you may freely compare the 4 points with each other and decide which one is the one you need.
EDIT 1:
Note that you may have the quantity N of first degree functions as huge as you wish.
Other methods for solving this problem could be better.
EDIT 3: Check out the complexity of simplex. In the average case scenario, it works efficiently.
Cheers!

How to use pymc to parameterize a probabilistic graphical model?

How can one use pymc to parameterize a probabilistic graphical model?
Suppose I have a PGM with two nodes X and Y.
Lets say X->Y is the graph.
And X takes two values {0,1}, and
Y also takes two values {0,1}.
I want to use pymc to learn the parameters of the distribution and populate the
graphical model with it for running inferences.
The way I could think of is as follows:
X_p = pm.Uniform("X_p", 0, 1)
X = pm.Bernoulli("X", X_p, values=X_Vals, observed=True)
Y0_p = pm.Uniform("Y0_p", 0, 1)
Y0 = pm.Bernoulli("Y0", Y0_p, values=Y0Vals, observed=True)
Y1_p = pm.Uniform("Y1_p", 0, 1)
Y1 = pm.Bernoulli("Y1", Y1_p, values=Y1Vals, observed=True)
Here Y0Vals are values of Y corresponding to X values = 0
And Y1Vals are values of Y corresponding to X values = 1.
The plan is to draw MCMC samples from these and use the means of Y0_p and Y1_p
to populate the discrete bayesian network's probability... So the probability table
for P(X) = (X_p,1-X_p) while that of P(Y/X):
Y 0 1
X
0 Y0_p 1-Y0_p
1 Y1_p 1-Y1_p
Questions:
Is this the correct way of doing this?
Does not this get clumsy, especially if I have X having 100s of discrete values?
or if a variable has two parents X and Y with 10 discrete values each?
Is there something better I can do?
Are there any good books that detail how we can do this kind of interconnection.

Belman-Ford algorithm in 2d Array

I've got a problem with applying a Bellman-Ford algorithm to 2D Array (not to graph)
Input array has m x n dimensions:
s[1,1] s[1,2] ... s[1,n] -> Exit
s[2,1] s[2,2] ... s[2,n]
...
Entry -> s[m,1] s[m,2] ... s[m,n]
And it is room-alike (each entry is a room with s[x,y] cost of enterance). Each room could have also a negative cost, and we have to find cheapest path from Entry to choosen room and to Exit.
For example, we've got this array of rooms and costs:
1 5 6
2 -3 4
5 2 -8
And we want to walk over room [3,2], s[3,2] = 4. We are starting form 5 at [1,3] and must walk over [3,2] before we go to [3,3].
And my question is, what is the best way to implement it in Bellman-Ford algorithm? I know that Dijkstry algorithm will not work becouse of negative cost.
Is for each room from [0, maxHeight] and relax all neighbors correct? Like this:
for (int i = height-1; i >= 0; --i) {
for (int j = 0; j < width; ++j) {
int x = i;
int y = j;
if (x > 0) // up
Relax(x, y, x - 1, y);
if (y + 1 < width) // right
Relax(x, y, x, y + 1);
if (y > 0) // left
Relax(x, y, x, y - 1);
if (x + 1 < height) // down
Relax(x, y, x + 1, y);
}
}
But how can I then read a cost to choosen room and from room to exit?
If you know how to move on the graph from an array, you can scroll to additional condition paragraph. Read also next paragraph.
In fact, you can look at that building like on a graph.
You can see like: (I forgot doors in second line, sorry.)
So, how it is possible to be implement. Ignore for the moment additional condition (visit a particular vertex before leaving).
Weight function:
Let S[][] be an array of entry cost. Notice, that about weight of edge decides only vertex on the end. It has no matter if it's (1, 2) -> (1,3) or (2,3) -> (1, 3). Cost is defined by second vertex. so function may look like:
cost_type cost(vertex v, vertex w) {
return S[w.y][w.x];
}
//As you can see, first argument is unnecessary.
Edges:
In fact you don't have to keep all edges in some array. You can calculate them in function every time you need.
The neighbours for vertex (x, y) are (x+1, y), (x-1, y), (x, y+1), (x, y-1), if that nodes exist. You have to check it, but it's easy. (Check if new_x > 0 && new_x < max_x.) It may look like that:
//Size of matrix is M x N
is_correct(vertex w) {
if(w.y < 1 || w.y > M || w.x < 1 || w.x > N) {
return INCORRECT;
}
return CORRECT;
}
Generating neighbours can look like:
std::tie(x, y) = std::make_tuple(v.x, v.y);
for(vertex w : {{x+1, y}, {x-1, y}, {x, y+1}, {x, y-1}}) {
if(is_correct(w) == CORRECT) {//CORRECT may be true
relax(v, w);
}
}
I believe, that it shouldn't take extra memory for four edges. If you don't know std::tie, look at cppreference. (Extra variables x, y take more memory, but I believe that it's more readable here. In your code it may not appear.)
Obviously you have to have other 2D array with distance and (if necessary) predecessor, but I think it's clear and I don't have to describe it.
Additional condition:
You want to know cost from enter to exit, but you have to visit some vertex compulsory. Easiest way to calculate it is to calculate cost from enter to compulsory and from compulsory to exit. (There will be two separate calculations.) It will not change big O time. After that you can just add results.
You just have to guarantee that it's impossible to visit exit before compulsory. It's easy, you can just erase outgoing edges from exit by adding extra line in is_correct function, (Then vertex v will be necessary.) or in generating neighbours code fragment.
Now you can implement it basing on wikipedia. You have graph.
Why you shouldn't listen?
Better way is to use Belman Ford Algorithm from other vertex. Notice, that if you know optimal path from A to B, you also know optimal path from B to A. Why? Always you have to pay for last vertex and you don't pay for first, so you can ignore costs of them. Rest is obvious.
Now, if you know that you want to know paths A->B and B->C, you can calculate B->A and B->C using one time BF from node B and reverse path B->A. It's over.
You just have to erase outgoing edges from entry and exit nodes.
However, if you need very fast algorithm, you have to optimize that. But it is for another topic, I think. Also, it looks like no one is interested in hard optimization.
I can quickly add, just that small and easy optimization bases at that, that you can ignore relaxation from correspondingly distant vertices. In array you can calculate distance in easy way, so it's pleasant optimization.
I have not mentioned well know optimization, cause I believe that all of them are in a random course of the web.

Adjacency matrix and Bron–Kerbosch algorithm

I want to find maximum cliques in a graph that is given to me in a form of adjacency matrix. I what I am trying to do I am being given the amount of shops I need to find with the same product tag that are being collected and whether sufficient amount of those shops was found
so input goes along lines
x - shop count.
y - product count/tag.
z - in how many shops does the product need to be present.
so let's say I got
5 - x
2 - y
4 - z
Then the adjacency matrix going with it is:
0 1 1 1 1
1 0 2 2 1
1 2 0 2 2
1 2 2 0 1
1 1 2 1 0
There are two different products available, now I want to find out whether there are atleast 4 shops selling specific product. I found out about Bron–Kerbosch algorithm e.g. http://en.wikipedia.org/wiki/Bron%E2%80%93Kerbosch_algorithm
But I don't know how to pick my R, P and X subsets and how to represent them. It does not have to be very efficient nor I believe there is a need for any more advanced data structure than a 2D array but I just don't know how to use this adjacency matrix as a list of my vertices etc. Could anyone give me an idea on how to get started with this algorithm? Probably telling me how to treat R, P and X with my data would be sufficient. I would want to create my program in C++
From the wikipedia article:
BronKerbosch3(G):
P = V(G)
R = X = empty
for each vertex v in a degeneracy ordering of G:
BronKerbosch2(R ⋃ {v}, P ⋂ N(v), X ⋂ N(v))
P := P \ {v}
X := X ⋃ {v}
where G is the graph and BronKerbosch2() is defined as :
BronKerbosch2(R,P,X):
if P and X are both empty:
report R as a maximal clique
choose a pivot vertex u in P ⋃ X
for each vertex v in P \ N(u):
BronKerbosch2(R ⋃ {v}, P ⋂ N(v), X ⋂ N(v))
P := P \ {v}
X := X ⋃ {v}
So now you know your choice of R, P and X. Also lookup what an adjacency matrix actually is as mentioned in the comments. Also, read the article carefully for the use of degeneracy in the algorithm.
In C++, you could use std::array<std::array<int, SIZE>, SIZE> for the 2D adjacency matrix and then proceed with the algorithm.

Variation on set cover problem in R / C++

Given a universe of elements U = {1, 2, 3,...,n} and a number of sets in this universe {S1, S2,...,Sm}, what is the smallest set we can create that will cover at least one element in each of the m sets?
For example, given the following elements U = {1,2,3,4} and sets S = {{4,3,1},{3,1},{4}}, the following sets will cover at least one element from each set:
{1,4}
or
{3,4}
so the minimum sized set required here is 2.
Any thoughts on how this can be scaled up to solve the problem for m=100 or m=1000 sets? Or thoughts on how to code this up in R or C++?
The sample data, from above, using R's library(sets).
s1 <- set(4, 3, 1)
s2 <- set(3, 1)
s3 <- set(4)
s <- set(s1, s2, s3)
Cheers
This is the hitting set problem, which is basically set cover with the roles of elements and sets interchanged. Letting A = {4, 3, 1} and B = {3, 1} and C = {4}, the element-set containment relation is
A B C
1 + + -
2 - - -
3 + + -
4 + - +
so you basically want to solve the problem of covering {A, B, C} with sets 1 = {A, B} and 2 = {} and 3 = {A, B} and 4 = {A, C}.
Probably the easiest way to solve nontrivial instances of set cover in practice is to find an integer programming package with an interface to R or C++. Your example would be rendered as the following integer program, in LP format.
Minimize
obj: x1 + x2 + x3 + x4
Subject To
A: x1 + x3 + x4 >= 1
B: x1 + x3 >= 1
C: x4 >= 1
Binary
x1 x2 x3 x4
End
At first I misunderstood the complexity of the problem and came up with a function that finds a set that covers the m sets - but I then realized that it isn't necessarily the smallest one:
cover <- function(sets, elements = NULL) {
if (is.null(elements)) {
# Build the union of all sets
su <- integer()
for(si in sets) su <- union(su, si)
} else {
su <- elements
}
s <- su
for(i in seq_along(s)) {
# create set candidate with one element removed
sc <- s[-i]
ok <- TRUE
for(si in sets) {
if (!any(match(si, sc, nomatch=0L))) {
ok <- FALSE
break
}
}
if (ok) {
s <- sc
}
}
# The resulting set
s
}
sets <- list(s1=c(1,3,4), s2=c(1,3), s3=c(4))
> cover(sets) # [1] 3 4
Then we can time it:
n <- 100 # number of elements
m <- 1000 # number of sets
sets <- lapply(seq_len(m), function(i) sample.int(n, runif(1, 1, n)))
system.time( s <- cover(sets) ) # 0.53 seconds
Not too bad, but still not the smallest one.
The obvious solution: generate all permutations of elements and pass is to the cover function and keep the smallest result. This will take close to "forever".
Another approach is to generate a limited number of random permutations - this way you get an approximation of the smallest set.
ns <- 10 # number of samples
elements <- seq_len(n)
smin <- sets
for(i in seq_len(ns)) {
s <- cover(sets, sample(elements))
if (length(s) < length(smin)) {
smin <- s
}
}
length(smin) # approximate smallest length
If you restrict each set to have 2 elements, you have the np-complete problem node cover. I would guess the more general problem would also be NP complete (for the exact version).
If you're just interested in an algorithm (rather than an efficient/feasible algorithm), you can simply generate subsets of the universe of increasing cardinality and check that the intersection with all the sets in S is non-empty. Stop as soon as you get one that works; the cardinality is the minimum possible.
The complexity of this is 2^|U| in the worst case, I think. Given Foo Bah's answer, I don't think you're going to get a polynomial-time answer...