Could PCA be used to combine multiple rankings? - combinations
I have n (in my case just 9) different ranking of the same items. Now, I'm trying to find a combination using PCA (Principal Component Analysis), in order to improve the accuracy of my ranking. The method should be unsupervised, that is I'd like to generate new ranking based.
My idea is to try all the possible subset (without repetitions) of the 9 different ranking and run PCA for every of that. There for I will come out with 501 different new ranking (in the case of n=9). With different subset I obtain better results.
When I say better results I mean that I have the true ranking and when I finish the combination I compare the result of all the ranking (combined and the original 9).
Is this method makes sense?
Your question involves a subset of voting theory and there are many possibilities on how to solve this. Some of the techniques are more flexible than others. For example, some techniques can accomodate ordered rankings of variable sizes (imagine one ranking only contained 5 ordered items, while the other rankings contained 9 ordered items) while others cannot. Some techniques can assign variable weights to the different reviewers. Netflix has very complex proprietary algorithms they use to combine multiple users' movie rankings into overall rankings. That being said, I would say your combinatorial PCA approach does not strike me as either computationally efficient or terribly relevant. If you are taking information from only a subset of your 9 rankings, you are potentially discarding useful (although perhaps subtle) information.
Schulze method: Somewhat complex, but widely regarded as one of the best ways to pick a single winner from a set of rankings. Could be applied iteratively or otherwise modified to get an ordered list of winners.
Borda count: Several variations, all of which are quite simple and intuitive and usually lead to reasonable results.
Perhaps the biggest problem with the Borda count is that it does not adaquately handle the different standard deviations of two items that may have very similar average rankings. If we constrain ourselves to the subset of problems where all ordered rankings are of the same size and all have equal weight, I can recommend one method that is intuitive and leads to very good results across a wide range of cases: Aggregate Z-score Minimization. This works as follows:
For each one of the ranked items, compute the mean μ and standard deviation σ of its rankings (assume a Gaussian distribution).
Next compute the |z-score| "distance" matrix for every item to every possible ranking position. Z-score = (proposed ranking position - μ) / σ
Then exhaustively calculate which set of ranking positions give the lowest aggregate (total) z-score distance.
Effectively, the ranking problem is converted into a classification problem where we are trying to classify each ranking position into the best fitting sampled distribution of each item. The constraint is that only one ranking position can be assigned to each Gaussian item distribution. By minimizing the aggregate z-score distance globally, we are finding the most statistically likely configuration for the "true" ranking.
If you don't want to do the programming to exhaustively calculate the combinatorial sums of step 3, there is a heuristic method I'll demonstrate here that usually gives good results (but not guaranteed to be the best solution).
Consider we have 4 independent rankings of 6 items (A-F) here. Assume the first item listed in each ranking is at ranking position #1:
1. A,C,F,E,B,D
2. D,B,C,E,F,A
3. F,A,B,C,D,E
4. E,A,C,B,D,F
Next compute the mean and standard deviation of each item's ranking positions:
A: (#1, #6, #2, #2); μ = 2.75, σ = 2.217
B: μ = 3.5, σ = 1.291
C: μ = 3.0, σ = 0.816
D: μ = 4.25, σ = 2.217
E: μ = 3.75, σ = 2.062
F: μ = 3.75, σ = 2.217
We can see from the relatively tight spread of means (2.75 to 4.25) that all of the items are contending for about the same average, middle positions. This is a case where the Borda count may tend to perform poorly because the standard deviations become extra important when the averages are all so close. So next, we create the matrix of z-score distances from each item to each possible ranking position:
A: 0.7892, 0.3382, 0.1127, 0.5637, 1.0147, 1.4657
B: 1.9365, 1.1619, 0.3873, 0.3873, 1.1619, 1.9365
C: 2.4495, 1.2247, 0.0000, 1.2247, 2.4495, 3.6742
D: 1.4657, 1.0147, 0.5637, 0.1127, 0.3382, 0.7892
E: 1.3339, 0.8489, 0.3638, 0.1213, 0.6063, 1.0914
F: 1.2402, 0.7892, 0.3382, 0.1127, 0.5637, 1.0147
It's probably obvious, but in the event you had any item with σ = 0, you can assign that item to its exclusive ranking position immediately. Now if you don't want to exhaustively solve this matrix for the ranking combination with the lowest possible aggregate z-score assignment, you can use this heuristic. Sum each column, and then subtract the minimum value from that column to get a value we can call "savings":
sum: 9.2151, 5.3777, 1.7658, 2.5225, 6.1344, 9.9718
min: 0.7892, 0.3382, 0.0000, 0.1127, 0.3382, 0.7892
savings: 8.4259, 5.0395, 1.7658, 2.4098, 5.7962, 9.1826
Take the column with the largest "savings" value and assign the item with the min value to that position. In our example here, this means we will assign the item "D" to the 6th position. After doing this, recompute the sum, min, and savings values, but first remove the "D" item's row and also remove the 6th column (because they have already been assigned). Then assign the new largest "savings" value to the item with the min value in that column. Continue until all rankings are assigned. In this example, the final (heuristic) ranking will be as follows: A, E, C, B, F, D (aggregate z-score: 3.3783). I didn't check my work, but it looks like the exhaustively solved solution of A, F, C, B, E, D (aggregate z-score: 3.3612) might be 0.5% better than the heuristic solution.
It's worth noting that the naive solution where we just simply ordered the means A, C, B, E, F, D (aggregate z-score: 3.8754) is substantially less likely (statistically) to be the best ranking.
Related
How to optimize nonlinear funtion with some constraint in c++
I want to find a variable in C++ that allows a given nonlinear formula to have a maximum value in a constraint. It is to calculate the maximum value of the formula below in the given constraint in C++. You can also use the library.(e.g. Nlopt) formula : ln(1+ax+by+c*z) a, b, c are numbers input by the user x, y, z are variables to be derived variable constraint is that x, y, z are positive and x+y+z<=1
This can actually be transformed into a linear optimization problem. max ln(1+ax+by+cz) <--> max (ax+by+cz) s.t. ax+by+cz > -1 This means that it is a linear optimization problem (with one more constraint) that you can easily handle with whatever C++ methods together with your Convex Optimization knowledge. Reminders to write a good code: Check the validity of input value. Since the input value can be negative, you need to consider this circumstance which can yield different results. P.S. This problem seems to be Off Topic on SO. If it is your homework, it is for your own good to write the code yourself. Besides, we do not have enough time to write that for you. This should have been a comment if I had more reputation.
Least Squares Solution of Overdetermined Linear Algebraic Equation Ax = By
I have a linear algebraic equation of the form Ax=By. Where A is a matrix of 6x5, x is vector of size 5, B a matrix of 6x6 and y vector of size 6. A, B and y are known variables and their values are accessed in real time coming from the sensors. x is unknown and has to find. One solution is to find Least Square Estimation that is x = [(A^T*A)^-1]*(A^T)B*y. This is conventional solution of linear algebraic equations. I used Eigen QR Decomposition to solve this as below matrixA = getMatrixA(); matrixB = getMatrixB(); vectorY = getVectorY(); //LSE Solution Eigen::ColPivHouseholderQR<Eigen::MatrixXd> dec1(matrixA); vectorX = dec1.solve(matrixB*vectorY);// Everything is fine until now. But when I check the errore = Ax-By, its not zero always. Error is not very big but even not ignorable. Is there any other type of decomposition which is more reliable? I have gone through one of the page but could not understand the meaning or how to implement this. Below are lines from the reference how to solve the problem. Could anybody suggest me how to implement this? The solution of such equations Ax = Byis obtained by forming the error vector e = Ax-By and the finding the unknown vector x that minimizes the weighted error (e^T*W*e), where W is a weighting matrix. For simplicity, this weighting matrix is chosen to be of the form W = K*S, where S is a constant diagonal scaling matrix, and K is scalar weight. Hence the solution to the equation becomes x = [(A^T*W*A)^-1]*(A^T)*W*B*y I did not understand how to form the matrix W.
Your statement " But when I check the error e = Ax-By, its not zero always. " almost always will be true, regardless of your technique, or what weighting you choose. When you have an over-described system, you are basically trying to fit a straight line to a slew of points. Unless, by chance, all the points can be placed exactly on a single perfectly straight line, there will be some error. So no matter what technique you use to choose the line, (weights and so on) you will always have some error if the points are not colinear. The alternative would be to use some kind of spline, or in higher dimensions to allow for warping. In those cases, you can choose to fit all the points exactly to a more complicated shape, and hence result with 0 error. So the choice of a weight matrix simply changes which straight line you will use by giving each point a slightly different weight. So it will not ever completely remove the error. But if you had a few particular points that you care more about than the others, you can give the error on those points higher weight when choosing the least square error fit. For spline fitting see: http://en.wikipedia.org/wiki/Spline_interpolation For the really nicest spline curve interpolation you can use Centripital Catmull-Rom, which in addition to finding a curve to fit all the points, will prevent unnecessary loops and self intersections that can sometimes come up during abrupt changes in the data direction. Catmull-rom curve with no cusps and no self-intersections
Algo: find max Xor in array for various interval limis, given N inputs, and p,q where 0<=p<=i<=q<=N
the problem statement is the following: Xorq has invented an encryption algorithm which uses bitwise XOR operations extensively. This encryption algorithm uses a sequence of non-negative integers x1, x2, … xn as key. To implement this algorithm efficiently, Xorq needs to find maximum value for (a xor xj) for given integers a,p and q such that p<=j<=q. Help Xorq to implement this function. Input First line of input contains a single integer T (1<=T<=6). T test cases follow. First line of each test case contains two integers N and Q separated by a single space (1<= N<=100,000; 1<=Q<= 50,000). Next line contains N integers x1, x2, … xn separated by a single space (0<=xi< 2^15). Each of next Q lines describe a query which consists of three integers ai,pi and qi (0<=ai< 2^15, 1<=pi<=qi<= N). Output For each query, print the maximum value for (ai xor xj) such that pi<=j<=qi in a single line. int xArray[100000]; cin >>t; for(int j =0;j<t;j++) { cin>> n >>q; //int* xArray = (int*)malloc(n*sizeof(int)); int i,a,pi,qi; for(i=0;i<n;i++) { cin>>xArray[i]; } for(i=0;i<q;i++) { cin>>a>>pi>>qi; int max =0; for(int it=pi-1;it<qi;it++) { int t = xArray[it] ^ a; if(t>max) max =t; } cout<<max<<"\n" ; } No other assumptions may be made except for those stated in the text of the problem (numbers are not sorted). The code is functional but not fast enough; is reading from stdin really that slow or is there anything else I'm missing?
XOR flips bits. The max result of XOR is 0b11111111. To get the best result if 'a' on ith place has 1 then you have to XOR it with key that has ith bit = 0 if 'a' on ith place has 0 then you have to XOR it with key that has ith bit = 1 saying simply, for bit B you need !B Another obvious thing is that higher order bits are more important than lower order bits. That is: if 'a' on highest place has B and you have found a key with highest bit = !B then ALL keys that have highest bit = !B are worse that this one This cuts your amount of numbers by half "in average". How about building a huge binary tree from all the keys and ordering them in the tree by their bits, from MSB to LSB. Then, cutting the A bit-by-bit from MSB to LSB would tell you which left-right branch to take next to get the best result. Of course, that ignores PI/QI limits, but surely would give you the best result since you always pick the best available bit on i-th level. Now if you annotate the tree nodes with low/high index ranges of its subelements (performed only done once when building the tree), then later when querying against a case A-PI-QI you could use that to filter-out branches that does not fall in the index range. The point is that if you order the tree levels like the MSB->LSB bit order, then the decision performed at the "upper nodes" could guarantee you that currently you are in the best possible branch, and it would hold even if all the subbranches were the worst: Being at level 3, the result of 0b111????? can be then expanded into 0b11100000 0b11100001 0b11100010 and so on, but even if the ????? are expanded poorly, the overall result is still greater than 0b11011111 which would be the best possible result if you even picked the other branch at level 3rd. I habe absolutely no idea how long would preparing the tree cost, but querying it for an A-PI-QI that have 32 bits seems to be something like 32 times N-comparisons and jumps, certainly faster than iterating randomly 0-100000 times and xor/maxing. And since you have up to 50000 queries, then building such tree can actually be a good investment, since such tree would be build once per keyset. Now, the best part is that you actually dont need the whole tree. You may build such from i.e. first two or four or eight bits only, and use the index ranges from the nodes to limit your xor-max loop to a smaller part. At worst, you'd end up with the same range as PiQi. At best, it'd be down to one element. But, looking at the max N keys, I think the whole tree might actually fit in the memory pool and you may get away without any xor-maxing loop.
I've spent some time google-ing this problem and it seams that you can find it in the context of various programming competitions. While the brute force approach is intuitive it does not really solve the challenge as it is too slow. There are a few contraints in the problem which you need to speculate in order to write a faster algorithm: the input consists of max 100k numbers, but there are only 32768 (2^15) possible numbers for each input array there are Q, max 50k, test cases; each test case consists of 3 values, a,pi,and qi. Since 0<=a<2^15 and there are 50k cases, there is a chance the same value will come up again. I've found 2 ideas for solving the problem: splitting the input in sqrt(N) intervals and building a segment tree ( a nice explanation for these approaches can be found here ) The biggest problem is the fact that for each test case you can have different values for a, and that would make previous results useless, since you need to compute max(a^x[i]), for a small number of test cases. However when Q is large enough and the value a repeats, using previous results can be possible. I will come back with the actual results once I finish implementing both methods
Linear programming - multi-objective optimization. How to build single function
I have LP problem It is similar to assignment problem for workers I have three assignments with constraint scores(Constraint 1 is more important than constraint 2) Constraint1 Constraint2 assign1 590 5 assign2 585 580 assign3 585 336 My current greedy solver will compare assignments by the first constraint. The best one becomes a solution. Solver will compare Constraint 2 if and only if he found two assignments with the same score value for previous constraint and so on. For a given example in the first round assign2 and assign3 will be chosen because they have the same lowest constraint score. After second round solver choses assign3. So I need a cost function which will represent that behavior. So I expect the assign1_cost > assign2_cost > assign3_cost. Is it posible to do? I believe that I can apply some weighted math function or something like that.
There's two ways to do this simply, that I know of: Assume that you can put an upper bound on each objective function (besides the first). Then you rewrite your objective as follows: (objective_1 * max_2 + objective_2) * max_3 + objective_3 and so on, and minimize said objective. For your example, say you know that objective_2 will be less than 1000. You can then solve, minimizing objective_1*1000 + objective_2. This may fail with some solvers if your upper bounds are too large. This method requires n solver iterations, where n is the number of objectives, but is general - it doesn't require you to know upper bounds beforehand. Solve, minimizing objective_1 and ignoring the other objective functions. Add a constraint that objective_1 == <value you just got>. Solve, minimizing objective_2 and ignoring the remaining objective functions. Repeat until you've gone through every objective function.
Alglib: solving A * x = b in a least squares sense
I have a somewhat complicated algorithm that requires the fitting of a quadric to a set of points. This quadric is given by its parametrization (u, v, f(u,v)), where f(u,v) = au^2+bv^2+cuv+du+ev+f. The coefficients of the f(u,v) function need to be found since I have a set of exactly 6 constraints this function should obey. The problem is that this set of constraints, although yielding a problem like A*x = b, is not completely well behaved to guarantee a unique solution. Thus, to cut it short, I'd like to use alglib's facilities to somehow either determine A's pseudoinverse or directly find the best fit for the x vector. Apart from computing the SVD, is there a more direct algorithm implemented in this library that can solve a system in a least squares sense (again, apart from the SVD or from using the naive inv(transpose(A)*A)*transpose(A)*b formula for general least squares problems where A is not a square matrix?
Found the answer through some careful documentation browsing: rmatrixsolvels( A, noRows, noCols, b, singularValueThreshold, info, solverReport, x) The documentation states the the singular value threshold is a clamping threshold that sets any singular value from the SVD decomposition S matrix to 0 if that value is below it. Thus it should be a scalar between 0 and 1. Hopefully, it will help someone else too.