i have two datasets which i want to find how much they are correlated.
the datasets represent the results of matches of two teams, where 1 represents a win, 0 represents a draw and -1 represents a loss.
e.g. for 5 games
team1 = [1,1,0,-1,0]
team2 = [0,1,0,1,0]
calculating the pearson correlation coefficient is fine till the point where one team won the last 5 games, hence a constant array, e.g.
team1 = [1,1,1,1,1]
In this case the pearson correlation coefficient will be undefined regardless of what team2 did.
I find this weird, because if the team2 also won most of the 5 games, the correlation should be close to 1 actually, not undefined.
and vice versa, if team2 lost most of their matches, the correlation should be close to -1 based on my understanding.
am I doing something wrong here? or my data needs another method to find how strong the relation between the datasets?
Thank in advance
so, i found this good resource:
http://www.ashukumar27.io/similarity_functions/
i think i will go for Euclidean Distance which is more suitable for my use case
The situation is lower numbers are better. Take a ranking on a leader board as an example. If you are number 1 your ranking is higher than if you are 100th place. Yet the bar graph of the rankings will look like 100th place is more important because the bar is higher.
How can the sizes of the bars be reversed so the lower numbers have the higher bars?
If you know there are 100 participates in the leader board, you can simply go through each piece of data and subtract 100, and then take the absolute value.
1st Place: 0 => |0-100| => 100
100th Place: 100 => |100-100| = 0
This is a question from the Australian Informatics Olympiad
The question is:
Have you ever heard of Melodramia, my friend? It is a land of forbidden forests and boundless swamps, of sprinting heroes and dashing heroines. And it is home to two dragons, Rose and Scarlet, who, despite their competitive streak, are the best of friends.
Rose and Scarlet love playing Binary Snap, a game for two players. The game is played with a deck of cards, each with a numeric label from 1 to N. There are two cards with each possible label, making 2N cards in total. The game goes as follows:
Rose shuffles the cards and places them face down in front of Scarlet.
Scarlet then chooses either the top card, or the second-from-top card from the deck and reveals it.
Scarlet continues to do this until the deck is empty. If at any point the card she reveals has the same label as the previous card she revealed, the cards are a Dragon Pair, and whichever dragon shouts `Snap!' first gains a point.
After many millenia of playing, the dragons noticed that having more possible Dragon Pairs would often lead to a more exciting game. It is for this reason they have summoned you, the village computermancer, to write a program that reads in the order of cards in the shuffled deck and outputs the maximum number of Dragon Pairs that the dragons can find.
I'm not sure how to solve this. I thought of something which is wrong(choosing the maximum over all cards, when compared with its previous occurence for each card)
Here's my code as of now:
#include <iostream>
#include <fstream>
using namespace std;
int main() {
ifstream fin("snapin.txt");
ofstream fout("snapout.txt");
int n;
fin>>n;
int arr[(2*n)+1];
for(int i=0;i<2*n;i++){
fin>>arr[i];
}
int dp[(2*n) +1];
int maxi = 0;
int pos[n+1];
for(int i=0;i<n+1;i++){
pos[i] = -1;
}
int count = 0;
for(int i=2;i<(2*n)-2;i++){
if(pos[arr[i]] == -1){
pos[arr[i]] = i;
}else{
dp[i] = pos[arr[i]]+1;
maxi = max(dp[i],maxi);
}
dp[i] = max(dp[i],maxi);
}
fout<<dp[2*n -1];
}
Ok, let's get some basic measurements of the problem out of the way first:
There are 2N cards. 1 card is drawn at a time, without replacement. Therefore there are 2N draws, taking the deck from size 2N (before the first draw) to size 0 (after the last draw).
The final draw takes place from a deck of size 1, and must take the last remaining card.
The 2N-1 preceding draws have deck size 2N, ... 3, 2. For each of these you have a choice between the top two cards. 2N-1 decisions, each with 2 possibilities.
The brute force search space is therefore 22N-1.
That is exponential growth, every optimization scientist's favorite sort of challenge.
If N is small, say 20, the brute force method needs to search "only" a trillion possibilities, which you can get done in a few thousand seconds on a readily available PC that does a few billion operations per second (each solution takes more than one CPU instruction to check).
In N is not quite as small, perhaps 100, the brute force method is akin to breaking the encryption on government secrets.
Not happy with the brute force approach then? I'm not either.
Before we get to the optimal solution, let’s take a break to explore what the Markov assumption is and what it means for us. It shows up in different fields using different verbiage, but I’ll just paraphrase it in a way that is particularly useful for this problem involving gameplay choices:
Markov Assumption
A process is Markov if and only if The choices available to you in the future depend only on what you have now, and not how you got it.
A bad but often used real-world example is the stock market. Not only do taxation differences between short-term and long-term capital gains make history important in a small way, but investors do trend analysis and remember what stocks have done before, which affects future behavior in a big way.
A better example, especially for StackOverflow, is that of Turing machines and computer processors. What your program does next depends on the current instruction pointer and the contents of memory, but not the history of memory that’s since been overwritten. But there are many more. As we’ll see shortly, the Binary Snap problem can be formulated as Markov.
Now let’s talk about what makes the Markov assumption so important. For that, we’ll use the Travelling Salesman Problem. No, the Travelling International Salesman Problem. Still too messy. Let’s try the “Travelling International Salesman with a Single-Entry Visa Problem”. But we’ll go through all three of them briefly:
Travelling Salesman Problem
A salesman has to visit potential buyers in N cities. Plan an itinerary for the salesman which minimizes the total cost of visiting all N cities (variations: at least once / exactly once), given a matrix aj,k which is the cost of travel from city j to city k.
Another variation is whether the starting city is predetermined or not.
Travelling International Salesman Problem
The cities the salesman needs to visit are split between two (or more) nations. A subset of the cities have border crossings and have travel options to all cities. The other cities can only reach cities which are either in the same country or are border-equipped.
Alternatively, instead of cities along the border, use cities with international airports. Won’t make a difference in the end.
The cost matrix for this problem looks rather like the flag of the Dominican Republic. Travel between interior cities of country A is permitted, as is travel between interior cities of country B (blue fields). Border cities connect with interior and border cities in both countries (white cross). And direct travel between an interior city of country A and one of country B is impossible (red areas).
Travelling International Salesman with a Single-Entry Visa
Now not only does the salesman need to visit cities in both countries, but he can only cross the border once.
(For travel fanatics, assume he starts in a third country and has single-entry visas for both countries, so he can’t visit some of A, all of B, then return to A for the rest).
Let’s look at an extremely simple case first: Only one border city. We’ll use one additional trick, the one from proof by induction: We assume that all problems smaller than the current one can be solved.
It should be fairly obvious that the Markov assumption holds when the salesman reaches the border city. No matter what path he took through country A, he has exactly the same choice of paths through country B.
But there’s a really important point here: Any path through country A ending at the border and any path through country B starting at the border, can be combined into a feasible full itinerary. If we have two full itineraries x and y, and x spent more money in country A than y did, then even if x has a lower total cost than the total cost of y, we can plan a path better than both, using the portion of y in country A and the portion of x in country B. I’m going to call that “splicing”. The Markov assumption lets us do it, by making all roads leading to the border interchangeable!
In fact, we can look just at the cities of country A, pick the best of all routes to the border, and forget about all the other options as soon as (in our plan) the salesman steps across into B.
This means instead of having factorial(NA) * factorial(NB) routes to look at, there’s only factorial(NA) + factorial(NB). Which is pretty much factorial(NA) times better. Wow, is this Markov thing helpful or what?
Ok, that was too easy. Let’s mess it all up by having NAB border cities instead of just one. Now if I have a path x which costs less in country B and a path y which costs less in country A, but they cross the border in different cities, I can’t just splice them together. So I have to keep track of all the paths through all the cities again, right?
Not exactly. What if, instead of throwing away all the paths through country A except the best y path, I actually keep one path ending in each border city (the lowest cost of all paths ending in the same border city). Now, for any path x I look at in country B, I have a path yendpt(x) that uses the same border city, to splice it with. So I have to solve the country A and country B partitions each NAB times to find the best splice of a complete itinerary, for total work of NAB factorial(NA) + NAB factorial(NB) which is still way better than factorial(NA) * factorial(NB).
Enough development of tools. Let’s get back to the dragons, since they are they are subtle and quick to anger and I don’t want to be eaten or burnt to a crisp.
I claim that at any step T of the Binary Snap game, if we consider our “location” a pair of (card just drawn, card on top of deck), the Markov assumption will hold. These are the only things that determine our future options. All the cards below the top one in the deck must be in the same order no matter what we did before. And for knowing whether to count a Snap! with the next card, we need to know the last one taken. And that’s it!
Furthermore, there are N possible labels on the card last drawn, and N possible for the top card on the deck, for a total of N2 “border cities”. As we start playing the game, there are two choices on the first turn, two on the second, two on the third, so we start out with 2T possible game states (and a count of Snap!s for each). But by the pigeonhole principle, when 2T > N2, some of these plays must end in exactly the same game state (“border city”) as each other, and when that happens, we only need to keep the "dominating" one that got the best score on the way there.
Final complexity bound: 2*N timesteps, from no more than N2 game states, with 2 draw choices at each, equals an upper limit of 4*N3 simulated draws.
And that means the same trillion calculations that allowed us to do N=20 with the brute force method, now permit right around N=8000.
That makes the dragons happy, which makes us alive and well.
Implementation note: Since the challenge didn’t ask for the order of draws, but just the highest attainable number of snaps, all you data to keep track of in addition to the initial ordering of the cards is the time, T, and a 2-dimensional array (N rows, N columns) of the best score you can have and reach that state at time T.
Real world applications: If you take this approach and apply it to a digital radio (fixed uniform bit timing, discrete signal levels) receiving a signal using a convolutional error-correcting code, you have the Viterbi decoder. If you apply it to acquired medical data, with variable timing intervals and continuous signal levels, and add some other gnarly math, you get my doctoral project.
I am using Weka 3.7.1
I am attempting to analyze sport predictions for baseball using weka. I would like to use a cost matrix because the cost of different outcomes is not the same at a sportsbook where I gamble on the game. My data set is simple: it is a set of predictions with a nominal class {WIN,LOSS}. For this question, the attributes are not a concern.
In the WEKA Explorer, after loading my arff file I can setup a cost matrix from
Classify->More Options...->Cost-sensitive evaluation->Set...->There is
a 2x2 grid that appears in the weka cost-sensitive evaluation after I
set the classes == 2
Here are the values I would like to enter in to the cost matrix:
Correctly classified as loss, cost is 0 (I did not wager)
Incorrectly classified as loss, cost is 0 (I did not wager)
Correctly classified as win, cost is -.909 (I won .909 dollars)
Incorrectly classified as win, cost is 1.0 (I lost a dollar)
Observe that to stay true with it being a 'cost matrix' that I set my profit to a negative value (which is the opposite of cost, it is a profit); and that I set the loss to a positive number (because it cost me when I lost the wager).
After some reflection I decided to use the following grid, and I have not a clue if I did this correctly, please let me know if I did this correctly:
- a b <---- "classified as"
- 0 1.0 a=LOSS
- 0 -.909 b=WIN
And here is my probably faulty logic: (col, row)
(0,0) of grid=0: classified as LOSS, and was LOSS
(0,1) of grid=0: classified as LOSS, but was WIN
(1,0) of grid=1.0; classified as WIN, but was LOSS
(1,1) of grid=.909; classified as WIN, was WIN
and of course (0,0) and (0,1) represent the classifier predicting a LOSS and in these cases I do not wager, and therefore there is no cost.
on the other hand (1,0) and (1,1) represent the classifier predicting a WIN and in these cases I place a wager, and therefore there is a cost associated.
One other item that is of great confusion: after I setup the cost matrix and execute a classifier, the output report contains the following:
Evaluation cost matrix:
0 1
0 0.91 <--- notice that this is not a negative value!
And as you can see, in the report (1,1) is 0.91 when I had actually entered -.909. I did find another post about this topic, but it does not explain why the negative value became positive.
Thank you in advance. Please note that these are answerable questions; however, if you want to provide some guidance I would be very happy as I am a newbie still trying to build a framework of understanding.
Cost matrix is a way to change the threshold value for decision boundary.
It is explained in a following paper.
http://research.ijcaonline.org/volume44/number13/pxc3878677.pdf
By looking at your cost matrix it seems that there is a little correction required.
e.g.
0 cost
cost 0
just for explanation:
consider following cost matrix:
a b
c d
This is the general format of cost matrix which I have observed for two class problems.
now when you have classified something at a or d location then there is no need to incorporate the cost.
So the point here is, the cost comes in picture only when there is a misclassification. i.e. either at b or c location.
But as you have written negative value as a cost at place d it creates confusion. (kindly make it possible to explain the same, i.e. what do you mean by negative cost.)
an example cost matrix can be:
0 1
10 0
which says that cost of classifying examples as false positive is 10 times higher than the cost of misclassification of similar example as false negative. Moreover there is no cost when examples are classified correctly.