Pearson correlation coefficient, is it the right way? - pearson-correlation

i have two datasets which i want to find how much they are correlated.
the datasets represent the results of matches of two teams, where 1 represents a win, 0 represents a draw and -1 represents a loss.
e.g. for 5 games
team1 = [1,1,0,-1,0]
team2 = [0,1,0,1,0]
calculating the pearson correlation coefficient is fine till the point where one team won the last 5 games, hence a constant array, e.g.
team1 = [1,1,1,1,1]
In this case the pearson correlation coefficient will be undefined regardless of what team2 did.
I find this weird, because if the team2 also won most of the 5 games, the correlation should be close to 1 actually, not undefined.
and vice versa, if team2 lost most of their matches, the correlation should be close to -1 based on my understanding.
am I doing something wrong here? or my data needs another method to find how strong the relation between the datasets?
Thank in advance

so, i found this good resource:
http://www.ashukumar27.io/similarity_functions/
i think i will go for Euclidean Distance which is more suitable for my use case

Related

pvlib : time convention issue with cumulated GHI

I have a question regarding time conventions in pvlib.
As far as I understand, all the computations are made using instantaneous timestep convention. Instant weather will produce instant electric power.
However, in the weather models like GFS, GHI parameter is cumulated over the last hour. This makes inconsitant the solar radiation and astronomical parameters (zenith, azimuth...).
For example, if I take a look at the ERBS function, used to compute DHI and DHI from GHI :
df_dni_dhi_kt = pvlib.irradiance.erbs(ghi, zenith, ghi.index)
Here, all parameters are hourly timeseries, but the because of the convention, the output may be inaccurate.
ghi : cumulated radiation over last hour
zenith : zenith angle at exact hour (instantaneous)
ghi.index : hourly DateTimeIndex
At the end of the power conversion process, a shift is observed between observations and model (please don't care about the amplitude difference, only time shift matters).
Any idea about using cumulated GHI as input of the library ?
When using hourly data there definitely is a dilemma in how to calculate the solar position. The most common method is to calculate the solar position for the middle of the time step. This is definitely an improvement to using either the start or end of the hour (as shown in your example). However, around sunset and sunrise this poses an issue, as the sun may be below the horizon at the middle of the hour. Thus some calculate the sun position for the middle where the period is defined as the part of the hour where the sun is above the horizon - but that adds complexity.
There's a good discussion on the topic here: https://pvlib-python.readthedocs.io/en/stable/gallery/irradiance-transposition/plot_interval_transposition_error.html

Binary Snap [AIO 2015]

This is a question from the Australian Informatics Olympiad
The question is:
Have you ever heard of Melodramia, my friend? It is a land of forbidden forests and boundless swamps, of sprinting heroes and dashing heroines. And it is home to two dragons, Rose and Scarlet, who, despite their competitive streak, are the best of friends.
Rose and Scarlet love playing Binary Snap, a game for two players. The game is played with a deck of cards, each with a numeric label from 1 to N. There are two cards with each possible label, making 2N cards in total. The game goes as follows:
Rose shuffles the cards and places them face down in front of Scarlet.
Scarlet then chooses either the top card, or the second-from-top card from the deck and reveals it.
Scarlet continues to do this until the deck is empty. If at any point the card she reveals has the same label as the previous card she revealed, the cards are a Dragon Pair, and whichever dragon shouts `Snap!' first gains a point.
After many millenia of playing, the dragons noticed that having more possible Dragon Pairs would often lead to a more exciting game. It is for this reason they have summoned you, the village computermancer, to write a program that reads in the order of cards in the shuffled deck and outputs the maximum number of Dragon Pairs that the dragons can find.
I'm not sure how to solve this. I thought of something which is wrong(choosing the maximum over all cards, when compared with its previous occurence for each card)
Here's my code as of now:
#include <iostream>
#include <fstream>
using namespace std;
int main() {
ifstream fin("snapin.txt");
ofstream fout("snapout.txt");
int n;
fin>>n;
int arr[(2*n)+1];
for(int i=0;i<2*n;i++){
fin>>arr[i];
}
int dp[(2*n) +1];
int maxi = 0;
int pos[n+1];
for(int i=0;i<n+1;i++){
pos[i] = -1;
}
int count = 0;
for(int i=2;i<(2*n)-2;i++){
if(pos[arr[i]] == -1){
pos[arr[i]] = i;
}else{
dp[i] = pos[arr[i]]+1;
maxi = max(dp[i],maxi);
}
dp[i] = max(dp[i],maxi);
}
fout<<dp[2*n -1];
}
Ok, let's get some basic measurements of the problem out of the way first:
There are 2N cards. 1 card is drawn at a time, without replacement. Therefore there are 2N draws, taking the deck from size 2N (before the first draw) to size 0 (after the last draw).
The final draw takes place from a deck of size 1, and must take the last remaining card.
The 2N-1 preceding draws have deck size 2N, ... 3, 2. For each of these you have a choice between the top two cards. 2N-1 decisions, each with 2 possibilities.
The brute force search space is therefore 22N-1.
That is exponential growth, every optimization scientist's favorite sort of challenge.
If N is small, say 20, the brute force method needs to search "only" a trillion possibilities, which you can get done in a few thousand seconds on a readily available PC that does a few billion operations per second (each solution takes more than one CPU instruction to check).
In N is not quite as small, perhaps 100, the brute force method is akin to breaking the encryption on government secrets.
Not happy with the brute force approach then? I'm not either.
Before we get to the optimal solution, let’s take a break to explore what the Markov assumption is and what it means for us. It shows up in different fields using different verbiage, but I’ll just paraphrase it in a way that is particularly useful for this problem involving gameplay choices:
Markov Assumption
A process is Markov if and only if The choices available to you in the future depend only on what you have now, and not how you got it.
A bad but often used real-world example is the stock market. Not only do taxation differences between short-term and long-term capital gains make history important in a small way, but investors do trend analysis and remember what stocks have done before, which affects future behavior in a big way.
A better example, especially for StackOverflow, is that of Turing machines and computer processors. What your program does next depends on the current instruction pointer and the contents of memory, but not the history of memory that’s since been overwritten. But there are many more. As we’ll see shortly, the Binary Snap problem can be formulated as Markov.
Now let’s talk about what makes the Markov assumption so important. For that, we’ll use the Travelling Salesman Problem. No, the Travelling International Salesman Problem. Still too messy. Let’s try the “Travelling International Salesman with a Single-Entry Visa Problem”. But we’ll go through all three of them briefly:
Travelling Salesman Problem
A salesman has to visit potential buyers in N cities. Plan an itinerary for the salesman which minimizes the total cost of visiting all N cities (variations: at least once / exactly once), given a matrix aj,k which is the cost of travel from city j to city k.
Another variation is whether the starting city is predetermined or not.
Travelling International Salesman Problem
The cities the salesman needs to visit are split between two (or more) nations. A subset of the cities have border crossings and have travel options to all cities. The other cities can only reach cities which are either in the same country or are border-equipped.
Alternatively, instead of cities along the border, use cities with international airports. Won’t make a difference in the end.
The cost matrix for this problem looks rather like the flag of the Dominican Republic. Travel between interior cities of country A is permitted, as is travel between interior cities of country B (blue fields). Border cities connect with interior and border cities in both countries (white cross). And direct travel between an interior city of country A and one of country B is impossible (red areas).
Travelling International Salesman with a Single-Entry Visa
Now not only does the salesman need to visit cities in both countries, but he can only cross the border once.
(For travel fanatics, assume he starts in a third country and has single-entry visas for both countries, so he can’t visit some of A, all of B, then return to A for the rest).
Let’s look at an extremely simple case first: Only one border city. We’ll use one additional trick, the one from proof by induction: We assume that all problems smaller than the current one can be solved.
It should be fairly obvious that the Markov assumption holds when the salesman reaches the border city. No matter what path he took through country A, he has exactly the same choice of paths through country B.
But there’s a really important point here: Any path through country A ending at the border and any path through country B starting at the border, can be combined into a feasible full itinerary. If we have two full itineraries x and y, and x spent more money in country A than y did, then even if x has a lower total cost than the total cost of y, we can plan a path better than both, using the portion of y in country A and the portion of x in country B. I’m going to call that “splicing”. The Markov assumption lets us do it, by making all roads leading to the border interchangeable!
In fact, we can look just at the cities of country A, pick the best of all routes to the border, and forget about all the other options as soon as (in our plan) the salesman steps across into B.
This means instead of having factorial(NA) * factorial(NB) routes to look at, there’s only factorial(NA) + factorial(NB). Which is pretty much factorial(NA) times better. Wow, is this Markov thing helpful or what?
Ok, that was too easy. Let’s mess it all up by having NAB border cities instead of just one. Now if I have a path x which costs less in country B and a path y which costs less in country A, but they cross the border in different cities, I can’t just splice them together. So I have to keep track of all the paths through all the cities again, right?
Not exactly. What if, instead of throwing away all the paths through country A except the best y path, I actually keep one path ending in each border city (the lowest cost of all paths ending in the same border city). Now, for any path x I look at in country B, I have a path yendpt(x) that uses the same border city, to splice it with. So I have to solve the country A and country B partitions each NAB times to find the best splice of a complete itinerary, for total work of NAB factorial(NA) + NAB factorial(NB) which is still way better than factorial(NA) * factorial(NB).
Enough development of tools. Let’s get back to the dragons, since they are they are subtle and quick to anger and I don’t want to be eaten or burnt to a crisp.
I claim that at any step T of the Binary Snap game, if we consider our “location” a pair of (card just drawn, card on top of deck), the Markov assumption will hold. These are the only things that determine our future options. All the cards below the top one in the deck must be in the same order no matter what we did before. And for knowing whether to count a Snap! with the next card, we need to know the last one taken. And that’s it!
Furthermore, there are N possible labels on the card last drawn, and N possible for the top card on the deck, for a total of N2 “border cities”. As we start playing the game, there are two choices on the first turn, two on the second, two on the third, so we start out with 2T possible game states (and a count of Snap!s for each). But by the pigeonhole principle, when 2T > N2, some of these plays must end in exactly the same game state (“border city”) as each other, and when that happens, we only need to keep the "dominating" one that got the best score on the way there.
Final complexity bound: 2*N timesteps, from no more than N2 game states, with 2 draw choices at each, equals an upper limit of 4*N3 simulated draws.
And that means the same trillion calculations that allowed us to do N=20 with the brute force method, now permit right around N=8000.
That makes the dragons happy, which makes us alive and well.
Implementation note: Since the challenge didn’t ask for the order of draws, but just the highest attainable number of snaps, all you data to keep track of in addition to the initial ordering of the cards is the time, T, and a 2-dimensional array (N rows, N columns) of the best score you can have and reach that state at time T.
Real world applications: If you take this approach and apply it to a digital radio (fixed uniform bit timing, discrete signal levels) receiving a signal using a convolutional error-correcting code, you have the Viterbi decoder. If you apply it to acquired medical data, with variable timing intervals and continuous signal levels, and add some other gnarly math, you get my doctoral project.

Card game combination

This is more like a maths question but since I am trying to create a code for that, I will ask here.
Situation:
I have poker cards and each people have 14 cards on their hands. This is the game I am trying exactly this.
Aim:
I want to make a rating system for the computer decide if it is better to take a card from the deck or continue with the hand.
Rules:
The same card type can make rows like spades 3-4-5-6-7 and the different card types make a row as spades 3 -hearts 3 - diamonds 3 - clubs 3. The only important rule is those rows has to be more than 2 cards so spades 3-4 is not good
Extra role:
I want computer to array the cards according to the best combination when the cards dealt mixed.
I have come with 2 solutions:
Making a permutation of 14 cards and give rating according ot the neighbours of the card so have a general point. This has around 80 trillion possibilities and each has to be rated, that takes huge amount of time and I want a solution that can calculate in 1-3 seconds.
Choosing each card and searching for possible rows, if there is a row with more than 3 members, subtract from the list and try others. Trying it for all cards and row possibilities is giving me a good amount. But this method has disadvantages of being too complicated and having many extra checks such as if the card is better suited as same type row or different type of row.
Has anybody come up with a problem like that or have an idea to give me the right direction?

Understanding cost-sensitive evaluation in Weka (cost matrix)

I am using Weka 3.7.1
I am attempting to analyze sport predictions for baseball using weka. I would like to use a cost matrix because the cost of different outcomes is not the same at a sportsbook where I gamble on the game. My data set is simple: it is a set of predictions with a nominal class {WIN,LOSS}. For this question, the attributes are not a concern.
In the WEKA Explorer, after loading my arff file I can setup a cost matrix from
Classify->More Options...->Cost-sensitive evaluation->Set...->There is
a 2x2 grid that appears in the weka cost-sensitive evaluation after I
set the classes == 2
Here are the values I would like to enter in to the cost matrix:
Correctly classified as loss, cost is 0 (I did not wager)
Incorrectly classified as loss, cost is 0 (I did not wager)
Correctly classified as win, cost is -.909 (I won .909 dollars)
Incorrectly classified as win, cost is 1.0 (I lost a dollar)
Observe that to stay true with it being a 'cost matrix' that I set my profit to a negative value (which is the opposite of cost, it is a profit); and that I set the loss to a positive number (because it cost me when I lost the wager).
After some reflection I decided to use the following grid, and I have not a clue if I did this correctly, please let me know if I did this correctly:
- a b <---- "classified as"
- 0 1.0 a=LOSS
- 0 -.909 b=WIN
And here is my probably faulty logic: (col, row)
(0,0) of grid=0: classified as LOSS, and was LOSS
(0,1) of grid=0: classified as LOSS, but was WIN
(1,0) of grid=1.0; classified as WIN, but was LOSS
(1,1) of grid=.909; classified as WIN, was WIN
and of course (0,0) and (0,1) represent the classifier predicting a LOSS and in these cases I do not wager, and therefore there is no cost.
on the other hand (1,0) and (1,1) represent the classifier predicting a WIN and in these cases I place a wager, and therefore there is a cost associated.
One other item that is of great confusion: after I setup the cost matrix and execute a classifier, the output report contains the following:
Evaluation cost matrix:
0 1
0 0.91 <--- notice that this is not a negative value!
And as you can see, in the report (1,1) is 0.91 when I had actually entered -.909. I did find another post about this topic, but it does not explain why the negative value became positive.
Thank you in advance. Please note that these are answerable questions; however, if you want to provide some guidance I would be very happy as I am a newbie still trying to build a framework of understanding.
Cost matrix is a way to change the threshold value for decision boundary.
It is explained in a following paper.
http://research.ijcaonline.org/volume44/number13/pxc3878677.pdf
By looking at your cost matrix it seems that there is a little correction required.
e.g.
0 cost
cost 0
just for explanation:
consider following cost matrix:
a b
c d
This is the general format of cost matrix which I have observed for two class problems.
now when you have classified something at a or d location then there is no need to incorporate the cost.
So the point here is, the cost comes in picture only when there is a misclassification. i.e. either at b or c location.
But as you have written negative value as a cost at place d it creates confusion. (kindly make it possible to explain the same, i.e. what do you mean by negative cost.)
an example cost matrix can be:
0 1
10 0
which says that cost of classifying examples as false positive is 10 times higher than the cost of misclassification of similar example as false negative. Moreover there is no cost when examples are classified correctly.

Population-weighted center of a state

I have a list of states, major cities in each state, their populations, and lat/long coordinates for each. Using this, I need to calculate the latitude and longitude that corresponds to the center of a state, weighted by where the population lives.
For example, if a state has two cities, A (population 100) and B (population 200), I want the coordinates of the point that lies 2/3rds of the way between A and B.
I'm using the SAS dataset that comes installed called maps.uscity. It also has some variables called "Projected Logitude/Latitude from Radians", which I think might allow me just to take a simple average of the numbers, but I'm not sure how to get them back into unprojected coordinates.
More generally, if anyone can suggest of a straightforward approach to calculate this it would be much appreciated.
The Census Bureau has actually done these calculations, and posted the results here: http://www.census.gov/geo/www/cenpop/statecenters.txt
Details on the calculation are in this pdf: http://www.census.gov/geo/www/cenpop/calculate2k.pdf
To answer the question that was asked, it sounds like you might be looking for a weighted mean. Just use PROC MEANS and take a weighted average of each coordinate:
/* data from http://www.world-gazetteer.com/ */
data AL;
input city $10 pop lat lon;
datalines;
Birmingham 242452 33.53 86.80
Huntsville 159912 34.71 86.63
Mobile 199186 30.68 88.09
Montgomery 201726 32.35 86.28
;
proc means data=AL;
weight pop;
var lat lon;
run;
Itzy's answer is correct. The US Census's lat/lng centroids are based on population. In constrast, the USGS GNIS data's lat/lng averages are based on administrative boundaries.
The files referenced by Itzy are the 2000 US Census data. The Census bureau is in the processing of rolling our the 2010 data. The following link is a segway to all of this data.
http://www.census.gov/geo/www/tiger/
I can answer a lot of geospatial questions. I am part of a public domain geospatial team at OpenGeoCode.Org
I believe you can do this using the same method used for calculating the center of gravity of an airplane:
Establish a reference point southwest of any part of the state. Actually it doesn't matter where the reference point is, but putting it SW will keep all numbers positive in the usual x-y send we tend to think of things.
Logically extend N-S and E-W lines from this point.
Also extend such lines from the cities.
For each city get the distance from its lines to the reference lines. These are the moment arms.
Multiply each of the distance values by the population of the city. Effectively you're getting the moment for each city.
Add all of the moments.
Add all of the populations.
Divide the total of the moments by the total of the populations and you have the center of gravity with respect for the reference point of the populations involved.