I am working on a question that asks me to solve for the weighted average of my dependent variable (hourly wage) by using the weight of my independent variable (which is a discrete variable that has 16 categories and more than 300,000 observations). as you can see below.
enter image description here
how am I suppose to generate the weighted variable for a variable that has so many observations?
First you should determine whether the weights of x are sampling weights, frequency weights or analytic weights. Then, if y is your dependent variable and x_weights is the variable that contains the weights for your independent variable, type in:
mean y [pweight = x_weight] for sampling (probability) weights
mean y [fweight = x_weight] for frequency weights
mean y [aweight = x_weight] for analytic weights
You can find a nice summary of these different options here, as well as information on the more specialized option iweight.
Related
I have a dataset with around 15 numeric columns and two categorical columns which are a "State" column and an "Income" column with six buckets representing each different income range. Do I need to encode the "Income" column if it contains integers 1-6 representing each income range? In addition, what type of encoder should I use for the "state" column and does anyone have any good resources on this?
In addition, does one typically perform feature selection (wrapper and filter methods such as Pearson's and Recursive Feature Elimination) before PCA? What is the typical correlation threshold when using a method like Pearson's? And what is the ideal number of dimensions or explained variance ratio one should use when running PCA. I'm confused if you use one of them or both. Thank you.
I have conducted a PCA (in Matlab) on a set of thousands of points of spatial data. I have also calculated the variance explained across the full dataset by each principal component (i.e. PC or eigenvector) by dividing its eigenvalue by the sum of all eigenvalues. As an example, PC 15 accounts for 2% of the variance in the entire dataset; however, there is a subset of points in this dataset for which I suspect PC 15 accounts for a much higher % of their variance (e.g. 80%).
My question is this, is there a way to calculate the variance explained by a given PC from my existing analysis for only a subset of points (i.e. 1000 pts from the full dataset of 500k+). I know that I could run another PCA on just the subset, but for my purposes, I need to continue to use the PCs from my original analysis. Any idea for how to do this would be very helpful.
Thanks!
I have a dataset which is categorical dataset. I am using WEKA software for feature selection. I have used CfsSubsetEval as attribute evaluator with Greedystepwise method. I came to know this link that CFS uses Pearson correlation to find the strong correlation between the dataset. I also found out how to calculate Pearson correlation coefficient using this link. As per the link the data values need to be numerical for evaluation. Then how can WEKA did the evaluation on my categorical dataset?
The strange result is that Among 70 attributes CFS selects only 10 attributes. Is it because of the categorical dataset? Additionally my dataset is a highly imbalanced dataset where imbalanced ration 1:9(yes:no).
A Quick question
If you go through the link you can found the statement the correlation coefficient to measure the strength and direction of the linear relationship between two numerical variables X and Y. Now I can understand the strength of the correlation coefficient which is varied in between +1 to -1 but what about the direction? How can I get that? I mean the variable is not a vector so it should not have a direction.
The method correlate in the CfsSubsetEval class is used to compute the correlation between two attributes. It calls other methods, depending on the attribute types, which I've linked here:
two numeric attributes: num_num
numeric/nominal attributes: num_nom2
two nominal attributes: nom_nom
I have a matrix which the row are the brands and the columns are the features of each brand.
First, I calculate the affinity matrix with scikit learn and then apply the spectral clustering on the affinity matrix to do the clustering.
When I calculate the silhouette value with respect to each number of clusters, as long as the number of clusters increasing, the silhouette value is also increasing.
In the end when the number of clusters get bigger and bigger, to calculate the silhouette value, it will give NaN result
#coding utf-8
import pandas as pd
import sklearn.cluster as sk
from sklearn.cluster import SpectralClustering
from sklearn.metrics import silhouette_score
data_event = pd.DataFrame.from_csv('\Data\data_of_events.csv', header=0,index_col=0, parse_dates=True, encoding=None, tupleize_cols=False, infer_datetime_format=False)
data_event_matrix = data_event.as_matrix(columns = ['Furniture','Food & Drinks','Technology','Architecture','Show','Fashion','Travel','Art','Graphics','Product Design'])
#compute the affinity matrix
data_event_affinitymatrix = SpectralClustering().fit(data_event_matrix).affinity_matrix_
#clustering
for n_clusters in range(2,100,2):
print n_clusters
labels = sk.spectral_clustering(data_event_affinitymatrix, n_clusters=n_clusters, n_components=None,
eigen_solver=None, random_state=None, n_init=10, eigen_tol=0.0, assign_labels='kmeans')
silhouette_avg = silhouette_score(data_event_affinitymatrix, labels)
print("For n_clusters =", n_clusters, "The average silhouette_score of event clustering is :", silhouette_avg)
If your intention is to find the optimal number of cluster then you can try using the Elbow method. Multiple variations exists for this method, but the main idea is that for different values of K (no. of clusters) you find the cost function that is most appropriate for you application (Example, Sum of Squared distance of all the points in a cluster to it's centroid for all values of K say 1 to 8, or any other error/cost/variance function. In your case if it is a distance function, then after a certain point number of clusters, you will notice that the difference in values along the y-axis becomes negligible. Based on the graph plotted for number of clusters along x-axis and your metric along y-axis, you choose the value 'k' on x-axis at such a point where the value at y-axis changes abruptly.
You can see in this , that the optimal value of 'K' is 4.
Image Source : Wikipedia.
Another measure that you can use to validate your clusters is V-measure Score. It is a symmetric measure and if often used when the ground truth is unknown. It is defined as the Harmonic mean of Homogenity and Completeness. Here is an example in scikit-learn for your reference.
EDIT: V-measure is basically used to compare two different cluster assignments to each other.
Finally, if you are interested, you can take a look at Normalized Mutual Information Score to validate your results as well.
References :
Biclustering Scikit-Learn
Elbow Method : Coursera
Research Paper on V-Measure
Choosing the right number of clusters
Update : I recently came across this Self Tuning Spectral Clustering. You can give it a try.
I have a list of states, major cities in each state, their populations, and lat/long coordinates for each. Using this, I need to calculate the latitude and longitude that corresponds to the center of a state, weighted by where the population lives.
For example, if a state has two cities, A (population 100) and B (population 200), I want the coordinates of the point that lies 2/3rds of the way between A and B.
I'm using the SAS dataset that comes installed called maps.uscity. It also has some variables called "Projected Logitude/Latitude from Radians", which I think might allow me just to take a simple average of the numbers, but I'm not sure how to get them back into unprojected coordinates.
More generally, if anyone can suggest of a straightforward approach to calculate this it would be much appreciated.
The Census Bureau has actually done these calculations, and posted the results here: http://www.census.gov/geo/www/cenpop/statecenters.txt
Details on the calculation are in this pdf: http://www.census.gov/geo/www/cenpop/calculate2k.pdf
To answer the question that was asked, it sounds like you might be looking for a weighted mean. Just use PROC MEANS and take a weighted average of each coordinate:
/* data from http://www.world-gazetteer.com/ */
data AL;
input city $10 pop lat lon;
datalines;
Birmingham 242452 33.53 86.80
Huntsville 159912 34.71 86.63
Mobile 199186 30.68 88.09
Montgomery 201726 32.35 86.28
;
proc means data=AL;
weight pop;
var lat lon;
run;
Itzy's answer is correct. The US Census's lat/lng centroids are based on population. In constrast, the USGS GNIS data's lat/lng averages are based on administrative boundaries.
The files referenced by Itzy are the 2000 US Census data. The Census bureau is in the processing of rolling our the 2010 data. The following link is a segway to all of this data.
http://www.census.gov/geo/www/tiger/
I can answer a lot of geospatial questions. I am part of a public domain geospatial team at OpenGeoCode.Org
I believe you can do this using the same method used for calculating the center of gravity of an airplane:
Establish a reference point southwest of any part of the state. Actually it doesn't matter where the reference point is, but putting it SW will keep all numbers positive in the usual x-y send we tend to think of things.
Logically extend N-S and E-W lines from this point.
Also extend such lines from the cities.
For each city get the distance from its lines to the reference lines. These are the moment arms.
Multiply each of the distance values by the population of the city. Effectively you're getting the moment for each city.
Add all of the moments.
Add all of the populations.
Divide the total of the moments by the total of the populations and you have the center of gravity with respect for the reference point of the populations involved.