How does CfsSubsetEva (Correlation-based Feature Selection) works in Weka - weka

I have a dataset which is categorical dataset. I am using WEKA software for feature selection. I have used CfsSubsetEval as attribute evaluator with Greedystepwise method. I came to know this link that CFS uses Pearson correlation to find the strong correlation between the dataset. I also found out how to calculate Pearson correlation coefficient using this link. As per the link the data values need to be numerical for evaluation. Then how can WEKA did the evaluation on my categorical dataset?
The strange result is that Among 70 attributes CFS selects only 10 attributes. Is it because of the categorical dataset? Additionally my dataset is a highly imbalanced dataset where imbalanced ration 1:9(yes:no).
A Quick question
If you go through the link you can found the statement the correlation coefficient to measure the strength and direction of the linear relationship between two numerical variables X and Y. Now I can understand the strength of the correlation coefficient which is varied in between +1 to -1 but what about the direction? How can I get that? I mean the variable is not a vector so it should not have a direction.

The method correlate in the CfsSubsetEval class is used to compute the correlation between two attributes. It calls other methods, depending on the attribute types, which I've linked here:
two numeric attributes: num_num
numeric/nominal attributes: num_nom2
two nominal attributes: nom_nom

Related

Feature Selection and PCA in Machine Learning

I have a dataset with around 15 numeric columns and two categorical columns which are a "State" column and an "Income" column with six buckets representing each different income range. Do I need to encode the "Income" column if it contains integers 1-6 representing each income range? In addition, what type of encoder should I use for the "state" column and does anyone have any good resources on this?
In addition, does one typically perform feature selection (wrapper and filter methods such as Pearson's and Recursive Feature Elimination) before PCA? What is the typical correlation threshold when using a method like Pearson's? And what is the ideal number of dimensions or explained variance ratio one should use when running PCA. I'm confused if you use one of them or both. Thank you.

Nonlinear model (with country and time fixed effects)

I try to estimate the above nonlinear model by Stata. Unfortunately, I am not comfortable with Stata. Can anyone help me about writing the above function in Stata?
How can we write regional dummy, time fixed effect and country fixed effect in nl command in Stata?
Is there a way to write the summation in the above equation in Stata? Alternatively, is it easier to estimate the equation for each individual region?
Stata 15 introduced a native command for fitting non-linear panel data models.
https://www.stata.com/new-in-stata/nonlinear-panel-data-models-with-random-effects/
That might help get you started, but you need Stata 15.

Confusion regarding Conditional Random Fields

http://i.imgur.com/dspFhlO.png
I am trying to label objects in am image using Conditional Random Fields. But I am stuck understanding this formula.
Can anyone tell me the meaning the terms of the formula and how to calculate them.
I am using MS-COCO data set which has labelled images i.e I have segmented images.
Here Z(.)= partition function and P(ci | Sj)= Probability that Sj segment of Image I belongs to class ci and q= no of pairwise spatial relations.
This is in fact the the conditional probability distribution of the labeling c={c1,c2,...,ck} for the image segments, given the segments features S={S1,S2,...,Sk}. p(ci|Si) is the probability of assigning class label ci to segment i, which can be computed using various classifiers like logistic regression, neural network, or SVM. The term B presents the aggregate pairwise function that determines how likely it is for each adjacent pair of {i,j} to take labels {ci,cj}. This term can be reallized by computing the co-occurrence statistics of different class pairs in the dataset, which is described in detail in this paper:
Object Categorization using Co-Occurrence, Location and Appearance

Implementing Difference-In-Differences Estimator with GLM in Stata

I am trying to implement a difference-in-differences estimator with a GLM model with Stata 13.0. The parameter I am interested in is the derivative of the expected value with respect to the interaction of binary treatment group indicator T and binary post-treatment period indicator S only (T#S, rather than the full derivative with respect to T). This approach is explained towards the end of this thread on Statalist. This is my code:
glm y i.T##i.S, exposure(e) cluster(user_id) link(log) family(poisson) robust
preserve
replace e = 30
margins rb0.T#rb0.S
restore
The preserve/replace/restore step is necessary because margins does not allow the at() option to be used with exposure variables.
Two questions.
How would I get a p-value for this effect?
Is it possible to get the effect in semi-elasticity form, perhaps by using margins with eydx() in some way?

Population-weighted center of a state

I have a list of states, major cities in each state, their populations, and lat/long coordinates for each. Using this, I need to calculate the latitude and longitude that corresponds to the center of a state, weighted by where the population lives.
For example, if a state has two cities, A (population 100) and B (population 200), I want the coordinates of the point that lies 2/3rds of the way between A and B.
I'm using the SAS dataset that comes installed called maps.uscity. It also has some variables called "Projected Logitude/Latitude from Radians", which I think might allow me just to take a simple average of the numbers, but I'm not sure how to get them back into unprojected coordinates.
More generally, if anyone can suggest of a straightforward approach to calculate this it would be much appreciated.
The Census Bureau has actually done these calculations, and posted the results here: http://www.census.gov/geo/www/cenpop/statecenters.txt
Details on the calculation are in this pdf: http://www.census.gov/geo/www/cenpop/calculate2k.pdf
To answer the question that was asked, it sounds like you might be looking for a weighted mean. Just use PROC MEANS and take a weighted average of each coordinate:
/* data from http://www.world-gazetteer.com/ */
data AL;
input city $10 pop lat lon;
datalines;
Birmingham 242452 33.53 86.80
Huntsville 159912 34.71 86.63
Mobile 199186 30.68 88.09
Montgomery 201726 32.35 86.28
;
proc means data=AL;
weight pop;
var lat lon;
run;
Itzy's answer is correct. The US Census's lat/lng centroids are based on population. In constrast, the USGS GNIS data's lat/lng averages are based on administrative boundaries.
The files referenced by Itzy are the 2000 US Census data. The Census bureau is in the processing of rolling our the 2010 data. The following link is a segway to all of this data.
http://www.census.gov/geo/www/tiger/
I can answer a lot of geospatial questions. I am part of a public domain geospatial team at OpenGeoCode.Org
I believe you can do this using the same method used for calculating the center of gravity of an airplane:
Establish a reference point southwest of any part of the state. Actually it doesn't matter where the reference point is, but putting it SW will keep all numbers positive in the usual x-y send we tend to think of things.
Logically extend N-S and E-W lines from this point.
Also extend such lines from the cities.
For each city get the distance from its lines to the reference lines. These are the moment arms.
Multiply each of the distance values by the population of the city. Effectively you're getting the moment for each city.
Add all of the moments.
Add all of the populations.
Divide the total of the moments by the total of the populations and you have the center of gravity with respect for the reference point of the populations involved.