Creating box plots of the gap between two groups by deciles - stata

I work with Stata and I have math grades for two different groups: A and B.
I want to see the gap that exists between both groups in each decile. In addition I want to do a box plot of this gap for each decile (I want to have 10 box plots, one for each decile which shows the gap between group grades).
What I first did was to compute the deciles using xtile for both groups:
xtile decileA= mat if group==1, nq(10)
xtile decileB= mat if group==0, nq(10)
However, groups A and B do not have the same number of observations nor the same distribution. I thought of computing quantiles for each decile and group and subtracting them to get the difference in each decile at each quartile to create the boxplot. But I do not know how to proceed afterwards to create the graph, and given that I have a different number of observations in each group decile I do not know if it is correct to proceed this way.
If I try to use the pctile command and compute the difference at each decile, I lose all the variance in the data inside each decile. I only get median differences and not all the quantiles I want.
Example:
pctile decileA= mat if group==1, nq(10)
pctile decileB= mat if group==0, nq(10)
gen qdiff= decileA- decileB if _n<10
gen qtau=_n/10 if _n<10
graph box qdiff, over(tau)
I want to know if there is a way to do the graph I am intending to?
Cross-posted on Statalist.

There is certainly a way to accomplish what you want with a bit of effort, but if the goal is to make a comparison between the two groups at each decile with some notion of variability, you can easily get that from a simultaneous quantile regression and the SEs that it produces:
sysuse auto, clear
sqreg price i.foreign, quantile(.1 .2 .3 .4 .5 .6 .7 .8 .9)
margins, dydx(foreign) ///
predict(outcome(q10)) ///
predict(outcome(q20)) ///
predict(outcome(q30)) ///
predict(outcome(q40)) ///
predict(outcome(q50)) ///
predict(outcome(q60)) ///
predict(outcome(q70)) ///
predict(outcome(q80)) ///
predict(outcome(q90)) ///
post
marginsplot, yline(0) xlab(, grid) ylab(#10, grid angle(90))
This yields a graph showing that foreign origin is associated with a bigger price at higher deciles, with the exception of the top decile, though none of the differences are probably significant here given how much the CIs overlap:
You can even conduct formal hypothesis tests that the effects are equal like this:
. test _b[1.foreign:9._predict] = _b[1.foreign:8._predict]
( 1) - [1.foreign]8._predict + [1.foreign]9._predict = 0
chi2( 1) = 3.72
Prob > chi2 = 0.0537
With 74 cars, we cannot reject that the effect on the 80th and 90th percentile are the same even though the point estimates have the opposite signs but similar magnitude.

Related

Title of second y-axis in stata

I am trying to create a graph with two y axes (below code), but the title of the right-side y-axis does not appear. Can anyone please help me with that?
sysuse auto, clear
generate kpl=mpg*0.425144
twoway (scatter mpg weight, color(navy) yaxis(1)) (scatter kpl weight, color(navy) yaxis(2) ylabel(4.25 8.5 12.75 17, axis(2)) ytitle(Kilometres per Litre, axis(2))), by(foreign, legend(off) note(Graphs by Car origin))
enter image description here
I think I understand what you want but I would approach it quite differently.
If you want a second scale of km per litre to compare with miles per gallon, that is just the same data points explained differently, just as you could show Celsius and Fahrenheit temperatures on different axes or calculate proportions and show percents, or vice versa.
Another variable holding km per litre makes this more difficult, not easier, as the values differ by the corresponding conversion factor.
Here I use mylabels from SSC, which must be installed before you can use it.
Naturally you don't need to show zero, but the identity 0 miles per gallon = 0 km per litre may make the point easier to follow.
sysuse auto, clear
set scheme s1color
mylabels 0(4)16, myscale(#/0.425144) local(yla)
scatter mpg weight, yaxis(1 2) yla(`yla', axis(2) ang(h)) yla(0(10)40, axis(1) ang(h)) ytitle(km per litre, axis(2)) ms(Oh)

Stata: Storing only part of a FE regression output for graphing

I am running a regression with two fixed effects categories (country and year, is economic macro data). Since I am using xtreg, one is autohid, but the other is a variable:
xtreg fiveyearyg taxratio i.year if taxratiocut == 1, i(wbcode1) fe cluster(wbcode1)
estimates store yi
I am running a number of these and I want to graph the coefficients for taxratio from each. But when I store the data, it stores both the taxratio coefficient, and the 50+ coefficients for the year fixed effects.
After a lot of searching, I cannot find any way to store (or recall) just part of the regression output, the one coefficient (with SEs) that I care about. Does anyone know a way to do that?
Here is how you can do that:
webuse grunfeld,clear
qui xtreg mvalue invest i.year,fe cluster(company)
//e(b) stores coefficient matrix and e(V) stores variance-covariance matrix. For details type: ereturn list after running the model
//let's say you want to extract only the coefficient on invest
mat coef_matrix=e(b)
scalar coef_invest=coef_matrix[1,1]
dis coef_invest
1.7178414
//to extract se of the the coefficient on invest
mat var_matrix=e(V)
mat diag_var_matrix=vecdiag(var_matrix) //diagonal elements are variances and the standard errors are square roots of these variances
matmap diag_var_matrix se_matrix , m(sqrt(#))) //you need to install matmap using ssc install matmap, you will get error if variance is negative
scalar se_invest=se_matrix[1,1]
dis se_invest
.14082153
Accessing coefficients is as easy as calling _b[varname]; analogously the corresponding standard errors: _se[varname].
An example:
webuse grunfeld, clear
qui xtreg mvalue invest i.year,fe cluster(company)
// coef for invest
display _b[invest]
// std error for invest
display _se[invest]
// displayed results in matrix
matrix list r(table)
For multiple-equation models use [eqno]_b[varname] where the preceding bracket contains an equation number.
More detail can be found in [U] 13.5 Accessing coefficients and standard errors.
Starting Stata 12, estimation commands also store results in r() [and not just e()]. Notice I listed r(table), which contains most results displayed by the estimation command xtreg.
You show interest in plotting coefficients, so you should read on the user-written command coefplot. Run ssc install coefplot to download and help coefplot to get started. It has many options.
Edit
A complete example that plots only coefficients for invest (leaving out those for year), using coefplot, and based on conditional regressions is:
clear
set more off
webuse grunfeld
xtreg mvalue invest i.year if time <= 10,fe cluster(company)
estimates store before10
xtreg mvalue invest i.year if time > 10,fe cluster(company)
estimates store after10
coefplot before10 after10, keep(invest)

SAS Sample Size Estimation

The seeds of the garden pea are either yellow or green. A certain cross between pea plants produces progeny where 75% are plants with yellow seeds and 25% are plants with green seeds. What is the minimum number of progeny you would need to grow to have probability no less than 0.99 of obtaining at least 10 plants with green seeds?
I understand how to estimate a required sample size when I have data such as standard deviation, mean, correlation, etc., but I don't even know where to start to estimate it based on the percentage values with a certain probability.
So far I set up this code in SAS:
Proc power;
onesamplefreq test=Z method=normal
sides=1
alpha=.01
nullproportion=.5
proportion=.25
power=.99
ntotal= .;
run;
Running this program resulted in a sample size of 76, but I don't feel like this is correct. I don't know how to specify that I need at least 10 plants with green seeds, and I don't know how to set the nullproportion or if it matters.
It is a Binomial distribution kind problem. Where chance of winning (green plant) is 25%. You want to win at least 10 times, so how many times you need to play (that is, how many seeds you need)?
Mean of binomial distribution will answer this question which is:
np = 10
n*0.25 = 10
n = 40
So required seed is 40. This is purely probabilistic. But we need to consider Type I and Type II error. So sample size 76 seems reasonable to me.

Plotting Multiple ROC curves, or an average one from multi class labels (multinomial regression)

I have a data set with multiple discrete labels, say 4,5,6. On this I run the ExtraTreesClassifier (I will also run Multinomial logit afterword on the same data, this is just a short example) as below. :
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import roc_curve, auc
clf = ExtraTreesClassifier(n_estimators=200,random_state=0,criterion='gini',bootstrap=True,oob_score=1,compute_importances=True)
# Also tried entropy for the information gain
clf.fit(x_train, y_train)
#y_test is test data and y_predict is trained using ExtraTreesClassifier
y_predicted=clf.predict(x_test)
fpr, tpr, thresholds = roc_curve(y_test, y_predicted,pos_label=4) # recall my labels are 4,5 and 6
roc_auc = auc(fpr, tpr)
print("Area under the ROC curve : %f" % roc_auc)
The question is - is there something like a average ROC curve - basically I could add up all the tpr & fpr seperately for EACH label value and then take means (will that make sense by the way?) - and then just call
# Would this be statistically correct, and would mean something worth interpreting?
roc_auc_avearge = auc(fpr_average, tpr_average)
print("Area under the ROC curve : %f" % roc_auc)
I am assuming, I will get something similar to this afterword - but how do I interpret thresholds in this case ?
How to plot a ROC curve for a knn model
Hence, please also mention if I can/should get individual thresholds in this case and why would one approach be better(statistically) over the other?
What I tried so far (besides averaging):
On changing pos_label = 4 , then 5 & 6 and plotting the roc curves, I see very poor performance, even lesser than the y=x (perfectly random and tpr=fpr case) How should I approach this problem ?
ROC curve averaging has been proposed by Hand & Till in 2001. They basically compute the ROC curves for all comparison pairs (4 vs. 5, 4 vs. 6 and 5 vs. 6) and average the result.
When you compute the ROC curve with pos_label=4, you implicitly say that the other labels are the negatives (5 and 6). Note that this is slightly different from what was proposed by Hand & Till.
A few notes:
You should make sure that your classifier was trained in a way that makes sense with your ROC analysis. If you say pos_label=5 in the roc_curve, and your classifier was train to recognize 5 as intermediate between 4 and 6, you will for sure get nothing useful here
If you get AUC < 0.5, it means you are looking at it in the wrong way (and you should reverse your predictions)
In general ROC analysis is useful for a binary classification. Whether it makes sense for multiclass problems is case-dependant, and it might not be the case for you.

How to find/detect optimal parameters of a Grid Search in Libsvm+Weka?

I'm trying to use SVM with Weka framework. So i'm using Libsvm. I'm new to SVM and reading the guide on the site of Libsvm I read that is possible to discover optimal parameters for SVM (cost and gamma) using GridSearch. So i choose Grid Search on Weka and I obtained a bad classification results (TN rate around 1%). So how do I have to interpret these results? If using optimal parameter I got bad results is there no chance for me to get better classification?What I mean is: Grid Search give me the Best results that i can obtain using SVM?
My dataset is formed by 1124 instances (89% negative class, 11% positive class) and there are 31 attributes (2 of them are nominal others are numeric). I'm using a cross validation (10-fold) on the whole dataset to test the model.
I tried to use GridSearch (I normalized each attribute values between 0 and 1, no features selection but I change class value from 0 and 1 to 1 and -1 accroding to SVM theory but T don't know if it useful) with these parameters: cost from 1 to 18 with 1.0 step and gamma from -5 to 10 with 1.0 step. Results are sensitivity 93,6% and specificity 64.8% but these takes around 1 hour to complete computation!!
I'd like to get better results compared with decision tree. Using Features Selection (Info Gain ranking) + SMOTE oversampling + Cost Sensitive Learning I obtained sensitivity 91% and specificity 80%. Is there a way to tune SVM without trying every possible range of values for cost and gamma?