cycling Ranksum on Stata - stata

I have some data with two different groupd of patients automatically exported from a diagnostic tool.
Variables are automatically nominated by the diagnostic tool (e.g. L1DensityWholeImage, L1WholeImageSHemi, L1WholeImageIHemi , L1WholeETDRS ,[...], DeepL2StartLayer, L2Startoffsetum, L2EndLayer, [...], Perimeter, AcircularityIndex )
I have to perform a Rank-sum test (or Mann-Whitney U test) with all the variables (> of 80) by group.
Normally, I should write each single analysis like that:
ranksum L1DensityWholeImage, by(Group)
ranksum L1WholeImageSHemi, by(Group)
ranksum L1WholeImageIHemi, by(Group)
ranksum L1WholeETDRS, by(Group)
Is there any way or code to write the command with a varlist? And maybe to obtain only 1 output result with all the p value?
e.g.: ranksum L1DensityWholeImage L1WholeImageSHemi L1WholeImageIHemi L1WholeETDRS, DeepL2StartLayer L2Startoffsetum L2EndLayer Perimeter AcircularityIndex, by(Group)

A short answer is write a loop and customise output.
Here is a token example which you can run.
sysuse auto, clear
foreach v of var mpg price weight length displacement {
quietly ranksum `v', by(foreign) porder
scalar pval = 2*normprob(-abs(r(z)))
di "`v'{col 14}" %05.3f pval " " %6.4e pval " " %05.3f r(porder)
}
Output is
mpg 0.002 1.9e-03 0.271
price 0.298 3.0e-01 0.423
weight 0.000 3.8e-07 0.875
length 0.000 9.4e-07 0.862
displacement 0.000 1.1e-08 0.921
Notes:
If your variable names are longer, they will need more space.
Displaying P-values with fixed numbers of decimal places won't prepare you for the circumstance in which all displayed digits are zero. The code exemplifies two forms of output.
The probability that values for the first group exceed those for the second group is very helpful in interpretation. Further summary statistics could be added.
Naturally a presentable table needs more header lines, best given with display.

Related

How to store confidence intervals from stata margins estimation?

stata experts,
I have been trying to find a way to store marginal estimations, including the p value and confidence interval.
Below is the code I have. All that I can get is the estimated marginal effect of variable I. Looks like I can't specify "ci" like what we can do for usual regression models. Is there a way to also store and present the other numbers from marginal estimations?
probit Y1 X
margin, dydx(X) post
est store m1
probit Y2 X
margins, dydx(X) post
est store m2
esttab m1 m2
esttab m1 m2, ci
Another related question is: how do I save marginal estimations for interaction terms? Example code below
probit Y2 year month year*month
margins year#month, asbalanced post
Thank you in advance!
Here's a way to grab p-values and confidence intervals after a margins command.
sysuse auto, clear
probit foreign price trunk
margin, dydx(price) post
eststo m1
The results from the margin command:
------------------------------------------------------------------------------
| Delta-method
| dy/dx std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
price | .0000268 .0000159 1.69 0.092 -4.36e-06 .000058
------------------------------------------------------------------------------
Then the p-value and confidence interval are recoverable from the stored matrices e(b) and e(V). To get the p-value, we need the z-score which is the point estimate over the standard error (e(b)[1,1]/sqrt(e(V)[1,1]). The rest is calculating the area in the two tails using normal.
The confidence interval is the point estimate e(b)[1,1] plus the standard error sqrt(e(V)[1,1]) times the critical value of z invnormal(0.975).
Shown with the output so that you can see the numbers line up:
. di "P-value: " normal(-abs(e(b)[1,1]/sqrt(e(V)[1,1])))*2
P-value: .09186065
. di "Upper bound: " e(b)[1,1] + sqrt(e(V)[1,1])*invnormal(0.975)
Upper bound: .00005796
. di "Lower bound: " e(b)[1,1] - sqrt(e(V)[1,1])*invnormal(0.975)
Lower bound: -4.361e-06
To put the p-value in a table, for example, you could use estadd:
estadd scalar pvalue = normal(-abs(e(b)[1,1]/sqrt(e(V)[1,1])))*2
And then esttab:
esttab m1, stats(pvalue, label("P-value"))
. esttab m1, stats(pvalue, label("P-value"))
----------------------------
(1)
----------------------------
price 0.0000268
(1.69)
----------------------------
P-value 0.0919
----------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001

Wilcoxon Z score is negative when it should be positive and vice versa

SAS Coding: - I perform a ttest on the differences in two groups (independent but from same population). The signs of the 'difference' amount and the t-stat match (i.e. mathematical difference between the two groups is negative and tstat is negative. Or if mathematical difference between the two groups is positive the tstat is positive).
However, when I run a wilcoxon rank sum test, the signs of my z-scores don't match the sign (-/+) of the group difference. (i.e. mathematical difference between the two groups is negative but z-score is positive. If mathematical difference between the two groups is positive the z-score is negative).
I have tried sorting the dataset regular and descending.
Here's my code:
*proc sort data = fundawin3t;
by vb_nvb_TTest;
run;
**Wilcoxon rank sums for vb vs nvb firms.;
proc npar1way data = fundawin3t wilcoxon;
title "NVB vs VB univariate tests and Wilcoxon-Table 4";
var ma_score_2015 age mve roa BM BHAR prcc_f CFI CFF momen6 vb_nvb SERIAL recyc_v;
class vb_nvb_TTest;
run;
Here is my log:
3208
3209 proc sort data = fundawin3t;
3210 by vb_nvb_TTest;
3211 run;
NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds
3212
3213 **Wilcoxon rank sums for vb vs nvb firms.;
3214 proc npar1way data = fundawin3t wilcoxon;
3215 title "NVB vs VB univariate tests and Wilcoxon-Table 4";
3216 var ma_score_2015 age mve roa BM BHAR prcc_f CFI CFF momen6
tenure vb_nvb SERIAL
3216! recyc_v;
3217 class vb_nvb_TTest;
3218 run;
NOTE: PROCEDURE NPAR1WAY used (Total process time):
real time 6.59 seconds
cpu time 5.25 seconds
RTM
To compute the linear rank statistic S, PROC NPAR1WAY sums the scores of the observations in the smaller of the two samples. If both samples have the same number of observations, PROC NPAR1WAY sums those scores for the sample that appears first in the input data set.
PROC NPAR1WAY computes one-sided and two-sided asymptotic p-values for each two-sample linear rank test. When the test statistic z is greater than its null hypothesis expected value of 0, PROC NPAR1WAY computes the right-sided p-value, which is the probability of a larger value of the statistic occurring under the null hypothesis. When the test statistic is less than or equal to 0, PROC NPAR1WAY computes the left-sided p-value, which is the probability of a smaller value of the statistic occurring under the null hypothesis. The one-sided p-value $P_1(z)$ can be expressed as

Computing and plotting difference in group means

In what follows I plot the mean of an outcome of interest (price) by a grouping variable (foreign) for each possible value taken by the fake variable time:
sysuse auto, clear
gen time = rep78 - 3
bysort foreign time: egen avg_p = mean(price)
scatter avg_p time if (foreign==0 & time>=0) || ///
scatter avg_p time if (foreign==1 & time>=0), ///
legend(order(1 "Domestic" 2 "Foreign")) ///
ytitle("Average price") xlab(#3)
What I would like to do is to plot the difference in the two group means over time, not the two separate means.
I am surely missing something, but to me it looks complicated because the information about the averages is stored "vertically" (in avg_p).
The easiest way to do this is to arguably use linear regression to estimate the differences:
/* Regression Way */
drop if time < 0 | missing(time)
reg price i.foreign##i.time
margins, dydx(foreign) at(time =(0(1)2))
marginsplot, noci title("Foreign vs Domestic Difference in Price")
If regression is hard to wrap your mind around, the other is involves mangling the data with a reshape:
/* Transform the Data */
keep price time foreign
collapse (mean) price, by(time foreign)
reshape wide price, i(time) j(foreign)
gen diff = price1-price0
tw connected diff time
Here is another approach. graph dot will happily plot means.
sysuse auto, clear
set scheme s1color
collapse price if inrange(rep78, 3, 5), by(foreign rep78)
reshape wide price, i(rep78) j(foreign)
rename price0 Domestic
label var Domestic
rename price1 Foreign
label var Foreign
graph dot (asis) Domestic Foreign, over(rep78) vertical ///
marker(1, ms(Oh)) marker(2, ms(+))

Wrap local macro in double quotes that will be visible to esttab's cells() option

I am trying to lazily create a table of means and standard errors for a longish list of variables. It seems that the estout package from SSC and tabstat are the best tools, but I can't get the local macros to work properly to specify esttab's cells() option.
sysuse auto, clear
* build macro for `cells()` option
local i = 1
foreach v of varlist price weight displacement {
local cells "`cells'" " `v'"
if (`i' == 1) local cells "`cells'(fmt(%9.3gc))"
local ++i
}
* properly built
display "`cells'"
* but does not work with `esttab`
estpost tabstat price weight displacement, statistics(mean semean)
esttab ., cells("`cells'")
This yields an "empty" table.
. esttab ., cells("`cells'")
-------------------------
(1)
b
-------------------------
-------------------------
N 74
-------------------------
It seems that cells() needs to see double quotes, but my attempts to add them with single and double quotes at any point in the process. Is there a way to make this approach work? I would like to avoid manually generating the cells() argument.
* The following approach does work.
esttab ., cells("price(fmt(%9.3gc)) weight displacement")
This yields the correct table.
. esttab ., cells("price(fmt(%9.3gc)) weight displacement")
---------------------------------------------------
(1)
price weight displacement
---------------------------------------------------
mean 6,165 3,019 197
semean 343 90.3 10.7
---------------------------------------------------
N 74
---------------------------------------------------
#Nick has already given a solution to the problem. He claims only stylistic changes were made, but I suspect more.
The double quotes originally used by the poster introduce an additional word in the definition of local cells. That is clear when we count the words contained in the local macro using the extended macro function : word count. #Richard works with three variables, but we count four words. Carefull inspection shows that the additional, surprise word, is "", introduced in the loop its first time around.
In this case, using display to check the contents of the local is misleading, because the command will simply do away with "". As a result, we see on screen "three" words. Displaying each word (one by one), more clearly shows that the first one is a blank.
What this means is you are coding something like:
esttab ., cells(""" price weight displacement")
when you really mean
esttab ., cells("price weight displacement")
Below I post some code consistent with this hypothesis. To simplify exposition, I have stripped away unnecessary complications from the original code.
sysuse auto, clear
// build macro
local i = 1
foreach v of varlist price weight displacement {
local cells "`cells'" " `v'"
*local cells `cells' `v'
}
// check contents of macro cells
local wc : word count "`cells'"
display `wc'
forvalues i = 1/4 {
local w`i' : word `i' of "`cells'"
display "`w`i''"
}
// display a test
local test "" " price" " weight" " displacement"
local wct : word count "`test'"
display `wct' // four words also
// more displays
display "`cells'"
display """ price weight displacement" // same display result
// tables
// post
quietly estpost tabstat price weight displacement, statistics(mean semean)
// original with error
esttab ., cells("`cells'")
// original with error after dereferencing the local macro cells
esttab ., cells(""" price weight displacement")
Nick's solution, that doesn't use double quotes, solves the problem.
One clue might be that the assert is not passing:
sysuse auto, clear
* build macro for `cells()` option
local i = 1
foreach v of varlist price weight displacement {
local cells "`cells'" " `v'"
if (`i' == 1) local cells "`cells'(fmt(%9.3gc))"
local ++i
}
* properly built
display "`cells'"
* but does not work with `esttab`
estpost tabstat price weight displacement, statistics(mean semean)
display "`cells'"
esttab ., cells("`cells'")
local cells2 price(fmt(%9.3gc)) weight displacement
esttab ., cells("`cells2'")
assert "`cells'" == "`cells2'"
esttab ., cells(price(fmt(%9.3gc)) weight displacement)
This works for me with Stata 13.1 and updated estout from SSC. My changes from your code were intended just as stylistic, but see an answer from #Roberto Ferrer.
I did get an error with an older version lurking on my machine, so updating appears to be at least part of the solution.
sysuse auto, clear
* build macro for `cells()` option
local i = 1
foreach v of varlist price weight displacement {
local cells `cells' `v'
if (`i' == 1) local cells `cells'(fmt(%9.3gc))
local ++i
}
* properly built
display "`cells'"
estpost tabstat price weight displacement, statistics(mean semean)
esttab ., cells("`cells'")

Add column with number of observations to esttab summary statistics table

I would like to make a summary statistics table using esttab from the estout package on SSC. I can make the table just fine, but I would like to add a column that counts the number of non-missing observations for each variable. That is, some variables may not be complete and I would like this to be clear to the reader.
In the example below I removed the first five observations for price, so I would like a 69 in that row. But my code doesn't include row-specific observation counts, only the total number of observations in the footer.
sysuse auto, clear
estpost summarize, detail
replace price = . in 1/5
local screen ///
cells("N mean sd min p50 max") ///
nonumber label
esttab, `screen'
This yields an empty N column, which I would prefer to have at 69 , followed by all 74s.
Is this it:
clear all
set more off
*----- exmple data -----
sysuse auto, clear
keep price mpg rep78 headroom
replace price = . in 1/5
*----- what you want -----
estpost summarize, detail
local screen cells("count mean sd") nonumber label noobs
esttab, `screen'
?
It just uses count. esttab is a wrapper for estout, and the help for the latter documents that it will take "results from e(myel)", which you have from estpost summarize, detail.
An alternative is:
tabstat _all, statistics(count mean sd) columns(statistics)
Yet another one, only that it allows variable labels to be displayed:
fsum _all, stat(n mean sd) uselabel
fsum is from SSC.