Plot confidence interval efficiently - stata

I want to plot confidence intervals for some estimates after running a regression model.
As I'm working with a very big dataset, I need an efficient solution: in particular, a solution that does not require me to sort or save the dataset. In the following example, I plot estimates for b1 to b6:
reg y b1 b2 b3 b4 b5 b6
foreach i of numlist 1/6 {
local mean `mean' `=_b[b`i']' `i'
local ci `ci' ///
(scatteri ///
`=_b[b`i'] +1.96*_se[b`i']' `i' ///
`=_b[`i'] -1.96 * _se[b`i']' `i' ///
,lpattern(shortdash) lcolor(navy))
}
twoway `ci' (scatteri `mean', mcolor(navy)), legend(off) yline(0)
While scatteri efficiently plots the estimates, I can't get boundaries for the confidence interval similar to rcap.
Is there a better way to do this?

Here's token code for what you seem to want. The example is ridiculous. It's my personal view that refining this would be pointless given the very accomplished previous work behind coefplot. The multiplier of 1.96 only applies in very large samples.
sysuse auto, clear
set scheme s1color
reg mpg weight length displ
gen coeff = .
gen upper = .
gen lower = .
gen which = .
local i = 0
quietly foreach v in weight length displ {
local ++i
replace coeff = _b[`v'] in `i'
replace upper = _b[`v'] + 1.96 * _se[`v'] in `i'
replace lower = _b[`v'] - 1.96 * _se[`v'] in `i'
replace which = `i' in `i'
label def which `i' "`v'", modify
}
label val which which
twoway scatter coeff which, mcolor(navy) xsc(r(0.5, `i'.5)) xla(1/`i', val) ///
|| rcap upper lower which, lcolor(navy) xtitle("") legend(off)

Related

k-fold cross validation: how to filter data based on a randomly generated integer variable in Stata

The following seems obvious, yet it does not behave as I would expect. I want to do k-fold cross validation without using SCC packages, and thought I could just filter my data and run my own regressions on the subsets.
First I generate a variable with a random integer between 1 and 5 (5-fold cross validation), then I loop over each fold number. I want to filter the data by the fold number, but using a boolean filter fails to filter anything. Why?
Bonus: what would be the best way to capture all of the test MSEs and average them? In Python I would just make a list or a numpy array and take the average.
gen randint = floor((6-1)*runiform()+1)
recast int randint
forval b = 1(1)5 {
xtreg c.DepVar /// // training set
c.IndVar1 ///
c.IndVar2 ///
if randint !=`b' ///
, fe vce(cluster uuid)
xtreg c.DepVar /// // test set, needs to be performed with model above, not a
c.IndVar1 /// // new model...
c.IndVar2 ///
if randint ==`b' ///
, fe vce(cluster uuid)
}
EDIT: Test set needs to be performed with model fit to training set. I changed my comment in the code to reflect this.
Ultimately the solution to the filtering issue was I was using a scalar in quotes to define the bounds and I had:
replace randint = floor((`varscalar'-1)*runiform()+1)
instead of just
replace randint = floor((varscalar-1)*runiform()+1)
When and where to use the quotes in Stata is confusing to me. I cannot just use varscalar in a loop, I have to use `=varscalar', but I can for some reason use varscalar - 1 and get the expected result. Interestingly, I cannot use
replace randint = floor((`varscalar')*runiform()+1)
I suppose I should just use
replace randint = floor((`=varscalar')*runiform()+1)
So why is it ok to use the version with the minus one and without the equals sign??
The answer below is still extremely helpful and I learned much from it.
As a matter of fact, two different things are going on here that are not necessarily directly related. 1) How to filter data with a randomly generated integer value and 2) k-fold cross-validation procedure.
For the first one, I will leave an example below that could help you work things out using Stata with some tools that can be easily transferable to other problems (such as matrix generation and manipulation to store the metrics). However, I would call neither your sketch of code nor my example "k-fold cross-validation", mainly because they fit the model, both in the testing and in training data. Nonetheless, the case should be that strictly speaking, the model should be trained in the training data, and using those parameters, assess the performance of the model in testing data.
For further references on the procedure Scikit-learn has done brilliant work explaining it with several visualizations included.
That being said, here is something that could be helpful.
clear all
set seed 4
set obs 100
*Simulate model
gen x1 = rnormal()
gen x2 = rnormal()
gen y = 1 + 0.5 * x1 + 1.5 *x2 + rnormal()
gen byte randint = runiformint(1, 5)
tab randint
/*
randint | Freq. Percent Cum.
------------+-----------------------------------
1 | 17 17.00 17.00
2 | 18 18.00 35.00
3 | 21 21.00 56.00
4 | 19 19.00 75.00
5 | 25 25.00 100.00
------------+-----------------------------------
Total | 100 100.00
*/
// create a matrix to store results
matrix res = J(5,4,.)
matrix colnames res = "R2_fold" "MSE_fold" "R2_hold" "MSE_hold"
matrix rownames res ="1" "2" "3" "4" "5"
// show formated empty matrix
matrix li res
/*
res[5,4]
R2_fold MSE_fold R2_hold MSE_hold
1 . . . .
2 . . . .
3 . . . .
4 . . . .
5 . . . .
*/
// loop over different samples
forvalues b = 1/5 {
// run the model using fold == `b'
qui reg y x1 x2 if randint ==`b'
// save R squared training
matrix res[`b', 1] = e(r2)
// save rmse training
matrix res[`b', 2] = e(rmse)
// run the model using fold != `b'
qui reg y x1 x2 if randint !=`b'
// save R squared training (?)
matrix res[`b', 3] = e(r2)
// save rmse testing (?)
matrix res[`b', 4] = e(rmse)
}
// Show matrix with stored metrics
mat li res
/*
res[5,4]
R2_fold MSE_fold R2_hold MSE_hold
1 .50949187 1.2877728 .74155365 1.0070531
2 .89942838 .71776458 .66401888 1.089422
3 .75542004 1.0870525 .68884359 1.0517139
4 .68140328 1.1103964 .71990589 1.0329239
5 .68816084 1.0017175 .71229925 1.0596865
*/
// some matrix algebra workout to obtain the mean of the metrics
mat U = J(rowsof(res),1,1)
mat sum = U'*res
/* create vector of column (variable) means */
mat mean_res = sum/rowsof(res)
// show the average of the metrics acros the holds
mat li mean_res
/*
mean_res[1,4]
R2_fold MSE_fold R2_hold MSE_hold
c1 .70678088 1.0409408 .70532425 1.0481599
*/

Multiple local in foreach command macro

I have a dataset with multiple subgroups (variable economist) and dates (variable temps99).
I want to run a tabsplit command that does not accept bysort or by prefixes. So I created a macro to apply my tabsplit command to each of my subgroups within my data.
For example:
levelsof economist, local(liste)
foreach gars of local liste {
display "`gars'"
tabsplit SubjectCategory if economist=="`gars'", p(;) sort
return list
replace nbcateco = r(r) if economist == "`gars'"
}
For each subgroup, Stata runs the tabsplit command and I use the variable nbcateco to store count results.
I did the same for the date so I can have the evolution of r(r) over time:
levelsof temps99, local(liste23)
foreach time of local liste23 {
display "`time'"
tabsplit SubjectCategory if temps99 == "`time'", p(;) sort
return list
replace nbcattime = r(r) if temps99 == "`time'"
}
Now I want to do it on each subgroups economist by date temps99. I tried multiple combination but I am not very good with macros (yet?).
What I want is to be able to have my r(r) for each of my subgroups over time.
Here's a solution that shows how to calculate the number of distinct publication categories within each by-group. This uses runby (from SSC). runby loops over each by-group, each time replacing the data in memory with the data from the current by-group. For each by-group, the commands contained in the user's program are executed. Whatever is left in memory when the user's program terminates is considered results and accumulates. Once all the groups have been processed, these results replace the data in memory.
I used the verbose option because I wanted to present the results for each by-group using nice formatting. The derivation of the list of distinct categories is done by splitting each list, converting to a long layout, and reducing to one observation per distinct value. The distinct_categories program generates one variable that contains the final count of distinct categories for the by-group.
* create a demontration dataset
* ------------------------------------------------------------------------------
clear all
set seed 12345
* Example generated by -dataex-. To install: ssc install dataex
clear
input str19 economist
"Carmen M. Reinhart"
"Janet Currie"
"Asli Demirguc-Kunt"
"Esther Duflo"
"Marianne Bertrand"
"Claudia Goldin"
"Bronwyn Hughes Hall"
"Serena Ng"
"Anne Case"
"Valerie Ann Ramey"
end
expand 20
bysort economist: gen temps99 = 1998 + _n
gen pubs = runiformint(1,10)
expand pubs
sort economist temps99
gen pubid = _n
local nep NEP-AGR NEP-CBA NEP-COM NEP-DEV NEP-DGE NEP-ECM NEP-EEC NEP-ENE ///
NEP-ENV NEP-HIS NEP-INO NEP-INT NEP-LAB NEP-MAC NEP-MIC NEP-MON ///
NEP-PBE NEP-TRA NEP-URE
gen SubjectCategory = ""
forvalues i=1/19 {
replace SubjectCategory = SubjectCategory + " " + word("`nep'",`i') ///
if runiform() < .1
}
replace SubjectCategory = subinstr(trim(SubjectCategory)," ",";",.)
leftalign // from SSC
* ------------------------------------------------------------------------------
program distinct_categories
dis _n _n _dup(80) "-"
dis as txt "fille = " as res economist[1] as txt _col(68) " temps = " as res temps99[1]
// if there are no subjects for the group, exit now to avoid a no obs error
qui count if !mi(trim(SubjectCategory))
if r(N) == 0 exit
// split categories, reshape to a long layout, and reduce to unique values
preserve
keep pubid SubjectCategory
quietly {
split SubjectCategory, parse(;) gen(cat)
reshape long cat, i(pubid)
bysort cat: keep if _n == 1
drop if mi(cat)
}
// show results and generate the wanted variable
list cat
local distinct = _N
dis _n as txt "distinct = " as res `distinct'
restore
gen wanted = `distinct'
end
runby distinct_categories, by(economist temps99) verbose
This is an example of the XY problem, I think. See http://xyproblem.info/
tabsplit is a command in the package tab_chi from SSC. I have no negative feelings about it, as I wrote it, but it seems quite unnecessary here.
You want to count categories in a string variable: semi-colons are your separators. So count semi-colons and add 1.
local SC SubjectCategory
gen NCategory = 1 + length(`SC') - length(subinstr(`SC', ";", "", .))
Then (e.g.) table or tabstat will let you explore further by groups of interest.
To see the counting idea, consider 3 categories with 2 semi-colons.
. display length("frog;toad;newt")
14
. display length(subinstr("frog;toad;newt", ";", "", .))
12
If we replace each semi-colon with an empty string, the change in length is the number of semi-colons deleted. Note that we don't have to change the variable to do this. Then add 1. See also this paper.
That said, a way to extend your approach might be
egen class = group(economist temps99), label
su class, meanonly
local nclass = r(N)
gen result = .
forval i = 1/`nclass' {
di "`: label (class) `i''"
tabsplit SubjectCategory if class == `i', p(;) sort
return list
replace result = r(r) if class == `i'
}
Using statsby would be even better. See also this FAQ.

test with missing standard errors

How can I conduct a hypothesis test in Stata when my predictor perfectly predicts my dependent variable?
I would like to run the same regression over many subsets of my data. For each regression, I would then like to test the hypothesis that beta_1 = 1/2. However, for some subsets, I have perfect collinearity, and Stata is not able to calculate standard errors.
For example, in the below case,
sysuse auto, clear
gen value = 2*foreign*(price<6165)
gen value2 = 2*foreign*(price>6165)
gen id = 1 + (price<6165)
I get the output
. reg foreign value value2 weight length, noconstant
Source | SS df MS Number of obs = 74
-------------+------------------------------ F( 4, 70) = .
Model | 22 4 5.5 Prob > F = .
Residual | 0 70 0 R-squared = 1.0000
-------------+------------------------------ Adj R-squared = 1.0000
Total | 22 74 .297297297 Root MSE = 0
------------------------------------------------------------------------------
foreign | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
value | .5 . . . . .
value2 | .5 . . . . .
weight | 3.54e-19 . . . . .
length | -6.31e-18 . . . . .
------------------------------------------------------------------------------
and
. test value = .5
( 1) value = .5
F( 1, 70) = .
Prob > F = .
In the actual data, there is usually more variation. So I can identify the cases where the predictor does a very good job of predicting the DV--but I miss those cases where prediction is perfect. Is there a way to conduct a hypothesis test that catches these cases?
EDIT:
The end goal would be to classify observations within subsets based on the hypothesis test. If I cannot reject the hypothesis at the 95% confidence level, I classify the observation as type 1. Below, both groups would be classified as type 1, though I only want the second group.
gen type = .
for values 1/2 {
quietly: reg foreign value value2 weight length if id = `i', noconstant
test value = .5
replace type = 1 if r(p)>.05
}
There is no way to do this out of the box that I'm aware of. Of course you could program it yourself to get an approximation of the p-value in these cases. The standard error is missing here because the relationship between x and y is perfectly collinear. There is no noise in the model, nothing deviates.
Interestingly enough though, the standard error of the estimate is useless in this case anyway. test performs a Wald test for beta_i = exp against beta_i != exp, not a t-test.
The Wald test uses the variance-covariance matrix from the regression. To see this yourself, refer to the Methods and formulas section here and run the following code:
(also, if you remove the -1 from gen mpg2 = and run, you will see the issue)
sysuse auto, clear
gen mpg2 = mpg * 2.5 - 1
qui reg mpg2 mpg, nocons
* collect matrices to calculate Wald statistic
mat b = e(b) // Vector of Coefficients
mat V = e(V) // Var-Cov matrix
mat R = (1) // for use in Rb-r. This does not == [0,1] because of
the use of the noconstant option in regress
mat r = (2.5) // Value you want to test for equality
mat W = (R*b-r)'*inv(R*V*R')*(R*b-r)
// This is where it breaks for you, because with perfect collinearity, V == 0
reg mpg2 mpg, nocons
test mpg = 2.5
sca F = r(F)
sca list F
mat list W
Now, as #Brendan Cox suggested, you might be able to simply use the missing value returned in r(p) to condition your replace command. Depending on exactly how you are using it. A word of caution on this, however, is that when the relationship between some x and y is such that y = 2x, and you want to test x = 5 vs test x = 2, you will want to be very careful about the interpretation of missing p-values - In both cases they are classified as type == 1, where the test x = 2 command should not result in that outcome.
Another work-around would be to simply set p = 0 in these cases, since the variance estimate will asymptotically approach 0 as the linear relationship becomes near perfect, and thus the Wald statistic will approach infinity (driving p down, all else equal).
A final yet more complicated work-around in this case could be to calculate the F-statistic manually using the formula in the manual, and setting V to some arbitrary, yet infinitesimally small number. I've included code to do this below, but it is quite a bit more involved than simply issuing the test command, and in truth only an approximation of the actual p-value from the F distribution.
clear *
sysuse auto
gen i = ceil(_n/5)
qui sum i
gen mpg2 = mpg * 2 if i <= 5 // Get different estimation results
replace mpg2 = mpg * 10 if i > 5 // over different subsets of data
gen type = .
local N = _N // use for d.f. calculation later
local iMax = r(max) // use to iterate loop
forvalues i = 1/`iMax' {
qui reg mpg2 mpg if i == `i', nocons
mat b`i' = e(b) // collect returned results for Wald stat
mat V`i' = e(V)
sca cov`i' = V`i'[1,1]
mat R`i' = (1)
mat r`i' = (2) // Value you wish to test against
if (cov`i' == 0) { // set V to be very small if Variance = 0 & calculate Wald
mat V`i' = 1.0e-14
}
mat W`i' = (R`i'*b`i'-r`i')'*inv(R`i'*V`i'*R`i'')*(R`i'*b`i'-r`i')
sca W`i' = W`i'[1,1] // collect Wald statistic into scalar
sca p`i' = Ftail(1,`N'-2, W`i') // pull p-value from F dist
if p`i' > .05 {
replace type = 1 if i == `i'
}
}
Also note that this workaround will become slightly more involved if you want to test multiple coefficients.
I'm not sure if I advise these approaches without issuing a word of caution considering you are in a very real sense "making up" variance estimates, but without a variance estimate you wont be able to test the coefficients at all.

Stata: Subsetting data using criteria stored in other data set

I have a large data set. I have to subset the data set (Big_data) by using values stored in other dta file (Criteria_data). I will show you the problem first:
**Big_data** **Criteria_data**
==================== ================================================
lon lat 4_digit_id minlon maxlon minlat maxlat
-76.22 44.27 0765 -78.44 -77.22 34.324 35.011
-67.55 33.19 6161 -66.11 -65.93 40.32 41.88
....... ........
(over 1 million obs) (271 observations)
==================== ================================================
I have to subset the bid data as follows:
use Big_data
preserve
keep if (-78.44<lon<-77.22) & (34.324<lat<35.011)
save data_0765, replace
restore
preserve
keep if (-66.11<lon<-65.93) & (40.32<lat<41.88)
save data_6161, replace
restore
....
(1) What should be the efficient programming for the subsetting in Stata? (2) Are the inequality expressions correctly written?
1) Subsetting data
With 400,000 observations in the main file and 300 in the reference file, it takes about 1.5 minutes. I can't test this with double the observations in the main file because the lack of RAM takes my computer to a crawl.
The strategy involves creating as many variables as needed to hold the reference latitudes and longitudes (271*4 = 1084 in the OP's case; Stata IC and up can handle this. See help limits). This requires some reshaping and appending. Then we check for those observations of the big data file that meet the conditions.
clear all
set more off
*----- create example databases -----
tempfile bigdata reference
input ///
lon lat
-76.22 44.27
-66.0 40.85 // meets conditions
-77.10 34.8 // meets conditions
-66.00 42.0
end
expand 100000
save "`bigdata'"
*list
clear all
input ///
str4 id minlon maxlon minlat maxlat
"0765" -78.44 -75.22 34.324 35.011
"6161" -66.11 -65.93 40.32 41.88
end
drop id
expand 150
gen id = _n
save "`reference'"
*list
*----- reshape original reference file -----
use "`reference'", clear
tempfile reference2
destring id, replace
levelsof id, local(lev)
gen i = 1
reshape wide minlon maxlon minlat maxlat, i(i) j(id)
gen lat = .
gen lon = .
save "`reference2'"
*----- create working database -----
use "`bigdata'"
timer on 1
quietly {
forvalues num = 1/300 {
gen minlon`num' = .
gen maxlon`num' = .
gen minlat`num' = .
gen maxlat`num' = .
}
}
timer off 1
timer on 2
append using "`reference2'"
drop i
timer off 2
*----- flag observations for which conditions are met -----
timer on 3
gen byte flag = 0
foreach le of local lev {
quietly replace flag = 1 if inrange(lon, minlon`le'[_N], maxlon`le'[_N]) & inrange(lat, minlat`le'[_N], maxlat`le'[_N])
}
timer off 3
*keep if flag
*keep lon lat
*list
timer list
The inrange() function implies that the minimums and maximums must be adjusted beforehand to satisfy the OP's strict inequalities (the function tests <=, >=).
Probably some expansion using expand, use of correlatives and by (so data is in long form) could speed things up. It's not totally clear for me right now. I'm sure there are better ways in plain Stata mode. Mata may be even better.
(joinby was also tested but again RAM was a problem.)
Edit
Doing computations in chunks rather than for the complete database, significantly improves the RAM issue. Using a main file with 1.2 million observations and a reference file with 300 observations, the following code does all the work in about 1.5 minutes:
set more off
*----- create example big data -----
clear all
set obs 1200000
set seed 13056
gen lat = runiform()*100
gen lon = runiform()*100
local sizebd `=_N' // to be used in computations
tempfile bigdata
save "`bigdata'"
*----- create example reference data -----
clear all
set obs 300
set seed 97532
gen minlat = runiform()*100
gen maxlat = minlat + runiform()*5
gen minlon = runiform()*100
gen maxlon = minlon + runiform()*5
gen id = _n
tempfile reference
save "`reference'"
*----- reshape original reference file -----
use "`reference'", clear
destring id, replace
levelsof id, local(lev)
gen i = 1
reshape wide minlon maxlon minlat maxlat, i(i) j(id)
drop i
tempfile reference2
save "`reference2'"
*----- create file to save results -----
tempfile results
clear all
set obs 0
gen lon = .
gen lat = .
save "`results'"
*----- start computations -----
clear all
* local that controls # of observations in intermediate files
local step = 5000 // can't be larger than sizedb
timer clear
timer on 99
forvalues en = `step'(`step')`sizebd' {
* load observations and join with references
timer on 1
local start = `en' - (`step' - 1)
use in `start'/`en' using "`bigdata'", clear
timer off 1
timer on 2
append using "`reference2'"
timer off 2
* flag observations that meet conditions
timer on 3
gen byte flag = 0
foreach le of local lev {
quietly replace flag = 1 if inrange(lon, minlon`le'[_N], maxlon`le'[_N]) & inrange(lat, minlat`le'[_N], maxlat`le'[_N])
}
timer off 3
* append to result database
timer on 4
quietly {
keep if flag
keep lon lat
append using "`results'"
save "`results'", replace
}
timer off 4
}
timer off 99
timer list
display "total time is " `r(t99)'/60 " minutes"
use "`results'"
browse
2) Inequalities
You ask if your inequalities are correct. They are in fact legal, meaning that Stata will not complain, but the result is probably unexpected.
The following result may seem surprising:
. display (66.11 < 100 < 67.93)
1
How is it the case that the expression evaluates to true (i.e. 1) ? Stata first evaluates 66.11 < 100 which is true, and then sees 1 < 67.93 which is also true, of course.
The intended expression was (and Stata will now do what you want):
. display (66.11 < 100) & (100 < 67.93)
0
You can also rely on the function inrange().
The following example is consistent with the previous explanation:
. display (66.11 < 100 < 0)
0
Stata sees 66.11 < 100 which is true (i.e. 1) and follows up with 1 < 0, which is false (i.e. 0).
This uses Roberto's data setup:
clear all
set obs 1200000
set seed 13056
gen lat = runiform()*100
gen lon = runiform()*100
local sizebd `=_N' // to be used in computations
tempfile bigdata
save "`bigdata'"
*----- create example reference data -----
clear all
set obs 300
set seed 97532
gen minlat = runiform()*100
gen maxlat = minlat + runiform()*5
gen minlon = runiform()*100
gen maxlon = minlon + runiform()*5
gen id = _n
tempfile reference
save "`reference'"
timer on 1
levelsof id, local(id_list)
foreach id of local id_list {
sum minlat if id==`id', meanonly
local minlat = r(min)
sum maxlat if id==`id', meanonly
local maxlat = r(max)
sum minlon if id==`id', meanonly
local minlon = r(min)
sum maxlon if id==`id', meanonly
local maxlon = r(max)
preserve
use if (inrange(lon,`minlon',`maxlon') & inrange(lat,`minlat',`maxlat')) using "`bigdata'", clear
qui save data_`id', replace
restore
}
timer off 1
I would try to avoid preserveing and restoreing the "big" file, and doing so is possible, but at the expense of losing Stata format.
Using the same set up as Roberto and Dimitriy did,
set more off
use `bigdata', clear
merge 1:1 _n using `reference'
* check for data consistency:
* minlat, maxlat, minlon, maxlon are either all defined or all missing
assert inlist( mi(minlat) + mi(maxlat) + mi(minlon) + mi(maxlon), 0, 4)
* this will come handy later
gen byte touse = 0
* set up and cycle over the reference data
count if !missing(minlat)
forvalues n=1/`=r(N)' {
replace touse = inrange(lat,minlat[`n'],maxlat[`n']) & inrange(lon,minlon[`n'],maxlon[`n'])
local thisid = id[`n']
outfile lat lon if touse using data_`thisid'.csv, replace comma
}
Time it on your machine. You could avoid touse and thisid and only have the single outfile within the cycle, but it would be less readable.
You can then infile lat lon using data_###.csv, clear later. If you really need the Stata files proper, you can convert that swarm of CSV files with
clear
local allcsv : dir . files "*.csv"
foreach f of local allcsv {
* change the filename
local dtaname = subinstr(`"`f'"',".csv",".dta",.)
infile lat lon using `"`f'"', clear
if _N>0 save `"`dtaname'"', replace
}
Time it, too. I protected the save as some of the simulated data sets were empty. I think this was faster than 1.5 min on my machine, including the conversion.

Is Conger's kappa available in Stata?

Is the modified version of kappa proposed by Conger (1980) available in Stata? Tried to google it to no avail.
This is an old question, but in case anyone is still looking--the SSC package kappaetc now calculates that, along with every other inter-rater statistic you could ever want.
Since no one has responded with a Stata solution, I developed some code to calculate Conger's kappa using the formulas provided in Gwet, K. L. (2012). Handbook of Inter-Rater Reliability (3rd ed.), Gaithersburg, MD: Advanced Analytics, LLC. See especially pp. 34-35.
My code is undoubtedly not as efficient as others could write, and I would welcome any improvements to the code or to the program format that others wish to make.
cap prog drop congerkappa
prog def congerkappa
* This program has only been tested with Stata 11.2, 12.1, and 13.0.
preserve
* Number of judges
scalar judgesnum = _N
* Subject IDs
quietly ds
local vlist `r(varlist)'
local removeit = word("`vlist'",1)
local targets: list vlist - removeit
* Sums of ratings by each judge
egen judgesum = rowtotal(`targets')
* Sum of each target's ratings
foreach i in `targets' {
quietly summarize `i', meanonly
scalar mean`i' = r(mean)
}
* % each target rating of all target ratings
foreach i in `targets' {
gen `i'2 = `i'/judgesum
}
* Variance of each target's % ratings
foreach i in `targets' {
quietly summarize `i'2
scalar s2`i'2 = r(Var)
}
* Mean variance of each target's % ratings
foreach i in `targets' {
quietly summarize `i'2, meanonly
scalar mean`i'2 = r(mean)
}
* Square of mean of each target's % ratings
foreach i in `targets' {
scalar mean`i'2sq = mean`i'2^2
}
* Sum of variances of each target's % ratings
scalar sumvar = 0
foreach i in `targets' {
scalar sumvar = sumvar + s2`i'2
}
* Sum of means of each target's % ratings
scalar summeans = 0
foreach i in `targets' {
scalar summeans = summeans + mean`i'2
}
* Sum of meansquares of each target's % ratings
scalar summeansqs = 0
foreach i in `targets' {
scalar summeansqs = summeansqs + mean`i'2sq
}
* Conger's kappa
scalar conkappa = summeansqs -(sumvar/judgesnum)
di _n "Conger's kappa = " conkappa
restore
end
The data structure required by the program is shown below. The variable names are not fixed, but the judge/rater variable must be in the first position in the data set. The data set should not include any variables other than the judge/rater and targets/ratings.
Judge S1 S2 S3 S4 S5 S6
Rater1 2 4 2 1 1 4
Rater2 2 3 2 2 2 3
Rater3 2 5 3 3 3 5
Rater4 3 3 2 3 2 3
If you would like to run this against a test data set, you can use the judges data set from StataCorp and reshape it as shown.
use http://www.stata-press.com/data/r12/judges.dta, clear
sort judge
list, sepby(judge)
reshape wide rating, i(judge) j(target)
rename rating* S*
list, noobs
* Run congerkappa program on demo data set in memory
congerkappa
I have run only a single validation test of this code against the data in Table 2.16 in Gwet (p. 35) and have replicated the Conger's kappa = .23343 as calculated by Gwet on p. 34. Please test this code on other data with known Conger's kappas before relying on it.
I don't know if Conger's kappa for multiple raters is available in Stata, but it is available in R via the irr package, using the kappam.fleiss function and specifying the exact option. For information on the irr package in R, see http://cran.r-project.org/web/packages/irr/irr.pdf#page.12 .
After installing and loading the irr package in R, you can view a demo data set and Conger's kappa calculation using the following code.
data(diagnoses)
print(diagnoses)
kappam.fleiss(diagnoses, exact=TRUE)
I hope someone else here can help with a Stata solution, as you requested, but this may at least provide a solution if you can't find it in Stata.
In response to Dimitriy's comment below, I believe Stata's native kappa command applies either to two unique raters or to more than two non-unique raters.
The original poster may also want to consider the icc command in Stata, which allows for multiple unique raters.