Is Conger's kappa available in Stata? - stata

Is the modified version of kappa proposed by Conger (1980) available in Stata? Tried to google it to no avail.

This is an old question, but in case anyone is still looking--the SSC package kappaetc now calculates that, along with every other inter-rater statistic you could ever want.

Since no one has responded with a Stata solution, I developed some code to calculate Conger's kappa using the formulas provided in Gwet, K. L. (2012). Handbook of Inter-Rater Reliability (3rd ed.), Gaithersburg, MD: Advanced Analytics, LLC. See especially pp. 34-35.
My code is undoubtedly not as efficient as others could write, and I would welcome any improvements to the code or to the program format that others wish to make.
cap prog drop congerkappa
prog def congerkappa
* This program has only been tested with Stata 11.2, 12.1, and 13.0.
preserve
* Number of judges
scalar judgesnum = _N
* Subject IDs
quietly ds
local vlist `r(varlist)'
local removeit = word("`vlist'",1)
local targets: list vlist - removeit
* Sums of ratings by each judge
egen judgesum = rowtotal(`targets')
* Sum of each target's ratings
foreach i in `targets' {
quietly summarize `i', meanonly
scalar mean`i' = r(mean)
}
* % each target rating of all target ratings
foreach i in `targets' {
gen `i'2 = `i'/judgesum
}
* Variance of each target's % ratings
foreach i in `targets' {
quietly summarize `i'2
scalar s2`i'2 = r(Var)
}
* Mean variance of each target's % ratings
foreach i in `targets' {
quietly summarize `i'2, meanonly
scalar mean`i'2 = r(mean)
}
* Square of mean of each target's % ratings
foreach i in `targets' {
scalar mean`i'2sq = mean`i'2^2
}
* Sum of variances of each target's % ratings
scalar sumvar = 0
foreach i in `targets' {
scalar sumvar = sumvar + s2`i'2
}
* Sum of means of each target's % ratings
scalar summeans = 0
foreach i in `targets' {
scalar summeans = summeans + mean`i'2
}
* Sum of meansquares of each target's % ratings
scalar summeansqs = 0
foreach i in `targets' {
scalar summeansqs = summeansqs + mean`i'2sq
}
* Conger's kappa
scalar conkappa = summeansqs -(sumvar/judgesnum)
di _n "Conger's kappa = " conkappa
restore
end
The data structure required by the program is shown below. The variable names are not fixed, but the judge/rater variable must be in the first position in the data set. The data set should not include any variables other than the judge/rater and targets/ratings.
Judge S1 S2 S3 S4 S5 S6
Rater1 2 4 2 1 1 4
Rater2 2 3 2 2 2 3
Rater3 2 5 3 3 3 5
Rater4 3 3 2 3 2 3
If you would like to run this against a test data set, you can use the judges data set from StataCorp and reshape it as shown.
use http://www.stata-press.com/data/r12/judges.dta, clear
sort judge
list, sepby(judge)
reshape wide rating, i(judge) j(target)
rename rating* S*
list, noobs
* Run congerkappa program on demo data set in memory
congerkappa
I have run only a single validation test of this code against the data in Table 2.16 in Gwet (p. 35) and have replicated the Conger's kappa = .23343 as calculated by Gwet on p. 34. Please test this code on other data with known Conger's kappas before relying on it.

I don't know if Conger's kappa for multiple raters is available in Stata, but it is available in R via the irr package, using the kappam.fleiss function and specifying the exact option. For information on the irr package in R, see http://cran.r-project.org/web/packages/irr/irr.pdf#page.12 .
After installing and loading the irr package in R, you can view a demo data set and Conger's kappa calculation using the following code.
data(diagnoses)
print(diagnoses)
kappam.fleiss(diagnoses, exact=TRUE)
I hope someone else here can help with a Stata solution, as you requested, but this may at least provide a solution if you can't find it in Stata.

In response to Dimitriy's comment below, I believe Stata's native kappa command applies either to two unique raters or to more than two non-unique raters.
The original poster may also want to consider the icc command in Stata, which allows for multiple unique raters.

Related

k-fold cross validation: how to filter data based on a randomly generated integer variable in Stata

The following seems obvious, yet it does not behave as I would expect. I want to do k-fold cross validation without using SCC packages, and thought I could just filter my data and run my own regressions on the subsets.
First I generate a variable with a random integer between 1 and 5 (5-fold cross validation), then I loop over each fold number. I want to filter the data by the fold number, but using a boolean filter fails to filter anything. Why?
Bonus: what would be the best way to capture all of the test MSEs and average them? In Python I would just make a list or a numpy array and take the average.
gen randint = floor((6-1)*runiform()+1)
recast int randint
forval b = 1(1)5 {
xtreg c.DepVar /// // training set
c.IndVar1 ///
c.IndVar2 ///
if randint !=`b' ///
, fe vce(cluster uuid)
xtreg c.DepVar /// // test set, needs to be performed with model above, not a
c.IndVar1 /// // new model...
c.IndVar2 ///
if randint ==`b' ///
, fe vce(cluster uuid)
}
EDIT: Test set needs to be performed with model fit to training set. I changed my comment in the code to reflect this.
Ultimately the solution to the filtering issue was I was using a scalar in quotes to define the bounds and I had:
replace randint = floor((`varscalar'-1)*runiform()+1)
instead of just
replace randint = floor((varscalar-1)*runiform()+1)
When and where to use the quotes in Stata is confusing to me. I cannot just use varscalar in a loop, I have to use `=varscalar', but I can for some reason use varscalar - 1 and get the expected result. Interestingly, I cannot use
replace randint = floor((`varscalar')*runiform()+1)
I suppose I should just use
replace randint = floor((`=varscalar')*runiform()+1)
So why is it ok to use the version with the minus one and without the equals sign??
The answer below is still extremely helpful and I learned much from it.
As a matter of fact, two different things are going on here that are not necessarily directly related. 1) How to filter data with a randomly generated integer value and 2) k-fold cross-validation procedure.
For the first one, I will leave an example below that could help you work things out using Stata with some tools that can be easily transferable to other problems (such as matrix generation and manipulation to store the metrics). However, I would call neither your sketch of code nor my example "k-fold cross-validation", mainly because they fit the model, both in the testing and in training data. Nonetheless, the case should be that strictly speaking, the model should be trained in the training data, and using those parameters, assess the performance of the model in testing data.
For further references on the procedure Scikit-learn has done brilliant work explaining it with several visualizations included.
That being said, here is something that could be helpful.
clear all
set seed 4
set obs 100
*Simulate model
gen x1 = rnormal()
gen x2 = rnormal()
gen y = 1 + 0.5 * x1 + 1.5 *x2 + rnormal()
gen byte randint = runiformint(1, 5)
tab randint
/*
randint | Freq. Percent Cum.
------------+-----------------------------------
1 | 17 17.00 17.00
2 | 18 18.00 35.00
3 | 21 21.00 56.00
4 | 19 19.00 75.00
5 | 25 25.00 100.00
------------+-----------------------------------
Total | 100 100.00
*/
// create a matrix to store results
matrix res = J(5,4,.)
matrix colnames res = "R2_fold" "MSE_fold" "R2_hold" "MSE_hold"
matrix rownames res ="1" "2" "3" "4" "5"
// show formated empty matrix
matrix li res
/*
res[5,4]
R2_fold MSE_fold R2_hold MSE_hold
1 . . . .
2 . . . .
3 . . . .
4 . . . .
5 . . . .
*/
// loop over different samples
forvalues b = 1/5 {
// run the model using fold == `b'
qui reg y x1 x2 if randint ==`b'
// save R squared training
matrix res[`b', 1] = e(r2)
// save rmse training
matrix res[`b', 2] = e(rmse)
// run the model using fold != `b'
qui reg y x1 x2 if randint !=`b'
// save R squared training (?)
matrix res[`b', 3] = e(r2)
// save rmse testing (?)
matrix res[`b', 4] = e(rmse)
}
// Show matrix with stored metrics
mat li res
/*
res[5,4]
R2_fold MSE_fold R2_hold MSE_hold
1 .50949187 1.2877728 .74155365 1.0070531
2 .89942838 .71776458 .66401888 1.089422
3 .75542004 1.0870525 .68884359 1.0517139
4 .68140328 1.1103964 .71990589 1.0329239
5 .68816084 1.0017175 .71229925 1.0596865
*/
// some matrix algebra workout to obtain the mean of the metrics
mat U = J(rowsof(res),1,1)
mat sum = U'*res
/* create vector of column (variable) means */
mat mean_res = sum/rowsof(res)
// show the average of the metrics acros the holds
mat li mean_res
/*
mean_res[1,4]
R2_fold MSE_fold R2_hold MSE_hold
c1 .70678088 1.0409408 .70532425 1.0481599
*/

Power analysis via simulations in Stata version 15.1

I have been trying to run this simulation code in Stata version 15.1, but am having issues running it as indicated below.
local num_clus 3 6 9 18 36
local clussize 5 10 15 20 25
*Model specifications
local intercept 17.87
local timecoeff1 -5.42
local timecoeff2 -5.72
local timecoeff3 -7.03
local timecoeff4 -6.13
local timecoeff5 -9.13
local intrvcoeff 5.00
local sigma_u3 25.77
local sigma_u2 120.62
local sigma_error 38.35
local nrep 1000
local alpha 0.05
*Generate multi-level data
capture program drop swcrt
program define swcrt, rclass
version 15.1
preserve
clear
args num_clus clussize intercept intrvcoeff timecoeff1 timecoeff2 timecoeff3 timecoeff4 timecoeff5 sigma_u3 sigma_error alpha
assert `num_clus' > 0 & `clussize' > 0 & `intercept' > 0 & `intrvcoeff' > 0 & `timecoeff1' < 0 & `timecoeff2' < 0 & `timecoeff3' < 0 & `timecoeff4' < 0 & `timecoeff5' < 0 & `sigma_u3' > 0 & `sigma_error' > 0 & `alpha' > 0
/*Generate simulated multi—level data*/
qui
clear
set obs `num_clus'
qui gen cluster = _n
qui gen group = 1+mod(_n-1,4)
/*Generate cluster-level errors*/
qui gen u_3 = rnormal(0,`sigma_u3')
expand `clussize'
bysort cluster: gen individual = _n
/*Set up time*/
expand 6
bysort cluster individual: gen time = _n-1
/*Set up intervention variable*/
gen intrv = (time>=group)
/*Generate residual errors*/
qui gen error = rnormal(0,`sigma_error')
/*Generate outcome y*/
qui gen y = `intercept' + `intrvcoeff'*intrv + `timecoeff1'*1.time + `timecoeff2'*2.time + `timecoeff3'*3.time + `timecoeff4'*4.time + `timecoeff5'*5.time + u_3 + error
/*Fit multi-level model to simulated dataset*/
mixed y intrv i.time ||cluster:, covariance(unstructured) reml dfmethod(kroger)
/*Return estimated effect size, bias, p-value, and significance dichotomy*/
tempname M
matrix `M' = r(table)
return scalar bias = _b[intrv] - `intrvcoeff'
return scalar p = `M'[1,4]
return scalar p_= (`M'[1,4] < `alpha')
exit
end swcrt
*Postfile to store results
tempname step
tempfile powerresults
capture postutil clear
postfile `step' num_clus [B]clussize[/B] intrvcoeff p p_ bias using `powerresults', replace
ERROR: (note: file /var/folders/v4/j5kzzhc52q9fvh6w9pcx9fgm0000gn/T//S_00310.00000c not found)
*Loop over number of clusters
foreach c of local num_clus{
display as text "Number of clusters" as result "`c'"
foreach s of local clussize{
display as text "Cluster size" as result "`s'"
forvalue i = 1/`nrep'{
display as text "Iterations" as result `nrep'
quietly swcrt `num_clus' `clussize' `intercept' `intrvcoeff' `timecoeff1' `timecoeff2' `timecoeff3' `timecoeff4' `timecoeff5' `sigma_u3' `sigma_error' `alpha'
post `step' (`c') (`s') (`intrvcoeff') (`r(p)') (`r(p_)') (`r(bias)')
}
}
}
postclose `step'
ERROR:
Number of clusters3
Cluster size5
Iterations1000
r(9);
*Open results, calculate power
use `powerresults', clear
levelsof num_clus, local(num_clus)
levelsof clussize, local(clussize)
matrix drop _all
*Loop over combinations of clusters
*Add power results to matrix
foreach c of local num_clus{
foreach s of local clussize{
quietly ci proportions p_ if num_clus == `c' & clussize = `s'
local power `r(proportion)'
local power_lb `r(lb)'
local power_ub `r(ub)'
quietly ci mean bias if num_clus == `c' & clussize = `s'
local bias `r(mean)'
matrix M = nullmat(M) \ (`c', `s', `intrvcoeff', `power', `power_lb', `power_ub', `bias')
}
}
*Display the matrix
matrix colnames M = c s intrvcoeff power power_lb power_ub bias
ERROR:
matrix M not found
r(111);
matrix list M, noheader format(%3.2f)
ERROR:
matrix M not found
r(111);
There are a few things that seem to be amiss above.
I get a message after the postfile command saying that the file is not found. Nowhere in my code do I actually use that name so it seems to be generated by Stata.
After the loop and the post command I get error r(9).
Error message r(111) - says that the matrix is not found.
I have checked the following parts of the code to try and resolve the issue:
Specified local macros outside of the program and passed into it via the args statement of the program
Match between the variables in the call of the swcrt with the args statement in the program
Match between arguments in assert statement of the program with args command and whether the alligator clips are specified appropriately
Match b/w the number of variables in the post and postfile commands
I am not quite sure why I get these errors considering that the code did work previously and the program iterated (even when I take away the changes there is still the error). Does anyone know why this happens? If I had to guess, the matrix can't be found because of the error with the file not being found when I use postfile.

Multiple local in foreach command macro

I have a dataset with multiple subgroups (variable economist) and dates (variable temps99).
I want to run a tabsplit command that does not accept bysort or by prefixes. So I created a macro to apply my tabsplit command to each of my subgroups within my data.
For example:
levelsof economist, local(liste)
foreach gars of local liste {
display "`gars'"
tabsplit SubjectCategory if economist=="`gars'", p(;) sort
return list
replace nbcateco = r(r) if economist == "`gars'"
}
For each subgroup, Stata runs the tabsplit command and I use the variable nbcateco to store count results.
I did the same for the date so I can have the evolution of r(r) over time:
levelsof temps99, local(liste23)
foreach time of local liste23 {
display "`time'"
tabsplit SubjectCategory if temps99 == "`time'", p(;) sort
return list
replace nbcattime = r(r) if temps99 == "`time'"
}
Now I want to do it on each subgroups economist by date temps99. I tried multiple combination but I am not very good with macros (yet?).
What I want is to be able to have my r(r) for each of my subgroups over time.
Here's a solution that shows how to calculate the number of distinct publication categories within each by-group. This uses runby (from SSC). runby loops over each by-group, each time replacing the data in memory with the data from the current by-group. For each by-group, the commands contained in the user's program are executed. Whatever is left in memory when the user's program terminates is considered results and accumulates. Once all the groups have been processed, these results replace the data in memory.
I used the verbose option because I wanted to present the results for each by-group using nice formatting. The derivation of the list of distinct categories is done by splitting each list, converting to a long layout, and reducing to one observation per distinct value. The distinct_categories program generates one variable that contains the final count of distinct categories for the by-group.
* create a demontration dataset
* ------------------------------------------------------------------------------
clear all
set seed 12345
* Example generated by -dataex-. To install: ssc install dataex
clear
input str19 economist
"Carmen M. Reinhart"
"Janet Currie"
"Asli Demirguc-Kunt"
"Esther Duflo"
"Marianne Bertrand"
"Claudia Goldin"
"Bronwyn Hughes Hall"
"Serena Ng"
"Anne Case"
"Valerie Ann Ramey"
end
expand 20
bysort economist: gen temps99 = 1998 + _n
gen pubs = runiformint(1,10)
expand pubs
sort economist temps99
gen pubid = _n
local nep NEP-AGR NEP-CBA NEP-COM NEP-DEV NEP-DGE NEP-ECM NEP-EEC NEP-ENE ///
NEP-ENV NEP-HIS NEP-INO NEP-INT NEP-LAB NEP-MAC NEP-MIC NEP-MON ///
NEP-PBE NEP-TRA NEP-URE
gen SubjectCategory = ""
forvalues i=1/19 {
replace SubjectCategory = SubjectCategory + " " + word("`nep'",`i') ///
if runiform() < .1
}
replace SubjectCategory = subinstr(trim(SubjectCategory)," ",";",.)
leftalign // from SSC
* ------------------------------------------------------------------------------
program distinct_categories
dis _n _n _dup(80) "-"
dis as txt "fille = " as res economist[1] as txt _col(68) " temps = " as res temps99[1]
// if there are no subjects for the group, exit now to avoid a no obs error
qui count if !mi(trim(SubjectCategory))
if r(N) == 0 exit
// split categories, reshape to a long layout, and reduce to unique values
preserve
keep pubid SubjectCategory
quietly {
split SubjectCategory, parse(;) gen(cat)
reshape long cat, i(pubid)
bysort cat: keep if _n == 1
drop if mi(cat)
}
// show results and generate the wanted variable
list cat
local distinct = _N
dis _n as txt "distinct = " as res `distinct'
restore
gen wanted = `distinct'
end
runby distinct_categories, by(economist temps99) verbose
This is an example of the XY problem, I think. See http://xyproblem.info/
tabsplit is a command in the package tab_chi from SSC. I have no negative feelings about it, as I wrote it, but it seems quite unnecessary here.
You want to count categories in a string variable: semi-colons are your separators. So count semi-colons and add 1.
local SC SubjectCategory
gen NCategory = 1 + length(`SC') - length(subinstr(`SC', ";", "", .))
Then (e.g.) table or tabstat will let you explore further by groups of interest.
To see the counting idea, consider 3 categories with 2 semi-colons.
. display length("frog;toad;newt")
14
. display length(subinstr("frog;toad;newt", ";", "", .))
12
If we replace each semi-colon with an empty string, the change in length is the number of semi-colons deleted. Note that we don't have to change the variable to do this. Then add 1. See also this paper.
That said, a way to extend your approach might be
egen class = group(economist temps99), label
su class, meanonly
local nclass = r(N)
gen result = .
forval i = 1/`nclass' {
di "`: label (class) `i''"
tabsplit SubjectCategory if class == `i', p(;) sort
return list
replace result = r(r) if class == `i'
}
Using statsby would be even better. See also this FAQ.

Extract intercepts from multiple regressions in stata

I am attempting to reproduce the following in stata. This is a scatter plot of average portfolio returns (y axis) and predicted retruns (x axis).
To do so, I need your help on how I can extract the intercepts from 25 regressions into one variable? I am currently running the 25 portfolio regressions as follows. I have seen that parmest can potentially do this but can't get it to work with the forval. Many thanks
forval s = 1 / 5 {
forval h = 1 / 5 {
reg S`s'H`h' Mkt_Rf SMB HML
}
}
I don't know what your data look like, but maybe something like this will work:
gen intercepts = .
local i = 1
forval s = 1 / 5 {
forval h = 1 / 5 {
reg S`s'H`h' Mkt_Rf SMB HML
// assign the ith observation of intercepts
// equal to the regression constant
replace intercepts = _b[_cons] if _n == `i'
// increment i
local ++i
}
}
The postfile series of commands can be very helpful in a situation like this. The commands allows you to store results in a separate data set without losing the data in memory.
You can start with this as a simple example. This code will produce a Stata data set called "results.dta" with the variables s h and constant with a record of each regression.
cap postclose results
postfile results s h constant using results.dta, replace
forval s = 1 / 5 {
forval h = 1 / 5 {
reg S`s'H`h' Mkt_Rf SMB HML
loc c = _b[_cons]
post results (`s') (`h') (`c')
}
}
postclose results
use results, clear

Rolling Standard Deviation

I use Stata for estimating rolling standard deviation of ROA (using 4 window in previous year). Now, I would like to keep only those rolling standard deviation that has at least 3 observation (out of 4) in the ROA. How can I do this using Stata?
ROA roa_sd
. .
. .
. .
.0108869 .
.0033411 .
.0032814 .0053356 (this value should be missing as it was calculated using only 2 valid value)
.0030827 .0043739
.0029793 .0038275
Your question is answered on the blog post I link to above in the comments. You can use rolling and then add an additional screen to discard sigma when the number of observations doesn't meet your threshold.
But for simple calculations like sigma and beta (i.e., standard deviation and univariate regression coefficient) you can do much better with a more manual approach. Compare the rolling solution with my manual solution.
/* generate panel by adpating the linked code */
clear
set obs 20000
gen date = _n
gen id = floor((_n - 1) / 20) + 1
gen roa = int((100) * runiform())
replace roa = . in 1/4
replace roa = . in 10/12
replace roa = . in 18/20
/* solution with rolling */
/* http://statadaily.wordpress.com/2014/03/31/rolling-standard-deviations-and-missing-observations/ */
timer on 1
xtset id date
rolling sd2 = r(sd), window(4) keep(date) saving(f2, replace): sum roa
merge 1:1 date using f2, nogenerate keepusing(sd2)
xtset id date
gen tag = missing(l3.roa) + missing(l2.roa) + missing(l1.roa) + missing(roa) > 1
gen sd = sd2 if (tag == 0)
timer off 1
/* my solution */
timer on 2
rolling_sd roa, window(4) minimum(3)
timer off 2
/* compare */
timer list
list in 1/50
I show the manual solution is much faster.
. /* compare */
. timer list
1: 132.38 / 1 = 132.3830
2: 0.10 / 1 = 0.0990
Save the following as rolling_sd.ado in your personal ado file directory (or in your current working directory). I'm sure that someone could further streamline this code. Note that this code has the additional advantage of meeting the minimum data requirements at the front edge of the window (i.e., calculates sigma with first three observations, rather than waiting for all four).
*! 0.2 Richard Herron 3/30/14
* added minimum data requirement
*! 0.1 Richard Herron 1/12/12
program rolling_sd
version 11.2
syntax varlist(numeric), window(int) minimum(int)
* get dependent and indpendent vars from varlist
tempvar n miss xs x2s nonmiss1 nonmiss2 sigma1 sigma2
local w = `window'
local m = `minimum'
* generate cumulative sums and missing values
xtset
bysort `r(panelvar)' (`timevar'): generate `n' = _n
by `r(panelvar)': generate `miss' = sum(missing(`varlist'))
by `r(panelvar)': generate `xs' = sum(`varlist')
by `r(panelvar)': generate `x2s' = sum(`varlist' * `varlist')
* generate variance 1 (front of window)
generate `nonmiss1' = `n' - `miss'
generate `sigma1' = sqrt((`x2s' - `xs'*`xs'/`nonmiss1')/(`nonmiss1' - 1)) if inrange(`nonmiss1', `m', `w') & !missing(`nonmiss1')
* generate variance 2 (back of window, main part)
generate `nonmiss2' = `w' - s`w'.`miss'
generate `sigma2' = sqrt((s`w'.`x2s' - s`w'.`xs'*s`w'.`xs'/`nonmiss2')/(`nonmiss2' - 1)) if inrange(`nonmiss2', `m', `w') & !missing(`nonmiss2')
* return standard deviation
egen sigma = rowfirst(`sigma2' `sigma1')
end