Summarizing a variable in Stata and extracting standard deviation - stata

I am trying to create a variable for each year in my data based on mathematical expressions of other variables (I have annual data and used "..." to avoid writing each year). I am using the summarize command in Stata to extract the standard deviation but Stata does not recognize the frac variable. I have tried to use egen but that results in an unknown function error. Using gen results in an already defined variable. I would appreciate anyone helping with the following code or pointing me to a link where this issue has been discussed.
foreach yr of numlist 1995...2012 {
local row = `yr' - 1994
local numerator = 100*(income - L1.income)
local denominator = ((abs(income) + abs(L1.income)) / 2)
local frac = (`numerator' / `denominator')
summarize frac
local sdfrac = r(sd)
matrix C[`row', 1] = `numerator'
matrix C[`row', 2] = `denominator'
matrix C[`row', 3] = `sdfrac'
}

If I am understanding your question right, maybe you don't need to use a loop until the end and then you can post the results to a postfile:
This is just a thought:
tempname memhold
tempfile filename
postfile `memhold' year sdfrac using `filename'
gen row=year-1994
gen numerator=100*(income-L1.income)
gen denominator=((abs(income)+abs(L1.income))/2)
gen frac=numerator/denominator
foreach yr of numlist 1995...2012 {
summarize frac if year=`yr'
local sdfrac=r(sd)
post `memhold' (year) (`sdfrac')
}
postclose `memhold'
clear all
use `filename'
*View Results
list
This code should get you a data set with the name of the year and the standard deviation of the frac variable as variables.

In a comment, OP added a question about code similar to this (but ignored the request to post it in a more civilised form). Note that backticks or left quotation marks in Stata clash with SO mark-up codes in comments. Presumably some
tempname memhold
definition preceded this.
postfile `memhold' year sdfrac sex race using myresults
levels of sex, local (s)
levelsof race, local (r)
foreach a of local s {
foreach b of local r {
forval yr = 1995/2012 {
summarize frac if year == `yr' & sex == `a' & race == `b'
post `memhold' (`yr') (`r(sd)') (`sex') (`race')
}
}
}
Let's focus on what the problem is. You want the standard deviations of frac for all combinations of sex, race and year in a separate file. That's one line
collapse (sd) frac, by(year sex race)
If you want to see a table alongside the data, consider
egen group = group(sex race year), label
and then
tab group, su(frac)
or
tabstat frac, by(group) stat(sd)

This code modifies that by #Pcarlitz, mostly by simplifying it. I can't check with your data, which I don't have.
It's too long to fit into a comment.
I would not use a temporary file as you want to save these results, it seems.
tempname memhold
postfile `memhold' year sdfrac using myresults
gen frac = (100*(income - L1.income))/((abs(income) + abs(L1.income))/2)
forval yr = 1995/2012 {
summarize frac if year==`yr'
post `memhold' (`yr') (`r(sd)')
}
postclose `memhold'
use myresults
list
UPDATE As in a later answer, consider collapse as a much simpler direct alternative here.

Related

Repeating code in an if qualifier in Stata

In Stata I am trying to repeat code inside an if qualifier using perhaps a forvalues loop. My code looks something like this:
gen y=0
replace y=1 if x_1==1 & x_2==1 & x_3==1 & x_4==1
Instead of writing the & x_i==1 statement every time for each variable, I want to do it using a loop, something like this:
gen y=0
replace y=1 if forvalues i=1/4{x_`i'==1 &}
LATER EDIT:
Would it be possible to create a local in the line of this with the elements added together:
forvalues i=1/4{
local text_`i' "x_`i'==1 &"
display "`text_`i''"
}
And then call it at the if qualifier ?
Although you use the term "if statement" all your code is phrased in terms of if qualifiers, which aren't commands or statements. (Your use of the term "statement" is looser than customary, but that doesn't affect an answer directly.)
You can't insert loops in if qualifiers.
See for the differences
help if
help ifcmd
The entire example
gen y = 0
replace y = 1 if x==1 | x==2 | x==3 | x==4
would be better as
gen y = inlist(x, 1, 2, 3, 4)
or (dependent possibly on whatever values are allowed)
gen y = inrange(x, 1, 4)
A loop solution could be
gen y = 0
quietly forval i = 1/4 {
replace y = 1 if x == `i'
}
We can't discuss whether inlist() or inrange() would or would not be a solution for your real problem if you don't show to us.
I usually don't like - in Nick's terms - to write code to write code. I see an immediate, though not elegant nor 'heterodox', solution to your issue. The whole thing amounts to generate an indicator function for all your indicators, and use it with your if qualifier.
Implicit assumptions, which make this a bad, non-generalizable solution, are: 1) all variables are dummies, and you need them to be == 1, and 2) variable names are conveniently ordered 1 to N (although, if that is not the case, you can easily change the forv into a 'foreach var of varlist etc.')
g touse = 1
forv i =1/30{
replace touse = touse * x_'i'
}
<your action> if touse == 1

Using local in a forvalues loop reports a syntax error

I am using two-level loops to create a set of variables. But Stata reports a syntax error.
forvalues i = 1/5 {
local to `i'+1
dis `to'
forvalues j = `to'/6{
dis `j'
gen e_`i'_`j' = .
}
}
I could not figure out where I made the syntax error.
And a follow-up question. I would like to change how the number of loops are coded in the example above. Right now, it's hard-coded as 5 and 6. But I want to make it based on the data. For instance,I am coding as below:
sum x
scalar x_max_1 = `r(max)'-1
scalar x_max_2 = `r(max)'
forvalues i = 1/x_max_1 {
local to = `i'+1
dis `to'
forvalues j = `to'/x_max_2{
dis `j'
gen e_`i'_`j' = .
}
}
However, Stata reports a syntax error in this case. I am not sure why. The scalar is a numeric variable. Why would the code above not work?
Your code would be better as
forvalues i = 1/5 {
local to = `i' + 1
forvalues j = `to'/6 {
gen e_`i'_`j' = .
}
}
With your code you went
local to `i' + 1
so first time around the loop to becomes the string or text 1 + 1 which is then illegal as an argument to forvalues. That is, a local definition without an = sign will result in copying of text, not evaluation of the expression.
The way you used display could not show you this error because display used that way will evaluate expressions to the extent possible. If you had insisted that the macro was a string with
di "`to'"
then you would have seen its contents.
Another way to do it is
forvalues i = 1/5 {
forvalues j = `= `i' + 1'/6 {
gen e_`i'_`j' = .
}
}
EDIT
You asked further about
sum x
scalar x_max_1 = `r(max)'-1
scalar x_max_2 = `r(max)'
forvalues i = 1/x_max_1 {
and quite a lot can be said about that. Let's work backwards from one of various better solutions:
sum x, meanonly
forvalues i = 1/`= r(max) - 1' {
or another, perhaps a little more transparent:
sum x, meanonly
local max = r(max) - 1
forvalues i = 1/`max' {
What are the messages here:
If you only want the maximum, specify meanonly. Agreed: the option name alone does not imply this. See https://www.stata-journal.com/sjpdf.html?articlenum=st0135 for more.
What is the point of pushing the r-class result r(max) into a scalar? You already have what you need in r(max). Educate yourself out of this with the following analogy.
I have what I want. Now I put it into a box. Now I take it out of the box. Now I have what I want again. Come to think of it, the box business can be cut.
The box is the scalar, two scalars in this case.
forvalues won't evaluate scalars to give you the number you want. That will happen in many languages, but not here.
More subtly, forvalues doesn't even evaluate local references or similar constructs. What happens is that Stata's generic syntax parser does that for you before what you typed is passed to forvalues.

Multiple local in foreach command macro

I have a dataset with multiple subgroups (variable economist) and dates (variable temps99).
I want to run a tabsplit command that does not accept bysort or by prefixes. So I created a macro to apply my tabsplit command to each of my subgroups within my data.
For example:
levelsof economist, local(liste)
foreach gars of local liste {
display "`gars'"
tabsplit SubjectCategory if economist=="`gars'", p(;) sort
return list
replace nbcateco = r(r) if economist == "`gars'"
}
For each subgroup, Stata runs the tabsplit command and I use the variable nbcateco to store count results.
I did the same for the date so I can have the evolution of r(r) over time:
levelsof temps99, local(liste23)
foreach time of local liste23 {
display "`time'"
tabsplit SubjectCategory if temps99 == "`time'", p(;) sort
return list
replace nbcattime = r(r) if temps99 == "`time'"
}
Now I want to do it on each subgroups economist by date temps99. I tried multiple combination but I am not very good with macros (yet?).
What I want is to be able to have my r(r) for each of my subgroups over time.
Here's a solution that shows how to calculate the number of distinct publication categories within each by-group. This uses runby (from SSC). runby loops over each by-group, each time replacing the data in memory with the data from the current by-group. For each by-group, the commands contained in the user's program are executed. Whatever is left in memory when the user's program terminates is considered results and accumulates. Once all the groups have been processed, these results replace the data in memory.
I used the verbose option because I wanted to present the results for each by-group using nice formatting. The derivation of the list of distinct categories is done by splitting each list, converting to a long layout, and reducing to one observation per distinct value. The distinct_categories program generates one variable that contains the final count of distinct categories for the by-group.
* create a demontration dataset
* ------------------------------------------------------------------------------
clear all
set seed 12345
* Example generated by -dataex-. To install: ssc install dataex
clear
input str19 economist
"Carmen M. Reinhart"
"Janet Currie"
"Asli Demirguc-Kunt"
"Esther Duflo"
"Marianne Bertrand"
"Claudia Goldin"
"Bronwyn Hughes Hall"
"Serena Ng"
"Anne Case"
"Valerie Ann Ramey"
end
expand 20
bysort economist: gen temps99 = 1998 + _n
gen pubs = runiformint(1,10)
expand pubs
sort economist temps99
gen pubid = _n
local nep NEP-AGR NEP-CBA NEP-COM NEP-DEV NEP-DGE NEP-ECM NEP-EEC NEP-ENE ///
NEP-ENV NEP-HIS NEP-INO NEP-INT NEP-LAB NEP-MAC NEP-MIC NEP-MON ///
NEP-PBE NEP-TRA NEP-URE
gen SubjectCategory = ""
forvalues i=1/19 {
replace SubjectCategory = SubjectCategory + " " + word("`nep'",`i') ///
if runiform() < .1
}
replace SubjectCategory = subinstr(trim(SubjectCategory)," ",";",.)
leftalign // from SSC
* ------------------------------------------------------------------------------
program distinct_categories
dis _n _n _dup(80) "-"
dis as txt "fille = " as res economist[1] as txt _col(68) " temps = " as res temps99[1]
// if there are no subjects for the group, exit now to avoid a no obs error
qui count if !mi(trim(SubjectCategory))
if r(N) == 0 exit
// split categories, reshape to a long layout, and reduce to unique values
preserve
keep pubid SubjectCategory
quietly {
split SubjectCategory, parse(;) gen(cat)
reshape long cat, i(pubid)
bysort cat: keep if _n == 1
drop if mi(cat)
}
// show results and generate the wanted variable
list cat
local distinct = _N
dis _n as txt "distinct = " as res `distinct'
restore
gen wanted = `distinct'
end
runby distinct_categories, by(economist temps99) verbose
This is an example of the XY problem, I think. See http://xyproblem.info/
tabsplit is a command in the package tab_chi from SSC. I have no negative feelings about it, as I wrote it, but it seems quite unnecessary here.
You want to count categories in a string variable: semi-colons are your separators. So count semi-colons and add 1.
local SC SubjectCategory
gen NCategory = 1 + length(`SC') - length(subinstr(`SC', ";", "", .))
Then (e.g.) table or tabstat will let you explore further by groups of interest.
To see the counting idea, consider 3 categories with 2 semi-colons.
. display length("frog;toad;newt")
14
. display length(subinstr("frog;toad;newt", ";", "", .))
12
If we replace each semi-colon with an empty string, the change in length is the number of semi-colons deleted. Note that we don't have to change the variable to do this. Then add 1. See also this paper.
That said, a way to extend your approach might be
egen class = group(economist temps99), label
su class, meanonly
local nclass = r(N)
gen result = .
forval i = 1/`nclass' {
di "`: label (class) `i''"
tabsplit SubjectCategory if class == `i', p(;) sort
return list
replace result = r(r) if class == `i'
}
Using statsby would be even better. See also this FAQ.

Fixed Compositional Weighting in Stata

I'm looking at the Current Population Survey in Stata, although this question could apply to any survey with individual weights.
It's straightforward to generate a table showing the mean of a variable -- say wages -- over time given individual weights:
table qtr [aw=pworwgt], contents(mean wage)
What I'd like to do automatically is show the average level of, in this example, wages, but with the proportions of certain categories fixed to a date.
So for example, let's say I have 6 educational categories (Less than HS, HS, Some College, AA, BA/BS, Grad School)... I'd want to see how wages would be different if I fixed the educational proportions of the workforce to their, say, 2005 levels.
Ideally, the solution would not be resource intensive for large-numbered categories. For example, I might want to do something similar with the CPS's detail occupational metric, which has hundreds of levels.
My gut tells me "margins" may be part of the solution but I'm not familiar enough with that command... also, I'd like to be able to generate table output so I can graph in other software.
ETA: Here's the way I tried to do this for fixing weights by age and sex: by cycling through all the data, comparing the contemporaneous proportions to the base quarter proportions, and then adjusting the individual weights accordingly. This takes a really long time to cycle through however.
local start = tq(1994q1)
local end = tq(2014q4)
local base = tq(2006q1)
tempvar pop2006
tempvar cohort2006
tempvar poptemp
gen pworwgt_a = pworwgt
levelsof pesex, local(sex)
sum pworwgt if qtr == `base'
gen `pop2006' = r(N)*r(mean)
gen `cohort2006' = .
gen `poptemp' = .
forvalues age = 16/85 {
foreach s in `sex' {
sum pworwgt if age == `age' & pesex == `s' & qtr == `base'
replace `cohort2006' = r(N)*r(mean)/`pop2006'
forvalues q = `start'/`end' {
sum pworwgt if qtr == `q'
replace `poptemp' = r(N)*r(mean)
sum pworwgt if age == `age' & pesex == `s' & qtr == `q'
replace pworwgt_a = pworwgt_a*`cohort2006'/((r(N)*r(mean))/`poptemp') if age == `age' & pesex == `s' & qtr == `q'
}
}
}
I don't have scope to test this, but here are suggested simplifications to the code segment. I don't address the main question, which I don't understand, partly because there is no precise description of data structure in the question.
To summarize suggestions:
Use summarize, meanonly when that is all you need and use r(sum) ditto.
Use scalars not variables for constants.
Shift repeated calculations to once-and-for-all calculations of variables. I think you can do even more of this, but I will stop here.
local start = tq(1994q1)
local end = tq(2014q4)
local base = tq(2006q1)
tempname pop2006 cohort2006
tempvar qassum qsum
// quarter-age-sex sums in a single variable
bysort qtr age pesex : gen double `qassum` = sum(pworwgt)
by qtr age pesex : replace `qassum` = `qassum`[_N]
// quarterly sums in a single variable
by qtr: gen double `qsum' = sum(pworwgt)
by qtr: replace `qsum` = `qsum'[_N]
gen pworwgt_a = pworwgt
levelsof pesex, local(sex)
sum pworwgt if qtr == `base', meanonly
scalar `pop2006' = r(sum)
forvalues age = 16/85 {
foreach s in `sex' {
sum pworwgt if age == `age' & pesex == `s' & qtr == `base', meanonly
scalar `cohort2006' = r(sum)/`pop2006'
replace pworwgt_a = pworwgt_a*`cohort2006'/`qassum'/`qsum' if age == `age' & pesex == `s'
}
}

Is Conger's kappa available in Stata?

Is the modified version of kappa proposed by Conger (1980) available in Stata? Tried to google it to no avail.
This is an old question, but in case anyone is still looking--the SSC package kappaetc now calculates that, along with every other inter-rater statistic you could ever want.
Since no one has responded with a Stata solution, I developed some code to calculate Conger's kappa using the formulas provided in Gwet, K. L. (2012). Handbook of Inter-Rater Reliability (3rd ed.), Gaithersburg, MD: Advanced Analytics, LLC. See especially pp. 34-35.
My code is undoubtedly not as efficient as others could write, and I would welcome any improvements to the code or to the program format that others wish to make.
cap prog drop congerkappa
prog def congerkappa
* This program has only been tested with Stata 11.2, 12.1, and 13.0.
preserve
* Number of judges
scalar judgesnum = _N
* Subject IDs
quietly ds
local vlist `r(varlist)'
local removeit = word("`vlist'",1)
local targets: list vlist - removeit
* Sums of ratings by each judge
egen judgesum = rowtotal(`targets')
* Sum of each target's ratings
foreach i in `targets' {
quietly summarize `i', meanonly
scalar mean`i' = r(mean)
}
* % each target rating of all target ratings
foreach i in `targets' {
gen `i'2 = `i'/judgesum
}
* Variance of each target's % ratings
foreach i in `targets' {
quietly summarize `i'2
scalar s2`i'2 = r(Var)
}
* Mean variance of each target's % ratings
foreach i in `targets' {
quietly summarize `i'2, meanonly
scalar mean`i'2 = r(mean)
}
* Square of mean of each target's % ratings
foreach i in `targets' {
scalar mean`i'2sq = mean`i'2^2
}
* Sum of variances of each target's % ratings
scalar sumvar = 0
foreach i in `targets' {
scalar sumvar = sumvar + s2`i'2
}
* Sum of means of each target's % ratings
scalar summeans = 0
foreach i in `targets' {
scalar summeans = summeans + mean`i'2
}
* Sum of meansquares of each target's % ratings
scalar summeansqs = 0
foreach i in `targets' {
scalar summeansqs = summeansqs + mean`i'2sq
}
* Conger's kappa
scalar conkappa = summeansqs -(sumvar/judgesnum)
di _n "Conger's kappa = " conkappa
restore
end
The data structure required by the program is shown below. The variable names are not fixed, but the judge/rater variable must be in the first position in the data set. The data set should not include any variables other than the judge/rater and targets/ratings.
Judge S1 S2 S3 S4 S5 S6
Rater1 2 4 2 1 1 4
Rater2 2 3 2 2 2 3
Rater3 2 5 3 3 3 5
Rater4 3 3 2 3 2 3
If you would like to run this against a test data set, you can use the judges data set from StataCorp and reshape it as shown.
use http://www.stata-press.com/data/r12/judges.dta, clear
sort judge
list, sepby(judge)
reshape wide rating, i(judge) j(target)
rename rating* S*
list, noobs
* Run congerkappa program on demo data set in memory
congerkappa
I have run only a single validation test of this code against the data in Table 2.16 in Gwet (p. 35) and have replicated the Conger's kappa = .23343 as calculated by Gwet on p. 34. Please test this code on other data with known Conger's kappas before relying on it.
I don't know if Conger's kappa for multiple raters is available in Stata, but it is available in R via the irr package, using the kappam.fleiss function and specifying the exact option. For information on the irr package in R, see http://cran.r-project.org/web/packages/irr/irr.pdf#page.12 .
After installing and loading the irr package in R, you can view a demo data set and Conger's kappa calculation using the following code.
data(diagnoses)
print(diagnoses)
kappam.fleiss(diagnoses, exact=TRUE)
I hope someone else here can help with a Stata solution, as you requested, but this may at least provide a solution if you can't find it in Stata.
In response to Dimitriy's comment below, I believe Stata's native kappa command applies either to two unique raters or to more than two non-unique raters.
The original poster may also want to consider the icc command in Stata, which allows for multiple unique raters.