Calculating the Gini Coefficient from LIS data (in Stata) - stata

I need to calculate the Gini coefficient from disposable personal income data at LIS. According to a LIS training document, the Stata code to do this is:
di "** INCOME DISTRIBUTION II – Exercise 13 **"
program define bottop
qui sum ey [w=hweight*d4]
replace ey = .01*r(mean) if ey<.01*r(mean)
qui sum dpi [w=hweight*d4], de
replace ey = (10*r(p50)/(d4^.5)) if dpi>10*r(p50)
end
foreach file in $us00h $fi00h {
display "`file'"
use hweight d4 dpi if (!mi(dpi) & !(dpi==0)) using "`file'", clear
gen ey=dpi/(d4^0.5)
bottop
ineqdeco ey [w=hweight*d4]
}
I have simply copied and pasted this code from the training document. The snippets
qui sum ey [w=hweight*d4]
replace ey=0.01*r(mean) if ey<0.01*r(mean)
and
qui sum dpi [w=hweight*d4], de
replace ey=(10*r(p50)/(d4^0.5)) if dpi>10*r(p50)
are bottom and top coding, respectively.
When I tried to run this code, the variable hweight was not found. Does anyone know what the new name of hweight is at LIS? Or can anyone suggest how I might otherwise overcome this impasse?
I'm familiar with stata, but the sophistication of this code is beyond my ken.
Much appreciated.

Based on the varaiable definition list at the LIS Documentation page, it looks like the variable is now called HWGT

This is more of a second-best solution. However, the census of population provides income by brackets. If you are willing to do that, you can get the counts for every bracket. Have a top-coded bracket for the last one. Use the median income value within each bracket. Then you can directly apply the formula for the Gini coefficient. It is a second best because it is an approximation for the individaul-level data.

Why don't you try the fastgini command:
http://www.stata.com/statalist/archive/2007-02/msg00524.html
ssc install fastgini
fastgini income
return list
this should give you the gini for the variable income.
This package also allows for weights. Type
help fastgini
for more information

Related

Stata coefplot: plot coefficients and corresponding confidence intervals on 2nd axis

When trying to depict two coefficients from one regression on separate axes with Ben Jann's superb coefplot (ssc install coefplot) command, the coefficient to be shown on the 2nd axis is correctly displayed, but its confidence interval is depicted on the 1st scale.
Can anyone explain how I get the CI displayed on the same (2nd) axis as the coefficient it belongs to? I couldn't find any option to change this - and imagine it should be the default, if not the only, option to plot the CI around the point estimate it belongs to.
I use the latest coefplot version with Stata 16.
Here is a minimum example to illustrate the problem:
results plot
webuse union, clear
eststo results: reg idcode i.union grade
coefplot (results, keep(1.union)) (results, keep(grade) xaxis(2))
In the line
coefplot (results, keep(1.union)) (results, keep(grade) xaxis(2))
you specify the option xaxis(2), but this is not a documented option of coefplot, although it is a valid option of twoway rspike which is called by coefplot. Apparently, if you use xaxis(2) something goes wrong with the communication between coefplot and rspike.
This works for me:
coefplot (results, keep(1.union)) (results, keep(grade) axis(2))
I'm trying to create something similar. Since this option is not built-in we need to write a program to tweak how coefplot works. I'm sharing the code from the user manual here: http://repec.sowi.unibe.ch/stata/coefplot/markers.html
capt program drop coefplot_mlbl
*! version 1.0.0 10jun2021 Ben Jann
program coefplot_mlbl, sclass
_parse comma plots 0 : 0
syntax [, MLabel(passthru) * ]
if `"`mlabel'"'=="" local mlabel mlabel(string(#b, "%5.2f") + " (" + string(#ll, "%5.2f") + "; " + string(#ul, "%5.2f") + ")")
preserve
qui coefplot `plots', `options' `mlabel' generate replace nodraw
sreturn clear
tempvar touse
qui gen byte `touse' = __at<.
mata: st_global("s(mlbl)", ///
invtokens((strofreal(st_data(.,"__at","`touse'")) :+ " " :+ ///
"`" :+ `"""' :+ st_sdata(.,"__mlbl","`touse'") :+ `"""' :+ "'")'))
sreturn local plots `"`plots'"'
sreturn local options `"`options'"'
end
capt program drop coefplot_ymlbl
*! version 1.0.0 10jun2021 Ben Jann
program coefplot_ymlbl
_parse comma plots 0 : 0
syntax [, MLabel(str asis) * ]
_parse comma mlspec mlopts : mlabel
local mlopts = substr(`"`mlopts'"', 2, .) // remove leading comma
if `"`mlspec'"'!="" local mlabel mlabel(`mlspec')
else local mlabel
coefplot_mlbl `plots', `options' `mlabel'
coefplot `plots', ///
yaxis(1 2) yscale(alt) yscale(axis(2) alt noline) ///
ylabel(none, axis(2)) yti("", axis(2)) ///
ymlabel(`s(mlbl)', axis(2) notick angle(0) `mlopts') `options'
end
coefplot_ymlbl D F, drop(_cons) xline(0)
However, the above program does not allow for the option 'bylabel'. I get a stata error saying "bylabel not allowed". I wanted to ask if there is a way to edit this code and include the bylabel option which is used to label subplots?

Precisions and counts

I am working with a educational dataset called IPEDS from the National Center for Educational Statistics. They track students in college based upon major, degree completion, etc. The problem in Stata is that I am trying to determine the total count for degrees obtained by a specific major.
They have a variable cipcode which contains values that serve as "majors". cipcode might be 14.2501 "petroleum engineering, 16.0102 "Linguistics" and so forth.
When I write a particular code like
tab cipcode if cipcode==14.2501
it reports no observations. What code will give me the totals?
/*Convert Float Variable to String Variable and use Force Replace*/
tostring cipcode, gen(cipcode_str) format(%6.4f) force
replace cipcode_str = reverse(substr(reverse(cipcode_str), indexnot(reverse(cipcode_str), "0"), .))
replace cipcode_str = reverse(substr(reverse(cipcode_str), indexnot(reverse(cipcode_str), "."), .))
/* Created a total variable called total_t1 for total count of all stem majors listed in table 1*/
gen total_t1 = cipcode_str== "14.2501" + "14.3901" + "15.0999" + "40.0601"
This minimal example confirms your problem. (See, by the way, https://stackoverflow.com/help/mcve for advice on good examples.)
* code
clear
input code
14.2501
14.2501
14.2501
end
tab code if code == 14.2501
tab code if code == float(14.2501)
* results
. tab code if code == 14.2501
no observations
. tab code if code == float(14.2501)
code | Freq. Percent Cum.
------------+-----------------------------------
14.2501 | 3 100.00 100.00
------------+-----------------------------------
Total | 3 100.00
The keyword is one you use, precision. In Stata, search precision for resources, starting with blog posts by William Gould. A decimal like 14.2501 is hard (impossible) to hold exactly in binary and the details of holding a variable as type float can bite.
It's hard to see what you're doing with your last block of code, which you don't explain. The last statement looks puzzling, as you're adding strings. Consider what happens with
. gen whatever = "14.2501" + "14.3901" + "15.0999" + "40.0601"
. di whatever[1]
14.250114.390115.099940.0601
The result is a long string that cannot be a valid cipcode. I suspect that you are reaching towards
... if inlist(cipcode_str, "14.2501", "14.3901", "15.0999", "40.0601")
which is quite different.
But using float() is the minimal trick for this problem.

Graph evolution of quantile non-linear coefficient: can it be done with grqreg? Other options?

I have the following model:
Y_{it} = alpha_i + B1*weight_{it} + B2*Dummy_Foreign_{i} + B3*(weight*Dummy_Foreign)_ {it} + e_{it}
and I am interested on the effect on Y of weight for foreign cars and to graph the evolution of the relevant coefficient across quantiles, with the respective standard errors. That is, I need to see the evolution of the coefficients (B1+ B3). I know this is a non-linear effect, and would require some sort of delta method to obtain the variance-covariance matrix to obtain the standard error of (B1+B3).
Before I delve into writing a program that attempts to do this, I thought I would try and ask if there is a way of doing it with grqreg. If this is not possible with grqreg, would someone please guide me into how they would start writing a code that computes the proper standard errors, and graphs the quantile coefficient.
For a cross section example of what I am trying to do, please see code below.
I use grqred to generate the evolution of the separate coefficients (but I need the joint one)-- One graph for the evolution of (B1+B3) with it's respective standard errors.
Thanks.
(I am using Stata 14.1 on Windows 10):
clear
sysuse auto
set scheme s1color
gen gptm = 1000/mpg
label var gptm "gallons / 1000 miles"
gen weight_foreign= weight*foreign
label var weight_foreign "Interaction weight and foreign car"
qreg gptm weight foreign weight_foreign , q(.5)
grqreg weight weight_foreign , ci ols olsci reps(40)
*** Question 1: How to constuct the plot of the coefficient of interest?
Your second question is off-topic here since it is statistical. Try the CV SE site or Statalist.
Here's how you might do (1) in a cross section, using margins and marginsplot:
clear
set more off
sysuse auto
set scheme s1color
gen gptm = 1000/mpg
label var gptm "gallons / 1000 miles"
sqreg gptm c.weight##i.foreign, q(10 25 50 75 95) reps(500) coefl
margins, dydx(weight) predict(outcome(q10)) predict(outcome(q25)) predict(outcome(q50)) predict(outcome(q75)) predict(outcome(q95)) at(foreign=(0 1))
marginsplot, xdimension(_predict) xtitle("Quantile") ///
legend(label(1 "Domestic") label(2 "Foreign")) ///
xlabel(none) xlabel(1 "Q10" 2 "Q25" 3 "Q50" 4 "Q75" 5 "Q95", add) ///
title("Marginal Effect of Weight By Origin") ///
ytitle("GPTM")
This produces a graph like this:
I didn't recast the CI here since it would look cluttered, but that would make it look more like your graph. Just add recastci(rarea) to the options.
Unfortunately, none of the panel quantile regression commands play nice with factor variables and margins. But we can hack something together. First, you can calculate the sums of coefficients with nlcom (instead of more natural lincom, which the lacks the post option), store them, and use Ben Jann's coefplot to graph them. Here's a toy example to give you the main idea where we will look at the effect of tenure for union members:
set more off
estimates clear
webuse nlswork, clear
gen tXu = tenure*union
local quantiles 1 5 10 25 50 75 90 95 99 // K quantiles that you care about
local models "" // names of K quantile models for coefplot to graph
local xlabel "" // for x-axis labels
local j=1 // counter for quantiles
foreach q of numlist `quantiles' {
qregpd ln_wage tenure union tXu, id(idcode) fix(year) quantile(`q')
nlcom (me_tu:_b[tenure]+_b[tXu]), post
estimates store me_tu`q'
local models `"`models' me_tu`q' || "'
local xlabel `"`xlabel' `j++' "Q{sub:`q'}""'
}
di "`models'
di `"`xlabel'"'
coefplot `models' ///
, vertical bycoefs rescale(100) ///
xlab(none) xlabel(`xlabel', add) ///
title("Marginal Effect of Tenure for Union Members On Each Conditional Quantile Q{sub:{&tau}}", size(medsmall)) ///
ytitle("Wage Change in Percent" "") yline(0) ciopts(recast(rcap))
This makes a dromedary curve, which suggests that the effect of tenure is larger in the middle of the wage distribution than at the tails:

Query on plotting Lorenz curves on Stata

I am trying to plot a lorenz curve, using the following command:
glcurve drugs, sortvar(death) pvar(rank) glvar(yord) lorenz nograph
generate rank1=rank
label variable rank "Cum share of mortality"
label variable rank1 "Equality Line"
twoway (line rank1 rank, sort clwidth(medthin) clpat(longdash))(line yord rank , sort clwidth(medthin) clpat(red)), ///
ytitle(Cumulative share of drug activity, size(medsmall)) yscale(titlegap(2)) xtitle(Cumulative share of mortality (2012), size(medsmall)) ///
legend(rows(5)) xscale(titlegap(5)) legend(region(lwidth(none))) plotregion(margin(zero)) ysize(6.75) xsize(6) plotregion(lcolor(none))
However, in the resultant curves, the Line of equality does not start from 0, is there a way to fix this?
Is it recommended to use the following in order to get the perfect 45 degree line of equality:
(function y=x, range(0 1)
Also, how many minimum observations are required to plot the above graph? Does it work well with 2 observations as well?
The reason your Line of Perfect Equality does not pass through (0,0) is because the values for your variable do not contain 0.
The smallest value you will have for rank will be 1/_N. Although this value will asymptotically approach 0, it will never actually reach 0.
To see this, try:
quietly sum rank
di r(min)
di 1/_N
Further, by applying the program code to your data (beginning around line 152 in the ado file and removing unnecessary bits), one can easily see that yord cannot take on a value of 0 without values of 0 for drugs:
glcurve drugs, sortvar(death) pvar(rank) glvar(yord) lorenz nograph
sort death drugs , stable
gen double rank1 = _n / _N
qui sum drugs
gen yord1= (sum(drugs) / _N) / r(mean)
The best way to plot your Equality would be the method from your edit, namely:
twoway(function y = x, ra(0 1))
One quick yet (very) crude fix to force the lorenz curve to start at the origin (if it doesn't already) is to add an observation to the data after obtaining rank and yord, and then deleting it after you have your curve:
glcurve drugs, sortvar(death) pvar(rank) glvar(yord) lorenz nograph
expand 2 in 1
replace yord = 0 in 1
replace rank = 0 in 1
twoway (function y = x, ra(0 1)) ///
(line yord rank)
drop in 1
Like I said, this is admittedly crude and even somewhat ill advised, but I can't see a much better alternative at the moment, and with this method you will not be altering any of the other values of yord by running glcurve on the extrapolated data.

Stata output files in surveys

I have some survey data which I'm using Stata to analyze. I want to compute means of one variable by group and save those means to a Stata file. My code looks like this:
svyset [iw=wtsupp], sdrweight(repwtp1-repwtp160) vce(sdr)
svy: mean x
I tried
svy: by grp: mean x
but that did not work. I could save each mean to a separate file by simply saying
svy: mean x if grp==1
but that's inefficient. Is there a better way?
Saving results to a file like one can use SAS ODS to capture results is also a need. I am not talking about the log here. I need the means and the associated group. I'm thinking
estimates save [path],replace
but I'm not sure if that will give me a Stata file or the group if I can figure out how to use by processing.
Here's a simpler approach that creates a data set of the displayed estimation results: estimated means, standard errors, confidence limits, z statistics, and p-values. svy: mean is called with the over() option, which does away with the need for the foreach loop and computes standard errors appropriate for subpopulation analysis. The estimation results are contained in the returned matrix r(table), which is converted by the svmat command to a Stata data set. While svmat maintains column names, it does not preserve row (group) names, so it is necessary to merge these in to the created data set.
set more off
use http://www.stata-press.com/data/r13/ss07ptx, clear
svyset _n [pw= pwgtp], sdrweight(pwgtp*) vce(sdr)
************************************************ *
* Set name of grouping variable in double quotes *
* in the next line. *
* ************************************************
local gpname "sex"
tempvar gp
egen `gp' = group(`gpname')
preserve
tempfile t1
bys `gp': keep if _n==1
keep `gp' `gpname'
save `t1'
restore
svy: mean agep , over(`gp')
matrix a = r(table)'
clear
qui svmat double a, names(col)
gen `gp'=_n
merge 1:1 `gp' using `t1'
keep `gpname' b se z pvalue ll ul
order `gpname'
save results, replace
list
Edited 10/28
This version contains legibility improvements and the outcome variable and saved datasets are specified in a local macro. Therefore the analyst need not touch the foreach block. Easier to write and read matrix subscript expressions are used instead of the el matrix function: thus m[1,1] instead of el("m",1,1).
sysuse auto, clear
svyset _n
************************************************ *
* Set names of grouping variable and results data *
* set in double quotes in the next line. *
* ************************************************
local yvar mpg // variable for mean
local gpname "foreign"
local d_results "results"
tempvar gp
gen `gp' = `gpname'
tempname memhold
postfile `memhold' ///
`gpname' n mean se sd using `d_results', replace
levelsof `gp', local(lg)
foreach x of local lg{
svy, subpop(if `gp'==`x'): mean `yvar'
matrix m = e(b)
matrix v = e(V)
matrix a = e(V_srssub)
matrix b = e(_N_subp)
matrix c = e(_N)
scalar gx = `x'
scalar mean = m[1,1]
scalar sem = sqrt(v[1,1])
scalar sd = sqrt(b[1,1]*a[1,1])
scalar n = c[1,1]
post `memhold' (gx) (n) (mean) (sem) (sd)
}
postclose `memhold'
use results, clear
list