Quintiles with different quantity of observations - stata

I am using Stata and investigating the variable household net wealth NetWealth).
I want to construct the quintiles of this variable and use the following command--as you can see I use survey data and thus apply survey weights:
xtile Quintile = NetWealth [pw=surveyweight], nq(5)
Then I give the following command to check what I have obtained:
tab Quintile, sum(NetWealth)
This is the result:
Means, Standard Deviations and Frequencies of DN3001 Net wealth
5 |
quantiles |
of dn3001 |
-----------+-----------+
1 |1519.4221
|43114.959
| 154
-----------+-----------+
2 | 135506.67
| 74360.816
| 179
-----------+-----------+
3 | 396712.16
| 69715.49
| 161
-----------+-----------+
4 | 669065.69
| 111102.02
| 182
-----------+-----------+
5 | 2552620.5
| 3872350.9
| 274
-----------+-----------+
Total | 957419.29
| 2323329.8
| 950
Why do I get a different number of households in each quintile? In particular in the last quintile?
The only explanation that I can come up with is that when Stata constructs quintiles with xtile, it excludes from the computation those observations that present a replicate value of NetWealth. I have had this impression also while consulting the Stata material.
What do you think?

Your problem is not fully reproducible in so far as you don't give a self-contained example, but in general there is no puzzle here.
Often people seeking such binnings have a small problem in that their number of observations is not a multiple (meaning, exact multiple) of the number of quantile-based bins they want, but in your case that does not bite as calculation
. di 154 + 179 + 161 + 182 + 274
950
shows that you have 950 observations, which is 5 x 190.
The bigger deal -- here and almost always -- arises from Stata's rule that identical values in different observations must be assigned to the same bin. So, ties are likely to be the problem here.
You have perhaps three possible solutions. Only one involves direct coding.
Live with it.
Do something else. For example, why you are doing this any way? Why not use the original data?
Try a different boundary condition. To do that, just negate the variable and bin that version. Then values on the boundary will jump differently.
Adding random noise to separate ties is utterly indefensible in my view. It's not reproducible (except trivially using the same program and the same settings) and it will have different implications in terms of the same observations' values on other variables.
Here's an example where #3 doesn't help, but it sometimes does:
. sysuse auto, clear
(1978 Automobile Data)
. xtile bin5 = mpg, nq(5)
. gen negmpg = -mpg
. xtile bin5_2 = negmpg, nq(5)
. tab bin5
5 quantiles |
of mpg | Freq. Percent Cum.
------------+-----------------------------------
1 | 18 24.32 24.32
2 | 17 22.97 47.30
3 | 13 17.57 64.86
4 | 12 16.22 81.08
5 | 14 18.92 100.00
------------+-----------------------------------
Total | 74 100.00
. tab bin5_2
5 quantiles |
of negmpg | Freq. Percent Cum.
------------+-----------------------------------
1 | 19 25.68 25.68
2 | 12 16.22 41.89
3 | 16 21.62 63.51
4 | 13 17.57 81.08
5 | 14 18.92 100.00
------------+-----------------------------------
Total | 74 100.00
See also some discussion within Section 4 of this paper
I see no hint whatsoever in the documentation that xtile would omit observations in the way that you imply. You give no precise quotation supporting that. It would be perverse to exclude any non-missing values unless so instructed.
I don't comment directly here on use of pweights except that using pweights might be a complicating factor here.

Related

How I can get more decimal digits of p-value of chi-square in two-way table

I use command cchi2 to display each cell’s contribution to Pearson’s chi-squared in a two-way table in Stata. The p-value of each cell is displayed as 1 decimal, e.g., that the p-value is 0.0, but I would like to see more digits, e.g. 0.052 or .050.
Is there any possible way to set the digits of the p-value decimal?
This lacks a good reproducible example with a data call we can understand and in fact gives no code whatsoever. It also seems confused in terms of both Stata and statistics.
There is an option cchi2 to the tabulate command when used with two variables. cchi2 is not a separate command; it yields the contribution to chi-square and makes most sense when combined with the chi2 option, e.g.
. sysuse auto, clear
(1978 Automobile Data)
. tab foreign rep78, chi2 cchi2
+-------------------+
| Key |
|-------------------|
| frequency |
| chi2 contribution |
+-------------------+
| Repair Record 1978
Car type | 1 2 3 4 5 | Total
-----------+-------------------------------------------------------+----------
Domestic | 2 8 27 9 2 | 48
| 0.3 1.1 1.8 1.0 4.2 | 8.3
-----------+-------------------------------------------------------+----------
Foreign | 0 0 3 9 9 | 21
| 0.6 2.4 4.1 2.3 9.5 | 19.0
-----------+-------------------------------------------------------+----------
Total | 2 8 30 18 11 | 69
| 0.9 3.5 5.9 3.3 13.7 | 27.3
Pearson chi2(4) = 27.2640 Pr = 0.000
Here we can see the contribution to chi-square; in this case 9.5 of the 27.264 chi-square statistic comes from the bottom right cell. These are not individual P-values; there is just one P-value, for the entire table, here given as 0.000 to 3 d.p.
One way of getting more detail is with the tabchi command downloadable with ssc install tab_chi. Here the pearson option gives the more useful Pearson residuals, (observed - expected) / square root of expected, which are the signed square roots of the contribution to chi-square.
. tabchi foreign rep78, pearson
observed frequency
expected frequency
Pearson residual
--------------------------------------------------
| Repair Record 1978
Car type | 1 2 3 4 5
----------+---------------------------------------
Domestic | 2 8 27 9 2
| 1.391 5.565 20.870 12.522 7.652
| 0.516 1.032 1.342 -0.995 -2.043
|
Foreign | 0 0 3 9 9
| 0.609 2.435 9.130 5.478 3.348
| -0.780 -1.560 -2.029 1.505 3.089
--------------------------------------------------
4 cells with expected frequency < 5
1 cell with expected frequency < 1
Pearson chi2(4) = 27.2640 Pr = 0.000
likelihood-ratio chi2(4) = 29.9121 Pr = 0.000
Typing return list after that command gives more decimal places for the P-value:
. ret li
scalars:
r(N) = 69
r(r) = 2
r(c) = 5
r(chi2) = 27.26396103896104
r(p) = .0000175796084266
In your case, we can use your displayed frequencies to get Pearson residuals from the command tabchii in the same package.
. tabchii 3 10 2 \ 11 54 10, pearson
observed frequency
expected frequency
Pearson residual
----------------------------------
| col
row | 1 2 3
----------+-----------------------
1 | 3 10 2
| 2.333 10.667 2.000
| 0.436 -0.204 0.000
|
2 | 11 54 10
| 11.667 53.333 10.000
| -0.195 0.091 0.000
----------------------------------
2 cells with expected frequency < 5
Pearson chi2(2) = 0.2786 Pr = 0.870
likelihood-ratio chi2(2) = 0.2643 Pr = 0.876
In your case as the total chi-squared statistic happens to be less than 1, then all the contributions, the squares of (observed - expected) / expected, are also all less than 1, but (a) that is not true in general (b) they are not P-values (there isn't a separate test being carried out in each cell).
Both tabchi and tabchii do also have cont options equivalent to the cchi2 option of tabulate. You can also tune the number of decimal places shown using tabdisp options, as documented.
I think the quickest way to show the p value to more than 3 dp is to display stored results after you do your chi square:
. tab var1 var2, col chi
. display `r(p)'
There is an easy way to do it without using any software other than your browser.
This online statistical calculator can provide p value for up to 12 decimal points: https://www.icalcu.com/stat/chisqtest.html
For the first question, just paste the numbers below:
3 10 2
11 54 10
and you will get a p value of 0.869979427395
For the second dataset, just paste the numbers below:
2 8 27 9 2
0 0 3 9 9
and you will get a p value of 0.000017579608.

Create 3-way percentages table

I would like to have a 3-way table displaying column or row percentages using three categorical variables. The command below gives the counts but I cannot find how to get percentages instead.
sysuse nlsw88
table married race collgrad, col
--------------------------------------------------------------------
| college graduate and race
| ---- not college grad ---- ------ college grad ------
married | white black other Total white black other Total
----------+---------------------------------------------------------
single | 355 256 5 616 132 53 3 188
married | 862 224 12 1,098 288 50 6 344
--------------------------------------------------------------------
How can I get percentages?
This answer will show a miscellany of tricks. The downside is that I don't know an easy way to get exactly what you ask. The upside is that all these tricks are easy to understand and often useful.
Let's use your example, which is excellent for the purpose.
. sysuse nlsw88, clear
(NLSW, 1988 extract)
Tip #1 You can calculate a percent variable for yourself. I focus on % single. In this data set married is binary, so I won't show the complementary percent.
Once you have calculated it, you can (a) rely on the fact that it is constant within the groups you used to define it (b) tabulate it directly. I find that tabdisp is underrated by users. It's billed as a programmer's command, but it is not difficult to use at all. tabdisp lets you set a display format on the fly; it does no harm and might be useful for other commands to assign one directly using format.
. egen pcsingle = mean(100 * (1 - married)), by(collgrad race)
. tabdisp collgrad race, c(pcsingle) format(%2.1f)
--------------------------------------
| race
college graduate | white black other
-----------------+--------------------
not college grad | 29.2 53.3 29.4
college grad | 31.4 51.5 33.3
--------------------------------------
. format pcsingle %2.1f
Tip #2 A user-written command groups offers different flexibility. groups can be installed from SSC (strictly, must be installed before you can use it). It's a wrapper for various kinds of tables, but using list as a display engine.
. * do this installation just once
. ssc inst groups
. groups collgrad race pcsingle
+-------------------------------------------------------+
| collgrad race pcsingle Freq. Percent |
|-------------------------------------------------------|
| not college grad white 29.2 1217 54.19 |
| not college grad black 53.3 480 21.37 |
| not college grad other 29.4 17 0.76 |
| college grad white 31.4 420 18.70 |
| college grad black 51.5 103 4.59 |
|-------------------------------------------------------|
| college grad other 33.3 9 0.40 |
+-------------------------------------------------------+
We can improve on that. We can set up better header text using characteristics. (In practice, these can be less constrained than variable names but often need to be shorter than variable labels.) We can use separators by calling up standard list options.
. char pcsingle[varname] "% single"
. char collgrad[varname] "college?"
. groups collgrad race pcsingle , subvarname sepby(collgrad)
+-------------------------------------------------------+
| college? race % single Freq. Percent |
|-------------------------------------------------------|
| not college grad white 29.2 1217 54.19 |
| not college grad black 53.3 480 21.37 |
| not college grad other 29.4 17 0.76 |
|-------------------------------------------------------|
| college grad white 31.4 420 18.70 |
| college grad black 51.5 103 4.59 |
| college grad other 33.3 9 0.40 |
+-------------------------------------------------------+
Tip #3 Wire display formats into a variable by making a string equivalent. I don't illustrate this fully, but I often use it when I want to combine a display of counts with numerical results with decimal places in tabdisp. format(%2.1f) and format(%3.2f) might do fine for most variables (and incidentally the important detail is the number of decimal places) but they would lead to a display of a count of 42 as 42.0 or 42.00, which would look pretty silly. The format() option of tabdisp does not reach into the string and change the contents; it doesn't even know what the string variable contains or where it came from. So, strings just get shown by tabdisp as they come, which is what you want.
. gen s_pcsingle = string(pcsingle, "%2.1f")
. char s_pcsingle[varname] "% single"
groups has an option to save what is tabulated as a fresh dataset.
Tip #4 To have a total category, temporarily double up the data. The clone of the original is relabelled as a Total category. You may need to do some extra calculations, but nothing there amounts to rocket science: a smart high school student could figure it out. Here a concrete example for line-by-line study beats lengthy explanations.
. preserve
. local Np1 = _N + 1
. expand 2
(2,246 observations created)
. replace race = 4 in `Np1'/L
(2,246 real changes made)
. label def racelbl 4 "Total", modify
. drop pcsingle
. egen pcsingle = mean(100 * (1 - married)), by(collgrad race)
. char pcsingle[varname] "% single"
. format pcsingle %2.1f
. gen istotal = race == 4
. bysort collgrad istotal: gen total = _N
. * for percents of the global total, we need to correct for doubling up
. scalar alltotal = _N/2
. * the table shows percents for college & race | collgrad and for collgrad | total
. bysort collgrad race : gen pc = 100 * cond(istotal, total/alltotal, _N/total)
. format pc %2.1f
. char pc[varname] "Percent"
. groups collgrad race pcsingle pc , show(f) subvarname sepby(collgrad istotal)
+-------------------------------------------------------+
| college? race % single Percent Freq. |
|-------------------------------------------------------|
| not college grad white 29.2 71.0 1217 |
| not college grad black 53.3 28.0 480 |
| not college grad other 29.4 1.0 17 |
|-------------------------------------------------------|
| not college grad Total 35.9 76.3 1714 |
|-------------------------------------------------------|
| college grad white 31.4 78.9 420 |
| college grad black 51.5 19.4 103 |
| college grad other 33.3 1.7 9 |
|-------------------------------------------------------|
| college grad Total 35.3 23.7 532 |
+-------------------------------------------------------+
Note the extra trick of using a variable not shown explicitly to add separator lines.

Stata: egen rowpctile a range of values instead of single percentile value

I have a variable var with many missing values for which I want to calculate the 95th percentile then use this value to drop observations that lie above the 95th percentile (for those observations that are not missing the variable).
Because of the many missing values, I use egen with rowpctile which is supposed to calculate the p(#) percentile, ignoring missing values. When I look at the p95 values, however, they're a range of different values rather than a single 95th percentile value as seen below:
. egen p95 = rowpctile(var), p(95)
. list p95
+-----------+
| p95 |
|-----------|
1. | . |
2. | 65.71429 |
3. | 14.28571 |
4. | . |
5. | . |
...
Am I using the function incorrectly or is there a better way to go about this?
The rowpctile function of the egen command calculates the percentile of the values of a list of variables separately for each observation. Here is some technique which should set you on the right path.
. sysuse auto, clear
(1978 Automobile Data)
. replace price = . in 1/5
(5 real changes made, 5 to missing)
. summarize price, detail
Price
-------------------------------------------------------------
Percentiles Smallest
1% 3291 3291
5% 3748 3299
10% 3895 3667 Obs 69
25% 4296 3748 Sum of Wgt. 69
50% 5104 Mean 6245.493
Largest Std. Dev. 3015.072
75% 6342 13466
90% 11497 13594 Variance 9090661
95% 13466 14500 Skewness 1.594391
99% 15906 15906 Kurtosis 4.555704
. display r(p95)
13466
. generate toobig = price>r(p95)
. list make price if toobig | price==.
+---------------------------+
| make price |
|---------------------------|
1. | AMC Concord . |
2. | AMC Pacer . |
3. | AMC Spirit . |
4. | Buick Century . |
5. | Buick Electra . |
|---------------------------|
12. | Cad. Eldorado 14,500 |
13. | Cad. Seville 15,906 |
27. | Linc. Mark V 13,594 |
+---------------------------+

to create highest & lowest quartiles of a variable in Stata

This is the Stata code I used to divide a Winsorised & centred variable (num_exp, denoting number of experienced managers) based on 4 quartiles & thereafter to generate the highest & lowest quartile dummies thereof:
egen quartile_num_exp = xtile(WC_num_exp), n(4)
gen high_quartile_numexp = 1 if quartile_num_exp==4
(1433 missing values generated);
gen low_quartile_num_exp = 1 if quartile_num_intlexp==1
(1062 missing values generated);
Thanks everybody - here's the link
https://dl.dropboxusercontent.com/u/64545449/No%20of%20expeienced%20managers.dta
I did try both Aspen Chen's & Roberto's suggestions - Chen's way of creating high quartile dummy gives the same results as I had earlier & Roberto's - both quartiles show 1 for the same rows - how's that possible?
I forgot to mention here that there are indeed many ties - the range of the original variable W_num_exp is from 0 to 7, the mean being 2.126618, i subtracted that from each observation of W_num_exp to get the WC_num_exp.
tab high_quartile_numexp shows the same problem I originally had
le_numexp | Freq. Percent Cum.
------------+-----------------------------------
0 | 1,433 80.64 80.64
1 | 344 19.36 100.00
------------+-----------------------------------
Total | 1,777 100.00
Also, I checked egenmore is already installed in my Stata version 13.1
What I fail to understand is why the dummy variable based on the highest quartile doesn't have 75% of observations below it (I've got 1777 total observations): to my understanding this dummy variable should be the cut-off point above which exactly 25% of the total no. of observations should lie (as we can see it contains only 19.3% of observations).
Am I doing anything wrong in writing the correct Stata code for high_quartile low_quartile dummy variables?
Consider the following code:
clear
set more off
sysuse auto
keep make mpg
*-----
// your way (kind of)
egen mpg4 = xtile(mpg), nq(4)
gen lowq = mpg4 == 1
gen highq = mpg4 == 4
*-----
// what you want
summarize mpg, detail
gen lowq2 = mpg < r(p25)
gen highq2 = mpg < r(p75)
*-----
summarize high* low*
list
Now check the listing to see what's going on.
See help stored results.
The dataset provided answers the question. Consider the tabulation:
. tab W_num_exp
num_execs_i |
ntl_exp, |
Winsorized |
fraction |
.01 | Freq. Percent Cum.
------------+-----------------------------------
0 | 297 16.71 16.71
1 | 418 23.52 40.24
2 | 436 24.54 64.77
3 | 282 15.87 80.64
4 | 171 9.62 90.26
5 | 109 6.13 96.40
6 | 34 1.91 98.31
7 | 30 1.69 100.00
------------+-----------------------------------
Total | 1,777 100.00
Exactly equal numbers in each of 4 quartile-based bins can be provided if, and only if, there are values with cumulative percents 25, 50, 75. No such values exist. You have to make do with approximations. The approximations can be lousy, but the only alternative, of arbitrarily assigning observations with the same value to different bins to even up frequencies, is statistically indefensible.
(The number of observations needing to be a multiple of 4 for 4 bins, etc., for exactly equal frequencies is also a complication, which bites hard for small datasets, but that is not the major issue here.)

Generating a new variable by selection from multiple variables

I have some data on diseases and age of diagnosis. Each participant was asked what diseases they have had and at what age that disease was diagnosed.
There are a set of variables disease1-28 with a numeric code for each disease and another set age1-28 with the age at diagnosis in years. The diseases are placed in successive variables in the order recalled; the age of diagnosis is placed in the appropriate age variable.
I would like to generate a new variable for each of several diseases giving the age of diagnosis of that disease: e.g. asthma_age_at_diagnosis
Can I do this without having 28 replace statements?
Example of the data:
+-------------+----------+----------+----------+------+------+------+
| Participant | Disease1 | Disease2 | Disease3 | Age1 | Age2 | Age3 |
+-------------+----------+----------+----------+------+------+------+
| 1 | 123 | 3 | . | 30 | 2 | . |
| 2 | 122 | 123 | 5 | 23 | 51 | 44 |
| 3 | 5 | . | . | 50 | . | . |
+-------------+----------+----------+----------+------+------+------+
I give a general heads-up that a question of this form without any code of your own is often considered off-topic for Stack Overflow. Still, the Stata users around here are the people answering Stata questions (surprise) and we usually indulge questions like this if interesting and well-posed.
I'd advise a different data structure, period. With your example data
clear
input Patient Disease1 Disease2 Disease3 Age1 Age2 Age3
1 123 3 . 30 2 .
2 122 123 5 23 51 44
3 5 . . 50 . .
end
You can reshape
reshape long Disease Age, i(Patient) j(Order)
drop if missing(Disease)
list, sep(0)
+--------------------------------+
| Patient Order Disease Age |
|--------------------------------|
1. | 1 1 123 30 |
2. | 1 2 3 2 |
3. | 2 1 122 23 |
4. | 2 2 123 51 |
5. | 2 3 5 44 |
6. | 3 1 5 50 |
+--------------------------------+
With the data in this form you can now answer lots of questions easily. I don't see that a whole bunch of new variables would make many analyses easier. Another way to see this is that you have hinted that the order in which diseases are coded is arbitrary; that being so, wiring that into the data structure is ill-advised. Even if order is important, it is still accessible as part of the dataset (variable Order).
Hint: If you still want separate variables for some purposes, look at separate.