I have created a two-way summary table in Stata, but I am struggling to output my results.
Using the auto.dta sample dataset as an example, I am trying to build a table that displays the means and standard deviations of mpg, by two other variables (expensive and foreign).
My code currently looks as follows:
sysuse auto.dta, replace
gen expensive = (price > 5000)
The table that I would like to display can be created by either of the two commands below:
tabulate expensive foreign, sum(mpg)
Means, Standard Deviations and Frequencies of Mileage (mpg)
| Car type
expensive | Domestic Foreign | Total
-----------+----------------------+----------
0 | 22.137931 28.875 | 23.594595
| 4.3648281 4.8825491 | 5.2305696
| 29 8 | 37
-----------+----------------------+----------
1 | 16.913043 22.428571 | 19
| 3.4629604 6.4416229 | 5.4467115
| 23 14 | 37
-----------+----------------------+----------
Total | 19.826923 24.772727 | 21.297297
| 4.7432972 6.6111869 | 5.7855032
| 52 22 | 74
table expensive foreign, c(mean mpg sd mpg) row col
----------------------------------------
| Car type
expensive | Domestic Foreign Total
----------+-----------------------------
0 | 22.1379 28.875 23.5946
| 4.364828 4.882549 5.23057
| 29 8 37
|
1 | 16.913 22.4286 19
| 3.46296 6.441623 5.446712
| 23 14 37
|
Total | 19.8269 24.7727 21.2973
| 4.743297 6.611187 5.785503
| 52 22 74
----------------------------------------
I can also closely approximate the same results using collapse, but this does not calculate row and column totals.
My issue is that neither the tabulate (with the sum option) command nor the table command seem friendly to output. I have tried converting to matrices, but tabulate with the sum option does not allow the matcell option and table seems similarly uncooperative.
I'm familiar with tabstat, esttab etc., but was not able to create the two-way table that I need with any of those packages. Any help would be really appreciated.
The community-contributed command asdoc does exactly that:
. asdoc table expensive foreign, c(mean mpg sd mpg count mpg) row col
----------------------------------------
| Car type
expensive | Domestic Foreign Total
----------+-----------------------------
0 | 22.1379 28.875 23.5946
| 4.364828 4.882549 5.23057
| 29 8 37
|
1 | 16.913 22.4286 19
| 3.46296 6.441623 5.446712
| 23 14 37
|
Total | 19.8269 24.7727 21.2973
| 4.743297 6.611187 5.785503
| 52 22 74
----------------------------------------
Click to Open File: Myfile.doc
Alternatively, one could use the community-contributed command tabout:
. tabout expensive foreign using table1.txt, c(mean mpg) sum replace
Table output written to: table1.txt
Car type
Domestic Foreign Total
Mean mpg Mean mpg Mean mpg
expensive
0 22.1 28.9 23.6
1 16.9 22.4 19.0
Total 19.8 24.8 21.3
. tabout expensive foreign using table2.txt, c(sd mpg) sum replace
Table output written to: table2.txt
Car type
Domestic Foreign Total
Sd mpg Sd mpg Sd mpg
expensive
0 4.4 4.9 5.2
1 3.5 6.4 5.4
Total 4.7 6.6 5.8
. tabout expensive foreign using table3.txt, c(count mpg) sum replace
Table output written to: table3.txt
Car type
Domestic Foreign Total
Count mpg Count mpg Count mpg
expensive
0 29.0 8.0 37.0
1 23.0 14.0 37.0
Total 52.0 22.0 74.0
an easy solution is to use collapse to get a dataset that reproduces your desired table, and then export the dataset as a csv
example
collapse (sum) mpg, by(expensive foreign)
and then
export delimited using mydata.csv
Related
I use command cchi2 to display each cell’s contribution to Pearson’s chi-squared in a two-way table in Stata. The p-value of each cell is displayed as 1 decimal, e.g., that the p-value is 0.0, but I would like to see more digits, e.g. 0.052 or .050.
Is there any possible way to set the digits of the p-value decimal?
This lacks a good reproducible example with a data call we can understand and in fact gives no code whatsoever. It also seems confused in terms of both Stata and statistics.
There is an option cchi2 to the tabulate command when used with two variables. cchi2 is not a separate command; it yields the contribution to chi-square and makes most sense when combined with the chi2 option, e.g.
. sysuse auto, clear
(1978 Automobile Data)
. tab foreign rep78, chi2 cchi2
+-------------------+
| Key |
|-------------------|
| frequency |
| chi2 contribution |
+-------------------+
| Repair Record 1978
Car type | 1 2 3 4 5 | Total
-----------+-------------------------------------------------------+----------
Domestic | 2 8 27 9 2 | 48
| 0.3 1.1 1.8 1.0 4.2 | 8.3
-----------+-------------------------------------------------------+----------
Foreign | 0 0 3 9 9 | 21
| 0.6 2.4 4.1 2.3 9.5 | 19.0
-----------+-------------------------------------------------------+----------
Total | 2 8 30 18 11 | 69
| 0.9 3.5 5.9 3.3 13.7 | 27.3
Pearson chi2(4) = 27.2640 Pr = 0.000
Here we can see the contribution to chi-square; in this case 9.5 of the 27.264 chi-square statistic comes from the bottom right cell. These are not individual P-values; there is just one P-value, for the entire table, here given as 0.000 to 3 d.p.
One way of getting more detail is with the tabchi command downloadable with ssc install tab_chi. Here the pearson option gives the more useful Pearson residuals, (observed - expected) / square root of expected, which are the signed square roots of the contribution to chi-square.
. tabchi foreign rep78, pearson
observed frequency
expected frequency
Pearson residual
--------------------------------------------------
| Repair Record 1978
Car type | 1 2 3 4 5
----------+---------------------------------------
Domestic | 2 8 27 9 2
| 1.391 5.565 20.870 12.522 7.652
| 0.516 1.032 1.342 -0.995 -2.043
|
Foreign | 0 0 3 9 9
| 0.609 2.435 9.130 5.478 3.348
| -0.780 -1.560 -2.029 1.505 3.089
--------------------------------------------------
4 cells with expected frequency < 5
1 cell with expected frequency < 1
Pearson chi2(4) = 27.2640 Pr = 0.000
likelihood-ratio chi2(4) = 29.9121 Pr = 0.000
Typing return list after that command gives more decimal places for the P-value:
. ret li
scalars:
r(N) = 69
r(r) = 2
r(c) = 5
r(chi2) = 27.26396103896104
r(p) = .0000175796084266
In your case, we can use your displayed frequencies to get Pearson residuals from the command tabchii in the same package.
. tabchii 3 10 2 \ 11 54 10, pearson
observed frequency
expected frequency
Pearson residual
----------------------------------
| col
row | 1 2 3
----------+-----------------------
1 | 3 10 2
| 2.333 10.667 2.000
| 0.436 -0.204 0.000
|
2 | 11 54 10
| 11.667 53.333 10.000
| -0.195 0.091 0.000
----------------------------------
2 cells with expected frequency < 5
Pearson chi2(2) = 0.2786 Pr = 0.870
likelihood-ratio chi2(2) = 0.2643 Pr = 0.876
In your case as the total chi-squared statistic happens to be less than 1, then all the contributions, the squares of (observed - expected) / expected, are also all less than 1, but (a) that is not true in general (b) they are not P-values (there isn't a separate test being carried out in each cell).
Both tabchi and tabchii do also have cont options equivalent to the cchi2 option of tabulate. You can also tune the number of decimal places shown using tabdisp options, as documented.
I think the quickest way to show the p value to more than 3 dp is to display stored results after you do your chi square:
. tab var1 var2, col chi
. display `r(p)'
There is an easy way to do it without using any software other than your browser.
This online statistical calculator can provide p value for up to 12 decimal points: https://www.icalcu.com/stat/chisqtest.html
For the first question, just paste the numbers below:
3 10 2
11 54 10
and you will get a p value of 0.869979427395
For the second dataset, just paste the numbers below:
2 8 27 9 2
0 0 3 9 9
and you will get a p value of 0.000017579608.
I am using Stata and investigating the variable household net wealth NetWealth).
I want to construct the quintiles of this variable and use the following command--as you can see I use survey data and thus apply survey weights:
xtile Quintile = NetWealth [pw=surveyweight], nq(5)
Then I give the following command to check what I have obtained:
tab Quintile, sum(NetWealth)
This is the result:
Means, Standard Deviations and Frequencies of DN3001 Net wealth
5 |
quantiles |
of dn3001 |
-----------+-----------+
1 |1519.4221
|43114.959
| 154
-----------+-----------+
2 | 135506.67
| 74360.816
| 179
-----------+-----------+
3 | 396712.16
| 69715.49
| 161
-----------+-----------+
4 | 669065.69
| 111102.02
| 182
-----------+-----------+
5 | 2552620.5
| 3872350.9
| 274
-----------+-----------+
Total | 957419.29
| 2323329.8
| 950
Why do I get a different number of households in each quintile? In particular in the last quintile?
The only explanation that I can come up with is that when Stata constructs quintiles with xtile, it excludes from the computation those observations that present a replicate value of NetWealth. I have had this impression also while consulting the Stata material.
What do you think?
Your problem is not fully reproducible in so far as you don't give a self-contained example, but in general there is no puzzle here.
Often people seeking such binnings have a small problem in that their number of observations is not a multiple (meaning, exact multiple) of the number of quantile-based bins they want, but in your case that does not bite as calculation
. di 154 + 179 + 161 + 182 + 274
950
shows that you have 950 observations, which is 5 x 190.
The bigger deal -- here and almost always -- arises from Stata's rule that identical values in different observations must be assigned to the same bin. So, ties are likely to be the problem here.
You have perhaps three possible solutions. Only one involves direct coding.
Live with it.
Do something else. For example, why you are doing this any way? Why not use the original data?
Try a different boundary condition. To do that, just negate the variable and bin that version. Then values on the boundary will jump differently.
Adding random noise to separate ties is utterly indefensible in my view. It's not reproducible (except trivially using the same program and the same settings) and it will have different implications in terms of the same observations' values on other variables.
Here's an example where #3 doesn't help, but it sometimes does:
. sysuse auto, clear
(1978 Automobile Data)
. xtile bin5 = mpg, nq(5)
. gen negmpg = -mpg
. xtile bin5_2 = negmpg, nq(5)
. tab bin5
5 quantiles |
of mpg | Freq. Percent Cum.
------------+-----------------------------------
1 | 18 24.32 24.32
2 | 17 22.97 47.30
3 | 13 17.57 64.86
4 | 12 16.22 81.08
5 | 14 18.92 100.00
------------+-----------------------------------
Total | 74 100.00
. tab bin5_2
5 quantiles |
of negmpg | Freq. Percent Cum.
------------+-----------------------------------
1 | 19 25.68 25.68
2 | 12 16.22 41.89
3 | 16 21.62 63.51
4 | 13 17.57 81.08
5 | 14 18.92 100.00
------------+-----------------------------------
Total | 74 100.00
See also some discussion within Section 4 of this paper
I see no hint whatsoever in the documentation that xtile would omit observations in the way that you imply. You give no precise quotation supporting that. It would be perverse to exclude any non-missing values unless so instructed.
I don't comment directly here on use of pweights except that using pweights might be a complicating factor here.
I am cross-tabulating two variables variable1 with 5 levels and variable2 with 2 levels. The result of the tabulation is such that level 1 and 2 of variable1 is not displayed in the tabulation since the frequency is zero as follows:
sysuse auto
levelsof rep78
1 2 3 4 5
tab rep78 foreign if foreign, col nofreq
Repair |
Record | Car type
1978 | Foreign | Total
-----------+-----------+----------
3 | 14.29 | 14.29
4 | 42.86 | 42.86
5 | 42.86 | 42.86
-----------+-----------+----------
Total | 100.00 | 100.00
I would like to have the tabulation with all the levels displayed as follows:
tab rep78 foreign if foreign, col nofreq
Repair |
Record | Car type
1978 | Foreign | Total
-----------+-----------+----------
1 | 0.00 | 0.00
2 | 0.00 | 0.00
3 | 14.29 | 14.29
4 | 42.86 | 42.86
5 | 42.86 | 42.86
-----------+-----------+----------
Total | 100.00 | 100.00
How can I do that?
The reason I need this is that I have created a program that tabulates a given variable and posts the results into an excel report template using the putexcel functionality of Stata. In some cases some levels are not displayed in the tabulation and this results in some values getting posted to the wrong row of the excel report.
No decent example as yet from the OP, but here is some technique.
In general, it's tricky. Stata's no metaphysician and is reluctant to display anything without empirical evidence to hand that it exists. I here create a dataset with all the cross-combinations needed and also create a variable with explicit zeros to show. For many problems, also see help fillin.
. clear
. sysuse auto
(1978 Automobile Data)
. contract foreign rep78, zero
. egen pc = pc(_freq), by(foreign)
. tabdisp rep78 foreign if !foreign, c(pc) format(%2.1f)
--------------------
Repair |
Record | Car type
1978 | Domestic
----------+---------
1 | 3.8
2 | 15.4
3 | 51.9
4 | 17.3
5 | 3.8
. | 7.7
--------------------
. tabdisp rep78 foreign if foreign, c(pc) format(%2.1f)
-------------------
Repair |
Record |Car type
1978 | Foreign
----------+--------
1 | 0.0
2 | 0.0
3 | 13.6
4 | 40.9
5 | 40.9
. | 4.5
-------------------
Commands that create tables echoing what you give them (notably tabdisp) are here more helpful than commands that create summaries and then create tables that show the summaries (e.g. tabulate, table).
Here are the relevant commands:
sysuse auto
table foreign, c(max mpg max rep78) row
Reading through the documentation (row: add row totals), I expected it to turn out like this:
----------------------------------
Car type | max(mpg) max(rep78)
----------+-----------------------
Domestic | 34 5
Foreign | 41 5
|
Total | 75 10
----------------------------------
However, the Total row is actually just the max of the column:
----------------------------------
Car type | max(mpg) max(rep78)
----------+-----------------------
Domestic | 34 5
Foreign | 41 5
|
Total | 41 5
----------------------------------
I was wondering if there is a similar command (without me having to collapse) that would allow me to construct a table like this (within the Stata window) but actually have the Total SUM at the bottom. Thanks for your time.
Stata's answer in table is arguably what would be expected. Given an instruction to calculate maximums, it does that by group and for the total dataset.
You want the maximums by group, but also to see their total or sum. That seems puzzling, but it can be done indirectly:
. sysuse auto , clear
(1978 Automobile Data)
. egen mpg_max = max(mpg), by(foreign)
. egen rep_max = max(rep78), by(foreign)
. egen tag = tag(foreign)
. table foreign if tag, c(sum mpg_max sum rep_max) row
--------------------------------------
Car type | sum(mpg_max) sum(rep_max)
----------+---------------------------
Domestic | 34 5
Foreign | 41 5
|
Total | 75 10
--------------------------------------
The trick here is that taking the maximums is done outside table. Then we feed just one observation in each category to table and the total is what is needed.
In a panel data set, I'm using
table Region TIME if TIME==2014 | TIME==2020 | TIME==2030 | TIME==2040, contents(sum BF ) row
to create the following table:
------------------------------------------
| TIME
Region | 2014 2020 2030 2040
----------+-------------------------------
701 | 26751 27941 29944 31477
702 | 10456 11354 12723 13788
704 | 41550 44481 49340 53273
706 | 44976 47535 51940 55573
709 | 43258 44398 46612 48191
711 | 6580 7011 7539 7856
713 | 9036 10139 11776 13194
714 | 3091 3284 3563 3750
716 | 9144 9730 10724 11543
719 | 5719 6292 7258 8036
720 | 11509 12161 13188 13919
722 | 21403 22344 23839 25006
723 | 4927 5094 5345 5447
728 | 2460 2576 2761 2906
|
Total | 240860 254340 276552 293959
------------------------------------------
I'd like to add a fifth column, which displays the difference between the year 2014 and 2040 in %.
Question: is this possible WITHOUT adding a new variable to the dataset? For instance by letting the fifth column being derived from a formula?
If not, how do I easily compute a new variable, taking account of the long format of the panel data set?
This isn't possible within table.
Your variable could be something like
egen total2014 = total(BF / (TIME == 2014)), by(Region)
egen total2040 = total(BF / (TIME == 2040)), by(Region)
gen pcdiff = 100 * (total2040 - total2014)/total2014
after which you can tabulate its (mean) value for each region. See Section 10 in http://www.stata-journal.com/sjpdf.html?articlenum=dm0055 for the first trick here.
You may need to go outside table for the tabulation, but if all else fails, collapse to a new dataset of totals and means.