I am cross-tabulating two variables variable1 with 5 levels and variable2 with 2 levels. The result of the tabulation is such that level 1 and 2 of variable1 is not displayed in the tabulation since the frequency is zero as follows:
sysuse auto
levelsof rep78
1 2 3 4 5
tab rep78 foreign if foreign, col nofreq
Repair |
Record | Car type
1978 | Foreign | Total
-----------+-----------+----------
3 | 14.29 | 14.29
4 | 42.86 | 42.86
5 | 42.86 | 42.86
-----------+-----------+----------
Total | 100.00 | 100.00
I would like to have the tabulation with all the levels displayed as follows:
tab rep78 foreign if foreign, col nofreq
Repair |
Record | Car type
1978 | Foreign | Total
-----------+-----------+----------
1 | 0.00 | 0.00
2 | 0.00 | 0.00
3 | 14.29 | 14.29
4 | 42.86 | 42.86
5 | 42.86 | 42.86
-----------+-----------+----------
Total | 100.00 | 100.00
How can I do that?
The reason I need this is that I have created a program that tabulates a given variable and posts the results into an excel report template using the putexcel functionality of Stata. In some cases some levels are not displayed in the tabulation and this results in some values getting posted to the wrong row of the excel report.
No decent example as yet from the OP, but here is some technique.
In general, it's tricky. Stata's no metaphysician and is reluctant to display anything without empirical evidence to hand that it exists. I here create a dataset with all the cross-combinations needed and also create a variable with explicit zeros to show. For many problems, also see help fillin.
. clear
. sysuse auto
(1978 Automobile Data)
. contract foreign rep78, zero
. egen pc = pc(_freq), by(foreign)
. tabdisp rep78 foreign if !foreign, c(pc) format(%2.1f)
--------------------
Repair |
Record | Car type
1978 | Domestic
----------+---------
1 | 3.8
2 | 15.4
3 | 51.9
4 | 17.3
5 | 3.8
. | 7.7
--------------------
. tabdisp rep78 foreign if foreign, c(pc) format(%2.1f)
-------------------
Repair |
Record |Car type
1978 | Foreign
----------+--------
1 | 0.0
2 | 0.0
3 | 13.6
4 | 40.9
5 | 40.9
. | 4.5
-------------------
Commands that create tables echoing what you give them (notably tabdisp) are here more helpful than commands that create summaries and then create tables that show the summaries (e.g. tabulate, table).
Related
I have a variable called Category that specifies the category of observations. The problem is that some observation have multiple categories. For example:
id Category
1 Economics
2 Biology
3 Psychology; Economics
4 Economics; Psychology
There is no meaning in the order of categories. They are always separated by ";". There are 250 categories, so creating dummy variables might be tricky. I have the complete list of categories in a separate Excel file if this might help.
What I want is simply to summarize my dataset by unique categories such as Economics (3), Psychology (2), Biology (1) (so the sum of all can be superior to the number of observations).
tabsplit from the tab_chi package on SSC will do this for you.
clear
input id str42 Category
1 "Economics"
2 "Biology"
3 "Psychology; Economics"
4 "Economics; Psychology"
end
capture ssc install tab_chi
tabsplit Category, p(;)
Category | Freq. Percent Cum.
------------+-----------------------------------
Biology | 1 16.67 16.67
Economics | 3 50.00 66.67
Psychology | 2 33.33 100.00
------------+-----------------------------------
Total | 6 100.00
Note: You can count semi-colons and thus phrases like this.
gen count = 1 + length(category) - length(subinstr(category, ";", "", .))
The logic is that you measure the length of the string and its length should semi-colons ; be replaced by empty strings (namely, removed). The difference is the number of semi-colons, to which you add 1.
EDIT: How to get to a different data structure, starting with the data example above.
. split Category, p(;)
variables created as string:
Category1 Category2
. drop Category
. reshape long Category, i(id) j(mention)
(note: j = 1 2)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 4 -> 8
Number of variables 3 -> 3
j variable (2 values) -> mention
xij variables:
Category1 Category2 -> Category
-----------------------------------------------------------------------------
. drop if missing(Category)
(2 observations deleted)
. list, sepby(id)
+----------------------------+
| id mention Category |
|----------------------------|
1. | 1 1 Economics |
|----------------------------|
2. | 2 1 Biology |
|----------------------------|
3. | 3 1 Psychology |
4. | 3 2 Economics |
|----------------------------|
5. | 4 1 Economics |
6. | 4 2 Psychology |
+----------------------------+
I have a variable var with many missing values for which I want to calculate the 95th percentile then use this value to drop observations that lie above the 95th percentile (for those observations that are not missing the variable).
Because of the many missing values, I use egen with rowpctile which is supposed to calculate the p(#) percentile, ignoring missing values. When I look at the p95 values, however, they're a range of different values rather than a single 95th percentile value as seen below:
. egen p95 = rowpctile(var), p(95)
. list p95
+-----------+
| p95 |
|-----------|
1. | . |
2. | 65.71429 |
3. | 14.28571 |
4. | . |
5. | . |
...
Am I using the function incorrectly or is there a better way to go about this?
The rowpctile function of the egen command calculates the percentile of the values of a list of variables separately for each observation. Here is some technique which should set you on the right path.
. sysuse auto, clear
(1978 Automobile Data)
. replace price = . in 1/5
(5 real changes made, 5 to missing)
. summarize price, detail
Price
-------------------------------------------------------------
Percentiles Smallest
1% 3291 3291
5% 3748 3299
10% 3895 3667 Obs 69
25% 4296 3748 Sum of Wgt. 69
50% 5104 Mean 6245.493
Largest Std. Dev. 3015.072
75% 6342 13466
90% 11497 13594 Variance 9090661
95% 13466 14500 Skewness 1.594391
99% 15906 15906 Kurtosis 4.555704
. display r(p95)
13466
. generate toobig = price>r(p95)
. list make price if toobig | price==.
+---------------------------+
| make price |
|---------------------------|
1. | AMC Concord . |
2. | AMC Pacer . |
3. | AMC Spirit . |
4. | Buick Century . |
5. | Buick Electra . |
|---------------------------|
12. | Cad. Eldorado 14,500 |
13. | Cad. Seville 15,906 |
27. | Linc. Mark V 13,594 |
+---------------------------+
I have created a two-way summary table in Stata, but I am struggling to output my results.
Using the auto.dta sample dataset as an example, I am trying to build a table that displays the means and standard deviations of mpg, by two other variables (expensive and foreign).
My code currently looks as follows:
sysuse auto.dta, replace
gen expensive = (price > 5000)
The table that I would like to display can be created by either of the two commands below:
tabulate expensive foreign, sum(mpg)
Means, Standard Deviations and Frequencies of Mileage (mpg)
| Car type
expensive | Domestic Foreign | Total
-----------+----------------------+----------
0 | 22.137931 28.875 | 23.594595
| 4.3648281 4.8825491 | 5.2305696
| 29 8 | 37
-----------+----------------------+----------
1 | 16.913043 22.428571 | 19
| 3.4629604 6.4416229 | 5.4467115
| 23 14 | 37
-----------+----------------------+----------
Total | 19.826923 24.772727 | 21.297297
| 4.7432972 6.6111869 | 5.7855032
| 52 22 | 74
table expensive foreign, c(mean mpg sd mpg) row col
----------------------------------------
| Car type
expensive | Domestic Foreign Total
----------+-----------------------------
0 | 22.1379 28.875 23.5946
| 4.364828 4.882549 5.23057
| 29 8 37
|
1 | 16.913 22.4286 19
| 3.46296 6.441623 5.446712
| 23 14 37
|
Total | 19.8269 24.7727 21.2973
| 4.743297 6.611187 5.785503
| 52 22 74
----------------------------------------
I can also closely approximate the same results using collapse, but this does not calculate row and column totals.
My issue is that neither the tabulate (with the sum option) command nor the table command seem friendly to output. I have tried converting to matrices, but tabulate with the sum option does not allow the matcell option and table seems similarly uncooperative.
I'm familiar with tabstat, esttab etc., but was not able to create the two-way table that I need with any of those packages. Any help would be really appreciated.
The community-contributed command asdoc does exactly that:
. asdoc table expensive foreign, c(mean mpg sd mpg count mpg) row col
----------------------------------------
| Car type
expensive | Domestic Foreign Total
----------+-----------------------------
0 | 22.1379 28.875 23.5946
| 4.364828 4.882549 5.23057
| 29 8 37
|
1 | 16.913 22.4286 19
| 3.46296 6.441623 5.446712
| 23 14 37
|
Total | 19.8269 24.7727 21.2973
| 4.743297 6.611187 5.785503
| 52 22 74
----------------------------------------
Click to Open File: Myfile.doc
Alternatively, one could use the community-contributed command tabout:
. tabout expensive foreign using table1.txt, c(mean mpg) sum replace
Table output written to: table1.txt
Car type
Domestic Foreign Total
Mean mpg Mean mpg Mean mpg
expensive
0 22.1 28.9 23.6
1 16.9 22.4 19.0
Total 19.8 24.8 21.3
. tabout expensive foreign using table2.txt, c(sd mpg) sum replace
Table output written to: table2.txt
Car type
Domestic Foreign Total
Sd mpg Sd mpg Sd mpg
expensive
0 4.4 4.9 5.2
1 3.5 6.4 5.4
Total 4.7 6.6 5.8
. tabout expensive foreign using table3.txt, c(count mpg) sum replace
Table output written to: table3.txt
Car type
Domestic Foreign Total
Count mpg Count mpg Count mpg
expensive
0 29.0 8.0 37.0
1 23.0 14.0 37.0
Total 52.0 22.0 74.0
an easy solution is to use collapse to get a dataset that reproduces your desired table, and then export the dataset as a csv
example
collapse (sum) mpg, by(expensive foreign)
and then
export delimited using mydata.csv
Here are the relevant commands:
sysuse auto
table foreign, c(max mpg max rep78) row
Reading through the documentation (row: add row totals), I expected it to turn out like this:
----------------------------------
Car type | max(mpg) max(rep78)
----------+-----------------------
Domestic | 34 5
Foreign | 41 5
|
Total | 75 10
----------------------------------
However, the Total row is actually just the max of the column:
----------------------------------
Car type | max(mpg) max(rep78)
----------+-----------------------
Domestic | 34 5
Foreign | 41 5
|
Total | 41 5
----------------------------------
I was wondering if there is a similar command (without me having to collapse) that would allow me to construct a table like this (within the Stata window) but actually have the Total SUM at the bottom. Thanks for your time.
Stata's answer in table is arguably what would be expected. Given an instruction to calculate maximums, it does that by group and for the total dataset.
You want the maximums by group, but also to see their total or sum. That seems puzzling, but it can be done indirectly:
. sysuse auto , clear
(1978 Automobile Data)
. egen mpg_max = max(mpg), by(foreign)
. egen rep_max = max(rep78), by(foreign)
. egen tag = tag(foreign)
. table foreign if tag, c(sum mpg_max sum rep_max) row
--------------------------------------
Car type | sum(mpg_max) sum(rep_max)
----------+---------------------------
Domestic | 34 5
Foreign | 41 5
|
Total | 75 10
--------------------------------------
The trick here is that taking the maximums is done outside table. Then we feed just one observation in each category to table and the total is what is needed.
I have some data on diseases and age of diagnosis. Each participant was asked what diseases they have had and at what age that disease was diagnosed.
There are a set of variables disease1-28 with a numeric code for each disease and another set age1-28 with the age at diagnosis in years. The diseases are placed in successive variables in the order recalled; the age of diagnosis is placed in the appropriate age variable.
I would like to generate a new variable for each of several diseases giving the age of diagnosis of that disease: e.g. asthma_age_at_diagnosis
Can I do this without having 28 replace statements?
Example of the data:
+-------------+----------+----------+----------+------+------+------+
| Participant | Disease1 | Disease2 | Disease3 | Age1 | Age2 | Age3 |
+-------------+----------+----------+----------+------+------+------+
| 1 | 123 | 3 | . | 30 | 2 | . |
| 2 | 122 | 123 | 5 | 23 | 51 | 44 |
| 3 | 5 | . | . | 50 | . | . |
+-------------+----------+----------+----------+------+------+------+
I give a general heads-up that a question of this form without any code of your own is often considered off-topic for Stack Overflow. Still, the Stata users around here are the people answering Stata questions (surprise) and we usually indulge questions like this if interesting and well-posed.
I'd advise a different data structure, period. With your example data
clear
input Patient Disease1 Disease2 Disease3 Age1 Age2 Age3
1 123 3 . 30 2 .
2 122 123 5 23 51 44
3 5 . . 50 . .
end
You can reshape
reshape long Disease Age, i(Patient) j(Order)
drop if missing(Disease)
list, sep(0)
+--------------------------------+
| Patient Order Disease Age |
|--------------------------------|
1. | 1 1 123 30 |
2. | 1 2 3 2 |
3. | 2 1 122 23 |
4. | 2 2 123 51 |
5. | 2 3 5 44 |
6. | 3 1 5 50 |
+--------------------------------+
With the data in this form you can now answer lots of questions easily. I don't see that a whole bunch of new variables would make many analyses easier. Another way to see this is that you have hinted that the order in which diseases are coded is arbitrary; that being so, wiring that into the data structure is ill-advised. Even if order is important, it is still accessible as part of the dataset (variable Order).
Hint: If you still want separate variables for some purposes, look at separate.