Column totals by group (Stata) - stata

Here are the relevant commands:
sysuse auto
table foreign, c(max mpg max rep78) row
Reading through the documentation (row: add row totals), I expected it to turn out like this:
----------------------------------
Car type | max(mpg) max(rep78)
----------+-----------------------
Domestic | 34 5
Foreign | 41 5
|
Total | 75 10
----------------------------------
However, the Total row is actually just the max of the column:
----------------------------------
Car type | max(mpg) max(rep78)
----------+-----------------------
Domestic | 34 5
Foreign | 41 5
|
Total | 41 5
----------------------------------
I was wondering if there is a similar command (without me having to collapse) that would allow me to construct a table like this (within the Stata window) but actually have the Total SUM at the bottom. Thanks for your time.

Stata's answer in table is arguably what would be expected. Given an instruction to calculate maximums, it does that by group and for the total dataset.
You want the maximums by group, but also to see their total or sum. That seems puzzling, but it can be done indirectly:
. sysuse auto , clear
(1978 Automobile Data)
. egen mpg_max = max(mpg), by(foreign)
. egen rep_max = max(rep78), by(foreign)
. egen tag = tag(foreign)
. table foreign if tag, c(sum mpg_max sum rep_max) row
--------------------------------------
Car type | sum(mpg_max) sum(rep_max)
----------+---------------------------
Domestic | 34 5
Foreign | 41 5
|
Total | 75 10
--------------------------------------
The trick here is that taking the maximums is done outside table. Then we feed just one observation in each category to table and the total is what is needed.

Related

Count the number of distinct strings and their occurrence in a variable

I have a variable called Category that specifies the category of observations. The problem is that some observation have multiple categories. For example:
id Category
1 Economics
2 Biology
3 Psychology; Economics
4 Economics; Psychology
There is no meaning in the order of categories. They are always separated by ";". There are 250 categories, so creating dummy variables might be tricky. I have the complete list of categories in a separate Excel file if this might help.
What I want is simply to summarize my dataset by unique categories such as Economics (3), Psychology (2), Biology (1) (so the sum of all can be superior to the number of observations).
tabsplit from the tab_chi package on SSC will do this for you.
clear
input id str42 Category
1 "Economics"
2 "Biology"
3 "Psychology; Economics"
4 "Economics; Psychology"
end
capture ssc install tab_chi
tabsplit Category, p(;)
Category | Freq. Percent Cum.
------------+-----------------------------------
Biology | 1 16.67 16.67
Economics | 3 50.00 66.67
Psychology | 2 33.33 100.00
------------+-----------------------------------
Total | 6 100.00
Note: You can count semi-colons and thus phrases like this.
gen count = 1 + length(category) - length(subinstr(category, ";", "", .))
The logic is that you measure the length of the string and its length should semi-colons ; be replaced by empty strings (namely, removed). The difference is the number of semi-colons, to which you add 1.
EDIT: How to get to a different data structure, starting with the data example above.
. split Category, p(;)
variables created as string:
Category1 Category2
. drop Category
. reshape long Category, i(id) j(mention)
(note: j = 1 2)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 4 -> 8
Number of variables 3 -> 3
j variable (2 values) -> mention
xij variables:
Category1 Category2 -> Category
-----------------------------------------------------------------------------
. drop if missing(Category)
(2 observations deleted)
. list, sepby(id)
+----------------------------+
| id mention Category |
|----------------------------|
1. | 1 1 Economics |
|----------------------------|
2. | 2 1 Biology |
|----------------------------|
3. | 3 1 Psychology |
4. | 3 2 Economics |
|----------------------------|
5. | 4 1 Economics |
6. | 4 2 Psychology |
+----------------------------+

Update results in a column from multiple columns with different names

Based on the image, I would like to loop through the columns to find where there is a text mo. It updates mo with the results not the text mo. The challenge has been how to select the result in the next column different from where mo is.
Your answer to my comment above suggests to me that the question you ask reflects the wrong approach to the larger problem. Your description suggests that you have observations with a varying number of testname/testvalue pairs, such as
+----------------------------------------+
| id day test1 val1 test2 val2 |
|----------------------------------------|
| A 1 mo 11 . |
| A 2 mo 12 df 98.2 |
|----------------------------------------|
| B 1 df 98.3 mo 23 |
| B 2 mo 14 . |
+----------------------------------------+
and your objective is to produce observations that look like this
+----------------------+
| id day df mo |
|----------------------|
| A 1 . 11 |
| A 2 98.2 12 |
|----------------------|
| B 1 98.3 23 |
| B 2 . 14 |
+----------------------+
If that is the case, here is a reproducible example that you can copy, paste into Stata's Do-file Editor window, execute it, and examine the output to see how the technique avoids all the complexity you introduce by trying to use loops to accomplish the task. The reshape command is one of Stata's most powerful data management tools and it will benefit you to learn how to use it.
clear
input str8 id int day str8 test1 float val1 str8 test2 float val2
A 1 "mo" 11 "" .
A 2 "mo" 12 "df" 98.2
B 1 "df" 98.3 "mo" 23
B 2 "mo" 14 "" .
end
list, sepby(id) noobs
reshape long test val, i(id day) j(num)
drop if missing(test)
drop num
list, sepby(id) noobs
reshape wide val, i(id day) j(test) str
rename val* *
list, sepby(id) noobs

How can I export a two-way table?

I have created a two-way summary table in Stata, but I am struggling to output my results.
Using the auto.dta sample dataset as an example, I am trying to build a table that displays the means and standard deviations of mpg, by two other variables (expensive and foreign).
My code currently looks as follows:
sysuse auto.dta, replace
gen expensive = (price > 5000)
The table that I would like to display can be created by either of the two commands below:
tabulate expensive foreign, sum(mpg)
Means, Standard Deviations and Frequencies of Mileage (mpg)
| Car type
expensive | Domestic Foreign | Total
-----------+----------------------+----------
0 | 22.137931 28.875 | 23.594595
| 4.3648281 4.8825491 | 5.2305696
| 29 8 | 37
-----------+----------------------+----------
1 | 16.913043 22.428571 | 19
| 3.4629604 6.4416229 | 5.4467115
| 23 14 | 37
-----------+----------------------+----------
Total | 19.826923 24.772727 | 21.297297
| 4.7432972 6.6111869 | 5.7855032
| 52 22 | 74
table expensive foreign, c(mean mpg sd mpg) row col
----------------------------------------
| Car type
expensive | Domestic Foreign Total
----------+-----------------------------
0 | 22.1379 28.875 23.5946
| 4.364828 4.882549 5.23057
| 29 8 37
|
1 | 16.913 22.4286 19
| 3.46296 6.441623 5.446712
| 23 14 37
|
Total | 19.8269 24.7727 21.2973
| 4.743297 6.611187 5.785503
| 52 22 74
----------------------------------------
I can also closely approximate the same results using collapse, but this does not calculate row and column totals.
My issue is that neither the tabulate (with the sum option) command nor the table command seem friendly to output. I have tried converting to matrices, but tabulate with the sum option does not allow the matcell option and table seems similarly uncooperative.
I'm familiar with tabstat, esttab etc., but was not able to create the two-way table that I need with any of those packages. Any help would be really appreciated.
The community-contributed command asdoc does exactly that:
. asdoc table expensive foreign, c(mean mpg sd mpg count mpg) row col
----------------------------------------
| Car type
expensive | Domestic Foreign Total
----------+-----------------------------
0 | 22.1379 28.875 23.5946
| 4.364828 4.882549 5.23057
| 29 8 37
|
1 | 16.913 22.4286 19
| 3.46296 6.441623 5.446712
| 23 14 37
|
Total | 19.8269 24.7727 21.2973
| 4.743297 6.611187 5.785503
| 52 22 74
----------------------------------------
Click to Open File: Myfile.doc
Alternatively, one could use the community-contributed command tabout:
. tabout expensive foreign using table1.txt, c(mean mpg) sum replace
Table output written to: table1.txt
Car type
Domestic Foreign Total
Mean mpg Mean mpg Mean mpg
expensive
0 22.1 28.9 23.6
1 16.9 22.4 19.0
Total 19.8 24.8 21.3
. tabout expensive foreign using table2.txt, c(sd mpg) sum replace
Table output written to: table2.txt
Car type
Domestic Foreign Total
Sd mpg Sd mpg Sd mpg
expensive
0 4.4 4.9 5.2
1 3.5 6.4 5.4
Total 4.7 6.6 5.8
. tabout expensive foreign using table3.txt, c(count mpg) sum replace
Table output written to: table3.txt
Car type
Domestic Foreign Total
Count mpg Count mpg Count mpg
expensive
0 29.0 8.0 37.0
1 23.0 14.0 37.0
Total 52.0 22.0 74.0
an easy solution is to use collapse to get a dataset that reproduces your desired table, and then export the dataset as a csv
example
collapse (sum) mpg, by(expensive foreign)
and then
export delimited using mydata.csv

Display all levels of a variable while tabulating in Stata

I am cross-tabulating two variables variable1 with 5 levels and variable2 with 2 levels. The result of the tabulation is such that level 1 and 2 of variable1 is not displayed in the tabulation since the frequency is zero as follows:
sysuse auto
levelsof rep78
1 2 3 4 5
tab rep78 foreign if foreign, col nofreq
Repair |
Record | Car type
1978 | Foreign | Total
-----------+-----------+----------
3 | 14.29 | 14.29
4 | 42.86 | 42.86
5 | 42.86 | 42.86
-----------+-----------+----------
Total | 100.00 | 100.00
I would like to have the tabulation with all the levels displayed as follows:
tab rep78 foreign if foreign, col nofreq
Repair |
Record | Car type
1978 | Foreign | Total
-----------+-----------+----------
1 | 0.00 | 0.00
2 | 0.00 | 0.00
3 | 14.29 | 14.29
4 | 42.86 | 42.86
5 | 42.86 | 42.86
-----------+-----------+----------
Total | 100.00 | 100.00
How can I do that?
The reason I need this is that I have created a program that tabulates a given variable and posts the results into an excel report template using the putexcel functionality of Stata. In some cases some levels are not displayed in the tabulation and this results in some values getting posted to the wrong row of the excel report.
No decent example as yet from the OP, but here is some technique.
In general, it's tricky. Stata's no metaphysician and is reluctant to display anything without empirical evidence to hand that it exists. I here create a dataset with all the cross-combinations needed and also create a variable with explicit zeros to show. For many problems, also see help fillin.
. clear
. sysuse auto
(1978 Automobile Data)
. contract foreign rep78, zero
. egen pc = pc(_freq), by(foreign)
. tabdisp rep78 foreign if !foreign, c(pc) format(%2.1f)
--------------------
Repair |
Record | Car type
1978 | Domestic
----------+---------
1 | 3.8
2 | 15.4
3 | 51.9
4 | 17.3
5 | 3.8
. | 7.7
--------------------
. tabdisp rep78 foreign if foreign, c(pc) format(%2.1f)
-------------------
Repair |
Record |Car type
1978 | Foreign
----------+--------
1 | 0.0
2 | 0.0
3 | 13.6
4 | 40.9
5 | 40.9
. | 4.5
-------------------
Commands that create tables echoing what you give them (notably tabdisp) are here more helpful than commands that create summaries and then create tables that show the summaries (e.g. tabulate, table).

Get the number of levels of a categorical variable as a single number in Stata

I am trying to find a way to get the number of levels of a categorical variable as a single number. For example if I have a variable X with 4 levels I need to somehow get that number. If I type levelsof X I get the following 1 2 3 4 but I can't get only number 4 from there. Is there a way to do it using the levelsof or another command?
Various commands will give you the number of distinct values, for any kind of variable. ("Categorical variable" is a statistical concept, rather than a Stata concept.) Perhaps the simplest way to do it for one-off purposes is to ask for a one-way tabulation using tabulate. The number of distinct values is then the number of rows in that table, returned as r(r). Note that (1) you can suppress the table itself (which is useful in a program or do file) and (2) missing values are excluded by default:
. sysuse auto, clear
(1978 Automobile Data)
. qui tab foreign
. ret li
scalars:
r(N) = 74
r(r) = 2
. qui tab rep78
. ret li
scalars:
r(N) = 69
r(r) = 5
. qui tab rep78, missing
. ret li
scalars:
r(N) = 74
r(r) = 6
An extended review of this problem, pitched more generally, is available here. That paper introduces a distinct command. Its uses include direct support for looking at the number of distinct values systematically. search distinct in Stata to find a download source for the most recent version.
. distinct
| Observations
| total distinct
--------------+----------------------------
make | 74 74
price | 74 74
mpg | 74 21
rep78 | 69 5
headroom | 74 8
trunk | 74 18
weight | 74 64
length | 74 47
turn | 74 18
displacement | 74 31
gear_ratio | 74 36
foreign | 74 2
You can look at r(r) after levelsof:
. sysuse auto
(1978 Automobile Data)
. levelsof rep78
1 2 3 4 5
. display "rep78 has " `r(r)' " levels."
rep78 has 5 levels.