Return a matrix from distinct command - stata

I have a simple question about the distinct command in Stata.
When using with a by prefix, can it return a one dimension matrix of r(N)?
For example:
sysuse auto,clear
bysort foreign: distinct rep78
Can I store a [2,1] matrix, with each row representing the number of distinct values of rep78?
The manual seems to suggest that it only stores the number of distinct values of the last by value.

You can easily create your own wrapper for that:
sysuse auto,clear
sort foreign
levelsof foreign, local(foreign_levels)
local number_of_foreign_levels : word count `foreign_levels'
matrix distinct_mat = J(`number_of_foreign_levels', 1, 0)
forvalues i = 1 / `number_of_foreign_levels' {
quietly distinct rep78 if foreign == `i' - 1
matrix distinct_mat[`i', 1] = r(ndistinct)
}
matrix list distinct_mat
distinct_mat[2,1]
c1
r1 5
r2 3
Note that the number of distinct observations is stored in r(ndistinct), not r(N).

Here is another way to get numbers of distinct values into a matrix.
. sysuse auto
(1978 Automobile Data)
. egen tag = tag(foreign rep78)
. tab foreign if tag, matcell(foo)
Car type | Freq. Percent Cum.
------------+-----------------------------------
Domestic | 5 62.50 62.50
Foreign | 3 37.50 100.00
------------+-----------------------------------
Total | 8 100.00

Related

Stata output table: Conditional symbols in estout

for simplification, let's assume following script to create a simple regression table:
sysuse auto
eststo clear
qui regress price weight mpg
esttab using "table.rtf", cells(t) mtitles onecell nogap ///
stats(N, labels("Observations")) label ///
compress replace
eststo clear
Output:
(1)
.
t
Weight (lbs.) 2.723238
Mileage (mpg) -.5746808
Constant .541018
Observations 74
Question:
Would it be possible to mark every t-value above 0.5 or below 0.5 with an asterisk? (= greater than absolute value 0.5)
Please note: In the specific application case, I can't work with given p-values, and need a custom solution that works with thresholds of t.
Desired outcome:
(1)
.
t
Weight (lbs.) 2.723238*
Mileage (mpg) -.5746808*
Constant .541018*
Observations 74
Crossposting can be found here:
Thank you for your help!
You cannot do this directly with estout but the following works all the same:
sysuse auto, clear
regress price weight mpg
quietly esttab, mtitles onecell nogap stats(N, labels("Observations")) label ///
compress replace star staraux
matrix A = r(coefs)
matrix A = A[1...,2]
svmat A
generate A2 = "*" if abs(A1) >= 0.5
generate A4 = string(A1) + A2
local names : rownames A
generate A3 = ""
forvalues i = 1 / `: word count `names'' {
replace A3 = `"`: word `i' of `names''"' in `i'
}
list A3 A4 if !missing(A3)
+---------------------+
| A3 A4 |
|---------------------|
1. | weight 2.723238* |
2. | mpg -.5746808* |
3. | _cons .541018* |
+---------------------+
preserve
keep if !missing(A3)
export delimited A3 A4 using table.txt, delimiter(" ") novarnames
restore
You will have to do some more gymnastics to get the variable labels etc.

Export tabulation in Excel

Consider the following toy example:
sysuse auto, clear
tab foreign, sum(price)
| Summary of Price
Car type | Mean Std. Dev. Freq.
------------+------------------------------------
Domestic | 6,072.423 3,097.104 52
Foreign | 6,384.682 2,621.915 22
------------+------------------------------------
Total | 6,165.257 2,949.496 74
How can I save the results in an Excel file?
Using the community-contributed command esttab, the following works for me:
sysuse auto, clear
egen m_total = mean(price)
egen s_total = sd(price)
scalar mtotal = m_total
scalar stotal = s_total
scalar N = _N
collapse (mean) Mean=price (sd) StdDev=price (count) Freq = price, by(foreign)
set obs 3
replace Mean = mtotal in 3
replace StdDev = stotal in 3
replace Freq = N in 3
mkmat Mean StdDev Freq, matrix(A)
esttab matrix(A) using myfilename.xls, varlabels(r1 Domestic r2 Foreign r3 Total) ///
title(" Summary of Price") mlabels(none)
Summary of Price
---------------------------------------------------
Mean StdDev Freq
---------------------------------------------------
Domestic 6072.423 3097.104 52
Foreign 6384.682 2621.915 22
Total 6165.257 2949.496 74
---------------------------------------------------

Group by with percentages and raw numbers

I have a dataset that looks like this:
I would like to create a table that groups by area and shows the total amount for the area both as a percentage of total amount and as a raw number, as well as the percent of the total number of records/observations per area and total number of records/observations as a raw number.
The code below works to generate a table of raw numbers but does not the show percent of total:
tabstat amount, by(county) stat(sum count)
There isn't a canned command for doing what you want. You will have to program the table yourself.
Here's a quick example using auto.dta:
. sysuse auto, clear
(1978 Automobile Data)
. tabstat price, by(foreign) stat(sum count)
Summary for variables: price
by categories of: foreign (Car type)
foreign | sum N
---------+--------------------
Domestic | 315766 52
Foreign | 140463 22
---------+--------------------
Total | 456229 74
------------------------------
You can do the calculations and save the raw numbers in variables as follows:
. generate total_obs = _N
. display total_obs
74
. count if foreign == 0
52
. generate total_domestic_obs = r(N)
. count if foreign == 1
22
. generate total_foreign_obs = r(N)
. egen total_domestic_price = total(price) if foreign == 0
. sort total_domestic_price
. local tdp = total_domestic_price
. display total_domestic_price
315766
. egen total_foreign_price = total(price) if foreign == 1
. sort total_foreign_price
. local tfp = total_foreign_price
. display total_foreign_price
140463
. generate total_price = `tdp' + `tfp'
. display total_price
456229
And for the percentages:
. generate pct_domestic_price = (`tdp' / total_price) * 100
. display pct_domestic_price
69.212173
. generate pct_foreign_price = (`tfp' / total_price) * 100
. display pct_foreign_price
30.787828
EDIT:
Here's a more automated way to do the above without having to specify individual values:
program define foo
syntax varlist(min=1 max=1), by(string)
generate total_obs = _N
display total_obs
quietly levelsof `by', local(nlevels)
foreach x of local nlevels {
count if `by' == `x'
quietly generate total_`by'`x'_obs = r(N)
quietly egen total_`by'`x'_`varlist' = total(`varlist') if `by' == `x'
sort total_`by'`x'_`varlist'
local tvar`x' = total_`by'`x'_`varlist'
local tvarall `tvarall' `tvar`x'' +
display total_`by'`x'_`varlist'
}
quietly generate total_`varlist' = `tvarall' 0
display total_`varlist'
foreach x of local nlevels {
quietly generate pct_`by'`x'_`varlist' = (`tvar`x'' / total_`varlist') * 100
display pct_`by'`x'_`varlist'
}
end
The results are identical:
. foo price, by(foreign)
74
52
315766
22
140463
456229
69.212173
30.787828
You will obviously need to format the results in a table of your liking.
Here's another approach. I stole #Pearly Spencer's example. It could be generalised to a command. The main message I want to convey is that list is useful for tabulations and other reports, with just usually some obligation to calculate what you want to show beforehand.
. sysuse auto, clear
(1978 Automobile Data)
. preserve
. collapse (sum) total=price (count) obs=price, by(foreign)
. egen pc2 = pc(total)
. egen pc1 = pc(obs)
. char pc2[varname] "%"
. char pc1[varname] "%"
. format pc* %2.1f
. list foreign obs pc1 total pc2 , subvarname noobs sum(obs pc1 total pc2)
+-----------------------------------------+
| foreign obs % total % |
|-----------------------------------------|
| Domestic 52 70.3 315766 69.2 |
| Foreign 22 29.7 140463 30.8 |
|-----------------------------------------|
Sum | 74 100.0 456229 100.0 |
+-----------------------------------------+
. restore
EDIT Here's an essay in egen with similar flavour but leaving the original data in place and new variables also available for export or graphics.
. sysuse auto, clear
(1978 Automobile Data)
. egen total = sum(price), by(foreign)
. egen obs = count(price), by(total)
. egen tag = tag(foreign)
. egen pc2 = pc(total) if tag
(72 missing values generated)
. egen pc1 = pc(obs) if tag
(72 missing values generated)
. char pc2[varname] "%"
. char pc1[varname] "%"
. format pc* %2.1f
. list foreign obs pc1 total pc2 if tag, subvarname noobs sum(obs pc1 total pc2)
+-----------------------------------------+
| foreign obs % total % |
|-----------------------------------------|
| Domestic 52 70.3 315766 69.2 |
| Foreign 22 29.7 140463 30.8 |
|-----------------------------------------|
Sum | 74 100.0 456229 100.0 |
+-----------------------------------------+

Exporting results from regressions in Excel

I want to store results from ordinary least squares (OLS) regressions in Stata within a double loop.
Here is the structure of my code:
foreach i2 of numlist 1 2 3{
foreach i3 of numlist 1 2 3 4{
quiet: eststo: reg dep covariates, robust
}
}
The end goal is to have a table in Excel composed by twelve rows (one for each model) and seven columns (number of observations, estimated constant, five estimated coefficients).
Any suggestion on how can I do this?
Such a table can be created simply by using the community-contributed command esttab:
sysuse auto, clear
eststo clear
eststo m1: quietly regress price weight
eststo m2: quietly regress price weight mpg
quietly esttab
matrix A = r(coefs)'
matrix C = r(stats)'
tokenize "`: rownames A'"
forvalues i = 1 / `=rowsof(A)' {
if strmatch("``i''", "*b*") matrix B = nullmat(B) \ A[`i', 1...]
}
matrix C = B , C
matrix rownames C = "Model 1" "Model 2"
Result:
esttab matrix(C) using table.csv, eqlabels(none) mlabels(none) varlabels("Model 1" "Model 2")
----------------------------------------------------------------
weight mpg _cons N
----------------------------------------------------------------
Model 1 2.044063 -6.707353 74
Model 2 1.746559 -49.51222 1946.069 74
----------------------------------------------------------------

Generating dummies in Stata

I have a dataset in Stata of the following form
id | year
a | 1950
b | 1950
c | 1950
d | 1950
.
.
.
y | 1950
-----
a | 1951
b | 1951
c | 1951
d | 1951
.
.
.
y | 1951
-----
...
I'm looking for a quick way to rewrite the following code
gen dummya=1 if id=="a"
gen dummyb=1 if id=="b"
gen dummyc=1 if id=="c"
...
gen dummyy=1 if id=="y"
and
gen dummy50=1 if year==1950
gen dummy51=1 if year==1951
...
Note that all your dummies would be created as 1 or missing. It is almost always more useful to create them directly as 1 or 0. Indeed, that is the usual definition of dummies.
In general, it's a loop over the possibilities using forvalues or foreach, but the shortcut is too easy not to be preferred in this case. Consider this reproducible example:
. sysuse auto, clear
(1978 Automobile Data)
. tab rep78, gen(rep78)
Repair |
Record 1978 | Freq. Percent Cum.
------------+-----------------------------------
1 | 2 2.90 2.90
2 | 8 11.59 14.49
3 | 30 43.48 57.97
4 | 18 26.09 84.06
5 | 11 15.94 100.00
------------+-----------------------------------
Total | 69 100.00
. d rep78?
storage display value
variable name type format label variable label
------------------------------------------------------------------------------
rep781 byte %8.0g rep78== 1.0000
rep782 byte %8.0g rep78== 2.0000
rep783 byte %8.0g rep78== 3.0000
rep784 byte %8.0g rep78== 4.0000
rep785 byte %8.0g rep78== 5.0000
That's all the dummies (some prefer to say "indicators") in one fell swoop through an option of tabulate.
For completeness, consider an example doing it the loop way. We imagine that years 1950-2015 are represented:
forval y = 1950/2015 {
gen byte dummy`y' = year == `y'
}
Two digit identifiers dummy50 to dummy15 would be unambiguous in this example, so here they are as a bonus:
forval y = 1950/2015 {
local Y : di %02.0f mod(`y', 100)
gen byte dummy`y' = year == `y'
}
Here byte is dispensable unless memory is very short, but it's good practice any way.
If anyone was determined to write a loop to create indicators for the distinct values of a string variable, that can be done too. Here are two possibilities. Absent an easily reproducible example in the original post, let's create a sandbox. The first method is to encode first, then loop over distinct numeric values. The second method is find the distinct string values directly and then loop over them.
clear
set obs 3
gen mystring = word("frog toad newt", _n)
* Method 1
encode mystring, gen(mynumber)
su mynumber, meanonly
forval j = 1/`r(max)' {
gen dummy`j' = mynumber == `j'
label var dummy`j' "mystring == `: label (mynumber) `j''"
}
* Method 2
levelsof mystring
local j = 1
foreach level in `r(levels)' {
gen dummy2`j' = mystring == `"`level'"'
label var dummy2`j' `"mystring == `level'"'
local ++j
}
describe
Contains data
obs: 3
vars: 8
size: 96
------------------------------------------------------------------------------
storage display value
variable name type format label variable label
------------------------------------------------------------------------------
mystring str4 %9s
mynumber long %8.0g mynumber
dummy1 float %9.0g mystring == frog
dummy2 float %9.0g mystring == newt
dummy3 float %9.0g mystring == toad
dummy21 float %9.0g mystring == frog
dummy22 float %9.0g mystring == newt
dummy23 float %9.0g mystring == toad
------------------------------------------------------------------------------
Sorted by:
Use i.<dummy_variable_name>
For example, in your case, you can use following command for regression:
reg y i.year
I also recommend using
egen year_dum = group(year)
reg y i.year_dum
This can be generalized arbitrarily, and you can easily create, e.g., year-by-state fixed effects this way:
egen year_state_dum = group(year state)
reg y i.year_state_dum