I have 2 tables in Stata: one shows the count of total parents in each state that had a divorce in the specific divorce year cohort, the other shows the count of divorced parents with csphycus == 2 in each state and divorce year cohort.
I want a table that displays the percentage of parents who has csphycus ==2 for each state and each divorce year cohort.
So I want to divide the counts in these two tables. How should I do it?
Your mean is
egen double numer = total(rdasecwt * (csphycus == 2)), by(statefip yrdivbin)
egen double denom = total(rdasecwt), by(statefip yrdivbin)
gen wanted = 100 * numer/denom
You can show it by some variation on
tabdisp statefip yrdivbin, c(wanted) format(%2.1f)
Related
DATA proj4.gasQTR;
SET proj4.gasQTR;
INPUT Q1 Q2 Q3 Q4;
IF MONTH = 1 or 2 or 3 THEN Q1 = 1;
ELSE IF MONTH = 4 or 5 or 6 THEN Q2 = 2;
ELSE IF MONTH = 7 or 8 or 9 THEN Q3 = 3;
ELSE IF MONTH = 10 or 11 or 12 THEN Q4 = 4;
quarter = MONTH; FORMAT Quarter qtrw.;
RUN;
I am trying to get a 1-4 value for each qtr of each year, my error comes from Quarter qtrw. 'ERROR 388-185 Expecting an arithmetic operator'
*Data is already in 1-4 format for the month variable
What am I doing wrong?
Any help would be appreciated!
Thank you!
You normally do not use both a SET statement to retrieve data from an existing dataset and an INPUT statement to read values from a text file in the same data step. And if you do want to INPUT values from a text file you must tell SAS where to find the text by including either an INFILE statement or add the text in-line with the code by using a DATALINES (or CARDS) statement.
SAS will consider any number that is not zero or missing as TRUE. So the condition 2 or 3 or 4 is always TRUE. So Q1 will always be set to 1 and Q2, Q3 and Q4 will always be missing (or if they existed already unchanged). If you want to test if a variables has any of a number of values use the IN operator instead of the equality operator. month in (1 2 3 4)
You also should not be reading and writing the same dataset. If there are logic issues in your coding you might destroy the original dataset. So hopefully you have backup copy of proj4.gasQTR, or a program that can recreate it.
What is the format QTRW ? Is that something you created? Show its definition.
Assuming you have a variable named MONTH with integer values in the range 1 to 12 you can calculate QUARTER with integer values in the range 1 to 4 with a simple arithmetic function instead of coding a series of IF conditions.
data want;
set have;
quarter = ceil(month/3) ;
run;
If you actually have a DATE variable then perhaps all you were supposed to do was use the MONTH or QTR format to display the dates as the month number or quarter number that they fall into.
Try this program to see the impact of applying different formats to the same values.
data test;
do month=1 to 12;
date1=mdy(month,1,2022);
date2=date1;
date3=date1;
output;
end;
format date1 date9. date2 month. date3 qtr.;
run;
proc print;
run;
Use the in operator or repeat the equality for every case.
Example from the doc:
You can use the IN operator with character strings to determine whether a variable's value is among a list of character values. The following statements produce the same results:
if state in ('NY','NJ','PA') then region+1;
if state='NY' or state='NJ' or state='PA' then region+1;
Therefore
DATA proj4.gasQTR;
SET proj4.gasQTR;
IF MONTH = 1 or MONTH = 2 or MONTH = 3 THEN Q1 = 1;
ELSE IF MONTH = 4 or MONTH = 5 or MONTH = 6 THEN Q2 = 2;
ELSE IF MONTH = 7 or MONTH = 8 or MONTH = 9 THEN Q3 = 3;
ELSE IF MONTH = 10 or MONTH = 11 or MONTH = 12 THEN Q4 = 4;
quarter = MONTH; FORMAT Quarter qtrw.;
RUN;
is equivalent to
DATA proj4.gasQTR;
SET proj4.gasQTR;
IF MONTH in (1,2,3) THEN Q1 = 1;
ELSE IF MONTH in (4,5,6) THEN Q2 = 2;
ELSE IF MONTH in (7,8,9) THEN Q3 = 3;
ELSE IF MONTH in (10,11,12) THEN Q4 = 4;
quarter = MONTH; FORMAT Quarter qtrw.;
RUN;
How to get number of people in the group instead of person-years when using strate in Stata?
Using cohort data, I have created a survival dataset in Stata like so:
stset end, id(person) failure(event==1) scale(365.25) enter(time start) origin(time dob)
stsplit ageband, at(0 (1) 5) after(time=dob)
stsplit year, after(time=mdy(1,1,1960)) at(40 (1) 45)
replace year = 1960 + year
strate ageband year sex, per(100000) output("rates.dta", replace)
where each person is born on dob and enters the study at start date and leaves at end date. If a person has an event (event == 1) during this period, then they leave at the event date.
stset creates the survival data.
stsplit splits the dataset into agebands (0-5 years old) and calendar year (2000-2005).
strate calculates the rates by each distinct value of ageband year sex and stores the summary data in "rates.dta". These summary results show, for each combination of ageband year sex: _D for number of events and _Y for person-years, which will be the numerator and denominator, respectively, when calculating rates.
I want to calculate the proportion of events _D out of the total number of people in each group.
Is there a way for _Y to be the total number of people within that group, e.g. ageband = 0, year = 2000, sex = 1 ?
How else can I get the number of people per group?
My solution:
* ADD before stset to get total number of people in the dataset
egen tag = tag(person)
egen N_total = total(tag)
drop tag
stset end, id(person) failure(event==1) scale(365.25) enter(time start) origin(time dob)
stsplit ageband, at(0 (1) 5) after(time=dob)
stsplit year, after(time=mdy(1,1,1960)) at(40 (1) 45)
replace year = 1960 + year
* ADD for each variable / group that you want to find the number of people in
egen tag = tag(person ageband)
egen N_ageband = total(tag), by(ageband)
drop tag
egen tag = tag(person year)
egen N_year = total(tag), by(year)
drop tag
egen tag = tag(person sex)
egen N_sex = total(tag), by(sex)
drop tag
* KEEP variables of interest
keep person event ageband year sex N_total N_ageband N_year N_sex
collapse (mean) N_total N_ageband N_year N_sex, by(ageband sex year person)
* SAVE
save "proportions.dta", replace
strate ageband year sex, per(100000) output("rates.dta", replace)
Then later, for each group (total, ageband, year, sex) merge your rates.dta file with the proportions.dta file:
foreach var in total ageband year sex {
use "rates.dta", clear
merge m:1 person sex ageband year using "proportions.dta", keep(match) nogen
if `var'==total collapse (sum) _D N_`var'
else collapse (sum) _D N_`var', by(`var')
* do any other processing with the results
save "proportions_by_`var'.dta", replace
}
How can I generate panel data in Stata?
I would like that each individual is affected by unobserved heterogeneity.
For example, I want the DGP (data generating process) is something like:
Wages_{it}= \beta (Labor market experience_{it}) + \alpha_{i} + \epsilon_{it},
where \alpha_{i} is the unobserved heterogeneity and where \epsilon_{it} is the error term which is normally distributed.
Finally, I would like that (Labor market experience_{it}) is an AR(1) process, e.g.:
Labor market experience_{it}= 0.8 * (Labor market experience_{i,t-1}) + v_{it},
where v_{it} is the error term which is normally distributed.
You can do something like this by using subscripting combined with bysort:
clear
set seed 10011979
set obs 4 // Set the number of panels (N)
gen id = _n
gen alpha = rnormal(0,1)
expand 3 // Set the number of periods (T)
bys id: gen t=_n
xtset id t
bysort id (t): gen lme = rnormal(0,1) + rnormal(0,1) if _n==1
bysort id (t): replace lme = .8 * lme[_n-1] + rnormal(0,1) if _n!=1
gen w = 3 * lme + alpha + rnormal(0,1)
drop alpha
In my dataset, I have observations for football matches. One of my variables is hometeam. Now I want to get the average amount of observations per hometeam. How do I do that in Stata?
I know that I could tab hometeam, but since there are over 500 distinct hometeams, I don't want to do the calculation manually.
bysort hometeam : gen n = _N
bysort hometeam : gen tag = _n == 1
su n if tag
EDIT Another way to do it more concisely
bysort hometown : gen n = _N if _n == 1
su n
Why the tagging then? It is often useful to have a tag variable when you are moving back and forth between individual and group level. egen, tag() does the same thing.
Why if _n == 1? You need to have this value just once for each group, and there are two ways of doing it that always work for groups that could be as small as one observation, to do it for the first or the last observation in a group. In a group of 1, they are the same, but that doesn't matter. So if _n == _N is another way to do it.
bysort hometown : gen n = _N if _n == _N
The code needs to be changed in situations where you need not to count missings on some variable
bysort hometown : gen n = sum(!missing(myvar))
by hometown : replace n = . if _n < _N
egen, count() is similar, but not identical.
I assume you can identify the different hometeams with some id variable.
If you want the average number of observations per id this is one way:
clear all
set more off
input id hometeam
1 .
1 5
1 0
3 6
3 2
3 1
3 9
2 7
2 7
end
list, sepby(id)
bysort id: egen c = count(hometeam)
by id: keep if _n == 1
summarize c, meanonly
disp r(mean)
Note that observations with missings are not counted by count. If you did want to count the missings, then you could do:
bysort id: gen c = _n
by id: keep if _n == _N
summarize c, meanonly
disp r(mean)
Option 2: Using the data of #Roberto
collapse (count) hometeam, by(id)
sum hometeam,meanonly
How do I use the gen or egen commands to generate the percent change between observations for different years in Stata? For example, I have observations for 1990 through 2010, each with a different value for expenditures, and I'm trying to generate a new observation with the percent change from 1990-1991, 1991-1992, etc.
// Here's an example with another measure of growth:
clear
set obs 100
gen year = _n + 1959
gen expenditure = _n^(1/3) + runiform()
line expenditure year, yti("Synthetic data example")
// From Statalist:
bys year: g expendituregrowth=100*(expenditure[_n]-expenditure[_n-1])/expenditure[_n-1]
// Also:
gen expenditure_gr = (expenditure/expenditure[_n-1] - 1)*100 // growth rate for expenditure
gen expenditure_bl = 100*expenditure/expenditure[1] // baseline growth rate for expenditure; base 100 = 1960
line expenditure_gr year, yti("Growth rate")
line expenditure_bl year, yti("Growth rate (base 100 = 1960)")
// The computation of expenditure_gr is what I think you are looking for.
// If your data are well-formed, use Stata with time series and get the growth rate easily:
tsset year, delta(1)
cap drop expenditure_gr
gen expenditure_gr = D.expenditure / 100*L.expenditure