Generate percent change between annual observations in Stata? - stata

How do I use the gen or egen commands to generate the percent change between observations for different years in Stata? For example, I have observations for 1990 through 2010, each with a different value for expenditures, and I'm trying to generate a new observation with the percent change from 1990-1991, 1991-1992, etc.

// Here's an example with another measure of growth:
clear
set obs 100
gen year = _n + 1959
gen expenditure = _n^(1/3) + runiform()
line expenditure year, yti("Synthetic data example")
// From Statalist:
bys year: g expendituregrowth=100*(expenditure[_n]-expenditure[_n-1])/expenditure[_n-1]
// Also:
gen expenditure_gr = (expenditure/expenditure[_n-1] - 1)*100 // growth rate for expenditure
gen expenditure_bl = 100*expenditure/expenditure[1] // baseline growth rate for expenditure; base 100 = 1960
line expenditure_gr year, yti("Growth rate")
line expenditure_bl year, yti("Growth rate (base 100 = 1960)")
// The computation of expenditure_gr is what I think you are looking for.
// If your data are well-formed, use Stata with time series and get the growth rate easily:
tsset year, delta(1)
cap drop expenditure_gr
gen expenditure_gr = D.expenditure / 100*L.expenditure

Related

How to get population size instead of person-years when using stsplit and strate in Stata?

How to get number of people in the group instead of person-years when using strate in Stata?
Using cohort data, I have created a survival dataset in Stata like so:
stset end, id(person) failure(event==1) scale(365.25) enter(time start) origin(time dob)
stsplit ageband, at(0 (1) 5) after(time=dob)
stsplit year, after(time=mdy(1,1,1960)) at(40 (1) 45)
replace year = 1960 + year
strate ageband year sex, per(100000) output("rates.dta", replace)
where each person is born on dob and enters the study at start date and leaves at end date. If a person has an event (event == 1) during this period, then they leave at the event date.
stset creates the survival data.
stsplit splits the dataset into agebands (0-5 years old) and calendar year (2000-2005).
strate calculates the rates by each distinct value of ageband year sex and stores the summary data in "rates.dta". These summary results show, for each combination of ageband year sex: _D for number of events and _Y for person-years, which will be the numerator and denominator, respectively, when calculating rates.
I want to calculate the proportion of events _D out of the total number of people in each group.
Is there a way for _Y to be the total number of people within that group, e.g. ageband = 0, year = 2000, sex = 1 ?
How else can I get the number of people per group?
My solution:
* ADD before stset to get total number of people in the dataset
egen tag = tag(person)
egen N_total = total(tag)
drop tag
stset end, id(person) failure(event==1) scale(365.25) enter(time start) origin(time dob)
stsplit ageband, at(0 (1) 5) after(time=dob)
stsplit year, after(time=mdy(1,1,1960)) at(40 (1) 45)
replace year = 1960 + year
* ADD for each variable / group that you want to find the number of people in
egen tag = tag(person ageband)
egen N_ageband = total(tag), by(ageband)
drop tag
egen tag = tag(person year)
egen N_year = total(tag), by(year)
drop tag
egen tag = tag(person sex)
egen N_sex = total(tag), by(sex)
drop tag
* KEEP variables of interest
keep person event ageband year sex N_total N_ageband N_year N_sex
collapse (mean) N_total N_ageband N_year N_sex, by(ageband sex year person)
* SAVE
save "proportions.dta", replace
strate ageband year sex, per(100000) output("rates.dta", replace)
Then later, for each group (total, ageband, year, sex) merge your rates.dta file with the proportions.dta file:
foreach var in total ageband year sex {
use "rates.dta", clear
merge m:1 person sex ageband year using "proportions.dta", keep(match) nogen
if `var'==total collapse (sum) _D N_`var'
else collapse (sum) _D N_`var', by(`var')
* do any other processing with the results
save "proportions_by_`var'.dta", replace
}

Multiple operations on a single value in SAS?

I'm trying to create a column that will apply to different interests to it based on how much each customer's cumulative purchases are. Not sure but I was thinking that I'd need to use a do while statement but entirely sure. :S
This is what I got so far but I don't know how to get it to perform two operations on one value. Such that, it will apply one interest rate until say, 4000, and then apply the other interest rate to the rest above 4000.
data cards;
set sortedccards;
by Cust_ID;
if first.Cust_ID then cp=0;
cp+Purchase;
if cp<=4000 then cb=(cp*.2);
if cp>4000 then cb=(cp*.2)+(cp*.1);
format cp dollar10.2 cp dollar10.2;
run;
What I'd like my output to look like.
You will want to also track the prior cumulative purchase in order to detect when a purchase causes the cumulative to cross the threshold (or breakpoint) $4,000. Breakpoint crossing purchases would be split into pre and post portions for different bonus rates.
Example:
Program flow causes retained variable pcp to act like a LAGged variable.
data have;
input id $ p;
datalines;
C001 1000
C001 2300
C001 2000
C001 1500
C001 800
C002 6200
C002 800
C002 300
C003 2200
C003 1700
C003 2500
C003 600
;
data want;
set have;
by id;
if first.id then do;
cp = 0;
pcp = 0; retain pcp; /* prior cumulative purchase */
end;
cp + p; /* sum statement causes cp to be implicitly retained */
* break point is 4,000;
if (cp > 4000 and pcp > 4000) then do;
* entire purchase in post breakpoint territory;
b = 0.01 * p;
end;
else
if (cp > 4000) then do;
* split purchase into pre and post breakpoint portions;
b = 0.10 * (4000 - pcp) + 0.01 * (p - (4000 - pcp));
end;
else do;
* entire purchase in pre breakpoint territory;
b = 0.10 * p;
end;
* update prior for next implicit iteration;
pcp = cp;
run;
Here is a fairly straightforward solution which is not optimized but works. We calculate the cumulative purchases and cumulative bonus at each step (which can be done quite simply), and then calculate the current period bonus as cumulative bonus minus previous cumulative bonus.
This is assuming that the percentage is 20% up to $4000 and 30% over $4000.
data have;
input id $ period MMDDYY10. purchase;
datalines;
C001 01/25/2019 1000
C001 02/25/2019 2300
C001 03/25/2019 2000
C001 04/25/2019 1500
C001 05/25/2019 800
C002 03/25/2019 6200
C002 04/25/2019 800
C002 05/25/2019 300
C003 02/25/2019 2200
C003 03/25/2019 1700
C003 04/25/2019 2500
C003 05/25/2019 600
;
run;
data want (drop=cumul_bonus);
set have;
by id;
retain cumul_purchase cumul_bonus;
if first.id then call missing(cumul_purchase,cumul_bonus);
** Calculate total cumulative purchase including current purchase **;
cumul_purchase + purchase;
** Calculate total cumulative bonus including current purchase **;
cumul_bonus = (0.2 * cumul_purchase) + ifn(cumul_purchase > 4000, 0.1 * (cumul_purchase - 4000), 0);
** Bonus for current purchase = total cumulative bonus - previous cumulative bonus **;
bonus = ifn(first.id,cumul_bonus,dif(cumul_bonus));
format period MMDDYY10.
purchase cumul_purchase bonus DOLLAR10.2
;
run;
proc print data=want;

Can you divide the entries of two tables?

I have 2 tables in Stata: one shows the count of total parents in each state that had a divorce in the specific divorce year cohort, the other shows the count of divorced parents with csphycus == 2 in each state and divorce year cohort.
I want a table that displays the percentage of parents who has csphycus ==2 for each state and each divorce year cohort.
So I want to divide the counts in these two tables. How should I do it?
Your mean is
egen double numer = total(rdasecwt * (csphycus == 2)), by(statefip yrdivbin)
egen double denom = total(rdasecwt), by(statefip yrdivbin)
gen wanted = 100 * numer/denom
You can show it by some variation on
tabdisp statefip yrdivbin, c(wanted) format(%2.1f)

Using do loops in sas

Assume you have a data file called VIRUS_PROLIF from an infectious disease research center. Each observation has 3 variables COUNTRY START_DATE, and DOUBLE_RATE, where START_DATE is the date that the Country registered its 100th case of COVID-19. For each country, DOUBLE_RATE is the number of days it takes for the number of cases to double in that country. Write the SAS code using DO UNTIL to calculate the date at which that Country would be predicted to register 200,000 cases of COVID-19.
data VIRUS_PROLIF;
INPUT COUNTRY $ start_date mmddyy10. num_of_cases double_rate ;
*here doubling rate is 100% so if day 1 had 100 cases day 2 will have 200;
Datalines;
US 03/13/2020 100 100
;
run;
data VIRUS_PROLIF1 (drop=start_date);
set VIRUS_PROLIF;
do until (num_of_cases>200000);
double_rate+1;
num_of_cases+ (num_of_cases*1);
end;
run;
proc print data=VIRUS_PROLIF1;
run;
The key concept you're missing here is how to employ the growth rate. That would be using the following formula, similar to interest growth for money.
If you have one dollar today and you get 100% interest it becomes
StartingAmount * (1 + interestRate) where the interest rate here is 100/100 = 1.
*fake data;
data VIRUS_PROLIF;
INPUT COUNTRY $ start_date mmddyy10. num_of_cases double_rate;
*here doubling rate is 100% so if day 1 had 100 cases day 2 will have 200;
Datalines;
US 03/13/2020 100 100
AB 03/17/2020 100 20
;
run;
data VIRUS_PROLIF1;
set VIRUS_PROLIF;
*assign date to starting date so both are in output;
date=start_date;
*save record to data set;
output;
do until (num_of_cases>200000);
*increment your day;
date=date+1;
;
*doubling rate is represented as a percent so add it to 1 to show the rate;
num_of_cases=num_of_cases*(1+double_rate/100);
*save record to data set;
output;
end;
*control date display;
format date start_date date9.;
run;
*check results;
proc print data=VIRUS_PROLIF1;
run;
The problem 200,000 < N0 (1+R/100) k can be solved for integer k without iterations
day_of_200K = ceil (
LOG ( 200000 / NUM_OF_CASES )
/ LOG ( 1 + R / 100 )
);

Generating panel data in Stata

How can I generate panel data in Stata?
I would like that each individual is affected by unobserved heterogeneity.
For example, I want the DGP (data generating process) is something like:
Wages_{it}= \beta (Labor market experience_{it}) + \alpha_{i} + \epsilon_{it},
where \alpha_{i} is the unobserved heterogeneity and where \epsilon_{it} is the error term which is normally distributed.
Finally, I would like that (Labor market experience_{it}) is an AR(1) process, e.g.:
Labor market experience_{it}= 0.8 * (Labor market experience_{i,t-1}) + v_{it},
where v_{it} is the error term which is normally distributed.
You can do something like this by using subscripting combined with bysort:
clear
set seed 10011979
set obs 4 // Set the number of panels (N)
gen id = _n
gen alpha = rnormal(0,1)
expand 3 // Set the number of periods (T)
bys id: gen t=_n
xtset id t
bysort id (t): gen lme = rnormal(0,1) + rnormal(0,1) if _n==1
bysort id (t): replace lme = .8 * lme[_n-1] + rnormal(0,1) if _n!=1
gen w = 3 * lme + alpha + rnormal(0,1)
drop alpha