How to write the best code for data aggregation?

How to write the best code for data aggregation? - stata

I have the following dataset (individual level data):
pid year state income
1 2000 il 100
2 2000 ms 200
3 2000 al 30
4 2000 dc 400
5 2000 ri 205
1 2001 il 120
2 2001 ms 230
3 2001 al 50
4 2001 dc 400
5 2001 ri 235
.........etc.......
I need to estimate average income for each state in each year and create a new dataset that would look like this:
state year average_income
ar 2000 150
ar 2001 200
ar 2002 250
il 2000 150
il 2001 160
il 2002 160
...........etc...............
I already have a code that runs perfectly fine (I have two loops). However, I would like to know is there any better way in Stata like sql style query?

This is shorter code than any suggested so far:
collapse average_income=income, by(state year)

This shouldn't need 2 loops, or any for that matter. There are in fact more efficient ways to do this. When you are repeating an operation on many groups, the bysort command is useful:
bysort year state: egen average_income = mean(income)
You also don't have to create a new dataset, you can just prune this one and save it. Start by only keeping the variables you want (state, year and average_income) and get rid of duplicates:
keep state year average_income
duplicates drop
save "mynewdataset.dta"

You have the SQL tag on the question. This is a basic aggregation query in SQL:
select state, year, avg(income) as average_income
from t
group by state, year;
To put this in a table, depends on your database. One of the following typically works:
create table NewTable as
select state, year, avg(income) as average_income
from t
group by state, year;
Or:
select state, year, avg(income) as average_income
into NewTable
from t
group by state, year;

Related

Inactivity duration variable in panel data (Stata)

I have a dataset for U.S. manufacturing workers in the past 30 decades, and I am particularly interested in the following variables:
Month and year of 1st manufacturing job, recorded separately and named "start_month_job_1" & "start_yr_job_1."
Month and year of leaving the 1st manufacturing job, recorded separately and named "end_month_job_1" & "end_yr_job_1."
The reason for leaving the job (e.g. retirement, firing, factory shutdown, etc.), named "leaving_reason"
Month and year of 2nd manufacturing job, recorded separately and named "start_month_job_2" & "start_yr_job_2."
Month and year of leaving the 2nd manufacturing job, recorded separately and named "end_month_job_2" & "end_yr_job_2."
I am trying to create a variable that measures the duration of economic inactivity/idleness. I am defining "duration of economic inactivity" this as the time difference between leaving a 1st job and starting another job. I have created a variable that accomplishes that with years as in below:
gen econ_inactivity_duration_1 = start_yr_job_2 - end_yr_job_1
replace econ_inactivity_1 = 2018 - end_yr_job_1 if missing(start_yr_job_2 ) /// In cases where a worker never starts a second job until 2018, which is the latest year measured in the survey.
However, I want to actually create an economic_inactivity_duration variable that takes into account the difference in month and year, for both starting and leaving a job, respectively. For instance, the duration for the worker in row 1 would be 2 months, between May, 1993 and July, 1993, as opposed to zero, which is what my code above computes.
dataex start_month_job_1 byte start_yr_job_1 byte end_month_job_1 byte end_yr_job_1 byte start_month_job_2 byte start_yr_job_2 byte end_month_job_2 byte end_yr_job_2 byte leaving_reason
3 1990 5 1993 7 1993 4 1994 "Firm shutdown"
1 2003 7 2015 . . . . "job automation"
98 1979 98 2004 . . . . "Firm shutdown"
98 1975 98 2010 98 2010 98 2015 "job automation"
1 1983 12 1985 1 1986 . . "Firm shutdown"
98 1996 98 1998 . . . . "Firm shutdown"

There is probably a better way, but here is a crude method.
* Data example
input end_month_job_1 end_yr_job_1 start_month_job_2 start_yr_job_2
5 1993 7 1993
end
* Calculate months since 1960
gen j1_end = (end_yr_job_1 - 1960) * 12 + end_month_job_1
gen j2_start = (start_yr_job_2 - 1960) * 12 + start_month_job_2
* Calculate difference
gen wanted = j2_start - j1_end
* Check difference is positive
assert wanted > 0
list
+------------------------------------------------------------------------+
| end_mo~1 end_yr~1 s~mont~2 s~yr_j~2 j1_end j2_start wanted |
|------------------------------------------------------------------------|
1. | 5 1993 7 1993 401 403 2 |
+------------------------------------------------------------------------+

Rolling Window Model for Unbalanced Dataset in SAS

I have an unbalanced panel dataset of the following form (simplified):
data have;
input ID YEAR EARN LAG_EARN;
datalines;
1 1960 450 .
1 1961 310 450
1 1962 529 310
2 1978 10 .
2 1979 15 10
2 1980 8 15
2 1981 10 8
2 1982 15 10
2 1983 8 15
2 1984 10 8
3 1972 1000 .
3 1973 1599 1000
3 1974 1599 1599
;
run;
I now want to estimate the following model for each ID:
proc reg;
by ID;
EARN = LAG_EARN;
run;
However, I want to do this for rolling windows of some size. Say for example for windows of size 2. The window should only contain non-empty observations. For example, in the case of firm A, the window is applicable from 1961 onwards and thus only one time (since only one year follows after 1961 and the window is supposed to be of size 2).
Finally, I want to get a table with year columns and firm rows. The table should indicate the following: The regression model (with window size 2) has been performed one time for firm A. The quantity of available years, has only allowed one estimation of this model. Put differently, in 1962 the coefficient of the regression model has a value of X based on the 2 year prior window. Applying the same logic to the other two firms, one can get the following table. "X" representing the respective estimated coefficient value in certain year for firm A/B/C based on the 2-year window and "n" indicating the non-existence of such a value:
data want;
input ID 1962 1974 1980 1981 1982 1983 1984;
datalines;
1 X n n n n n n
2 n n X X X X X
3 n X n n n n n
;
run;
I do not know how to execute this. Furthermore, I would like to create a macro that allows me to estimate different rolling window models while still creating analogous output dataframes. I would appreciate any help with it, since I have been struggling quite some time now.

Try this macro. This will only output if there are non-missing values of lags that you specify.
%macro lag(data=, out=, window=);
data _want_;
set &data.;
by ID;
LAG_EARN = lag&window.(earn);
if(first.ID) then call missing(lag_earn);
if(NOT missing(lag_earn));
run;
proc sort data=_want_;
by year id;
run;
proc transpose data=_want_
out=&out.(drop=_NAME_);
by ID notsorted;
id year;
var lag_earn;
run;
proc sort data=&out.;
by id;
run;
%mend;
%lag(data=have, out=want, window=1);

How to collapse data by week correctly in Stata?

I have a transaction level dataset and I want to collapse and calculate weekly average price. The dataset can be simplified as follows,
clear
input str9 date quantity price id
"01jan2010" 50 70 1
"02jan2010" 60 80 2
"02jan2010" 70 90 3
"04jan2010" 70 95 4
"08jan2010" 60 81 5
"09jan2010" 70 88 6
"12jan2010" 55 87 7
"13jan2010" 52 88 8
end
gen date2=date(date,"DMY")
format date2 %td
drop date
I want to create a variable date3. For every transaction happened in a week, date3 is the Monday of that week.
Here's the code I have:
sort date2
gen date3=date2 if dow(date2)==1
replace date3=date3[_n-1] if missing(date3)
format date3 %td
However, there are Mondays with no transactions, but the rest of the week has transactions. In those cases, date3 is not the Monday date of that week, but Monday date in the weeks before.
My data becomes the following using the above code:
quantity price id date2 date3
50 70 1 01jan2010
60 80 2 02jan2010
70 90 3 02jan2010
70 95 4 04jan2010 04jan2010
60 81 5 08jan2010 04jan2010
70 88 6 09jan2010 04jan2010
55 87 7 12jan2010 04jan2010
52 88 8 13jan2010 04jan2010
To me, it does not matter if id =1,2,3 have no date3. What I am concerned is that id=7 and id=8 should have a date3 of 11jan2010. But because there is no transaction on that day, the date becomes 04jan2010. Is there a way to fix this?
(I was thinking of constructing a new dataset with consecutive dates since 01jan2010 and then merge with the one above, and then drop if missing quantity of price. But I was wondering if there's a more efficient way).
In addition, I have a weekly index data that reports on every Friday since 01jan2010. If I use wofd command, Stata will generate 53 weeks in 2010. (Or more precisely, two 2010w52.) How can I get just 52 weeks in Stata?
(I found this http://www.stata.com/statalist/archive/2012-02/msg01030.html but I still cannot figure out how this can help solve my problem. )

Your weeks start on Mondays. Everything you need follows from using dow() to exploit the fact that in every one of your weeks, the day of week function dow() yields 1, 2, 3, 4, 5, 6, 0 for the days from Monday to Sunday.
The present or previous Monday for daily dates daily is just
gen Monday = cond(dow(daily) == 0, daily - 6, daily - dow(daily) + 1)
The branch is like this. If it's a Sunday, the previous Monday was 6 days ago. Otherwise, the Monday that starts the week was today if it's Monday and dow() yields 1, yesterday if it's Tuesday and 2, and so forth. Here the variable Monday is just the dates of Mondays that define the weeks.
Important detail: There are no assumptions here about dates being complete in the data or even in order.
Small note: Arbitrary names like date2 and date3 mean nothing much. Use evocative names in your questions (and your practice).
There was a sequel to the article mentioned by Robert Ferrer. search week, sj in Stata to get the references.
Do not use Stata's weeks and in particular do not use the wofd() function (not a command), as they can't help you. Stata's weeks will not map on to your weeks. The article mentioned by Robert Ferrer really is worthwhile reading to understand this (even though I wrote it).
(This is all explained in the Statalist threads you link to.)

Adding across years

Quick question. I'm working with code that produces a spreadsheet that contains the information like the following:
year business sales profit
2001 a 5 3
2002 a 6 4
2003 a 4 2
2001 b 2 1
2002 b 6 3
2003 b 7 5
How can I get Stata to total sales and profits across years?
Thanks

Try
collapse (sum) sales profit, by(year)
or, if you want to retain your original data,
bysort year: egen tot_sales = total(sales)
egen stands for extended generate, a very useful command.

Convert Daily Returns to Monthly Returns in Stata

I am using Stata and I have 6 years of daily returns for stocks that individuals hold in their portfolios. I would like to aggregate the daily returns to monthly portfolio returns. In some instances, the individual may hold more than one stock in the portfolio. I am struggling with writing the code to do this.
For a visual, my data looks like this:
I would like the results to look like this:
Where individual 2's portfolio return for the month of December 1996 is calculated as: 0.3 * 0.0031 + 0.7 * 0.0076 = 0.00625.
I have tried the collapse command such as
collapse Return, by (ID Year Month)
but this does not provide the same return that I calculated out in Excel.
I am able to make a weighted portfolio return for all the days using
bysort ID year month: egen wt_return = stock_weight * monthly_return
But this gives me daily returns. My trouble is then aggregating them into one return for the corresponding month.
As for the specifics, I would like to calculate the monthly portfolio return as the product of 1 + the weighted daily returns. As a last resort, the mean return for the month could work.

You don't show monthly portfolio return for person 2 in 1991. Your initial example data doesn't show stock weights but the desired example
data does. Your variable Monthly Return is not reproducible. You should take time to verify your question is clear when posting.
It's supposed be clear to the public who will read it, not only to you.
I didn't bother checking if your computations are correct but below is what I
understand you want. The procedure is simply to compute a weighted return and then
add them up by person year month groups. (I assume the stock weights apply to stocks on a daily basis, which is what your example data implies.)
clear all
set more off
input ///
perid year month day str3 stockid return stockw
1 1991 1 1 "ABC" .01 1
1 1991 1 2 "ABC" .02 1
1 1991 1 3 "ABC" -.01 1
1 1991 1 31 "ABC" .004 1
1 1996 12 31 "ABC" .002 1
2 1991 1 1 "ABC" .01 .3
2 1991 1 2 "ABC" .02 .3
2 1996 12 31 "ABC" .004 .3
2 1991 1 1 "XYZ" .001 .7
2 1991 1 2 "XYZ" .004 .7
2 1996 12 31 "XYZ" .021 .7
end
* create weighted return
gen returnw = return * stockw
sort perid year month day
list, sepby(perid year month day)
* sum weighted returns by person, year, month
collapse (sum) returnw, by (perid year month)
list, sepby(perid)
If you want collapse to sum, then you must indicate it with the (sum) (although I'm not clear if this is what you want). By default, it computes the mean. Read help collapse thouroughly.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to write the best code for data aggregation? - stata

This is shorter code than any suggested so far: collapse average_income=income, by(state year)

Related

Inactivity duration variable in panel data (Stata)

Rolling Window Model for Unbalanced Dataset in SAS

How to collapse data by week correctly in Stata?

Adding across years

Convert Daily Returns to Monthly Returns in Stata

Categories

Resources