I am struggling with a question in Cameron and Trivedi's "Microeconometrics using Stata". The question concerns a cross-sectional dataset with two key variables, log of annual earnings (lnearns) and annual hours worked (hours).
I am struggling with part 2 of the question, but I'll type the whole thing for context.
A moving average of y after data are sorted by x is a simple case of nonparametric regression of y on x.
Sort the data by hours.
Create a centered 15-period moving average of lnearns with ith observation yma_i = 1/25(sum from j=-12 to j=12 of y_i+j). This is easiest using the command forvalues.
Plot this moving average against hours using the twoway connected graph command.
I'm unsure what command(s) to use for a moving average of cross-sectional data. Nor do I really understand what a moving average over one-period data shows.
Any help would be great and please say if more information is needed.
Thanks!
Edit1:
Should be able to download the dataset from here https://www.dropbox.com/s/5d8qg5i8xdozv3j/mus02psid92m.dta?dl=0. It is a small extract from the 1992 Individual-level data from the Panel Study of Income Dynamics - used in the textbook.
Still getting used to the syntax, but here is my attempt at it
sort hours
gen yma=0
1. forvalues i = 1/4290 {
2. quietly replace yma = yma + (1/25)(lnearns[`i'-12] to lnearns[`i'+12])
3. }
There are other ways to do this, but I created a variable for each lag and lead, then take the sum of all of these variables and the original then divide by 25 as in the equation you provided:
sort hours
// generate variables for the 12 leads and lags
forvalues i = 1/12 {
gen lnearns_plus`i' = lnearns[_n+`i']
gen lnearns_minus`i' = lnearns[_n-`i']
}
// get the sum of the lnearns variables
egen yma = rowtotal(lnearns_* lnearns)
// get the number of nonmissing lnearns variables
egen count = rownonmiss(lnearns_* lnearns)
// get the average
replace yma = yma/count
// clean up
drop lnearns_* count
This gives you the variable you are looking for (the moving average) and also does not simply divide by 25 because you have many missing observations.
As to your question of what this shows, my interpretation is that it will show the local average for each hours variable. If you graph lnearn on the y and hours on the x, you get something that looks crazy becasue there is a lot of variation, but if you plot the moving average it is much more clear what the trend is.
In fact this dataset can be read into a suitable directory by
net from http://www.stata-press.com/data/musr
net install musr
net get musr
u mus02psid92m, clear
This smoothing method is problematic in that sort hours doesn't have a unique result in terms of values of the response being smoothed. But an implementation with similar spirit is possible with rangestat (SSC).
sort hours
gen counter = _n
rangestat (mean) mean=lnearns (count) n=lnearns, interval(counter -12 12)
There are many other ways to smooth. One is
gen binhours = round(hours, 50)
egen binmean = mean(lnearns), by(binhours)
scatter lnearns hours, ms(Oh) mc(gs8) || scatter binmean binhours , ms(+) mc(red)
Even better would be to use lpoly.
Related
I have percent change of a variable for 20 years. I want to find the average percent change for 3 years continuously over the 20 years. So, suppose I have the data from 2000-2020. I want to form the average of 2000,2001,2002, then, 2001,2002,2003, and so on. in groups of 3 till 2018,2019,2020 in Stata.
Please help me with the code.
This is just a running mean or moving average. (For some reason, running average and moving mean aren't expressions I ever hear.) So you need to tsset or xtset your data and then look at help tssmooth ma.
I got a panel data (time: date name: ticker). I want to create upto 10 lagged variables for x. So I use the following code.
tsrevar L(1/10).x
rename (`r(varlist)') x_#, addnumber
Because my data is in hourly frequency, and only observation during the daytime. Using the code above, the first observation for each trading day is missing.
My alternative solution is:
by ticker: gen lag1 = return[_n-1]
Then, I have to copy and paste this code 10 times, which looks very messy. Could anyone teach me how to solve this problem please
This is my "10-minute guess" given I don't actually work with anything finer than a daily periodicity.
Stata has no hourly display format that I know of. One way to achieve what you want is using the delta() option when you tsset the data.
clear
set more off
*----- example data -----
// an "hour-by-hour" time series which really has millisecond format
set obs 25
gen double t = _n*1000*60
format %tcDDmonCCYY_HH:MM:SS:.sss t
set seed 3129745
gen ret = runiform()
list, sep(0)
*----- what you want? -----
// 1000*60 milliseconds conform 1 hour
tsset t, delta((1000*60))
// one way
tsrevar L(1/2).ret
rename (`r(varlist)') ret_#, addnumber
// two other ways
gen ret1 = L.ret
gen ret11 = ret[_n-1]
// check
assert ret_1 == ret1
assert ret_1 == ret11
list, sep(0)
tsset also has a generic option, and delta() itself has several specifications. Take a look and test, to see if you find a better fit.
(You mention an "hourly frequency" but you don't give example data with the specifics. There really is no way to know for sure what you're dealing with.)
I have trouble to generate a new variable which will be created for every month while having multiple entries for every month.
date1 x b
1925m12 .01213 .323
1925m12 .94323 .343
1926m01 .34343 .342
Code would look like this gen newvar = sum(x*b) but I want to create the variable for each month.
What I tried so far was
to create an index for the date1 variable with
sort date1
gen n=_n
and after that create a binary marker for when the date changes
with
gen byte new=date1!=date[[_n-1]
After that I received a value for every other month but I m not sure if this seems to be correct or not and thats why I would like someone have a look at this who could maybe confirm if that should be correct. The thing is as there are a lot of values its hard to control it manually if the numbers are correct. Hope its clear what I want to do.
Two comments on your code
There's a typo: date[[_n-1] should be date1[_n-1]
In your posted code there's no need for gen n = _n.
Maybe something along the lines of:
clear
set more off
*-----example data -----
input ///
str10 date1 x b
1925m12 .01213 .323
1925m12 .94323 .343
1926m01 .34343 .342
end
gen date2 = monthly(date1, "YM")
format %tm date2
*----- what you want -----
gen month = month(dofm(date2))
bysort month: gen newvar = sum(x*b)
list, sepby(month)
will help.
But, notice that the series of the cumulative sum can be different for each run due to the way in which Stata sorts and because month does not uniquely identify observations. That is, the last observation will always be the same, but the way in which you arrive at the sum, observation-by-observation, won't be. If you want the total, then use egen, total() instead of sum().
If you want to group by month/year, then you want: bysort date2: ...
The key here is the by: prefix. See, for example, Speaking Stata: How to move step by: step by Nick Cox, and of course, help by.
A major error is touched on in this thread which deserves its own answer.
As used with generate the function sum() returns cumulative or running sums.
As used with egen the function name sum() is an out-of-date but still legal and functioning name for the egen function total().
The word "function" is over-loaded here even within Stata. egen functions are those documented under egen and cannot be used in any other command or context. In contrast, Stata functions can be used in many places, although the most common uses are within calls to generate or display (and examples can be found even of uses within egen calls).
This use of the same name for different things is undoubtedly the source of confusion. In Stata 9, the egen function name sum() went undocumented in favour of total(), but difficulties are still possible through people guessing wrong or not studying the documentation really carefully.
I am attempting to run a quantile regression on monthly observations (of mutual fund characteristics). What I would like to do is distribute my observations in quintiles for each month (my dataset comprises 99 months). I want to base the quintiles on a variable (lagged fund size i.e. Total Net Assets) that will be later employed as an independent variable to explain fund performance.
What I already tried to do is use the qreg command, but that uses quantiles based on the dependent variable not the independent variable that is needed.
Moreover I tried to use the xtile command to create the quintiles; however, the by: command is not supported.
. by Date: xtile QLagTNA= LagTNA, nq(5)
xtile may not be combined with by
r(190);
Is there a (combination of) command(s) which saves me from creating quintiles manually on a month-by-month basis?
Statistical comments first before getting to your question, which has two Stata answers at least.
Quantile regression is defined by prediction of quantiles of the response (what you call the dependent variable). You may or may not want to do that, but using quantile-based groups for predictors does not itself make a regression a quantile regression.
Quantiles (here quintiles) are values that divide a variable into bands of defined frequency. Here you want the 0, 20, 40, 60, 80, 100% points. The bands, intervals or groups themselves are not best called quantiles, although many statistically-minded people would know what you mean.
What you propose seems common in economics and business, but it is still degrading the information in the data.
All that said, you could always write a loop using forval, something like this
egen group = group(Date)
su group, meanonly
gen QLagTNA = .
quietly forval d = 1/`r(max)' {
xtile work = LagTNA if group == `d', nq(5)
replace QLagTNA = work if group == `d'
drop work
}
For more, see this link
But you will probably prefer to download a user-written egen function [correct term here] to do this
ssc inst egenmore
h egenmore
The function you want is xtile().
I have a SAS dataset of 60k customers with the following attributes:
1) customer number
2) X coordinate
3) Y coordinate
4) store visits
I need to calculate the average weighted distance from each customer to all the other customers in the table where each distance is weighted by the comparing customer's number of visits. As an example, the distance between Customer A & Customer B is 10. We would then weight that distance by Customer B's number of visits (2) which equals 5. This process would repeat for all other customers in the table and we would then average all of these weighted distances for each of the 60k customers.
I suppose the brute force way to do this is a Cartesian join (ie. create a 60k x 60k = 3.6 billion record table) but that will likely run out of memory or crash SAS. I have also thought of breaking this up into more manageable Cartesian joins (ie. 10 x 60K = 600k x 6000 iterations but this is likely to be quite time consuming -- maybe my only choice though). I'm guessing you guys/gals know a much better way to do this!
I appreciate all your suggestions.
Thanks for you help!
Bad news, there is no way to speed up this calculation (that I know of).
Good news is SAS won't crash or run out of memory if you do a Cartesian product. Other good news is doing this in a data step is faster than doing it in PROC SQL.
data test;
do cn=1 to 64000;
x = ceil(Ranuni(13)*100);
y = ceil(ranuni(13)*100);
visits = max(1,round(rannor(12)*3 + 8,1));
output;
end;
run;
sasfile test load;
data ave_dist(keep=cn ave_dist);
set test end=last;
dist=0;
td= 0;
total_visits=0;
do i=1 to n;
set test(rename=(cn=cn_2 x=x_2 y=y_2) drop=visits) point=i nobs=n;
if cn ^= cn_2 then do;
xx = (x-x_2);
yy = (y-y_2);
total_visits = total_visits + visits;
dist = sqrt(xx*xx + yy*yy);
if dist^= 0 then
dist = 1/dist;
else
dist = 100; /*Adjust to something that makes sense to your data*/
td = visits*dist + td;
end;
end;
ave_dist = td / total_visits;
output;
run;
sasfile test close;
I inverted the distance calculation. You want small distances to have a higher score. I made this a true visit weighted average.
This takes about 13 minutes to run on my laptop.
if your customers base is going to be <100k then PROC DISTANCE could be of help. Using the dataset created by #DomPazz you could run the following code and examine the results. In this case I'm only trying it out on the first 10K customers which runs in 16secs. Do not let that fool you into false sense of security. When you double the no. of customers the time taken goes by 4times.
(actual times: 10K - 16secs, 20K - 47 secs, 40K - 3mins...)
This procedure produce a NxN square matrix (where N is the no. of customers in your input dataset). You could try and experiment and see at what point SAS runs into RAM memory issues ( be sure to have plenty of hard drive space, at least in the order of 1.10*NxN*8bytes).
Each cell in the matrix represents ith customer's (in rows) distance with 'j'th customer (in columns). Once you get the distance it is a simple matter of multiplying the respective distances with the customer's visits and taking the average.
proc distance data = test(obs = 100)
OUT=test_distances(compress = binary)
METHOD= EUCLID shape = SQUARE
UNDEF=1000000
VARDEF=wdf;
var INTERVAL(x y)
;
copy cn visits;
run;
data avg_dist;
set train_distances;
array dist{*} dist:;
prod=0;
do i = 1 to dim(dist);
prod = visits*dist{i}+prod;
end;
avg_dist=prod/dim(dist);
dims=dim(dist);
drop i dist:
;
run;
proc sql;
drop table test_distances;
quit;
The type of problem you are looking to solve are generally known as k-nearest neighbour problems. There has been decades of research in this area and most often these are solved using special data-structures such as Kd-trees for performance. Most often one is interested in answering questions such as who are the top-10 (or K) closest customers to this customer I'm interested in? Another procedure which is very good for solving these type of problems efficiently is the PROC PMBR which supports both kd-tree and SAS's proprietary structure called the Rd-tree - look it up - you will only find a pdf document from SAS Eminer 4.3 days
The moment you are having to calculate distance between N*N items you are asking for trouble.
From reading your project description in the comments it appears that what you need is not calculate distance between every customer with every other customer but something like distance between every customer and every store.
This will dramatically improve your query performance since the dimensionality of the problem is greatly reduced.
Let's say you have N customers and S stores then you only need to calculate distance between N*S points. ( a simple data step will do the job as there is no need for a cartesian product nor specialised data structures)
From there you can look at, for each store in S what proportion of the customer's who shoped at that store live with in 1KM, 2KM, 3KM ....
Then you can come up answers such as 80% live within 1km , 15% live within 2KM etc...