how to use tsrevar to create lag variable using stata - stata

I got a panel data (time: date name: ticker). I want to create upto 10 lagged variables for x. So I use the following code.
tsrevar L(1/10).x
rename (`r(varlist)') x_#, addnumber
Because my data is in hourly frequency, and only observation during the daytime. Using the code above, the first observation for each trading day is missing.
My alternative solution is:
by ticker: gen lag1 = return[_n-1]
Then, I have to copy and paste this code 10 times, which looks very messy. Could anyone teach me how to solve this problem please

This is my "10-minute guess" given I don't actually work with anything finer than a daily periodicity.
Stata has no hourly display format that I know of. One way to achieve what you want is using the delta() option when you tsset the data.
clear
set more off
*----- example data -----
// an "hour-by-hour" time series which really has millisecond format
set obs 25
gen double t = _n*1000*60
format %tcDDmonCCYY_HH:MM:SS:.sss t
set seed 3129745
gen ret = runiform()
list, sep(0)
*----- what you want? -----
// 1000*60 milliseconds conform 1 hour
tsset t, delta((1000*60))
// one way
tsrevar L(1/2).ret
rename (`r(varlist)') ret_#, addnumber
// two other ways
gen ret1 = L.ret
gen ret11 = ret[_n-1]
// check
assert ret_1 == ret1
assert ret_1 == ret11
list, sep(0)
tsset also has a generic option, and delta() itself has several specifications. Take a look and test, to see if you find a better fit.
(You mention an "hourly frequency" but you don't give example data with the specifics. There really is no way to know for sure what you're dealing with.)

Related

Can I impute a variable conditional on another?

I am trying to impute the data about whether someone is born in the UK from wave 1 to wave 2. I suspect the egen function would work but I am not sure what the code would look like?
As you can see, I need to assign the same born in the uk response for person id 1 in wave 1 to wave 2.
I know I could do it by reshaping the dataset to a wide format but do you know whether there is any other way?
This is a Stata FAQ as accessible here.
You can copy downwards in the dataset without creating any new variables.
bysort id (wave) : replace born_in_uk = born_in_uk[_n-1] if missing(born_in_uk)
mipolate (SSC) has a groupwise option that checks for there being more than one non-missing value. Search within www.statalist.org for mentions.
Note that egen is a command, not a function.
I am not sure whether here born in the UK is numeric with labels or string. But, what if you would do something like:
encode born_in_UK, gen(born_num)
bysort person_id: egen born_num2=mean(born_num)
drop born_num
rename born_num2 born_num
The idea is to think of the repeating personal ids as groups and use the mean function to fill the missing values in the group. I think this should work.

Moving average using forvalues - Stata

I am struggling with a question in Cameron and Trivedi's "Microeconometrics using Stata". The question concerns a cross-sectional dataset with two key variables, log of annual earnings (lnearns) and annual hours worked (hours).
I am struggling with part 2 of the question, but I'll type the whole thing for context.
A moving average of y after data are sorted by x is a simple case of nonparametric regression of y on x.
Sort the data by hours.
Create a centered 15-period moving average of lnearns with ith observation yma_i = 1/25(sum from j=-12 to j=12 of y_i+j). This is easiest using the command forvalues.
Plot this moving average against hours using the twoway connected graph command.
I'm unsure what command(s) to use for a moving average of cross-sectional data. Nor do I really understand what a moving average over one-period data shows.
Any help would be great and please say if more information is needed.
Thanks!
Edit1:
Should be able to download the dataset from here https://www.dropbox.com/s/5d8qg5i8xdozv3j/mus02psid92m.dta?dl=0. It is a small extract from the 1992 Individual-level data from the Panel Study of Income Dynamics - used in the textbook.
Still getting used to the syntax, but here is my attempt at it
sort hours
gen yma=0
1. forvalues i = 1/4290 {
2. quietly replace yma = yma + (1/25)(lnearns[`i'-12] to lnearns[`i'+12])
3. }
There are other ways to do this, but I created a variable for each lag and lead, then take the sum of all of these variables and the original then divide by 25 as in the equation you provided:
sort hours
// generate variables for the 12 leads and lags
forvalues i = 1/12 {
gen lnearns_plus`i' = lnearns[_n+`i']
gen lnearns_minus`i' = lnearns[_n-`i']
}
// get the sum of the lnearns variables
egen yma = rowtotal(lnearns_* lnearns)
// get the number of nonmissing lnearns variables
egen count = rownonmiss(lnearns_* lnearns)
// get the average
replace yma = yma/count
// clean up
drop lnearns_* count
This gives you the variable you are looking for (the moving average) and also does not simply divide by 25 because you have many missing observations.
As to your question of what this shows, my interpretation is that it will show the local average for each hours variable. If you graph lnearn on the y and hours on the x, you get something that looks crazy becasue there is a lot of variation, but if you plot the moving average it is much more clear what the trend is.
In fact this dataset can be read into a suitable directory by
net from http://www.stata-press.com/data/musr
net install musr
net get musr
u mus02psid92m, clear
This smoothing method is problematic in that sort hours doesn't have a unique result in terms of values of the response being smoothed. But an implementation with similar spirit is possible with rangestat (SSC).
sort hours
gen counter = _n
rangestat (mean) mean=lnearns (count) n=lnearns, interval(counter -12 12)
There are many other ways to smooth. One is
gen binhours = round(hours, 50)
egen binmean = mean(lnearns), by(binhours)
scatter lnearns hours, ms(Oh) mc(gs8) || scatter binmean binhours , ms(+) mc(red)
Even better would be to use lpoly.

How can I do =INDEX(_, MATCH(_, _,0)) of Excel In Stata?

I would like to use the same concept as =INDEX(_,MATCH(_,_,0)) of Excel in Stata 12, exclusively using Stata programming.
Is there a way to match one value with a column (say variable A), and then give another column (say variable B) as the output?
It is not a good idea to rely on Stata users knowing what MS Excel functions do: many knowledgeable Stata users don't use MS Excel. Conversely, it's a good idea to put forward your failed attempts. See https://stackoverflow.com/help/asking on asking good questions.
Can the following be what you want?
clear
set more off
*----- example data -----
sysuse auto
keep make foreign
bysort foreign (make) : keep if _n == 1
list, nolabel
*----- what you want ? -----
// two cases
list make if foreign == 1
list make if foreign == 0
Run findit vlookup, for a user-written command that does just that, but in Stata.
I believe you are looking for the merge command. Type help merge for an explanation.

Stata: Newvar for multiple equal dates

I have trouble to generate a new variable which will be created for every month while having multiple entries for every month.
date1 x b
1925m12 .01213 .323
1925m12 .94323 .343
1926m01 .34343 .342
Code would look like this gen newvar = sum(x*b) but I want to create the variable for each month.
What I tried so far was
to create an index for the date1 variable with
sort date1
gen n=_n
and after that create a binary marker for when the date changes
with
gen byte new=date1!=date[[_n-1]
After that I received a value for every other month but I m not sure if this seems to be correct or not and thats why I would like someone have a look at this who could maybe confirm if that should be correct. The thing is as there are a lot of values its hard to control it manually if the numbers are correct. Hope its clear what I want to do.
Two comments on your code
There's a typo: date[[_n-1] should be date1[_n-1]
In your posted code there's no need for gen n = _n.
Maybe something along the lines of:
clear
set more off
*-----example data -----
input ///
str10 date1 x b
1925m12 .01213 .323
1925m12 .94323 .343
1926m01 .34343 .342
end
gen date2 = monthly(date1, "YM")
format %tm date2
*----- what you want -----
gen month = month(dofm(date2))
bysort month: gen newvar = sum(x*b)
list, sepby(month)
will help.
But, notice that the series of the cumulative sum can be different for each run due to the way in which Stata sorts and because month does not uniquely identify observations. That is, the last observation will always be the same, but the way in which you arrive at the sum, observation-by-observation, won't be. If you want the total, then use egen, total() instead of sum().
If you want to group by month/year, then you want: bysort date2: ...
The key here is the by: prefix. See, for example, Speaking Stata: How to move step by: step by Nick Cox, and of course, help by.
A major error is touched on in this thread which deserves its own answer.
As used with generate the function sum() returns cumulative or running sums.
As used with egen the function name sum() is an out-of-date but still legal and functioning name for the egen function total().
The word "function" is over-loaded here even within Stata. egen functions are those documented under egen and cannot be used in any other command or context. In contrast, Stata functions can be used in many places, although the most common uses are within calls to generate or display (and examples can be found even of uses within egen calls).
This use of the same name for different things is undoubtedly the source of confusion. In Stata 9, the egen function name sum() went undocumented in favour of total(), but difficulties are still possible through people guessing wrong or not studying the documentation really carefully.

Stata estpost esttab: Generate table with mean of variable split by year and group

I want to create a table in Stata with the estout package to show the mean of a variable split by 2 groups (year and binary indicator) in an efficient way.
I found a solution, which is to split the main variable cash_at into 2 groups by hand through the generation of new variables, e.g. cash_at1 and cash_at2. Then, I can generate summary statistics with tabstat and get output with esttab.
estpost tabstat cash_at1 cash_at2, stat(mean) by(year)
esttab, cells("cash_at1 cash_at2")
Link to current result: http://imgur.com/2QytUz0
However, I'd prefer a horizontal table (e.g. year on the x axis) and a way to do it without splitting the groups by hand - is there a way to do so?
My preference in these cases is for year to be in rows and the statistic (e.g. mean) in the columns, but if you want to do it the other way around, there should be no problem.
For a table like the one you want it suffices to have the binary variable you already mention (which I name flag) and appropriate labeling. You can use the built-in table command:
clear all
set more off
* Create example data
set seed 8642
set obs 40
egen year = seq(), from(1985) to (2005) block(4)
gen cash = floor(runiform()*500)
gen flag = round(runiform())
list, sepby(year)
* Define labels
label define lflag 0 "cash0" 1 "cash1"
label values flag lflag
* Table
table flag year, contents(mean cash)
In general, for tables, apart from the estout module you may want to consider also the user-written command tabout. Run ssc describe tabout for more information.
On the other hand, it's not clear what you mean by "splitting groups by hand". You show no code for this operation, but as long as it's general enough for your purposes (and practical) I think you should allow for it. The code might not be as elegant as you wish but if it's doing what it's supposed to, I think it's alright. For example:
clear all
set more off
set seed 8642
set obs 40
* Create example data
egen year = seq(), from(1985) to (2005) block(4)
gen cash = floor(runiform()*500)
gen flag = round(runiform())
* Data management
gen cash0 = cash if flag == 0
gen cash1 = cash if flag == 1
* Table
estpost tabstat cash*, stat(mean) by(year)
esttab, cells("cash0 cash1")
can be used for a table like the one you give in your original post. It's true you have two extra lines and variables, but they may be harmless. I agree with the idea that in general, efficiency is something you worry about once your program is behaving appropriately; unless of course, the lack of it prevents you from reaching that state.