Generating group mean for a continuous variable in stata - stata

I have percent change of a variable for 20 years. I want to find the average percent change for 3 years continuously over the 20 years. So, suppose I have the data from 2000-2020. I want to form the average of 2000,2001,2002, then, 2001,2002,2003, and so on. in groups of 3 till 2018,2019,2020 in Stata.
Please help me with the code.

This is just a running mean or moving average. (For some reason, running average and moving mean aren't expressions I ever hear.) So you need to tsset or xtset your data and then look at help tssmooth ma.

Related

PowerBi - Counting Switch Statements

I have the following measure:
test = SWITCH(TRUE(),
MAX(test[month])>=9&&MAX(test[month])<=12,"fall",
MAX(test[month])>=1&&MAX(test[month])<=3,"winter",
MAX(test[month])>=4&&MAX(test[month])<=6,"spring",
MAX(test[month])>=7&&MAX(test[month])<=8,"summer")
Currently it looks at the month number (i.e. "3" for March and outputs "winter", what I'd like however is it to output is a count per season to show the distribution of the seasons in the dataset.
For example my desired output would be
Month Number
Count of occurrences of each season
fall
5
winter
7
spring
11
summer
2
I can't have a calculated column here either as I will want to make this measure dynamic later on with the use of a slicer, can someone tell me if this is possible?
The issue here is that you want to define your categories within the measure. Measures are not dynamic without some filter-context.
Take this for example:
Notice that the output of the calculation is identical between seasons.
There is no filter context to help the measure discern between the different seasons because these seasons are not defined in the model. (At least, I don't know how to make this work)
Switch returns the first true result. So, if you have values like in your sample, then start with the smallest, then bigger, and the largest at the end.
test =
SWITCH(
TRUE()
,MAX(test[month])<4,"winter" -- test <4
,MAX(test[month])<7,"spring" -- 3< test < 7
,MAX(test[month])<9,"summer" -- 6< test < 9 -- Is it ok that you have 2 months in
,"fall" -- 8< test -- summer and 4 in fall?
)
If you use MAX(test[month])<4,"winter" instead of MAX(test[month])<=3,"winter" then you avoid one calculation step and the code will be faster.
Then you need to use the result to find months numbers and get dates from the selected months. Then calculate your table filtered by months dates. If this answer is not enough to solve the case, then give more information about you table, it's columns, and what do you mean by 'Count of occurrences of each season', exactly what does 'occurrences' mean, is it a number of certain rows or some unique values.

In SAS, how do you create a certain number of records where the primary outcome does not occur based on the value of another variable?

I am examining the effect of passing vs running plays on injuries across a few football seasons. The way the data was collected, all injuries were recorded as well as information about the play in which the injury occurred (ie position, quarter, play type), game info (ie weather conditions, playing surface, etc), and team info (ie number of pass vs run plays in the game).
I would like to use one play as the primary exposure with the outcome as injury vs no injury with analysis using logistic regression, but to do so I would need to create all the records with no injury. There is a range from 0 to around 6-7 injuries in a game for a team, and the total passing and running plays are recorded so I would need to find a way to add X (total passing plays minus injuries on passing plays) and Y (total running plays - injuries on running plays) records that share all the details for that particular game but have no injury as the outcome. I imagine there is a way in proc sql to do this, but I could not find it online. How would I go about coding this?
I have attached an example of the relevant data. An example of what I would need to do is for game 1 add 30 records for passing plays and 38 records for running plays with outcome of no injury and otherwise the same data (team A, dry weather, game plays).
You can use the freq statement to prevent having to de-aggregate it.
The FREQ statement identifies a variable that contains the frequency
of occurrence of each observation. PROC LOGISTIC treats each
observation as if it appears n times, where n is the value of the FREQ
variable for the observation. If it is not an integer, the frequency
value is truncated to an integer.
SAS Documentation
De-aggregating the data would require the data step and a do loop. It's not recommended to do this.

Moving average using forvalues - Stata

I am struggling with a question in Cameron and Trivedi's "Microeconometrics using Stata". The question concerns a cross-sectional dataset with two key variables, log of annual earnings (lnearns) and annual hours worked (hours).
I am struggling with part 2 of the question, but I'll type the whole thing for context.
A moving average of y after data are sorted by x is a simple case of nonparametric regression of y on x.
Sort the data by hours.
Create a centered 15-period moving average of lnearns with ith observation yma_i = 1/25(sum from j=-12 to j=12 of y_i+j). This is easiest using the command forvalues.
Plot this moving average against hours using the twoway connected graph command.
I'm unsure what command(s) to use for a moving average of cross-sectional data. Nor do I really understand what a moving average over one-period data shows.
Any help would be great and please say if more information is needed.
Thanks!
Edit1:
Should be able to download the dataset from here https://www.dropbox.com/s/5d8qg5i8xdozv3j/mus02psid92m.dta?dl=0. It is a small extract from the 1992 Individual-level data from the Panel Study of Income Dynamics - used in the textbook.
Still getting used to the syntax, but here is my attempt at it
sort hours
gen yma=0
1. forvalues i = 1/4290 {
2. quietly replace yma = yma + (1/25)(lnearns[`i'-12] to lnearns[`i'+12])
3. }
There are other ways to do this, but I created a variable for each lag and lead, then take the sum of all of these variables and the original then divide by 25 as in the equation you provided:
sort hours
// generate variables for the 12 leads and lags
forvalues i = 1/12 {
gen lnearns_plus`i' = lnearns[_n+`i']
gen lnearns_minus`i' = lnearns[_n-`i']
}
// get the sum of the lnearns variables
egen yma = rowtotal(lnearns_* lnearns)
// get the number of nonmissing lnearns variables
egen count = rownonmiss(lnearns_* lnearns)
// get the average
replace yma = yma/count
// clean up
drop lnearns_* count
This gives you the variable you are looking for (the moving average) and also does not simply divide by 25 because you have many missing observations.
As to your question of what this shows, my interpretation is that it will show the local average for each hours variable. If you graph lnearn on the y and hours on the x, you get something that looks crazy becasue there is a lot of variation, but if you plot the moving average it is much more clear what the trend is.
In fact this dataset can be read into a suitable directory by
net from http://www.stata-press.com/data/musr
net install musr
net get musr
u mus02psid92m, clear
This smoothing method is problematic in that sort hours doesn't have a unique result in terms of values of the response being smoothed. But an implementation with similar spirit is possible with rangestat (SSC).
sort hours
gen counter = _n
rangestat (mean) mean=lnearns (count) n=lnearns, interval(counter -12 12)
There are many other ways to smooth. One is
gen binhours = round(hours, 50)
egen binmean = mean(lnearns), by(binhours)
scatter lnearns hours, ms(Oh) mc(gs8) || scatter binmean binhours , ms(+) mc(red)
Even better would be to use lpoly.

How to do proportionate stratified sampling without replacement?

I want to select my sample in Stata 13 based on three stratum variables with 12 strata in total (size - two strata; sector - three strata; intangible intensity - two strata). The selection should be proportional without replacement.
However, I can only find disproportionate selection commands that select for instance x% of each stratum.
Can anyone help me out with this problem?
Thank you for this discussion. I think I know where my problem was.
The command "gsample" can select strata based on different variables. Therefore, I thought I had to define three different stratum variables. But the solution should be more simple.
There are 12 strata in total (the large firms with high intensity in sector 1, the small firms with high intensity in sector 1, and so on) with each firm in the sample falling in to one of the strata.
All I have to do is creating a variable "strataident" with values from 1 to 12 identifying the different strata. I do this for the population dataset, so the number of firms falling into each stratum is representative for the population. The following code will provide me a stratified random sample that is representative for the population.
gsample 10, percent strata (strataident) wor
This command works as well and is much easier, see the example in 1:
gsample 10, percent wor strata(size sector intensity)
The problem is, that strata may "overlap". So you probably have to rebalance the sample after initial draft.
Now the question is, how this can be implemented. The final sample should represent the proportion of the population as good as possible.

Stata: Groupwise regressions and ranking

I am currently developing a sentiment index using Google search frequencies taken from Google Trends.
I am using Stata 12 on Windows.
My approach is as following:
I downloaded approx ~150 business-related search queries from Googletrends from Jan 2004 to Dec 2013
I now want to construct an index using the 30 at that point in time most relevant queries related to the market I observe
To achieve that I want to use monthly expanding backward rolling regressions of each query on the market
Thus I need to regress 150 items one-by-one on the market 120 times (12 months x 10 years), using different time windows and then extract the 30 queries with the most negative t-test.
To exemplify the procedure, if I would want to construct the sentiment for January 2010 I would regress the query terms on the market during the period from Jan 2004 to December 2009 and then extract the 30 queries with the most negative t-statistic.
Now I am looking for a way to make this as automatized as possible. I guess should be able to run the 150 items at once, and I can specify the time window using the time stamps. Using Excel commands and creating a do-file with all the regression commands in it (which would be quite large) I could probably create the regressions relatively efficiently (although it depends on how much Stata can handle - any experience on that?).
What I would need to make the data extraction much easier is a command which I can use to rank the results of the regression according to their t-statistics. Does someone have an efficient approach to this? Or has general advice?
If you are using Stata, once you run a ttest, you can type return list and you will get scalars that stata stores. Once you run a loop you can store these values in a number of different ways. check out the post command.