Drop a block of observations if any value is missing within that block - stata

I have 10 years of data for all firms from Compustat from 2010 to 2020. I want to drop any firm which does not have an entry for any of these 10 years. In short I want only those firms that have the entire 10 years of data. How can I do this?

bysort firmid (whatever) : drop if missing(whatever[_N])
Missings sort to the end of a block of values, so if any value of whatever is missing in a block it will show up at the end.

Related

Attrition in panel data - Stata

I am constructing a panel dataset based on the survey data for the years 2010-2013 (four consecutive years). As is usually the case with household survey data, there is an issue of attrition, i.e. some households drop out from the survey from year to year. I need to figure out whether these households are missing at random.
My idea is to come up with a dummy equal to 1 in 2011 if a household present in 2010 is missing in 2011 (and 0 otherwise), and so on for the years 2012, 2013. Then I want to run the logit/probit regression on this dummy with a set of covariates that I would like to control for in my study. The variable for household id is "hhid" and I have of course the time dimension variable "year".
Does anyone have a precise idea how this should be properly coded in Stata? I know it is not complicated, but I just cannot wrap my head around it and figure this out....
Here is an example on how you create a dummy in a panel data and then collapse those dummy to the parent unit-of-observation making the dummy 1 if the parent unit-of-observation was 1 in any time period. Then merge the parent unit-of-observation level data back to the panel data.
* Example generated by -dataex-. For more info, type help dataex
clear
input byte hhid int year
1 2010
1 2011
1 2012
1 2013
2 2010
2 2011
2 2013
3 2010
3 2011
end
*Create a dummy for each year-hh level observation for each year
local year_dummies ""
forvalues year = 2010/2013 {
gen dummy`year' = (year==`year')
local year_dummies "`year_dummies' dummy`year'"
}
*Collapse the data set to hh level where the dummies is 1 if any year-hh level was 1
preserve
collapse (max) `year_dummies' , by(hhid)
tempfile year_dummy_hhlevel
save `year_dummy_hhlevel'
restore
*Rename to not having to overwrite the first step
rename dummy???? org_dummy????
*Merge the hh level data back to the year-hh level
*data merging the hh dummy to each year-hh observation
merge m:1 hhid using `year_dummy_hhlevel', nogen
Your question is if there is a difference in the households you do not observe in year X compare to those you do observe in year X. There is no perfect way to answer this question as you, by definition, did not observe those households.
You did however observe all households in your study in year 0 (2010 in your case). As you imply yourself, you can use observations in year 0 as a proxy to answer if those households are different in year X. I can help you show how you can code this, but StackOverflow is not the appropriate forum to discuss is this is statistically valid given your data, how it was collected and what analysis you intend to use.
One way to code this is to use iebaltab in the package called ietoolkit available from SSC (disclosure, I wrote that command).
You can create an attrition dummy indicating attrition and use iebaltab like this: iebaltab balancevars, grpvar(attrition) where balancevars is a list of variables for characteristics in the household where you want to make sure they were similar in year 0. You can use the option ftest to include the test across all balance variables they way you are suggesting.
Not that this command generates statistics, but it is up to you to decide if this is valid, and the validity of balance tests are hotly debated. But those debates are not about coding which StackOverflow is about.

Generating group mean for a continuous variable in stata

I have percent change of a variable for 20 years. I want to find the average percent change for 3 years continuously over the 20 years. So, suppose I have the data from 2000-2020. I want to form the average of 2000,2001,2002, then, 2001,2002,2003, and so on. in groups of 3 till 2018,2019,2020 in Stata.
Please help me with the code.
This is just a running mean or moving average. (For some reason, running average and moving mean aren't expressions I ever hear.) So you need to tsset or xtset your data and then look at help tssmooth ma.

drop entire panel id/firm if at least 1 missing value for variable

I have panel data and want to delete an entire panel id/firm ID if it has at least 1 missing total assets (at) in one of the years. Could someone help me?
So to be clear the panel data contains the following variables:
1) year: year
2) gvkey: firm id
3) TotalAssets: amount of Total Assets
So if a firm (id) has in one of the years at least 1 missing value for TotalAssets, then it needs to be completely removed out of the sample.
It wouldn't seem possible to have more than one missing value in any year and firm with annual data, but most of the question seems to imply that the criterion is one or more missing values within each panel. If you sort on the outcome variable within panels, then missing values are sorted to the end. So the last value will be missing if any values are missing and a drop is conditional on that:
bysort gvkey (totalassets) : drop if missing(totalassets[_N])

insert columns that are missing in a range

I've created panel data by transposing columns, based on weeks, and some of the weeks never had observations, so those weeks never showed up as columns. Is there a reasonable way to insert the weeks that had no observations.
I need week0-week61, but currently I am missing week0, week4, week8... It seems silly to do this by hand in excel.
The simplest way is like this:
data ttt;
input id week0 week4;
datalines;
1 10 20
2 11 21
;
data ttt1;
set ttt;
array a{*} week0-week61;
run;

SAS: backward looking data step to compute the average

Sorry for the "not really informative" title of this post.
I have the following data set in SAS:
time Add time_delete
5 3.00 5
5 3.15 11
5 3.11 11
8 4.21 8
8 3.42 8
8 4.20 11
11 3.12 .
Where the time correspond to a new added (Add) price in an auction at every 3minute. This price can get delete within the same time interval or later as shown in time_delete. My objective is to compute the average price from the Add field standing at every time. For instance, my average price at time=5 is (3.15+3.11)/2 since the 3.00 gets deleted within the interval. Then the average price standing at time=8 is (4.20+3.15+3.11)/3. As you can see, I have to look at the current time where I am standing and look back and see which price is still valid standing at time=8. Also, I would like to have a field where for every time I know the highest price available that was not deleted.
Any help?
You have a variant of a rolling sum here. There's no one straightforward solution (especially as you undoubtedly have a few complications not mentioned); but here are a few pointers.
First, you may want to change the format of your data. This is actually a relatively easy problem to solve if you have one row for each possible timepoint rather than just a single row.
data have;
input time Add time_delete;
datalines;
5 3.00 5
5 3.15 11
5 3.11 11
8 4.21 8
8 3.42 8
8 4.20 11
11 3.12 .
;;;;
run;
data want;
set have;
if time=time_delete then delete;
else do time=time to time_delete-1;
output;
end;
keep time add;
run;
proc means data=want mean max n;
class time;
var add;
run;
You could output the proc means to a dataset and have your maximum value plus the average value, and then either put that back on the main dataset or whatever you need.
The main downside to this is it's a much larger dataset, so if you're looking at hundreds of thousands of data points, this is not your best option likely.
You can also perform this in SQL without the extra rows, although this is where those "other complications" would potentially throw a wrench in things.
proc sql;
select H.time, mean(V.add), max(V.add) from (
select distinct H.time from have H
left join
(select * from have) V
on V.time le H.time
and V.time_delete gt H.time )
group by 1;
;
quit;
Fairly straightforward and quick query, except that if you have a lot of time values it might take some time to execute the join.
Other options:
Read the data into an array, with a second array tracking the delete points. This can get a bit complex as you probably need to sort your array by delete point - so rather than just adding a new record into the end, you need to move a bunch of records down. SAS isn't quite as friendly to this sort of operation as a c-type language would be.
Use a hash table solution. Somewhat less messy than an array, particularly as you can sort a hash table more easily than two separate arrays.
Use IML and vectors. Similar to the array solution but with more powerful manipulation techniques available.