Trying to analyze panel data but feel like I am mixing up commands - could anybody review and check? - stata

I have the following data structure:
186 unique firm acquisitions
Observations for 5 years per firm; 2 years before acquisition year, acquisition year, and 2 years after
Total number of observations is thus 186 * 5 = 930
Two dependent variables, which I like to use in different analyses - one is binary (1/0), the other is one variable divided by another, which ranges from 0 to 5.
Acquisition years range from 2008 to 2019
Acquisitions took place in 20 different industries
Goal: test whether there are significant differences in target characteristics (the two DVs mentioned above) after acquisition vs before acquisition.
I expect the following unobserved factors to exist that can bias results:
Deal-specific: some deals involve characteristics that others do not
Target-specific: some targets might be more difficult to change, for example. Also, some targets get acquired twice in the period I am examining, so without controlling for that fact, the results will be biased.
Acquirer-specific: some acquirers are more likely to implement change than others. Also, some acquirers engage in multiple acquisitions during the period I am examining (max is 9)
Industry-specific: there might have been some unobserved industry-trends going on, which caused targets in certain industries to be more likely to change than targets in other industries.
Year-specific: since the acquisitions took place in different years between 2008 and 2019, observations might be biased by unobserved year-specific factors. For example, 2020 and 2021 observations will likely be affected by the COVID-19 pandemic.
I have constructed a dummy variable, post, which is coded 1 for year 1 and year 2 after acquisition, and 0 for year 1 and year 2 before acquisition.
I have been struggling with using the right models and commands in Stata. The code I have been using:
BINARY DV
First, I ran an OLS regression so that I could remove outliers after the regression:
reg Y1 post X1 post*X1 $controls i.industry i.year
Then, I removed outliers (not sure if this is the right method though):
predict estu if e(sample), rstudent
drop if abs(estu)>3.5
Then, ran the xtprobit regression below:
egen id = group(target_id acquiror_id)
xtset deal_id year
xtprobit Y1 post X1 post*X1 $controls i.industry i.year, vce(cluster id)
OTHER DV
Same as above, but replacing xtprobit with xtreg and Y1 with Y2
Although I get results which theoretically make sense, I feel like I am messing things up.
Any thoughts on how to improve my code?

You could try checking reghdfe for the different fixed effects you're running. I don't really understand the question tho. http://scorreia.com/software/reghdfe/

Related

How do I create a non mutually exclusive categorical variable?

In Stata I am analyzing a study looking at pre-existing conditions that participants may have had that affect whether they experience side effects after vaccination.
For each participant, there are three binary variables that denote whether the participant had that condition (0: does not have, 1: does have), namely hypertension: 0/1, asthma: 0/1, diabetes: 0/1.
However, these categories are not mutually exclusive as the participant can have any combination of conditions: (no pre-existing conditions, only hypertension, only asthma, only diabetes, hypertension and asthma, hypertension and diabetes, asthma and diabetes, hypertension and asthma and diabetes).
I would like to perform a regression analysis to determine the risk of developing side effects given exposure to pre-existing conditions and to create a variable denoting the different combinations.
I would like to get the risk ratios for the following table:
Type of pre-existing condition
With side effects
no side effects
risk ratio
None
455
316
ref
Hypertension
51
28
Asthma
42
26
Diabetes
17
7
Does anyone havecode that would help in creating a new categorical variable to help with this regression analysis?
I've tried using the following code, but because the categories are not mutually exclusive, the values assigned overwrite each other. new_var denotes the new variable created denoting the pre-existing conditions.
generate new_var = 0
replace new_var = 1 if hypertension == 1
replace new_var = 2 if asthma == 1
replace new_var = 3 if diabetes == 1
This is as much statistical as Stata-oriented, but there is a Stata dimension, so here goes.
#Stuart has indicated some ways of getting composite variables in Stata, but as no doubt he would emphasise too, watch out that the numeric coding is arbitrary and not to be taken literally.
Other methods of creating composite variables were discussed in this paper and that advice remains valid.
That said, I suspect most researchers would not use a composite variable here at all, but would use as predictors the three indicators you already have and their interactions. That is the only serious and supported method to get estimates of effect size together with appropriate tests.
There are 8 possible combinations of preexisting conditions, and one approach is to add the variables like this, then manually label them:
generate new_var = hypertension * 4 + asthma * 2 + diabetes
label define preexisting 0 none 1 diabetes 2 asthma 3 "asthma and diabetes" 4 hypertension 5 "hypertension and asthma" 6 "hypertension and diabetes" 7 "hypertension, asthma and diabetes"
label values new_var preexisting
If you have additional preexisting condition variables, multiply them by 8, 16, 32 and so on to get unique values for every combination.
Another approach is to use interactions in the regression.
regress outcome hypertension##asthma##diabetes

Using Logistic Regression For Timeseries Data in Amazon SageMaker

For a project I am working on, which uses annual financial reports data (of multiple categories) from companies which have been successful or gone bust/into liquidation, I previously created a (fairly well performing) model on AWS Sagemaker using a multiple linear regression algorithm (specifically, the AWS stock algorithm for logistic regression/classification problems - the 'Linear Learner' algorithm)
This model just produces a simple "company is in good health" or "company looks like it will go bust" binary prediction, based on one set of annual data fed in; e.g.
query input: {data:[{
"Gross Revenue": -4000,
"Balance Sheet": 10000,
"Creditors": 4000,
"Debts": 1000000
}]}
inference output: "in good health" / "in bad health"
I trained this model by just ignoring what year for each company the values were from and pilling in all of the annual financial reports data (i.e. one years financial data for one company = one input line) for the training, along with the label of "good" or "bad" - a good company was one which has existed for a while, but hasn't gone bust, a bad company is one which was found to have eventually gone bust; e.g.:
label
Gross Revenue
Balance Sheet
Creditors
Debts
good
10000
20000
0
0
bad
0
5
100
10000
bad
20000
0
4
100000000
I hence used these multiple features (gross revenue, balance sheet...) along with the label (good/bad) in my training input, to create my first model.
I would like to use the same features as before as input (gross revenue, balance sheet..) but over multiple years; e.g take the values from 2020 & 2019 and use these (along with the eventual company status of "good" or "bad") as the singular input for my new model. However I'm unsure of the following:
is this an inappropriate use of logistic regression Machine learning? i.e. is there a more suitable algorithm I should consider?
is it fine, or terribly wrong to try and just use the same technique as before, but combine the data for both years into one input line like:
label
Gross Revenue(2019)
Balance Sheet(2019)
Creditors(2019)
Debts(2019)
Gross Revenue(2020)
Balance Sheet(2020)
Creditors(2020)
Debts(2020)
good
10000
20000
0
0
30000
10000
40
500
bad
100
50
200
50000
100
5
100
10000
bad
5000
0
2000
800000
2000
0
4
100000000
I would personally expect that a company which has gotten worse over time (i.e. companies finances are worse in 2020 than in 2019) should be more likely to be found to be a "bad"/likely to go bust, so I would hope that, if I feed in data like in the above example (i.e. earlier years data comes before later years data, on an input line) my training job ends up creating a model which gives greater weighting to the earlier years data, when making predictions
Any advice or tips would be greatly appreciated - I'm pretty new to machine learning and would like to learn more
UPDATE:
Using Long-Short-Term-Memory Recurrent Neural Networks (LSTM RNN) is one potential route I think I could try taking, but this seems to commonly just be used with multivariate data over many dates; my data only has 2 or 3 dates worth of multivariate data, per company. I would want to try using the data I have for all the companies, over the few dates worth of data there are, in training
I once developed a so called Genetic Time Series in R. I used a Genetic Algorithm which sorted out the best solutions from multivariate data, which were fitted on a VAR in differences or a VECM. Your data seems more macro economic or financial than user-centric and VAR or VECM seems appropriate. (Surely it is possible to treat time-series data in the same way so that we can use LSTM or other approaches, but these are very common) However, I do not know if VAR in differences or VECM works with binary classified labels. Perhaps if you would calculate a metric outcome, which you later label encode to a categorical feature (or label it first to a categorical) than VAR or VECM may also be appropriate.
However you may add all yearly data points to one data points per firm to forecast its survival, but you would loose a lot of insight. If you are interested in time series ML which works a little bit different than for neural networks or elastic net (which could also be used with time series) let me know. And we can work something out. Or I'll paste you some sources.
Summary:
1.)
It is possible to use LSTM, elastic NEt (time points may be dummies or treated as cross sectional panel) or you use VAR in differences and VECM with a slightly different out come variable
2.)
It is possible but you will loose information over time.
All the best,
Patrick

Linear Programming - Re-setting a variable based on it's cumulative count

Detailed business problem:
I'm trying to solve a production scheduling business problem as below:
I have two plants producing FG A and B respectively.
Both the products consume the same Raw Material x
I need to create a 30 day production schedule looking at the Raw Material availability.
FG A and B can be produced if there is sufficient raw material available on the day.
After every 6 days of production the plant has to undergo maintenance and the production on that day will be zero.
Objective is to maximize the margin looking at the day level Raw material available and adhere to the production constraint (i.e. shutdown after every 6th day)
I need to build a linear programming to address the below problem:
Variable y: (binary)
variable z: cumulative of y
When z > 6 then y = 0. I also need to reset the cumulation of z after this point.
Desired output:
How can I build the statement to MILP constraint. Are there any techniques for solving this problem. Thank you.
I think you can model your maintenance differently. Just forbid any sequences of 7 ones for y. I.e.
y[t-6]+y[t-5]+y[t-4]+y[t-3]+y[t-2]+y[t-1]+y[t] <= 6 for t=1,..,T
This is easier than using your accumulator. Note that the beginning needs some attention: you can use historic data for this. I.e., at t=1, the values for t=0,-1,-2,.. are known.
Your accumulator approach is not inherently wrong. We often use it to model inventory. An inventory capacity is a restriction on how large the accumulated inventory can be.

How do I "fill down" a dataset in Stata, but for only a certain number of rows?

I have a dataset with observations at specific timepoints, but those timepoints (and the length of time between them) vary by group. I'm trying to "fill down" the data so that existing observations are carried down into missing cells. But I only want to do this for a certain number of rows after the original observation. So for example, I could have a dataset that looks like this:
For group A, I'd want to fill in the value for 2002 with 2001's value, 2004 with 2003, etc. I wouldn't want to fill in 2000 at all, since I don't have the preceding value. And I ALSO wouldn't want to fill in the 2011 value, because the "cyclelength" variable tells me that group A's observations are supposed to take place every two years, so I don't want to carry data forward past that. 2011 is just a genuinely missing value.
Similarly, in group B, I'd want to carry 2000's value forward into years 2001, 2002, and 2003 (because the "cyclelength" here is 4 years). I'd want to carry 2004's value into 2005, 2006, and 2007, but not beyond that--the later years should stay missing.
I've tried setting this up with the "carryforward" command, but haven't figured out how to have it stop filling down after a specified number of years that varies by group. Is there a way to do this, either with carryforward or otherwise?
This is a variation on a problem documented since 2000 as an FAQ: see here
The variation lies in limiting how far non-missing values are copied. But it falls easily to the same idea.
The last known value was recorded in certain years which we can copy down the dataset:
gen when_last_known = year if !missing(value)
bysort group (year) : replace when_last_known = when_last_known[_n-1] if missing(when_last_known)
Now the replacement wanted is
by group : replace value = value[_n-1] if missing(value) & (year - when_last_known) < cyclelength
That statement presupposes the sort order of the previous statement.
On Statalist (see here) you'd be expected to document that carryforward is a user-written command to be installed from SSC. That's a good convention here too.
In practice, it's good data management to keep the original data exactly as they arrive and do this on a clone of the variable. Sooner or later someone will ask to see the original values, and then you could be seriously embarrassed.

Block bootstrap with indicator variable for each block

I want to run block bootstrap, where the blocks are countries, and include country indicator variables. I thought the following would work.
regress mvalue kstock i.country, vce(bootstrap, cluster(country))
But I get the following error.
. regress mvalue kstock i.country, vce(bootstrap, cluster(country))
(running regress on estimation sample)
Bootstrap replications (50)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.xxxxx 50
insufficient observations to compute bootstrap standard errors
no results will be saved
r(2000);
It seems that this should work. If the block bootstrap picks the same country for every block, then it seems it should just drop the intercept.
Is my error coding or conceptual? Here is some code using the grunfeld data.
webuse grunfeld, clear
xtset, clear
generate country = int((company - 1) / 2) + 1
regress mvalue kstock i.country, vce(bootstrap, cluster(country))
The problem here is not with your coding, but is conceptual. The problem is that you cannot identify each coefficient in each regression in each bootstrap sample. Not all "countries" are included in the dataset for each bootstrap repetition. You can diagnose what is going on with the vce( , noisily) sub-option:
. regress mvalue kstock i.bscountry, vce(bootstrap, cluster(country) noisily)
Errors are generated because some coefficients are missing when the regression runs with particular bootstrap samples. In each regression you can see that some countries dummies are being omitted due to collinearity. This should be expected and makes a lot of sense -- the country dummies could =0 for all observations in the bootstrap sample if the country was not drawn!
If you are really trying to estimate the coefficients on the country dummies, you are going to have to find another approach than bootstrapping with K clusters if K is the number of countries. If you don't care about the coefficient dummies you could use another command that simply absorbs the fixed effects and only reports the coefficients on the other independent variables (e.g., areg or xtreg). One way think about what is going on is that it is analogous to this:
.bootstrap, cluster(country) idcluster(bscountry) noisily: regress mvalue kstock i.bscountry
With the idcluster() option, each country that is drawn in a bootstrap sample is given its own ID number. If a country is drawn twice then there are two dummies. (The coefficients for the two dummies naturally turn out to be identical or near-identical.) However, the coefficients in this output are are completely meaningless because bscountry "2" will be different countries in different bootstrap iterations. Since you would ignore any output on the dummies, you might as well use a model like areg or xtreg since they run more quickly.
Although there are many applications where bootstrapping with clusters would work fine, the problem here is the inclusion of cluster dummies in the regression. This all begs the question of whether this exercise makes any sense at all. If you are trying to estimate the coefficients for the country dummies, then certainly not. Otherwise, the solutions above might be OK, but it is hard to say without knowing your research question.