How do I create a non mutually exclusive categorical variable? - stata

In Stata I am analyzing a study looking at pre-existing conditions that participants may have had that affect whether they experience side effects after vaccination.
For each participant, there are three binary variables that denote whether the participant had that condition (0: does not have, 1: does have), namely hypertension: 0/1, asthma: 0/1, diabetes: 0/1.
However, these categories are not mutually exclusive as the participant can have any combination of conditions: (no pre-existing conditions, only hypertension, only asthma, only diabetes, hypertension and asthma, hypertension and diabetes, asthma and diabetes, hypertension and asthma and diabetes).
I would like to perform a regression analysis to determine the risk of developing side effects given exposure to pre-existing conditions and to create a variable denoting the different combinations.
I would like to get the risk ratios for the following table:
Type of pre-existing condition
With side effects
no side effects
risk ratio
None
455
316
ref
Hypertension
51
28
Asthma
42
26
Diabetes
17
7
Does anyone havecode that would help in creating a new categorical variable to help with this regression analysis?
I've tried using the following code, but because the categories are not mutually exclusive, the values assigned overwrite each other. new_var denotes the new variable created denoting the pre-existing conditions.
generate new_var = 0
replace new_var = 1 if hypertension == 1
replace new_var = 2 if asthma == 1
replace new_var = 3 if diabetes == 1

This is as much statistical as Stata-oriented, but there is a Stata dimension, so here goes.
#Stuart has indicated some ways of getting composite variables in Stata, but as no doubt he would emphasise too, watch out that the numeric coding is arbitrary and not to be taken literally.
Other methods of creating composite variables were discussed in this paper and that advice remains valid.
That said, I suspect most researchers would not use a composite variable here at all, but would use as predictors the three indicators you already have and their interactions. That is the only serious and supported method to get estimates of effect size together with appropriate tests.

There are 8 possible combinations of preexisting conditions, and one approach is to add the variables like this, then manually label them:
generate new_var = hypertension * 4 + asthma * 2 + diabetes
label define preexisting 0 none 1 diabetes 2 asthma 3 "asthma and diabetes" 4 hypertension 5 "hypertension and asthma" 6 "hypertension and diabetes" 7 "hypertension, asthma and diabetes"
label values new_var preexisting
If you have additional preexisting condition variables, multiply them by 8, 16, 32 and so on to get unique values for every combination.
Another approach is to use interactions in the regression.
regress outcome hypertension##asthma##diabetes

Related

Trying to analyze panel data but feel like I am mixing up commands - could anybody review and check?

I have the following data structure:
186 unique firm acquisitions
Observations for 5 years per firm; 2 years before acquisition year, acquisition year, and 2 years after
Total number of observations is thus 186 * 5 = 930
Two dependent variables, which I like to use in different analyses - one is binary (1/0), the other is one variable divided by another, which ranges from 0 to 5.
Acquisition years range from 2008 to 2019
Acquisitions took place in 20 different industries
Goal: test whether there are significant differences in target characteristics (the two DVs mentioned above) after acquisition vs before acquisition.
I expect the following unobserved factors to exist that can bias results:
Deal-specific: some deals involve characteristics that others do not
Target-specific: some targets might be more difficult to change, for example. Also, some targets get acquired twice in the period I am examining, so without controlling for that fact, the results will be biased.
Acquirer-specific: some acquirers are more likely to implement change than others. Also, some acquirers engage in multiple acquisitions during the period I am examining (max is 9)
Industry-specific: there might have been some unobserved industry-trends going on, which caused targets in certain industries to be more likely to change than targets in other industries.
Year-specific: since the acquisitions took place in different years between 2008 and 2019, observations might be biased by unobserved year-specific factors. For example, 2020 and 2021 observations will likely be affected by the COVID-19 pandemic.
I have constructed a dummy variable, post, which is coded 1 for year 1 and year 2 after acquisition, and 0 for year 1 and year 2 before acquisition.
I have been struggling with using the right models and commands in Stata. The code I have been using:
BINARY DV
First, I ran an OLS regression so that I could remove outliers after the regression:
reg Y1 post X1 post*X1 $controls i.industry i.year
Then, I removed outliers (not sure if this is the right method though):
predict estu if e(sample), rstudent
drop if abs(estu)>3.5
Then, ran the xtprobit regression below:
egen id = group(target_id acquiror_id)
xtset deal_id year
xtprobit Y1 post X1 post*X1 $controls i.industry i.year, vce(cluster id)
OTHER DV
Same as above, but replacing xtprobit with xtreg and Y1 with Y2
Although I get results which theoretically make sense, I feel like I am messing things up.
Any thoughts on how to improve my code?
You could try checking reghdfe for the different fixed effects you're running. I don't really understand the question tho. http://scorreia.com/software/reghdfe/

Creating an ID based on factor and filling down with Stata

Consider the fictional data to illustrate my problem, which contains in reality thousands of rows.
Figure 1
Each individual is characterized by values attached to A,B,C,D,E. In figure1, I show 3 individuals for which some characteristics are missing. Do you have any idea how can I get the following completed table (figure 2)?
Figure 2
With the ID in figure 1 I could have used the carryforward command to filling in the values. But since each individual has a different number of rows I don't know how to create the ID.
Edit: All individual share the characteristic "A".
Edit: the existing order of observations is informative.
To detect the change of id, the idea is to compare if the precedent value of char is >= in each rows.
This works only if your data are ordered, but it seems mandatory in your data.
gen id= 1 if (char[_n-1] >= char[_n]) | _n ==1
replace id = sum(id) if id==1
replace id = id[_n-1] if missing(id)
fillin id char
drop _fillin
If an individual as only the characteristics A and C and another individual as only the characteristics D and E, this won't work, but it seems impossible to detect with your data.

How to create "non-standard" descriptive statistics more efficiently in Stata

Say I want to create some scalar value like median price/median income mean downpayment/house price. I know I can first use su command and then extract denominators and numerators separately from the r-class and then create the desired scalars.
However, when I have a dozen such scalars and by different household type, such approach is tedious in practice. So I wonder if there's any way to accomplish above work more efficiently? If I can create a table containing such scalars within Stata, it's even more amusing.
Executive summary: So, don't use scalars; use variables instead.
There is a prior statistical issue, which is that (say) summary(y) / summary(x) is not necessarily equal to summary(y/x); in general, the two will differ. It seems to me that the latter usually makes more sense, but set that aside otherwise.
Here is one not too crazy example. How much do you have to pay (in US dollars circa 1978) per pound weight (physicists: mass, really) for various cars in the Stata auto dataset?
. sysuse auto
(1978 Automobile Data)
. gen pricePERlb = price/weight
. egen mean = mean(pricePERlb), by(rep78)
. tabstat mean, s(n mean) by(rep78)
Summary for variables: mean
by categories of: rep78 (Repair Record 1978)
rep78 | N mean
---------+--------------------
1 | 2 1.479266
2 | 8 1.731407
3 | 30 1.895855
4 | 18 2.25233
5 | 11 2.472519
---------+--------------------
Total | 69 2.049639
------------------------------
Now here's a small twist. The generate wasn't needed here. We could have gone
egen mean = mean(price/weight), by(rep78).
The tools are all trivial: generate to create new variables, egen to create new variables that here can be summary statistics calculated for groups, and tabstat, among many other tabulation commands, to show results. Since the statistics here are by construction constant within groups, asking for their mean is just one of several ways of getting at them. Similarly, graph dot, graph hbar, etc. are immediate for display.

Dummy and Heckman

I'm using Heckman Selection Model which are two consist of 2 equation. i'm using Probit as a selection equation and multiple regression as a result equation.
how can put in dummy variables in those equation ?
Do we have to make the variables into logaritmic form ?
How can I make logaritmic variables with stata ?
Thank you..
Here's an example of how you might do what you ask. The example looks at the effect of being a union member on log wages:
webuse union3
gen log_wage = ln(wage)
etregress log_wage age grade i.smsa i.black tenure, treat(union = i.south i.black tenure) twostep
etregress estimates an average treatment effect of an endogenous binary-treatment variable. In plain English, that means the "first-stage" is a probit. Estimation is by either full maximum likelihood or a two-step consistent estimator, as above.
The dummies are created on the fly by putting an i. in front of the covariates. This is called factor variable notation, and it also makes interactions a breeze. You can also do tab race, gen(d_) to create d_1, d_2, and d_3 (3 race dummies, one of which you can drop).

CONTRAST for CLASS variable with more than two levels in PROC GLM

Background: When we test the significance of a categorical variable that has been coded as dummy variables, we need to simultaneously test all dummy variables are 0. For example, if X takes on values of 0, 1, 2, 3 and 4, I would fit dummy variables for levels 1-4 (assuming I want 0 to be baseline), then want to simultaneously test B1=B2=B3=B4=0.
If this is the only variable in my data set, I can use the overall F-statistic to achieve this. However, if I have other covariates, the overall F-test doesn't work.
In Stata, for example, this is (very, very) simply carried out by the testparm command as:
testparm i.x (after fitting the desired regression model), where the i. prefix tells Stata X is a categorical data to be treated as dummy variables.
Question/issue: I'm wondering how I can do this in SAS with a CONTRAST (or ESTIMATE?) statement while fitting a regression model with PROC GLM. Since I have scoured the internet and haven't found what I'm looking for, I'm guessing I'm missing something very obvious. However, all of the examples I've seen are NOT for categorical (class) variables, but rather two separate (say continuous) variables. The contrast statement in that case would simply be something like
CONTRAST 'Contrast1' y 1 z 1;
Otherwise, they're for calculating hypotheses like H_0: B1-B2=0.
I feel like I need to breakdown the hypotheses into smaller pieces and determine that set that defines the whole relationship, but I'm not doing it correctly. For example, for B1=B2=B3=B4=0, I thought I might say B1=B2=B3=-B4, then define (1) B1=-B4, (2) B2=-B4 and (3) B2=B3. I was trying to code this as a CONTRAST statement as (say X is in descending order in data set: 4-0):
CONTRAST 'Contrast' x -1 0 0 1 0
x -1 0 1 0 0
x 0 1 1 0 0;
I know this is not correct, and I tried many, many variations and whatever random logic I could come up with. My problem is I have relatively novice-level knowledge of CONTRAST (and unfortunately have not found great documentation to help with this) and also of how this hypothesis test should really be formulated for the sake of estimation (do I try to split it up into pieces as I did above, or...?).
From my note above, you actually can get SAS to do this for you with PROC GENMOD and the CLASS statement and a TYPE3 specification.
proc genmod data=input;
class classvar ;
model slope= classvar othervar/ type3;
run;
quit;
In the example above, my class levels are in the classvar variable. The othervar is my other covariate.
At the end of the output, you see a table labeled LR Statistics For Type 3 Analysis. The row for classvar is the LR test of all the class effects=0.
Another case where PROC REG with TEST works (TEST x1=0, x2=0, x3=0, x4=0, e.g.), which isn't answering my initial question for PROC GLM, but is an option if PROC REG gets the job done for your type of model.