Stata: how I add an interaction between two variables? - stata

I have multiple regression model question:
I am required to add an interaction between two variables in stata, but every time I try and do so (reg log_p c##lux) I get a message prompted, "c: factor variables may not contain noninteger values".
Any advice on where to go from here?

Related

How to generate dummy variables from two categorical variables?

I have two variables containing state identifier and year. If I want to create dummy variables indicating each state, I usually write the following code:
tab state_id, gen(state_id_)
This will give me a group of variables, state_id_1,state_id_2,... etc. But what operations are available if I want to get a list of dummy variables for the interaction of state and year, for instance a dummy variable indicating state 1 in year 2005.
Have you tried looking at xi (https://www.stata.com/manuals13/rxi.pdf)? It will create dummies for each of the categorical variables and for the interaction of those two. So if you do:
xi i.state*i.year
This should give you what you are looking for, but note that it will naturally code this and omit the first category of each of your categorical variables.

Variables creation before or after Data cleaning?

I am trying to perform k-means cluster analysis on a dataset with 20 variables and 9000 observations. I want to create 2 new variables (Usage and Payment Ratio) using 4 (Balance, Due date, Payment, Minimum Payment) of the 20 variables in the dataset. Ex: Usage = (Balance/Due date) and Payment_ratio = (payment/minimum payment). Now, should I make these variables at the start itself, or should I first cap the outliers, remove/impute missing values and then create these 2 new variables?
I tried to first clean the data and then created these two variables. Then I also capped the outliers and treated missing values again for these 2 variables. After doing this, these 2 new variables are skewed and I tried taking log, sqrt etc but the variables are still not normal i.e there is still skewness in these variables. After this step, I need to do the Factor Analysis but I cannot proceed.
Can anyone please suggest a way to go about this? Thanks.

Stata: Esttab of xtreg with time fixed effects

I'm trying to save output from several hundred eststo's storing results of bivariate probability models into one excel file using esttab. It works for xtlogit(both ,re and ,pa), xtprobit (both ,re and ,pa) and for the linear probability model xtreg (both standard and ,fe. However, when I use xtreg y x i.year, fe I get the error message too many base levels specified. Google doesn't help me much.
I've been trying for an hour to create a reproducible example but the stata datasets all work fine. It does not seem to be due to the number of years or the fact that different specifications have data for different years. Still, the normal xtreg, fe' works, the problem only appears with time dummies. The weirdest thing is that it works for all subsets of my variables but not for the whole list (again just the time fixed effects specifications).
Does anyone have an idea how to proceed? Using drop(*.year) works whenever the problem does not arise (so in specifications where it works, I get outputs without the year dummies) but does not prevent the too many base levels specified error; ,nobaselevels has no apparent effect as well. Is there a way to remove the time fixed effects from eststo before I pass those on to esttab? Any workaround would be appreciated as well.
The problem you might be facing is that of Stata creating different base levels for the factor variable year, in different regressions.
Try fixing the factor variable base level beforehand with fvset:
fvset base <some_number> year
Check help fvset and the manual entry for details. Also, read the source given below, which contains more information.
Source: two posts from Statalist; one from Tim Wade and another by Jeff Pitblado.

Panel Data Ordered Probit in Stata 12

I am using Stata 12 and I have to run a Ordered Probit (oprobit) with a panel dataset. I know that "oprobit" command is compatible with cross-section analysis. In the new version of Stata (Stata 13) they have "xtoprobit" command to account for Random Effects Ordered Probit. I need the similar command for Stata 12. I have checked "reoprob" command but when I use it with my panel dataset I have the following error :
"factor variables and time-series operators not allowed"
That means you need to create your own dummy variables instead using the factor variable notation i.dummyvar. Try this:
tab dummyvar, gen(D)
reg y D*
This will creates a set of dummy variables (D1, D2,...) reflecting the observed values of the tabulated variable.
Some of the older user-written commands do not know what to do with the factor variable notation, which is convenient, but fairly new.
You can also explore xi for more complicated tasks.

Correlation with one variable and a lot of others

In Stata, is there a quick way to show the correlation between a variable and a bunch of dummies. In my data I have an independent variable, goals_scored in a game, and a bunch of dummies for stadium played. How can I show the correlation between the goals_scored and i.stadium in one table, without getting the correlations between stadiums, which I do not care about.
Here's one way:
#delimit;
quietly tab stadium, gen(D); // create dummies
foreach var of varlist D* {;
quietly corr goals_scored `var';
di as text "`: variable label `var'': " as result r(rho);
};
drop D*; // get rid of dummies
cpcorr from SSC (install with ssc inst cpcorr) supports minimal correlation tables, i.e. only the correlations between one set and another set, without the others. But it's an old program (2001) and doesn't support factor variables directly. The indicator variables (a.k.a. dummy variables) would have to exist first.
If you store all of the stadium variables in a local, you would probably loop through them to pull the correlations.
1.
If all stadium variables are placed next to each other in the dataset:
foreach s of varlist stadium1-stadium150 {
// do whatever
}
2a.
If the stadium variables are not next to each other, use order to get there.
2b.
If the variable names follow a pattern, there might be another workaround.
3.
I would not use correlation for this. Depending on the distribution of goals, I would consider something else.