I am using Stata 12 and I have to run a Ordered Probit (oprobit) with a panel dataset. I know that "oprobit" command is compatible with cross-section analysis. In the new version of Stata (Stata 13) they have "xtoprobit" command to account for Random Effects Ordered Probit. I need the similar command for Stata 12. I have checked "reoprob" command but when I use it with my panel dataset I have the following error :
"factor variables and time-series operators not allowed"
That means you need to create your own dummy variables instead using the factor variable notation i.dummyvar. Try this:
tab dummyvar, gen(D)
reg y D*
This will creates a set of dummy variables (D1, D2,...) reflecting the observed values of the tabulated variable.
Some of the older user-written commands do not know what to do with the factor variable notation, which is convenient, but fairly new.
You can also explore xi for more complicated tasks.
Related
I am a complete noob, so this is a fairly basic question. I'm looking to build a decision tree-based classification model in SAS.
I'm unable to embed pictures in my question nor am I able to attach images, but I have a dataset that I am working with.
Here is a link preview of my data set:
I'm trying to build this decision-tree using the hpsplit procedure in SAS, but it's not working. I think it's because:
(1) I am not using all of the categorical variables
(2) I have missing values in the "node-caps" column: the available options are yes, no, and ? - I think I should be using the "ASSIGNMISSING" procedure, but not sure. See image:
Here is my current code:
proc hpsplit data=bcancer seed=1;
class class;
model class = Age Menopause tumor_size inv_nodes node_caps deg_malig breast breast_quad irradiat;
grow entropy;
prune costcomplexity;
run;
I think that I should be:
(1) Adding more variables to the second row (as they are categorical)
(2) Adding the "ASSIGNMISSING" procedure to account for missing variables in one column. See link: https://i.imgur.com/aJmB3kx.png
NOTE: The ASSIGNMISSING= option has not been specified. Because of this, all observations with
missing values in the explanatory variables will be excluded from tree construction.
ERROR: Character variable appeared on the MODEL statement without appearing on a CLASS statement.
ERROR: Unable to create a usable predictor variable set.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE HPSPLIT used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
For reference, this is the error I see in the log. Any help would be greatly appreciated!
Categorical variables need to be in the CLASS statement. It looks like many of your variables are categorical and need to be in the CLASS statement.
Continuous variables should be numeric, but I don't see any in your data.
Because the categorical variable node_caps is character ? will be assigned as a level, not as missing. Do you want them coded as missing or included as their own level of that variable?
proc hpsplit data=bcancer seed=1;
class age menopause tutor_size inv_nodes node_caps deg_malig breast breast_quad irradiat;
model class = Age Menopause tumor_size inv_nodes node_caps deg_malig breast breast_quad irradiat;
grow entropy;
prune costcomplexity;
run;
I have a panel dataset for multiple firms throughout 8 years and I'm trying to use a pooled OLS with industry-specific effects with the `reghdfe' command to control for a categorical variable (NAICS Industry Code). I typed
reghdfe DV IV control variables i.year, absorb(NAICS Industry Code)
Is this the correct way to use the command? Is it correct to use i.year within the variables or should I add it to the absorbed variables?
In addition I'm using a Fixed Effect Panel Regression and control for clustered standard errors. Do I have to control for clustered standard errors in the reghdfe as well or is it sufficient to just do it within the fixed effect panel regression?
You should include your variable year in the absorb() option to catch the intended use of reghdfe:
reghdfe y x, absorb(naics year)
Alternatively, you can also use reg y x i.naics i.year.
I assume NAICS codes to be numeric; otherwise, you might need to transform the variable to numeric, e.g. using egen num_naics= group(naics).
Note: The R-squared rests on different assumptions and might differ between the two commands.
Note_2: If your question is specifically about coding, everyone is better off when you provide example data. Statistical questions might be better suited for Cross Validated.
I am trying to perform k-means cluster analysis on a dataset with 20 variables and 9000 observations. I want to create 2 new variables (Usage and Payment Ratio) using 4 (Balance, Due date, Payment, Minimum Payment) of the 20 variables in the dataset. Ex: Usage = (Balance/Due date) and Payment_ratio = (payment/minimum payment). Now, should I make these variables at the start itself, or should I first cap the outliers, remove/impute missing values and then create these 2 new variables?
I tried to first clean the data and then created these two variables. Then I also capped the outliers and treated missing values again for these 2 variables. After doing this, these 2 new variables are skewed and I tried taking log, sqrt etc but the variables are still not normal i.e there is still skewness in these variables. After this step, I need to do the Factor Analysis but I cannot proceed.
Can anyone please suggest a way to go about this? Thanks.
Why does Stata complain with a cryptic error when I use string variables in the table command?
Consider the following toy example:
sysuse auto, clear
decode foreign, g(foreign_str)
table foreign, contents(n foreign_str mean mpg)
This raises an r(111) variable __000002 not found error in Stata 13.1.
Tracing the error tells me that it is trying to run format __000002 %9.0gc and crashing when it does not find the variable. If I switch the order of the variables in the clist, that is i run table foreign, contents(n mpg_rank mean mpg), I get the same error but with __000003 instead of __000002.
So it appears that Stata crashes when it finds the string variable. If I replace the string variable with a numeric variable, the error doesn't occur.
I know it is not meaningful to compute summary statistics on string variables, but counting the number of observations of a string variable (in each group specified by the rowvar) makes perfect sense.
Stata complains because variable __000002 (or __000003 if you change the order) is not created by the collapse command (which is used internally by table) due to the following error:
collapse (count) foreign_str
type mismatch
r(109);
What really happens is not visible to the user because capture is used in combination with collapse and the output from trace confirms that:
- capture collapse `clist' `wgt', by(`varlist' `by') fast `cw'
= capture collapse (count) __000002=foreign_str (mean) __000003=mpg , by(foreign ) fast
There are only provisions for error codes 111 and 135, so the table command continues to run until it hits a wall when it cannot find the aforementioned variables.
Stata 14 and later versions check the variable(s) provided by the user in the contents() option and only accept numeric types, issuing a more informative error message if this is not the case.
It is also worth pointing out that collapse treats strings differently in more recent Stata versions.
In Stata, I have a factor variable with 50 levels (state) and an integer-valued variable (year). I want to create 50 new variables: 50 interactions of state indicators with the year variable. Is there a way to do this without writing 50 lines of code?
I can produce the 50 state dummies with tabulate state, generate (state), but I don't know how to get further than that without writing a line to create each individual state-year variable.
I want to use the new state-year variables in a regression. Stata's factor notation makes it easy to include the state-year variables as regressors without creating them beforehand (e.g., with a command like regress y i.state#c.year), but some add-on functions don't support factor notation.
You can try using xi, both as a stand-alone command to create indicator and interaction terms, and as a command prefix. A nonsensical example:
clear all
set more off
sysuse auto
* stand-alone
xi i.rep78*mpg
* as prefix
xi: regress price i.rep78*mpg
Run help xi for all the details.
Edit
To make this a bit clearer, suppose the regress command did not admit the use of either factor variable notation or the xi: prefix. Then using the xi stand-alone syntax you could create the indicator and interaction terms (which answers your original question) and then use those terms with the regress command:
sysuse auto, clear
xi i.rep78*mpg
regress price mpg _Irep78* _IrepXmpg*
(Remember to use Stata's help capabilities. Running search interactions, for example, leads you to xi......Interaction expansion.)