Dummy and Heckman - stata

I'm using Heckman Selection Model which are two consist of 2 equation. i'm using Probit as a selection equation and multiple regression as a result equation.
how can put in dummy variables in those equation ?
Do we have to make the variables into logaritmic form ?
How can I make logaritmic variables with stata ?
Thank you..

Here's an example of how you might do what you ask. The example looks at the effect of being a union member on log wages:
webuse union3
gen log_wage = ln(wage)
etregress log_wage age grade i.smsa i.black tenure, treat(union = i.south i.black tenure) twostep
etregress estimates an average treatment effect of an endogenous binary-treatment variable. In plain English, that means the "first-stage" is a probit. Estimation is by either full maximum likelihood or a two-step consistent estimator, as above.
The dummies are created on the fly by putting an i. in front of the covariates. This is called factor variable notation, and it also makes interactions a breeze. You can also do tab race, gen(d_) to create d_1, d_2, and d_3 (3 race dummies, one of which you can drop).

Related

factor variables may not contain noninteger values margin stata

I am trying to estimate the marginal effects of my xtlogit model in stata, which looks like this:
xtlogit onset c.l.log_welfarespending##c.l.ethnic_groups l.gdplog, re vce(robust)
The model itself runs smoothly and there are no complications at all. When I try to estimate the marginal effects:
margins log_welfarespending, at(l.ethnic_groups(= (0, 6))
I receive log_welfarespending: factor variables may not contain noninteger values. From what I have read so far sometimes Stata needs a specification on whether the variables are categorical or continuous. When I change the code to:
c.margins log_welfarespending, at(l.ethnic_groups(= (0, 6))
The error changes to
only factor variables and their interactions are allowed. Any suggestions how to solve the problem?

Value of coefficient (Beta1) at different values of other covariate (X2), hopefully graphed

(cross-posted at http://www.statalist.org/forums/forum/general-stata-discussion/general/1370770-margins-plot-of-treatment-effect-rather-than-y-for-values-of-a-covariate)
I'm running a multivariate regression (outcome variable is continuous, happens to be GPA). The covariate of interest is a dummy variable for treatment status; another of the covariates is a pre-score. We want to look at how the treatment effect differs at various values of pre-score. The structure of the model is not complicated:
regress GPA treatment pre_score X3 X4 X5...
What I want is a graph that shows what the treatment effect is (values of Beta1) at various values of pre-score (X2). It's straightforward to get a graph with values of the OUTCOME at various values of X2:
margins, at(pre_score= (1(0.25)5)) post
marginsplot
I have consulted an array of resources and tried alternatives using marginscontplot, coefplot with recast, the dy/dx option, and so forth. I remain unsuccessful. But this seems like something that there must be a way to do; wanting to know if a treatment effect varies for values of a control (say, income) must be common.
Can anyone direct me to the right command, or options for Margins, to output values of Beta1 (coefficient on treatment dummy), rather than of Y (GPA), at values of the pre_score?
Question was resolved at Statalist. Turns out that Margins alone can't do what I was trying to; the model needs to be run with an interaction term. Then it's simple.

Code observation as belonging to quantile in Stata

In Stata, I wanted to be able to put observations in buckets based on a specific variable, or equivalently code observations as belonging to a certain quantile. I looked around for some existing code that would accomplish this task but didn't quite find what I wanted. I wrote the following simple ado:
program toquantiles
version 13
syntax varname [, n(integer 4)]
quietly{
local interval = 100/`n'
local binVarName = "`varlist'_quantile"
gen `binVarName' = `n'
local upper = `n'-1
forvalues i=1/`upper'{
local y = `i'*`interval'
//Abuse the egen cmd to calculate the yth percentile.
tempvar x
egen `x' = pctile(`varlist'), p(`y')
//Label this observation as belonging to the ith bin if the value of the
//var in question is greater than x.
replace `binVarName' = `n'-`i' if `varlist' > `x'
drop `x'
}
}
end
The output is that each observation has a new variable, varname_quantile that is coded as 1,2,3, etc. based on the quantile in which it fits. My code seems like a pretty naive approach to this problem.
Is there any built-in functionality that does what I do above? If not, are there any improvements to this ado that would speed up execution? Currently, it runs quite slowly. (Slowly as in, it is faster to summ all 100+ variables than to calculate the quintiles for 1 variable.) Thanks so much.
There is a terminology problem here, most simply illustrated by quartiles, three particular summary statistics, the lower and upper quartiles and the median in between, and the first, second, third and fourth quarters (some say quartiles here too), intervals defined by falling below or above particular quartiles. (What happens when values equal particular quartiles is a matter of convention.)
In other words, quartiles and more generally quantiles can be particular levels (which I take to be the standard statistical use of the term) or intervals (a common (mis?)use of the term in some applied fields, e.g. applied economics).
It seems that you want the second sense.
Turning to Stata, doesn't xtile do this?
See also http://www.stata.com/support/faqs/statistics/percentile-ranks-and-plotting-positions/index.html

Stata. How to use if statement with sum()?

I am trying to execute the following code:
forval i = 1/51 {
// number of households
by hhid, sort: gen nvals = _n==1
count if (nvals & stateID == `i')
local stateTotalHH = r(N)
local avPersonHH`i' = sum(numper)/`stateTotalHH' if(nvals & stateID ==`i')
drop nvals
}
Everything works fine except if is not allowed with sum(). How can I estimate the total or the sum of all values in numper variable for each state and at household level?
ps:
I cannot use collapse numper, by(stateID) because I have other estimations
also, I cannot do the following: duplicates drop hhid, force
Your problem does not even call for sum() with if, so it is best to start at the beginning.
Reconstructing your problem, which is not well explained,
You have observations for individuals within households (identifier hhid) within 50 states of the USA and the District of Columbia (identifier stateID).
You have a variable numper, the number of persons per household, and you want the average per state.
Observations are repeated for each individual in a household, so it is necessary to use just one observation per household.
You can tag each household once by
egen tag = tag(hhid)
The average as a new variable would be
egen avPersonHH = mean(numper/tag), by(stateID)
Stata is going to average numper/tag which variously will be numper/1 and numper/0; the missings from the latter division will just be ignored, which is what is wanted.
That variable is repeated for each household. To see just one value for each stateID,
tabdisp stateID, cell(avPersonHH)
What is wrong with your code? Here is a partial list:
a. No loop is required.
b. If it were, the statement by hhid, sort: gen nvals = _n==1 should not be repeated.
c. sum() is a function for cumulative sums across observations, not what you want here.
d. The line
local avPersonHH`i' = sum(numper)/`stateTotalHH' if(nvals & stateID ==`i')
would at best calculate one number, but the if condition is misplaced. if whatever local ... often makes sense in Stata, but putting if on the right of a local definition only makes sense for manipulating text containing commands.
Your comment on this line misses these basic misconceptions, c. and d.
e. You were aiming to have collected 51 values of averages in as many local macros, but still need to put them somewhere useful.
f. Separate calculation of totals and numbers is not required, as you can get Stata to calculate the mean for you.
(LATER) This code plays along step by step with your aversion to using collapse and duplicates, the grounds for which are not stated. But most experienced Stata users would be happy to use brute force:
duplicates drop hhid, force
collapse numper, by(stateID)
and then merge back. That solution is not only direct, but also uses fewer idiosyncratic Stata details, which can take time to figure out.

Equivalent R^2 for Logit Regression in Stata

I am running Logit Regression in Stata.
How can I know the explanatory power of the regression (in OLS, I look at R^2)?
Is there a meaningful approach in expanding the regression with other independent variables (in OLS, I manually keep on adding the independent variables and look for adjusted R^2; my guess is Stata should have simplified this manual process)?
The concept of R^2 is meaningless in logit regression and you should disregard the McFadden Pseudo R2 in the Stata output altogether.
Lemeshow recommends 'to assess the significance of an independent variable we compare the value of D with and without the independent variable in the equation' with the Likelihood ratio test (G): G=D(Model without variables [B])-D(Model with variables [A]).
The Likelihood ratio test (G):
H0: coefficients for eliminated variables are all equal to 0
Ha: at least one coefficient is not equal to 0
When the LR-test p>.05 do not reject H0, which implies that, statistically speaking, there is no advantage to include the additional IV's into the model.
Example Stata syntax to do this is:
logit DV IV1 IV2
estimates store A
logit DV IV1
estimates store B
lrtest A B // i.e. tests if A is 'nested' in B
Note, however, that many more aspects have to checked and tested before we can conclude whether or not a logit model is 'acceptable'. For more detauls, I recommend to visit:
http://www.ats.ucla.edu/stat/stata/topics/logistic_regression.html
and consult:
Applied logistic regression, David W. Hosmer and Stanley Lemeshow , ISBN-13: 978-0471356325
I'm worried that you are getting the fundamentals of modelling wrong here:
The explanatory power of a regression model is theoretically determined by your interpretation of the coefficients, not by the R-squared. The R^2 represents the amount of variance that your linear model predicts, which might be an appropriate benchmark to your model, or not.
Identically, the presence or absence of an independent variable in your model requires substantive justification. If you want to have a look at how the R-squared changes when adding or subtracting parts of your model, see help nestreg for help on nested regression.
To summarize: the explanatory power of your model and its variable composition cannot be determined just by crunching the numbers. You first need an adequate theory to build your model onto.
Now, if you are running logit:
Read Long and Freese (Ch. 3) to understand how log likelihood converges (or not) in your model.
Do not expect to find something as straightforward as the R-squared for logit.
Use logit diagnostics on your model, just like you should be after running OLS.
You might also want to read the likelihood ratio Chi-squared test or run additional lrtest commands as explained by Eric.
I certainly agree with the above posters that almost any measure of R^2 for a binary model like logit or probit shouldn't be considered very important. There are ways to see how good of a job your model does at predicting. For example, check out the following commands:
lroc
estat class
Also, here's a good article for further reading:
http://www.statisticalhorizons.com/r2logistic