I have a group of treated firms in a country, and for each firm I would like to find the closest match in terms of industry, size and profitability in the rest of the country. I am working on Stata. All I need is to form a control group- could anybody guide me with the code? That'd be greatly appreciated! I currently have the following, which doesn't get me what I need:
psmatch2 (logpension) (treated sector logassets logebitda), logit ate
Here's how you might match on x1 and x2 using Mahalanobis distance as a metric, to get the effect on y from treatment t:
use http://ssc.wisc.edu/sscc/pubs/files/psm, clear
psmatch2 t, mahalanobis(x1 x2) outcome(y) ate
The variable _n1 stores the observation number of the matched control observation for every treatment observation.
The following is a full set of code you can run to find your average treatment effect on the treated (your most important indicator result) and to check if the data is balanced (whether your result is valid). Before you run it, you need to make sure your treated is labeled in the following manner: 0 should be labeled as the control group and 1 should be labeled as the experimental/treatment. "neighbor(1)" means I chose the option nearest-neighbor matching. It basically pairs each treated observation with a control observation whose propensity score is closest in absolute value.
psmatch2 treated sector logassets logebitda, outcome (logpension) neighbor(1) common
After running psmatch, you need to make sure your data is balanced. So you need to run this:
pstest sector logassets logebitda, treated(treated)
if your t-test shows any significance below 0.05, it means your data is not balanced. to check the balance of your data visually, you can also run
psgraph
right after your psmatch2 command.
Good luck!
Related
I am using propensity score stratification method. I got some output but can't interpret. I am looking for a source how to interpret those results.
I have divided PS scores into 5 groups and got this output at the end after running some codes
obs =1
type =0
freq =10 sum_wt = 1010988.4 sum_diff= 0.0015572 mean-diff= 0.0015572 SE-diff= 0.0000994551
I know that frequency column stands for 2*5(number of groups), mean diff is equal to sum diff and SE diff is the sq rt of 1-sum of weights
Does it say that ranking PS scores into 5 groups is an appropriate approach ? Which of above criteria I should use for final decision?
I believe your output is just stating the distribution within the groups. You evaluate whether or not propensity score matching, in your case stratified matching, works by looking at the absolute standardized differences of the variables pre vs post-matching.
Here is a peer reviewed paper my colleagues and I published that incorporates propensity score matching. There is some details in the methodology section that I wrote which should answer your question on how to evaluate if your approach is working.
I am trying to create a comparison group of observations using propensity score matching. There are some characteristics that I care more about matching on than others. My questions are:
Is it possible to adjust the relative weights of variables I'm matching on when constructing the propensity score?
If so, how would one do this in Stata (with the psmatch2 command, for example)?
Thanks!
Take a look at coarsened exact matching. There's a user-written command called cem that implements this in Stata and several other packages.
However, this is not equivalent to PSM, where the scores are estimated rather than imposed by the analyst.
(cross-posted at http://www.statalist.org/forums/forum/general-stata-discussion/general/1370770-margins-plot-of-treatment-effect-rather-than-y-for-values-of-a-covariate)
I'm running a multivariate regression (outcome variable is continuous, happens to be GPA). The covariate of interest is a dummy variable for treatment status; another of the covariates is a pre-score. We want to look at how the treatment effect differs at various values of pre-score. The structure of the model is not complicated:
regress GPA treatment pre_score X3 X4 X5...
What I want is a graph that shows what the treatment effect is (values of Beta1) at various values of pre-score (X2). It's straightforward to get a graph with values of the OUTCOME at various values of X2:
margins, at(pre_score= (1(0.25)5)) post
marginsplot
I have consulted an array of resources and tried alternatives using marginscontplot, coefplot with recast, the dy/dx option, and so forth. I remain unsuccessful. But this seems like something that there must be a way to do; wanting to know if a treatment effect varies for values of a control (say, income) must be common.
Can anyone direct me to the right command, or options for Margins, to output values of Beta1 (coefficient on treatment dummy), rather than of Y (GPA), at values of the pre_score?
Question was resolved at Statalist. Turns out that Margins alone can't do what I was trying to; the model needs to be run with an interaction term. Then it's simple.
In Stata, I wanted to be able to put observations in buckets based on a specific variable, or equivalently code observations as belonging to a certain quantile. I looked around for some existing code that would accomplish this task but didn't quite find what I wanted. I wrote the following simple ado:
program toquantiles
version 13
syntax varname [, n(integer 4)]
quietly{
local interval = 100/`n'
local binVarName = "`varlist'_quantile"
gen `binVarName' = `n'
local upper = `n'-1
forvalues i=1/`upper'{
local y = `i'*`interval'
//Abuse the egen cmd to calculate the yth percentile.
tempvar x
egen `x' = pctile(`varlist'), p(`y')
//Label this observation as belonging to the ith bin if the value of the
//var in question is greater than x.
replace `binVarName' = `n'-`i' if `varlist' > `x'
drop `x'
}
}
end
The output is that each observation has a new variable, varname_quantile that is coded as 1,2,3, etc. based on the quantile in which it fits. My code seems like a pretty naive approach to this problem.
Is there any built-in functionality that does what I do above? If not, are there any improvements to this ado that would speed up execution? Currently, it runs quite slowly. (Slowly as in, it is faster to summ all 100+ variables than to calculate the quintiles for 1 variable.) Thanks so much.
There is a terminology problem here, most simply illustrated by quartiles, three particular summary statistics, the lower and upper quartiles and the median in between, and the first, second, third and fourth quarters (some say quartiles here too), intervals defined by falling below or above particular quartiles. (What happens when values equal particular quartiles is a matter of convention.)
In other words, quartiles and more generally quantiles can be particular levels (which I take to be the standard statistical use of the term) or intervals (a common (mis?)use of the term in some applied fields, e.g. applied economics).
It seems that you want the second sense.
Turning to Stata, doesn't xtile do this?
See also http://www.stata.com/support/faqs/statistics/percentile-ranks-and-plotting-positions/index.html
I've estimated a model via maximum likelihood in Stata and was surprised to find that estimated standard errors for one particular parameter are drastically smaller when clustering observations. I take it from the Stata manual on robust standard error estimation in ML that this can happen if the contributions of individual observations to the score (the derivative of the log-likelihood) tend to cancel each other within clusters.
I would now like to dig a little deeper into what exactly is happening and would therefore like to have a look at these score contributions. As far as I can see, however, Stata only gives me the total sum as e(gradient). Is there any way to pry the individual summands out of Stata?
If you have written your own command, you can create a new variable containing these scores using the ml score command. Official Stata commands and most finished user written commands will often have score as an option for predict, which does the same thing but with an easier syntax.
These will give you the score of the log likelihood ($\ell$) with respect to the linear predictor, $x\beta = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \elipses$. To get the derivative of the log likelihood with respect to an individual parameter, say $\beta_1$, you just use the chain rule:
$\frac{\partial \ell}{\partial \beta_1} = \frac{\partial \ell }{\partial x\beta} \frac{\partial x\beta}{\partial \beta_1}$
The scores returned by Stata are $ \frac{\partial \ell }{\partial x\beta}$, and $\frac{\partial x\beta}{\partial \beta_1} = x_1$.
So, to get the score for $\beta_1$ you just multiply the score returned by Stata and $x_1$.