Creating a matched sample using some variables - stata

My dependent variable is an indicator variable (1 or 0). The issue is that my N is very large but the number of observations where my dependent variable =1 is very small (only 3%) so when I run regressions my coefficients are very tiny. So I want to create a matched sample across some dimensions. I know I should use psmatch2,
I tried
psmatch2 depvar v1 v2 v3, common
but I only have very small number of "treated" and a large number of "untreated" pretty much what I had in my original data. I want to keep the observations where y=1, and only want to create a sample that consists of similar obs across v1, v2, v3 and I want this group to have reasonably similar number of obs.
Any idea?

Instead of creating a matched sample, an alternative approach would be to consider quasi-matching techniques, such as entropy balancing and coarsened exact matching.
To implement entropy balancing in Stata, you can try something like below:
ssc install ebalance
ebalance treat_var v1 v2 v3, tar(2)
The above commands install the ebalance package and assign weights to each observation such that the mean and the variance of variables v1, v2, v3 are roughly the same for treatment and control groups. Use help ebalance to find out more.
To implement coarsened exact matching in Stata, you can try something like the following:
ssc install cem
cem v1 v2 v3, tr(treat_var)
Both cem and ebalance generate a weight for each observation. The weight variables are stored in the dataset and named cem_weights and _webal respectively. To incorporate entropy balancing or coarsened exact matching into your regression analysis, simply estimate weighted regressions
regress y treat_var [aweight=_webal]

Related

Adjust weighting on propensity score matching in Stata

I am trying to create a comparison group of observations using propensity score matching. There are some characteristics that I care more about matching on than others. My questions are:
Is it possible to adjust the relative weights of variables I'm matching on when constructing the propensity score?
If so, how would one do this in Stata (with the psmatch2 command, for example)?
Thanks!
Take a look at coarsened exact matching. There's a user-written command called cem that implements this in Stata and several other packages.
However, this is not equivalent to PSM, where the scores are estimated rather than imposed by the analyst.

Value of coefficient (Beta1) at different values of other covariate (X2), hopefully graphed

(cross-posted at http://www.statalist.org/forums/forum/general-stata-discussion/general/1370770-margins-plot-of-treatment-effect-rather-than-y-for-values-of-a-covariate)
I'm running a multivariate regression (outcome variable is continuous, happens to be GPA). The covariate of interest is a dummy variable for treatment status; another of the covariates is a pre-score. We want to look at how the treatment effect differs at various values of pre-score. The structure of the model is not complicated:
regress GPA treatment pre_score X3 X4 X5...
What I want is a graph that shows what the treatment effect is (values of Beta1) at various values of pre-score (X2). It's straightforward to get a graph with values of the OUTCOME at various values of X2:
margins, at(pre_score= (1(0.25)5)) post
marginsplot
I have consulted an array of resources and tried alternatives using marginscontplot, coefplot with recast, the dy/dx option, and so forth. I remain unsuccessful. But this seems like something that there must be a way to do; wanting to know if a treatment effect varies for values of a control (say, income) must be common.
Can anyone direct me to the right command, or options for Margins, to output values of Beta1 (coefficient on treatment dummy), rather than of Y (GPA), at values of the pre_score?
Question was resolved at Statalist. Turns out that Margins alone can't do what I was trying to; the model needs to be run with an interaction term. Then it's simple.

Propensity score matching on stata

I have a group of treated firms in a country, and for each firm I would like to find the closest match in terms of industry, size and profitability in the rest of the country. I am working on Stata. All I need is to form a control group- could anybody guide me with the code? That'd be greatly appreciated! I currently have the following, which doesn't get me what I need:
psmatch2 (logpension) (treated sector logassets logebitda), logit ate
Here's how you might match on x1 and x2 using Mahalanobis distance as a metric, to get the effect on y from treatment t:
use http://ssc.wisc.edu/sscc/pubs/files/psm, clear
psmatch2 t, mahalanobis(x1 x2) outcome(y) ate
The variable _n1 stores the observation number of the matched control observation for every treatment observation.
The following is a full set of code you can run to find your average treatment effect on the treated (your most important indicator result) and to check if the data is balanced (whether your result is valid). Before you run it, you need to make sure your treated is labeled in the following manner: 0 should be labeled as the control group and 1 should be labeled as the experimental/treatment. "neighbor(1)" means I chose the option nearest-neighbor matching. It basically pairs each treated observation with a control observation whose propensity score is closest in absolute value.
psmatch2 treated sector logassets logebitda, outcome (logpension) neighbor(1) common
After running psmatch, you need to make sure your data is balanced. So you need to run this:
pstest sector logassets logebitda, treated(treated)
if your t-test shows any significance below 0.05, it means your data is not balanced. to check the balance of your data visually, you can also run
psgraph
right after your psmatch2 command.
Good luck!

Calculating difference in survival functions at time t in Stata

I am estimating a Cox model in Stata using stcox. I estimate the model at
stcox treat x1 x2 x3
I can then use the stcurve command to plot the survival function for treatment and control groups, with the x1, x2 and x3 variables set at their means by doing
stcurve, survival at1(treat=0) at2(treat=1)
However, I would also like to calculate the difference in the survival function at specific, discrete points in time. For instance, I'd like to know the probability of survival to 1 year for treated and control groups, with x's set to their means. I think I might be able to do this with the sts generate command and its adjustfor option, but I am a little confused about whether I should use by or strata when using sts generate and I'm also not sure how to hold the control variables at their means rather than at 0. The Stata help pages suggest I can center the values of the controls by subtracting x1's mean from x1, but I am not sure if I am reading this correctly.
I wrote an answer to a similar question which might be useful here, but do not have enough reputation to answer this as a comment, so here goes:
You can do stcox treat x1 x2 x3 and stcurve, survival at1(treat=0) at2(treat=1) outfile(stcurve.dta). In the file stcurve.dta you will have the data that produced the graph, which could be used for you purposes of looking up specific timepoints (not terribly familiar with stcox and stcurve but think it should work).
regarding sts generate: I should use by(treat) if you only want to see the difference between treatment groups. However, using by vs strata is more of a statistical question which I am not qualified to answer, You are reading the help pages correctly, but cannot say if this approach makes statistical sense and do not know if sts generate allows for several variables in adjustfor. My suggestion is just try it and see if it works. Make your own mean variable and use that to make a new x to use in sts generate if the data you need is not in stcurve.dta. Something like this:
local varlist = x1 x2 x3
foreach x in `varlist' {
egen mean_`x'=mean(`x')
gen adj_`x'=`x' - mean_`x'
}
*
sts generate survival=s, by(treat) adjustfor(adj_x1)
then you will have your data in the variable called survival. another possibility is using sts list by(treat) adjustfor(adj_x1) compare instead of sts generate

Individual score contributions in ML estimation

I've estimated a model via maximum likelihood in Stata and was surprised to find that estimated standard errors for one particular parameter are drastically smaller when clustering observations. I take it from the Stata manual on robust standard error estimation in ML that this can happen if the contributions of individual observations to the score (the derivative of the log-likelihood) tend to cancel each other within clusters.
I would now like to dig a little deeper into what exactly is happening and would therefore like to have a look at these score contributions. As far as I can see, however, Stata only gives me the total sum as e(gradient). Is there any way to pry the individual summands out of Stata?
If you have written your own command, you can create a new variable containing these scores using the ml score command. Official Stata commands and most finished user written commands will often have score as an option for predict, which does the same thing but with an easier syntax.
These will give you the score of the log likelihood ($\ell$) with respect to the linear predictor, $x\beta = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \elipses$. To get the derivative of the log likelihood with respect to an individual parameter, say $\beta_1$, you just use the chain rule:
$\frac{\partial \ell}{\partial \beta_1} = \frac{\partial \ell }{\partial x\beta} \frac{\partial x\beta}{\partial \beta_1}$
The scores returned by Stata are $ \frac{\partial \ell }{\partial x\beta}$, and $\frac{\partial x\beta}{\partial \beta_1} = x_1$.
So, to get the score for $\beta_1$ you just multiply the score returned by Stata and $x_1$.