interpreting propensity score results in stratification method in SAS - sas

I am using propensity score stratification method. I got some output but can't interpret. I am looking for a source how to interpret those results.
I have divided PS scores into 5 groups and got this output at the end after running some codes
obs =1
type =0
freq =10 sum_wt = 1010988.4 sum_diff= 0.0015572 mean-diff= 0.0015572 SE-diff= 0.0000994551
I know that frequency column stands for 2*5(number of groups), mean diff is equal to sum diff and SE diff is the sq rt of 1-sum of weights
Does it say that ranking PS scores into 5 groups is an appropriate approach ? Which of above criteria I should use for final decision?

I believe your output is just stating the distribution within the groups. You evaluate whether or not propensity score matching, in your case stratified matching, works by looking at the absolute standardized differences of the variables pre vs post-matching.
Here is a peer reviewed paper my colleagues and I published that incorporates propensity score matching. There is some details in the methodology section that I wrote which should answer your question on how to evaluate if your approach is working.

Related

Stata code to conditionally sum values based on a group rank

I'm trying to write a code for a fairly huge dataset (3m observations) which has been segregated into smaller groups (ID). For each observation (described in the table below), I want to create a cumulative sum of a variable "Value" for all observations ranked below me, subject to condition of the lower ranked observation equals mine.
[
I want to write this code without using loops, if there is a way to do so.
Could someone help me?
Thank you!
UPDATE:
I have pasted the equation for the output variable below.
UPDATE 2:
The CSV format of the above table is:
ID,Rank,Condition,Value,Expected output,,
1,1,30,10,0,,
1,2,40,20,0,,
1,3,20,30,0,,
1,4,30,40,10,,
1,5,40,50,20,,
1,6,20,60,30,,
1,7,30,70,80,,
2,1,40,80,0,,
2,2,20,90,0,,
2,3,30,100,0,,
2,4,40,110,80,,
2,5,20,120,90,,
2,6,30,130,100,,
2,7,40,140,190,,
2,8,20,150,210,,
2,9,30,160,230,,
Equation
If I understand correctly, for each combination of ID and Condition, you want to calculate a running sum, ordered by Rank, of the variable Value, excluding the current observation. If that is indeed your goal, the following untested code might set you on the path to a solution
sort ID Condition Rank
// be sure there is a single observation for each combination
isid ID Condition Rank
// generate the running sum
by ID Condition (Rank): generate output = sum(Value)
// subtract out the current observation
replace output = output - Value
// return to the original order
sort ID Rank
As I said, this is untested, because my copy of Stata cannot read pictures of data. If your testing shows that it is imperfect and you cannot resolve the problem yourself, providing your sample data in a usable format will increase the likelihood someone will be able to help.
Added in edit: Corrected the isid command.

Propensity score matching on stata

I have a group of treated firms in a country, and for each firm I would like to find the closest match in terms of industry, size and profitability in the rest of the country. I am working on Stata. All I need is to form a control group- could anybody guide me with the code? That'd be greatly appreciated! I currently have the following, which doesn't get me what I need:
psmatch2 (logpension) (treated sector logassets logebitda), logit ate
Here's how you might match on x1 and x2 using Mahalanobis distance as a metric, to get the effect on y from treatment t:
use http://ssc.wisc.edu/sscc/pubs/files/psm, clear
psmatch2 t, mahalanobis(x1 x2) outcome(y) ate
The variable _n1 stores the observation number of the matched control observation for every treatment observation.
The following is a full set of code you can run to find your average treatment effect on the treated (your most important indicator result) and to check if the data is balanced (whether your result is valid). Before you run it, you need to make sure your treated is labeled in the following manner: 0 should be labeled as the control group and 1 should be labeled as the experimental/treatment. "neighbor(1)" means I chose the option nearest-neighbor matching. It basically pairs each treated observation with a control observation whose propensity score is closest in absolute value.
psmatch2 treated sector logassets logebitda, outcome (logpension) neighbor(1) common
After running psmatch, you need to make sure your data is balanced. So you need to run this:
pstest sector logassets logebitda, treated(treated)
if your t-test shows any significance below 0.05, it means your data is not balanced. to check the balance of your data visually, you can also run
psgraph
right after your psmatch2 command.
Good luck!

Calculating p-value by hand from Stata table

I want to ask a question on how to compute the p-value without a t-stat table, just by looking at the table, like on the first page of the pdf in the following link http://faculty.arts.ubc.ca/dwhistler/326UBC/stataHILL.pdf . Like if I don't know the value 0.062, how can I know it is 0.062 by looking at other information from the table?
You need to use the ttail() function, which returns the reverse cumulative Student's t distribution, aka the probability T > t:
display ttail(38,abs(_b[_cons]/_se[_cons]))*2
The first argument, 38, is the degrees of freedom (sample size less number of parameters), while the second, 1.92, is the absolute value of the coefficient of interest divided by its standard error, or the t-stat. The factor of two comes from the fact that Stata is doing a two-tailed test. You can also use the stored DoF with
display ttail(e(df_r),abs(_b[_cons]/_se[_cons]))*2
You can also do the integration of the t density by "hand" using Adrian Mander's integrate:
ssc install integrate
integrate, f(tden(38,x)) l(-1.92) u(1.92)
This gives you 0.93761229, but you want Pr(T>|t|), which is 1-0.93761229=0.06238771.
If you look at many statistics textbooks, you will find a table called the Z-table which will give you the probability that Z is beyond your test statistic. The table is actually a cumulative distribution function of the normal curve.
When people went to school with 4-function calculators, one or more of the questions on the statistics test would include a copy of this Z-table, and the dear students would have to interpolate columns of numbers to find the p-value. In your example, you would see the test statistic between .06 and .07 and those fingers would tap out that it was closer to .06 and do a linear interpolation to come up with .062.
Today, the p-value is something that Stata or SAS will calculate for you.
Here is another SO question that may be of interest: How do I calculate a p-value if I have the t-statistic and d.f. (in Perl)?
Here is a basic page on how to determine p-value "by hand": http://www.dummies.com/how-to/content/how-to-determine-a-pvalue-when-testing-a-null-hypo.html
Here is how you can determine p-value using Excel: http://ms-office.wonderhowto.com/how-to/find-p-value-with-excel-346366/
===EDIT===
My Stata text ("Microeconometrics using Stata", Revised Ed, Cameron & Trivedi) says the following on p. 402.
* p-values for t(30), F(1,30), Z, and chi(1) at y=2
. scalar y=2
. scalar p_t30 = 2 * ttail(30,y)
. scalar p_f1and30 = Ftail(1,30,y^2)
. scalar p_z = 2 * (1 - normal(y))
. scalar p_chi1 = chi2tail(1,y^2)
. display "p-values" " t(30)=" %7.4f p_t30
p-values t(30) = 0.0546

Code observation as belonging to quantile in Stata

In Stata, I wanted to be able to put observations in buckets based on a specific variable, or equivalently code observations as belonging to a certain quantile. I looked around for some existing code that would accomplish this task but didn't quite find what I wanted. I wrote the following simple ado:
program toquantiles
version 13
syntax varname [, n(integer 4)]
quietly{
local interval = 100/`n'
local binVarName = "`varlist'_quantile"
gen `binVarName' = `n'
local upper = `n'-1
forvalues i=1/`upper'{
local y = `i'*`interval'
//Abuse the egen cmd to calculate the yth percentile.
tempvar x
egen `x' = pctile(`varlist'), p(`y')
//Label this observation as belonging to the ith bin if the value of the
//var in question is greater than x.
replace `binVarName' = `n'-`i' if `varlist' > `x'
drop `x'
}
}
end
The output is that each observation has a new variable, varname_quantile that is coded as 1,2,3, etc. based on the quantile in which it fits. My code seems like a pretty naive approach to this problem.
Is there any built-in functionality that does what I do above? If not, are there any improvements to this ado that would speed up execution? Currently, it runs quite slowly. (Slowly as in, it is faster to summ all 100+ variables than to calculate the quintiles for 1 variable.) Thanks so much.
There is a terminology problem here, most simply illustrated by quartiles, three particular summary statistics, the lower and upper quartiles and the median in between, and the first, second, third and fourth quarters (some say quartiles here too), intervals defined by falling below or above particular quartiles. (What happens when values equal particular quartiles is a matter of convention.)
In other words, quartiles and more generally quantiles can be particular levels (which I take to be the standard statistical use of the term) or intervals (a common (mis?)use of the term in some applied fields, e.g. applied economics).
It seems that you want the second sense.
Turning to Stata, doesn't xtile do this?
See also http://www.stata.com/support/faqs/statistics/percentile-ranks-and-plotting-positions/index.html

Individual score contributions in ML estimation

I've estimated a model via maximum likelihood in Stata and was surprised to find that estimated standard errors for one particular parameter are drastically smaller when clustering observations. I take it from the Stata manual on robust standard error estimation in ML that this can happen if the contributions of individual observations to the score (the derivative of the log-likelihood) tend to cancel each other within clusters.
I would now like to dig a little deeper into what exactly is happening and would therefore like to have a look at these score contributions. As far as I can see, however, Stata only gives me the total sum as e(gradient). Is there any way to pry the individual summands out of Stata?
If you have written your own command, you can create a new variable containing these scores using the ml score command. Official Stata commands and most finished user written commands will often have score as an option for predict, which does the same thing but with an easier syntax.
These will give you the score of the log likelihood ($\ell$) with respect to the linear predictor, $x\beta = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \elipses$. To get the derivative of the log likelihood with respect to an individual parameter, say $\beta_1$, you just use the chain rule:
$\frac{\partial \ell}{\partial \beta_1} = \frac{\partial \ell }{\partial x\beta} \frac{\partial x\beta}{\partial \beta_1}$
The scores returned by Stata are $ \frac{\partial \ell }{\partial x\beta}$, and $\frac{\partial x\beta}{\partial \beta_1} = x_1$.
So, to get the score for $\beta_1$ you just multiply the score returned by Stata and $x_1$.