kmatch: Check propensity scores and individuals who are matched - stata

I'm using kmatch in Stata. The reason why I use kmatch is to use the command ematch to match exactly on a specific variable in addition to the propensity score matching. Here is my code:
kmatch ps treatment age sex edu (outcome), ematch(level) comsup
I think kmatch is different from pscore and psmatch2 in that propensity scores will not be automatically stored in the dataset. I wonder if there is a way to save these propensity scores and to check which individuals are included in the matched sample.

The answer is in the help file, help kmatch. Add generate[(spec)] as an option to store the propensity scores as _KM_ps. Other helpful matching results also have the _KM_ prefix. wgenerate[(spec)] generates variables containing the ready-to-use matching weights. idgenerate[(prefix)] generates variables containing the IDs (observations numbers) of the matched controls.
Here is an example.
webuse cattaneo2, clear
kmatch ps mbsmoke mmarried mage fbaby medu (bweight), ///
generate(kpscore) wgenerate(_matched) idgenerate(_controlid) ate
Try this to compare results from kmatch and teffects psmatch, keeping only the propensity scores from each.
webuse cattaneo2, clear
tempfile temp1 temp2
keep mbsmoke mmarried mage fbaby medu bweight
gen id = _n
save temp1, replace
teffects psmatch (bweight) (mbsmoke mmarried mage fbaby medu), ///
ate generate(_pscore)
predict te_pscore, ps
keep te_pscore id
replace te_pscore = 1 - te_pscore
save temp2, replace
use temp1
kmatch ps mbsmoke mmarried mage fbaby medu (bweight), generate(kpscore) ate
rename _KM_ps k_pscore
keep k_pscore id
save temp3, replace
merge 1:1 id using temp2
drop _merge
list in 1/10
+---------------------------+
| id k_pscore te_psc~e |
|---------------------------|
1. | 1 .13229635 .1322963 |
2. | 2 .4204439 .4204439 |
3. | 3 .22490795 .2249079 |
4. | 4 .16333027 .1633303 |
5. | 5 .11024706 .1102471 |
|---------------------------|
6. | 6 .25395923 .2539592 |
7. | 7 .16283038 .1628304 |
8. | 8 .10881813 .1088181 |
9. | 9 .10988829 .1098883 |
10. | 10 .11608692 .1160869 |
+---------------------------+

Related

Create Custom Definition of Week

I have daily data and want to convert them to weekly, using the following definition. Every Monday denotes the beginning of week i, and Sunday denotes the end of week i.
My date variable is called day and is already has %td format. I have a feeling that I should use the dow() function, combined with egen, group() but I struggle to get it quite right.
If your data are once a week and you have data for Mondays only, then your date variable is fine and all you need to do is declare delta(7) if you use tsset or xtset.
If your data are for two or more days a week and you wish to collapse or contract to weekly data, then you can convert to a suitable time basis like this:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float date
22067
22068
22069
22070
22071
22072
22073
22074
22075
22076
22077
22078
22079
22080
end
format %td date
gen wdate = cond(dow(date) == 1, date, cond(dow(date) == 0, date - 6, date - dow(date) + 1))
format wdate %td
gen dow = dow(date)
list, sepby(wdate)
+-----------------------------+
| date dow wdate |
|-----------------------------|
1. | 01jun2020 1 01jun2020 |
2. | 02jun2020 2 01jun2020 |
3. | 03jun2020 3 01jun2020 |
4. | 04jun2020 4 01jun2020 |
5. | 05jun2020 5 01jun2020 |
6. | 06jun2020 6 01jun2020 |
7. | 07jun2020 0 01jun2020 |
|-----------------------------|
8. | 08jun2020 1 08jun2020 |
9. | 09jun2020 2 08jun2020 |
10. | 10jun2020 3 08jun2020 |
11. | 11jun2020 4 08jun2020 |
12. | 12jun2020 5 08jun2020 |
13. | 13jun2020 6 08jun2020 |
14. | 14jun2020 0 08jun2020 |
+-----------------------------+
In short, index weeks by the Mondays that start them. Now collapse or contract your dataset. Naturally if you have panel or longitudinal data some identifier may be involved too. delta(7) remains essential for anything depending on tsset or xtset.
There is no harm in using egen to map to successive integers, but no advantage in that either.
A theme underlying this is that Stata's own weeks are idiosyncratic, always starting week 1 on 1 January and always having 8 or 9 days in week 52. For more on weeks in Stata, see the papers here and here, which include the advice given in this answer, and much more.

Preserving data more than once

I am writing some code in Stata and I have already used preserve once. However, now I would like to preserve again, without using restore.
I know this will give an error message, but does it save up to the new preserve area?
No, preserving twice without restoring in-between simply throws an error:
sysuse auto, clear
preserve
drop mpg
preserve
already preserved
r(621);
However, you can do something similar using temporary files. From help macro:
"...tempfile assigns names to the specified local macro names that may be used as names for temporary files. When the program or do-file concludes, any
datasets created with these assigned names are erased..."
Consider the following toy example:
tempfile one two three
sysuse auto, clear
save `one'
drop mpg
save `two'
drop price
save `three'
use `two'
list price in 1/5
+-------+
| price |
|-------|
1. | 4,099 |
2. | 4,749 |
3. | 3,799 |
4. | 4,816 |
5. | 7,827 |
+-------+
use `one'
list mpg in 1/5
+-----+
| mpg |
|-----|
1. | 22 |
2. | 17 |
3. | 22 |
4. | 20 |
5. | 15 |
+-----+

Count the number of distinct strings and their occurrence in a variable

I have a variable called Category that specifies the category of observations. The problem is that some observation have multiple categories. For example:
id Category
1 Economics
2 Biology
3 Psychology; Economics
4 Economics; Psychology
There is no meaning in the order of categories. They are always separated by ";". There are 250 categories, so creating dummy variables might be tricky. I have the complete list of categories in a separate Excel file if this might help.
What I want is simply to summarize my dataset by unique categories such as Economics (3), Psychology (2), Biology (1) (so the sum of all can be superior to the number of observations).
tabsplit from the tab_chi package on SSC will do this for you.
clear
input id str42 Category
1 "Economics"
2 "Biology"
3 "Psychology; Economics"
4 "Economics; Psychology"
end
capture ssc install tab_chi
tabsplit Category, p(;)
Category | Freq. Percent Cum.
------------+-----------------------------------
Biology | 1 16.67 16.67
Economics | 3 50.00 66.67
Psychology | 2 33.33 100.00
------------+-----------------------------------
Total | 6 100.00
Note: You can count semi-colons and thus phrases like this.
gen count = 1 + length(category) - length(subinstr(category, ";", "", .))
The logic is that you measure the length of the string and its length should semi-colons ; be replaced by empty strings (namely, removed). The difference is the number of semi-colons, to which you add 1.
EDIT: How to get to a different data structure, starting with the data example above.
. split Category, p(;)
variables created as string:
Category1 Category2
. drop Category
. reshape long Category, i(id) j(mention)
(note: j = 1 2)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 4 -> 8
Number of variables 3 -> 3
j variable (2 values) -> mention
xij variables:
Category1 Category2 -> Category
-----------------------------------------------------------------------------
. drop if missing(Category)
(2 observations deleted)
. list, sepby(id)
+----------------------------+
| id mention Category |
|----------------------------|
1. | 1 1 Economics |
|----------------------------|
2. | 2 1 Biology |
|----------------------------|
3. | 3 1 Psychology |
4. | 3 2 Economics |
|----------------------------|
5. | 4 1 Economics |
6. | 4 2 Psychology |
+----------------------------+

Stata: egen rowpctile a range of values instead of single percentile value

I have a variable var with many missing values for which I want to calculate the 95th percentile then use this value to drop observations that lie above the 95th percentile (for those observations that are not missing the variable).
Because of the many missing values, I use egen with rowpctile which is supposed to calculate the p(#) percentile, ignoring missing values. When I look at the p95 values, however, they're a range of different values rather than a single 95th percentile value as seen below:
. egen p95 = rowpctile(var), p(95)
. list p95
+-----------+
| p95 |
|-----------|
1. | . |
2. | 65.71429 |
3. | 14.28571 |
4. | . |
5. | . |
...
Am I using the function incorrectly or is there a better way to go about this?
The rowpctile function of the egen command calculates the percentile of the values of a list of variables separately for each observation. Here is some technique which should set you on the right path.
. sysuse auto, clear
(1978 Automobile Data)
. replace price = . in 1/5
(5 real changes made, 5 to missing)
. summarize price, detail
Price
-------------------------------------------------------------
Percentiles Smallest
1% 3291 3291
5% 3748 3299
10% 3895 3667 Obs 69
25% 4296 3748 Sum of Wgt. 69
50% 5104 Mean 6245.493
Largest Std. Dev. 3015.072
75% 6342 13466
90% 11497 13594 Variance 9090661
95% 13466 14500 Skewness 1.594391
99% 15906 15906 Kurtosis 4.555704
. display r(p95)
13466
. generate toobig = price>r(p95)
. list make price if toobig | price==.
+---------------------------+
| make price |
|---------------------------|
1. | AMC Concord . |
2. | AMC Pacer . |
3. | AMC Spirit . |
4. | Buick Century . |
5. | Buick Electra . |
|---------------------------|
12. | Cad. Eldorado 14,500 |
13. | Cad. Seville 15,906 |
27. | Linc. Mark V 13,594 |
+---------------------------+

Create new variable by dividing column by observation in last row

I want to create a new variable, say cheese2, that takes cheese and divides every by the last observation (2921333).
+----------+
| cheese |
|----------|
1. | 3060000 |
2. | 840333.3 |
3. | 1839667 |
4. | 1.17e+07 |
5. | 1374000 |
|----------|
6. | 2092333 |
7. | 341000 |
8. | 3149000 |
9. | 3557667 |
10. | 590666.7 |
|----------|
11. | 8937000 |
12. | 4142000 |
13. | 2624000 |
14. | 1973667 |
15. | 2921333 |
I would also like to do this for multiple columns at once i.e. divide multiple columns by the last row of my data set.
In Stata terminology,
create a new variable by dividing a column by the observation in the last row
becomes
create a new variable by dividing a variable by the value in the last observation.
Such a question suggests that you are storing totals in your last observation, spreadsheet style. Such a practice is undoubtedly convenient for what you are asking, but it creates obligations to exclude the last observation from almost every other manipulation and to maintain precisely the same sort order, and would generally be considered a bad idea therefore.
All that said,
gen cheese2 = cheese/cheese[_N]
is what you ask and a loop over several variables could be
foreach v of var frog newt toad lizard dragon {
gen `v'2 = `v'/`v'[_N]
}
See also the help for foreach.