Matching observations based on a single variable or multiple variables on a single data set - stata - compare

I need to match observations based on an index variable that measures home conditions, personal variables such as age, gender, education, etc. and year. My home index variable is numerical (from 0 to 103) and the personal characteristics are either dummies or categorical variables. For my analysis I need to match the most similar observations based on these variables. It is sort of a nearest neighbor match but without having a control or treatment group.
The dataset looks something like this.
indice_hogar anio mes directorio orden mujer nivel__educativo_cat trabaja
0 2018 08 4700731 1 1 4 1
0 2018 08 4700731 2 0 5 1
0 2018 11 4777752 1 0 5 1
37 2018 04 4605803 1 0 3 1
42 2011 07 2735691 1 1 4 1
42 2018 02 4545459 1 0 3 1
43 2018 12 4803694 1 0 5 1
44 2018 10 4747974 1 0 5 1
46 2018 05 4610096 1 0 3 1
47 2018 04 4598828 1 1 1 0
47 2018 08 4687722 1 0 1 0
48 2018 04 4592941 1 0 5 0
48 2018 06 4636177 1 0 3 1
50 2018 06 4645892 1 0 1 1
50 2018 06 4645892 2 1 4 1
For better understanding, I am using an IV that is the ability of the most similar person according to the index and to personal characteristics. That means I need to find the most similar observation to, for example, person A and then be able to take the match's abilities and use it for a regression.
I have not been able to create a code.

Duplicate your dataset, and match the 1st copy to the 2nd using nnmatch.
* Duplicate the data set
gen byte treat = 1
gen nobs = _N
save temp, replace
replace treat = 0
append using temp
* Make a fake outcome variable to keep nnmatch happy
gen byte outcome = runiform()<.5
* nnmatch performs a nearest neighbor match, return the id of the matched cases as nnid
teffects nnmatch (outcome indice_hogar nivel_educativo_cat trabaja) (treat), gen(nnid)
* Unduplicate the data set
keep if treat == 0
* change nnid to point to the 1st copy of the data set, not the 2nd
replace nnid = nnid - nobs

Related

Subtracting values based on a index column and using a condition in the same column in DAX

I've a lot of material on Stack about this, but i'm still not able to reproduce it.
Sample data set.
Asset
Value
Index
A
10
1
B
15
1
C
20
1
A
11
2
B
17
2
C
24
2
A
18
3
B
25
3
C
30
3
What i want to do is, subtract the Asset values individually based on the index column.
Ex:
Asset A [1] -> 10
Asset A [2] -> 11
11 - 10 = 1
So the table would be like this.
Asset
Value
Index
Diff
A
10
1
0
B
15
1
0
C
20
1
0
A
11
2
1
B
17
2
2
C
24
2
4
A
18
3
7
B
25
3
8
C
30
3
6
This need's to be done using DAX.
Can you guys help me ?
Best Regards!
I just did this and it worked.
Diff =
var Assets = 'Table'[Asset]
var Ind = 'Table'[Index] - 1
Return
IF(Ind = -1, 0, 'Table'[Value] - CALCULATE(SUM('Table'[Value]),FILTER('Table','Table'[Asset] = Assets && 'Table'[Index] = Ind)))

Subsampling years before and after an event of an unbalanced panel dataset

I'm trying to figure out a concise way to keep only the two years before and after the year in which an event takes place using daily panel data in Stata. The panel is unbalanced. Ultimately, I'm trying to conduct an event study but I experienced issues because the unique groups report inconsistent years.
The data looks something like this:
ID year month day event
1 1999 1 1 0
1 1999 1 2 0
1 1999 1 3 0
1 1999 1 4 0
1 1999 1 5 0
1 1999 1 6 0
1 1999 1 7 0
1 1999 1 8 0
1 1999 1 9 0
1 1999 1 10 0
1 1999 1 11 0
1 1999 1 12 0
1 1999 1 13 0
1 1999 1 14 0
1 1999 1 15 0
1 1999 1 16 0
1 1999 1 17 0
1 1999 1 18 0
1 1999 1 19 0
1 1999 1 20 0
1 1999 1 21 0
1 1999 1 22 0
1 1999 1 23 0
1 1999 1 24 0
1 1999 1 25 0
1 1999 1 26 0
1 1999 1 27 0
1 1999 1 28 0
1 1999 1 29 0
1 1999 1 30 0
1 1999 1 31 0
1 1999 2 1 1
1 1999 2 2 1
In this case, the event takes place in February 1999. The event is monthly, but I need the daily data for a later part of the analysis. I want to somehow tag the 24 months before February 1999 and the 24 months after February 1999. However, I need to do this in a way that won't codify any months in 2002 if group 1 reported no data in 2000.
I got the following to work on a similar set of monthly data but I can't figure out a way to do it with daily data. Furthermore, if anyone has suggestions for a less clunky solution, I would be very appreciative.
bys ID year (month) : egen year_change = max(event)
bys ID (year month) : replace year_change = 2 if ///
(year_change[_n+24] == 1 & year[_n] == year[_n+24] - 2) | ///
(year_change[_n+12] == 1 & year[_n] == year[_n+12] - 1) | ///
(year_change[_n-12] == 1 & year[_n] == year[_n-12] + 1) | ///
(year_change[_n-24] == 1 & year[_n] == year[_n-24] + 2)
keep if year_change >= 1
It seems that your event date is the first date with event 1. So,
gen dailydate = mdy(month, day, year)
bysort id : egen key = min(cond(event == 1, dailydate, .))
gen wanted = inrange(dailydate, key - 730, key + 730)
Check that wanted gives the dates you want and then modify the rule or keep accordingly.
This code doesn't assume that the event date is the same for each panel, but that would not be a problem.
See this paper for a review of related technique.
For your task, I suggest you to work use actual Stata dates, instead of relying on year + month + day variables - this way, it would be easier to add/subtract 24 months without relying on data sorting (the "_n+24" part in your code) and the codification would not suffer from the issue with missing data that you outline in the question.
I see a straightforward solution, which relies on an assumption I made on your setting (that you did not specify, but is the general form of event studies): the event date is unique for all IDs, hence there is no group-specific "treatment" date.
g stata_date = mdy(month, day, year) // generate variable with Stata date
/* Unique event on Feb 1, 1999 */
bys ID: egen treat_group = max(event) // indicator for an ID to ever be "treated"
g event_window = (stata_date >= td(01Feb1997) & stata_date < td(01Feb2001)) // indicator for event window - 2 years before and after Feb 1, 1999
g event_treatment = treat_group * event_window // indicator for a treated ID during the event window

How to modify a variable conditioned on max value of other variable

I have a long format dataset: ID, time varying variable, time and outcome (y).
Subjects have differing numbers of rows due to different times and different outcome values, 0,1 or 2. But I need to only keep the outcome value corresponding to the last time point, and replace all other outcome rows to 0.
I can't figure out how to gen a new variable = outcome only for max(time) by ID
id sbp y time
1 120 1 0
1 126 1 1
1 126 1 2
1 126 1 3
1 126 1 4
1 132 1 5
1 132 1 6
1 132 1 7
1 150 1 8
1 150 1 9
1 150 1 10
1 160 1 11
1 160 1 12
1 160 1 13
1 160 1 14
You seem to be asking quite different things:
Replacing outcome values before the last for each panel with 0.
Keeping only the last.
Here they are in turn:
bysort id (time) : replace y = 0 if _n < _N
by id: keep if _n == _N
If you just want the second, you need bysort id (time) rather than by id.

Management of spell data: months spent in given state in the past 24 months

I am working with a spell dataset that has the following form:
clear all
input persid start end t_start t_end spell_type year spell_number event
1 8 9 44 45 1 1999 1 0
1 12 12 60 60 1 2000 1 0
1 1 1 61 61 1 2001 1 0
1 7 11 67 71 1 2001 2 0
1 1 4 85 88 2 2003 1 0
1 5 7 89 91 1 2003 2 1
1 8 11 92 95 2 2003 3 0
1 1 1 97 97 2 2004 1 0
1 1 3 121 123 1 2006 1 1
1 4 5 124 125 2 2006 2 0
1 6 9 126 129 1 2006 3 1
1 10 11 130 131 2 2006 4 0
1 12 12 132 132 1 2006 5 1
1 1 12 157 168 1 2009 1 0
1 1 12 169 180 1 2010 1 0
1 1 12 181 192 1 2011 1 0
1 1 12 193 204 1 2012 1 0
1 1 12 205 216 1 2013 1 0
end
lab define lab_spelltype 1 "unemployment spell" 2 "employment spell"
lab val spell_type lab_spelltype
where persid is the id of the person; start and end are the months when the yearly unemployment/employment spell starts and ends, respectively; t_start and t_end are the same measures but starting to count from 1st January 1996; event is equal to 1 for the employment entries for which the previous row was an unemployment spell.
The data is such that there are no overlapping spells during a given year, and each year contiguous spells of the same type have been merged together.
My goal is, for each row such that event is 1, to compute the number of months spent as employed in the last 6 months and 24 months.
In this specific example, what I would like to get is:
clear all
input persid start end t_start t_end spell_type year spell_number event empl_6 empl_24
1 8 9 44 45 1 1999 1 0 . .
1 12 12 60 60 1 2000 1 0 . .
1 1 1 61 61 1 2001 1 0 . .
1 7 11 67 71 1 2001 2 0 . .
1 1 4 85 88 2 2003 1 0 . .
1 5 7 89 91 1 2003 2 1 0 5
1 8 11 92 95 2 2003 3 0 . .
1 1 1 97 97 2 2004 1 0 . .
1 1 3 121 123 1 2006 1 1 0 0
1 4 5 124 125 2 2006 2 0 . .
1 6 9 126 129 1 2006 3 1 3 3
1 10 11 130 131 2 2006 4 0 . .
1 12 12 132 132 1 2006 5 1 4 7
1 1 12 157 168 1 2009 1 0 . .
1 1 12 169 180 1 2010 1 0 . .
1 1 12 181 192 1 2011 1 0 . .
1 1 12 193 204 1 2012 1 0 . .
1 1 12 205 216 1 2013 1 0 . .
end
So the idea is that I have to go back to rows preceding each event==1 entry and count how many months the individual was employed.
Can you suggest a way to obtain this final result?
Some suggested to expand the dataset, but perhaps there are better ways to tackle the problem (especially because the dataset is quite large).
EDIT
The correct labeling of the employment status is:
lab define lab_spelltype 1 "employment spell" 2 "unemployment spell"
The number of past months spent in employment (empl_6 and empl_24) and the definition of event are now correct with this label.
A solution to the problem is to:
expand the data so to have it monthly,
fill in the gap months with tsfill and finally,
use sum() and lag operators to get the running sum for the last 6 and 24 months.
See also Robert solution for some ideas I borrowed.
Important: this is almost surely not an efficient way to solve the issue, especially if the data is large (as in my case).
However, the plus is that one actually "sees" what happens in background to make sure the final result is the one desired.
Also, importantly, this solution takes into account cases where 2 (or more) events happen within 6 (or 24) months from each other.
clear all
input persid start end t_start t_end spell_type year spell_number event
1 8 9 44 45 1 1999 1 0
1 12 12 60 60 1 2000 1 0
1 1 1 61 61 1 2001 1 0
1 7 11 67 71 1 2001 2 0
1 1 4 85 88 2 2003 1 0
1 5 7 89 91 1 2003 2 1
1 8 11 92 95 2 2003 3 0
1 1 1 97 97 2 2004 1 0
1 1 3 121 123 1 2006 1 1
1 4 5 124 125 2 2006 2 0
1 6 9 126 129 1 2006 3 1
1 10 11 130 131 2 2006 4 0
1 12 12 132 132 1 2006 5 1
1 1 12 157 168 1 2009 1 0
1 1 12 169 180 1 2010 1 0
1 1 12 181 192 1 2011 1 0
1 1 12 193 204 1 2012 1 0
1 1 12 205 216 1 2013 1 0
end
lab define lab_spelltype 1 "employment" 2 "unemployment"
lab val spell_type lab_spelltype
list
* generate Stata monthly dates
gen spell_start = ym(year,start)
gen spell_end = ym(year,end)
format %tm spell_start spell_end
list
* expand to monthly data
gen n = spell_end - spell_start + 1
expand n, gen(expanded)
sort persid year spell_number (expanded)
bysort persid year spell_number: gen month = spell_start + _n - 1
by persid year spell_number: replace event = 0 if _n > 1
format %tm month
* xtset, fill months gaps with "empty" rows, use lags and cumsum to count past months in employment
xtset persid month, monthly // %tm format
tsfill
bysort persid (month): gen cumsum = sum(spell_type) if spell_type==1
bysort persid (month): replace cumsum = cumsum[_n-1] if cumsum==.
bysort persid (month): gen m6 = cumsum-1 - L7.cumsum if event==1 // "-1" otherwise it sums also current empl month
bysort persid (month): gen m24 = cumsum-1 - L25.cumsum if event==1
drop if event==.
list persid start end year m* if event
The posted example is of little utility in developing and testing a solution so I made up fake data that has the same properties. It's bad practice to use 1 and 2 as values for an indicator so I replaced the employed indicator with 1 meaning employed, 0 otherwise. Using month and year separately is also useless so Stata monthly dates are used.
The first solution uses tsegen (from SSC) after expanding each spell to one observation per month. With panel data, all you need to do is to sum the employment indicator for the desired time window.
The second solution uses rangestat (also from SSC) and does the same computations without expanding the data at all. The idea is simple, just add the duration of previous employment spells if the end of the spell falls within the desired window. Of course if the end of the spell falls within the window but not the start, days outside the window must be subtracted.
* fake data for 100 persons, up to 10 spells with no overlap
clear
set seed 123423
set obs 100
gen long persid = _n
gen spell_start = ym(runiformint(1990,2013),1)
expand runiformint(1,10)
bysort persid: gen spellid = _n
by persid: gen employed = runiformint(0,1)
by persid: gen spell_avg = int((ym(2015,12) - spell_start) / _N) + 1
by persid: replace spell_start = spell_start[_n-1] + ///
runiformint(1,spell_avg) if _n > 1
by persid: gen spell_end = runiformint(spell_start, spell_start[_n+1]-1)
replace spell_end = spell_start + runiformint(1,12) if mi(spell_end)
format %tm spell_start spell_end
* an event is an employment spell that immediately follow an unemployment spell
by persid: gen event = employed & employed[_n-1] == 0
* expand to one obs per month and declare as panel data
expand spell_end - spell_start + 1
bysort persid spellid: gen ym = spell_start + _n - 1
format %tm ym
tsset persid ym
* only count employement months; limit results to first month event obs
tsegen m6 = rowtotal(L(1/6).employed)
tsegen m24 = rowtotal(L(1/24).employed)
bysort persid spellid (ym): replace m6 = . if _n > 1 | !event
bysort persid spellid (ym): replace m24 = . if _n > 1 | !event
* --------- redo using rangestat, without any monthly expansion ----------------
* return to original obs but keep first month results
bysort persid spellid: keep if _n == 1
* employment end and duration for employed observations only
gen e_end = spell_end if employed
gen e_len = spell_end - spell_start + 1 if employed
foreach target in 6 24 {
// define interval bounds but only for event observations
// an out-of-sample [0,0] interval will yield no results for non-events
gen low`target' = cond(event, spell_start-`target', 0)
gen high`target' = cond(event, spell_start-1, 0)
// sum employment lengths and save earliest employment spell info
rangestat (sum) empl`target'=e_len ///
(firstnm) firste`target'=e_end firste`target'len=e_len, ///
by(persid) interval(spell_end low`target' high`target')
// remove from the count months that occur before lower bound
gen e_start = firste`target' - firste`target'len + 1
gen outside = low`target' - e_start
gen empl`target'final = cond(outside > 0, empl`target'-outside, empl`target')
replace empl`target'final = 0 if mi(empl`target'final) & event
drop e_start outside
}
* confirm that we match the -tsegen- results
assert m24 == empl24final
assert m6 == empl6final

SAS_Add value for specific rows

I want to give the value for some specific rows. I think showing it by example would be better. I have following datasheet;
Date Value
01/01/2001 10
02/01/2001 20
03/01/2001 35
04/01/2001 15
05/01/2001 25
06/01/2001 35
07/01/2001 20
08/01/2001 45
09/01/2001 35
My result should be:
Date Value Spec.Value
01/01/2001 10 1
02/01/2001 20 1
03/01/2001 35 1
04/01/2001 15 2
05/01/2001 25 2
06/01/2001 35 2
07/01/2001 20 3
08/01/2001 45 3
09/01/2001 35 3
As you can see, my condition value is 35. I have three 35. I need to group my date by using this condition value.
data want;
set have;
retain specvalue 1;
if lag(value) = 35 then do;
specvalue +1;
end;
run;