Create a numeric encoding of levels for a string variable in SAS - sas

I would like to create a numeric indicator variable that captures the levels of a complex string id variable in SAS. Note that I cannot use if-then logic, as the id values will change constantly.
I have searched multiple threads and previous questions, and I am unable to find an approach that works.
Below is some sample code that captures the essence of the problem.
data value_id;
length id $54 ;
informat id $54. ;
input id $ ;
cards ;
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8
tsknifwb29-818zgpj2be-vq7558xhqa-1lgqck7219-rq1ojedtmp
ts1j2y9q6u-nghpfhdxsl-vkdwk060gg-s3tred6a7g-h5iqsl8cir
jg1qpqhofy-02d2m62ayb-fg2f6dtvqc-vx4lsnowcj-s4kg37wxah
o3qadvrqtl-kdyw9qpfir-7xeilvuk7e-g73olb67tm-nwwvla6r4g
gc6dny3d5n-qzdkgfkpoc-iv1vnmwu4d-hubjun73y1-mbaggyrmkq
dbvdcbvafv-cmb2zp67tn-gpnfwcvvpt-nl8qpgwn8b-l3biox4318
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym
wfrbcr3i8e-3vodo5wkrr-hp733zwkhy-uxm9uf16zp-5y11re5um5
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8
tsknifwb29-818zgpj2be-vq7558xhqa-1lgqck7219-rq1ojedtmp
ts1j2y9q6u-nghpfhdxsl-vkdwk060gg-s3tred6a7g-h5iqsl8cir
jg1qpqhofy-02d2m62ayb-fg2f6dtvqc-vx4lsnowcj-s4kg37wxah
o3qadvrqtl-kdyw9qpfir-7xeilvuk7e-g73olb67tm-nwwvla6r4g
gc6dny3d5n-qzdkgfkpoc-iv1vnmwu4d-hubjun73y1-mbaggyrmkq
dbvdcbvafv-cmb2zp67tn-gpnfwcvvpt-nl8qpgwn8b-l3biox4318
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym
wfrbcr3i8e-3vodo5wkrr-hp733zwkhy-uxm9uf16zp-5y11re5um5
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8
tsknifwb29-818zgpj2be-vq7558xhqa-1lgqck7219-rq1ojedtmp
ts1j2y9q6u-nghpfhdxsl-vkdwk060gg-s3tred6a7g-h5iqsl8cir
jg1qpqhofy-02d2m62ayb-fg2f6dtvqc-vx4lsnowcj-s4kg37wxah
o3qadvrqtl-kdyw9qpfir-7xeilvuk7e-g73olb67tm-nwwvla6r4g
gc6dny3d5n-qzdkgfkpoc-iv1vnmwu4d-hubjun73y1-mbaggyrmkq
dbvdcbvafv-cmb2zp67tn-gpnfwcvvpt-nl8qpgwn8b-l3biox4318
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym
wfrbcr3i8e-3vodo5wkrr-hp733zwkhy-uxm9uf16zp-5y11re5um5
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8
tsknifwb29-818zgpj2be-vq7558xhqa-1lgqck7219-rq1ojedtmp
ts1j2y9q6u-nghpfhdxsl-vkdwk060gg-s3tred6a7g-h5iqsl8cir
jg1qpqhofy-02d2m62ayb-fg2f6dtvqc-vx4lsnowcj-s4kg37wxah
o3qadvrqtl-kdyw9qpfir-7xeilvuk7e-g73olb67tm-nwwvla6r4g
gc6dny3d5n-qzdkgfkpoc-iv1vnmwu4d-hubjun73y1-mbaggyrmkq
dbvdcbvafv-cmb2zp67tn-gpnfwcvvpt-nl8qpgwn8b-l3biox4318
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym
wfrbcr3i8e-3vodo5wkrr-hp733zwkhy-uxm9uf16zp-5y11re5um5
;
proc sort data = value_id
out = value_id_s ascii ;
by id ;
run;
Ideally, I would like to have a result where I can create a numeric indicator variable id_n, like the following:
id id_n
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym 1
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym 1
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym 1
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym 1
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8 2
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8 2
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8 2
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8 2
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf 3
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf 3
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf 3
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf 3
... ...
Any advice / guidance on the approach would be very much appreciated.
Aside
If I were using Stata, I would use encode and move on with my day...
. encode id, gen(id_n)
.
.
. tab1 id id_n , nolabel
-> tabulation of id
id | Freq. Percent Cum.
----------------------------------------+-----------------------------------
98llcibqon-u86ww0mzgo-ut58htcfv1-lybg.. | 4 10.00 10.00
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20.. | 4 10.00 20.00
byxl352kpd-se5godm0gv-jukpzv1u7x-8kff.. | 4 10.00 30.00
dbvdcbvafv-cmb2zp67tn-gpnfwcvvpt-nl8q.. | 4 10.00 40.00
gc6dny3d5n-qzdkgfkpoc-iv1vnmwu4d-hubj.. | 4 10.00 50.00
jg1qpqhofy-02d2m62ayb-fg2f6dtvqc-vx4l.. | 4 10.00 60.00
o3qadvrqtl-kdyw9qpfir-7xeilvuk7e-g73o.. | 4 10.00 70.00
ts1j2y9q6u-nghpfhdxsl-vkdwk060gg-s3tr.. | 4 10.00 80.00
tsknifwb29-818zgpj2be-vq7558xhqa-1lgq.. | 4 10.00 90.00
wfrbcr3i8e-3vodo5wkrr-hp733zwkhy-uxm9.. | 4 10.00 100.00
----------------------------------------+-----------------------------------
Total | 40 100.00
-> tabulation of id_n
id_n | Freq. Percent Cum.
------------+-----------------------------------
1 | 4 10.00 10.00
2 | 4 10.00 20.00
3 | 4 10.00 30.00
4 | 4 10.00 40.00
5 | 4 10.00 50.00
6 | 4 10.00 60.00
7 | 4 10.00 70.00
8 | 4 10.00 80.00
9 | 4 10.00 90.00
10 | 4 10.00 100.00
------------+-----------------------------------
Total | 40 100.00

Try this
proc sort data = value_id out = value_id_s ascii;
by id;
run;
data want;
set value_id_s;
by id;
if first.id then id_n + 1;
run;

Perhaps use this
data want;
retain id_n 0;
set value_id_s;
by id;
if first.id then id_n+1;
run;
id id_n
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym 1
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym 1
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym 1
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym 1
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8 2
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8 2
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8 2
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8 2
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf 3
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf 3
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf 3
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf 3
...

Related

Keep all the records for all IDs with a specific code

Consider the following data example:
clear
input id code cost
1 15342 18
2 15366 12
1 16786 32
2 15342 12
3 12345 45
4 23453 345
1 34234 23
2 22223 12
4 22342 64
3 23452 23
1 23432 22
end
How can I keep all the records for the IDs that contain the code 15324 in any row?
This is a follow-up question to a previous one of mine: Keeping all the records for specific IDs
The following works for me:
clear
input id code cost
1 15342 18
2 15366 12
1 16786 32
2 15342 12
3 12345 45
4 23453 345
1 34234 23
2 22223 12
4 15342 64
3 23452 23
1 23432 22
end
bysort id (code): egen tag = total(inlist(code, 15342))
keep if tag
Results:
list, sepby(id)
+-------------------------+
| id code cost tag |
|-------------------------|
1. | 1 15342 18 1 |
2. | 1 16786 32 1 |
3. | 1 23432 22 1 |
4. | 1 34234 23 1 |
|-------------------------|
5. | 2 15342 12 1 |
6. | 2 15366 12 1 |
7. | 2 22223 12 1 |
|-------------------------|
8. | 4 15342 64 1 |
9. | 4 23453 345 1 |
+-------------------------+
Note that I changed the data example slightly for better illustration.

Calculate the sum of a variable

I would like to calculate the sum of variable boasav:
clear
input id boasav
1 2500
1 2900
1 4200
2 5700
2 6100
3 7400
3 7600
3 8300
end
I know that the tabulate command can be used to summarize data but it only counts:
bys id: tab boasav
-> id = 1
boasav | Freq. Percent Cum.
------------+-----------------------------------
2500 | 1 33.33 33.33
2900 | 1 33.33 66.67
4200 | 1 33.33 100.00
------------+-----------------------------------
Total | 3 100.00
-> id = 2
boasav | Freq. Percent Cum.
------------+-----------------------------------
5700 | 1 50.00 50.00
6100 | 1 50.00 100.00
------------+-----------------------------------
Total | 2 100.00
-> id = 3
boasav | Freq. Percent Cum.
------------+-----------------------------------
7400 | 1 33.33 33.33
7600 | 1 33.33 66.67
8300 | 1 33.33 100.00
------------+-----------------------------------
Total | 3 100.00
However, what I want is the following:
1 9600
2 11800
3 23300
Is there a function that can do this in Stata?
Here are three more.
clear
input id boasav
1 2500
1 2900
1 4200
2 5700
2 6100
3 7400
3 7600
3 8300
end
* Method 4: use summarize
forval g = 1/3 {
su boasav if id == `g', meanonly
di "`g' " %5.0f r(sum)
}
1 9600
2 11800
3 23300
* Method 5: tabstat
tabstat boasav, by(id) stat(sum)
Summary for variables: boasav
by categories of: id
id | sum
---------+----------
1 | 9600
2 | 11800
3 | 23300
---------+----------
Total | 44700
--------------------
* Method 6: use rangestat (SSC)
rangestat (sum) boasav, int(id 0 0)
tabdisp id, c(boasav_sum)
-------------------------
id | sum of boasav
----------+--------------
1 | 9600
2 | 11800
3 | 23300
-------------------------

Create table for asclogit and nlogit

Suppose I have the following table:
id | car | sex | income
-------------------------------
1 | European | Male | 45000
2 | Japanese | Female | 48000
3 | American | Male | 53000
I would like to create the one below:
| id | car | choice | sex | income
--------------------------------------------
1.| 1 | European | 1 | Male | 45000
2.| 1 | American | 0 | Male | 45000
3.| 1 | Japanese | 0 | Male | 45000
| ----------------------------------------
4.| 2 | European | 0 | Female | 48000
5.| 2 | American | 0 | Female | 48000
6.| 2 | Japanese | 1 | Female | 48000
| ----------------------------------------
7.| 3 | European | 0 | Male | 53000
8.| 3 | American | 1 | Male | 53000
9.| 3 | Japanese | 0 | Male | 53000
I would like to fit an asclogit and according to Example 1 in Stata's Manual, this table format seems necessary. However, i have not found a way to create this easily.
You can use the cross command to generate all the possible combinations:
clear
input byte id str10 car str8 sex long income
1 "European" "Male" 45000
2 "Japanese" "Female" 48000
3 "American" "Male" 53000
end
generate choice = 0
save old, replace
keep id
save new, replace
use old
rename id =_0
cross using new
replace choice = 1 if id_0 == id
replace sex = cond(id == 2, "Female", "Male")
replace income = cond(id == 1, 45000, cond(id == 2, 48000, 53000))
Note that the use of the cond() function here is equivalent to:
replace sex = "Male" if id == 1
replace sex = "Female" if id == 2
replace sex = "Male" if id == 3
replace income = 45000 if id == 1
replace income = 48000 if id == 2
replace income = 53000 if id == 3
The above code snipped produces the desired output:
drop id_0
order id car choice sex income
sort id car
list, sepby(id)
+------------------------------------------+
| id car choice sex income |
|------------------------------------------|
1. | 1 American 0 Male 45000 |
2. | 1 European 1 Male 45000 |
3. | 1 Japanese 0 Male 45000 |
|------------------------------------------|
4. | 2 American 0 Female 48000 |
5. | 2 European 0 Female 48000 |
6. | 2 Japanese 1 Female 48000 |
|------------------------------------------|
7. | 3 American 1 Male 53000 |
8. | 3 European 0 Male 53000 |
9. | 3 Japanese 0 Male 53000 |
+------------------------------------------+
For more information, type help cross and help cond() from Stata's command prompt.
Please see dataex in Stata for how to produce data examples useful in web forums. (If necessary, install first using ssc install dataex.)
This could be an exercise in using fillin followed by filling in the missings.
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id str10 car str8 sex long income
1 "European" "Male" 45000
2 "Japanese" "Female" 48000
3 "American" "Male" 53000
end
fillin id car
foreach v in sex income {
bysort id (_fillin) : replace `v' = `v'[1]
}
list , sepby(id)
+-------------------------------------------+
| id car sex income _fillin |
|-------------------------------------------|
1. | 1 European Male 45000 0 |
2. | 1 American Male 45000 1 |
3. | 1 Japanese Male 45000 1 |
|-------------------------------------------|
4. | 2 Japanese Female 48000 0 |
5. | 2 European Female 48000 1 |
6. | 2 American Female 48000 1 |
|-------------------------------------------|
7. | 3 American Male 53000 0 |
8. | 3 European Male 53000 1 |
9. | 3 Japanese Male 53000 1 |
+-------------------------------------------+
A provisional solution using Pandas in Python is the following:
1) Open the base with:
df = pd.read_stata("mybase.dta")
2) Use the code of the accepted answer of this question.
3) Save the base:
df.to_stata("newbase.dta")
If one wants to use dummy variables, reshape also is an option.
clear
input byte id str10 car str8 sex long income
1 "European" "Male" 45000
2 "Japanese" "Female" 48000
3 "American" "Male" 53000
end
tabulate car, gen(choice)
reshape long choice, i(id)
label define car 2 "European" 3 "Japanese" 1 "American"
drop car
rename _j car
label values car car
list, sepby(id)
+------------------------------------------+
| id car sex income choice |
|------------------------------------------|
1. | 1 American Male 45000 0 |
2. | 1 European Male 45000 1 |
3. | 1 Japanese Male 45000 0 |
|------------------------------------------|
4. | 2 American Female 48000 0 |
5. | 2 European Female 48000 0 |
6. | 2 Japanese Female 48000 1 |
|------------------------------------------|
7. | 3 American Male 53000 1 |
8. | 3 European Male 53000 0 |
9. | 3 Japanese Male 53000 0 |
+------------------------------------------+

Find Lagged Average of Group

I am trying to create instruments from a three-dimensional panel dataset, as included below:
input firm year market price comp_avg
1 2000 10 1 .
3 2000 10 2 .
3 2001 10 3 .
1 2002 10 4 .
3 2002 10 5 .
1 2000 20 6 .
3 2000 20 7 .
1 2001 20 8 .
2 2001 20 9 .
3 2001 20 10 .
1 2002 20 20 .
2 2002 20 30 .
3 2002 20 40 .
2 2000 30 50 .
1 2001 30 60 .
2 2001 30 70 .
1 2002 30 80 .
2 2002 30 90 .
end
The instrument I am trying to create is the lagged (year-1) average price of a firm's competitors (those in the same market) in each market the firm operates in in a given year.
At the moment, I have some code that does the job, but I am hoping that I am missing something and can do this in a more clear or efficient way.
Here is the code:
// for each firm
qui levelsof firm, local(firms)
qui foreach f in `firms' {
// find all years for that firm
levelsof year if firm == `f', local(years)
foreach y in `years' {
// skip first year (because there is no lagged data)
if `y' == 2000 {
continue
}
// find all markets in that year
levelsof market if firm == `f' & year == `y', local(mkts)
local L1 = `y'-1
foreach m in `mkts' {
// get average of all compeitors in that market in the year prior
gen temp = firm != `f' & year == `L1' & market == `m'
su price if temp
replace comp_avg = r(mean) if firm == `f' & market == `m' & year == `y'
drop temp
}
}
}
The data I am working with are reasonably large (~1 million obs) so the faster the better.
clear
input firm year market price
1 2000 10 1
3 2000 10 2
3 2001 10 3
1 2002 10 4
3 2002 10 5
1 2000 20 6
3 2000 20 7
1 2001 20 8
2 2001 20 9
3 2001 20 10
1 2002 20 20
2 2002 20 30
3 2002 20 40
2 2000 30 50
1 2001 30 60
2 2001 30 70
1 2002 30 80
2 2002 30 90
end
bysort firm market (year) : gen Lprice = price[_n-1] if year - year[_n-1] == 1
bysort market year : egen total = total(Lprice)
bysort market year : egen count = count(Lprice)
gen mean_others = (total - cond(missing(Lprice), 0, Lprice)) ///
/ (count - cond(missing(Lprice), 0, 1))
sort market year
list market year firm price Lprice mean_others total count, sepby(market year)
+--------------------------------------------------------------------------+
| market year firm price Lprice price mean_o~s total count |
|--------------------------------------------------------------------------|
1. | 10 2000 1 1 . 1 . 0 0 |
2. | 10 2000 3 2 . 2 . 0 0 |
|--------------------------------------------------------------------------|
3. | 10 2001 3 3 2 3 . 2 1 |
|--------------------------------------------------------------------------|
4. | 10 2002 1 4 . 4 3 3 1 |
5. | 10 2002 3 5 3 5 . 3 1 |
|--------------------------------------------------------------------------|
6. | 20 2000 3 7 . 7 . 0 0 |
7. | 20 2000 1 6 . 6 . 0 0 |
|--------------------------------------------------------------------------|
8. | 20 2001 2 9 . 9 6.5 13 2 |
9. | 20 2001 3 10 7 10 6 13 2 |
10. | 20 2001 1 8 6 8 7 13 2 |
|--------------------------------------------------------------------------|
11. | 20 2002 1 20 8 20 9.5 27 3 |
12. | 20 2002 3 40 10 40 8.5 27 3 |
13. | 20 2002 2 30 9 30 9 27 3 |
|--------------------------------------------------------------------------|
14. | 30 2000 2 50 . 50 . 0 0 |
|--------------------------------------------------------------------------|
15. | 30 2001 2 70 50 70 . 50 1 |
16. | 30 2001 1 60 . 60 50 50 1 |
|--------------------------------------------------------------------------|
17. | 30 2002 2 90 70 90 60 130 2 |
18. | 30 2002 1 80 60 80 70 130 2 |
+--------------------------------------------------------------------------+
My approach breaks it down:
Calculate the previous price for the same firm and market. (#1 could also be done by declaring a (firm, market) pair a panel.)
The mean of other values (here previous prices) in the same market and year is the (sum of others MINUS this price) divided by (number of others MINUS 1).
#2 needs a modification as if this price is missing, you need to subtract 0 from both numerator and denominator. Stata's normal rules would render sum MINUS missing as missing, but this firm's previous price might be unknown, yet others in the same market might have known prices.
Note: There are small ways of speeding up your code, but this should be faster (so long as it is correct).
EDIT: Another solution (2 lines) using rangestat (must be installed using ssc inst rangestat):
bysort firm market (year) : gen Lprice = price[_n-1] if year - year[_n-1] == 1
rangestat Lprice, interval(year 0 0) by(market) excludeself

How to collapse numbers with same identifier but different date, but preserve the date of first observation for each identifier

I have a dataset that can be simplified in the following format:
clear
input str9 Date ID VarA VarB
"12jan2010" 5 21 42
"12jan2010" 6 47 21
"15jan2010" 10 7 68
"17jan2010" 6 -5 -3
"19jan2010" 6 -1 -1
end
In the dataset, there is Date, ID, VarA, and VarB. Each ID represents a unique set of transactions. I want to collapse (sum) VarA VarB, by(Date) in Stata. However, I want to keep the date of the first observation for each ID number.
Essentially, I want the above dataset to become the following:
+--------------------------------+
| Date ID Var1 Var2 |
|--------------------------------|
| 12jan2010 5 21 42 |
| 12jan2010 6 41 17 |
| 15jan2010 10 7 68 |
+--------------------------------+
12jan2010 17jan2010 and 19jan2010 have the same ID, so I want to collapse (sum) Var1 Var2 for these three observations. I want to keep the date 12jan2010 because it is the date for the first observation. The other two observations are dropped.
I know it might be possible to collapse by ID first and then merge with the original dataset and then subset. I was wondering if there is an easier way to make this work. Thanks!
collapse allows you to calculate a variety of statistics, so you can convert your string date into a numerical date, then take the minimum of the numerical date to get the first occurrence.
clear
input str9 Date ID VarA VarB
"12jan2010" 5 21 42
"12jan2010" 6 47 21
"15jan2010" 10 7 68
"17jan2010" 6 -5 -3
"19jan2010" 6 -1 -1
end
gen Date2 = date(Date, "DMY")
format Date2 %td
collapse (sum) VarA VarB (min) Date2 , by(ID)
order Date2, first
li
yielding
+------------------------------+
| Date2 ID VarA VarB |
|------------------------------|
1. | 12jan2010 5 21 42 |
2. | 12jan2010 6 41 17 |
3. | 15jan2010 10 7 68 |
+------------------------------+
In response to the comment: You can generate the formatted date for only observations where VarA is > 0 (and not missing). (Assuming that, per your comment, VarA & VarB always have the same sign.)
// now assume ID 6 has an earliest date of 17jan2005 (obs.4)
// but you want to return your 'first date' as the
// first date where varA & varB are both positive
clear
input str9 Date ID VarA VarB
"12jan2010" 5 21 42
"12jan2010" 6 47 21
"15jan2010" 10 7 68
"17jan2005" 6 -5 -3
"19jan2010" 6 -1 -1
end
gen Date2 = date(Date, "DMY") if VarA > 0 & !missing(VarA)
format Date2 %td
collapse (sum) VarA VarB (min) Date2 , by(ID)
order Date2, first
li
yielding
+------------------------------+
| Date2 ID VarA VarB |
|------------------------------|
1. | 12jan2010 5 21 42 |
2. | 12jan2010 6 41 17 |
3. | 15jan2010 10 7 68 |
+------------------------------+