Stata: Reverse Sort - stata

I have a data set that looks like this:
id varA varB varC
1 0 10 .
1 0 20 .
1 0 35 .
2 1 60 76
2 1 76 60
2 0 32 .
I want to create the varC that reverses the order of varB only for the values varA=1 and missing otherwise.

This may help:
clear
input id varA varB varC
1 0 10 .
1 0 20 .
1 0 35 .
2 1 60 76
2 1 76 60
2 0 32 .
end
gen group = sum(id != id[_n-1] | varA != varA[_n-1])
sort group, stable
by group: gen wanted = cond(varA == 1, varB[_N - _n + 1], .)
list id var* wanted, sepby(id varA)
+----------------------------------+
| id varA varB varC wanted |
|----------------------------------|
1. | 1 0 10 . . |
2. | 1 0 20 . . |
3. | 1 0 35 . . |
|----------------------------------|
4. | 2 1 60 76 76 |
5. | 2 1 76 60 60 |
|----------------------------------|
6. | 2 0 32 . . |
+----------------------------------+

Related

Biderectional Vlookup - flag in the same table - Sas

I need to do this:
table 1:
ID Cod.
1 20
2 102
4 30
7 10
9 201
10 305
table 2:
ID Cod.
1 20
2 50
3 15
4 30
5 25
7 10
10 300
Now, I got a table like this with an outer join:
ID Cod. ID1 Cod1.
1 20 1 20
2 50 . .
. . 2 102
3 15 . .
4 30 4 30
5 25 . .
7 10 7 10
. . 9 201
10 300 . .
. . 10 305
Now I want to add a flag that tell me if the ID have common values, so:
ID Cod. ID1 Cod1. FLag_ID Flag_cod:
1 20 1 20 0 0
2 50 . . 0 1
. . 2 102 0 1
3 15 . . 1 1
4 30 4 30 0 0
5 25 . . 1 1
7 10 7 10 0 0
. . 9 201 1 1
10 300 . . 0 1
. . 10 305 0 1
I would like to know how can I get the flag_ID, specifically to cover the cases of ID = 2 or ID=10.
Thank you
You can group by a coalescence of id in order to count and compare details.
Example
data table1;
input id code ##; datalines;
1 20 2 102 4 30 7 10 9 201 10 305
;
data table2;
input id code ##; datalines;
1 20 2 50 3 15 4 30 5 25 7 10 10 300
;
proc sql;
create table got as
select
table2.id, table2.code
, table1.id as id1, table1.code as code1
, case
when count(table1.id) = 1 and count(table2.id) = 1 then 0 else 1
end as flag_id
, case
when table1.code - table2.code ne 0 then 1 else 0
end as flag_code
from
table1
full join
table2
on
table2.id=table1.id and table2.code=table1.code
group by
coalesce(table2.id,table1.id)
;
You might also want to look into
Proc COMPARE with BY

How to rank observations in panel data?

I have a panel dataset in Stata with several countries and each country containing groups. I would like to rank the groups by country, according to the variable var1.
The structure of my dataset is as follows (the rank column is what I would like to achieve). Note that var1 is indeed constant within groups (it is just the within group average of another variable).
--country--|--groupId--|---time----|---var1----|---rank---
1 | 1 | 1 | 50 | 3
1 | 1 | 2 | 50 | 3
1 | 1 | 3 | 50 | 3
1 | 2 | 1 | 90 | 1
1 | 2 | 2 | 90 | 1
1 | 2 | 3 | 90 | 1
1 | 3 | 1 | 60 | 2
1 | 3 | 2 | 60 | 2
1 | 3 | 3 | 60 | 2
2 | 4 | 1 | 15 | 2
2 | 4 | 2 | 15 | 2
2 | 4 | 3 | 15 | 2
2 | 5 | 1 | 10 | 3
2 | 5 | 2 | 10 | 3
2 | 5 | 3 | 10 | 3
2 | 6 | 1 | 80 | 1
2 | 6 | 2 | 80 | 1
2 | 6 | 3 | 80 | 1
Among the options I have tried is:
sort country groupId
by country (groupId): egen rank = rank(var1)
However, I cannot achieve the desired result.
Thanks for the data example. There are two problems with your code. One is that as you want to rank from highest to lowest, you need to negate the argument to rank(). The second is that given the repetitions, you need to rank on one time only and then copy those ranks to other times.
This works with your data example, here edited to be input code. (See also the Stata tag wiki for that principle.)
clear
input country groupId time var1 rank
1 1 1 50 3
1 1 2 50 3
1 1 3 50 3
1 2 1 90 1
1 2 2 90 1
1 2 3 90 1
1 3 1 60 2
1 3 2 60 2
1 3 3 60 2
2 4 1 15 2
2 4 2 15 2
2 4 3 15 2
2 5 1 10 3
2 5 2 10 3
2 5 3 10 3
2 6 1 80 1
2 6 2 80 1
2 6 3 80 1
end
bysort country : egen wanted = rank(-var) if time == 1
bysort country groupId (time) : replace wanted = wanted[1]
assert rank == wanted

Keep all the records for all IDs with a specific code

Consider the following data example:
clear
input id code cost
1 15342 18
2 15366 12
1 16786 32
2 15342 12
3 12345 45
4 23453 345
1 34234 23
2 22223 12
4 22342 64
3 23452 23
1 23432 22
end
How can I keep all the records for the IDs that contain the code 15324 in any row?
This is a follow-up question to a previous one of mine: Keeping all the records for specific IDs
The following works for me:
clear
input id code cost
1 15342 18
2 15366 12
1 16786 32
2 15342 12
3 12345 45
4 23453 345
1 34234 23
2 22223 12
4 15342 64
3 23452 23
1 23432 22
end
bysort id (code): egen tag = total(inlist(code, 15342))
keep if tag
Results:
list, sepby(id)
+-------------------------+
| id code cost tag |
|-------------------------|
1. | 1 15342 18 1 |
2. | 1 16786 32 1 |
3. | 1 23432 22 1 |
4. | 1 34234 23 1 |
|-------------------------|
5. | 2 15342 12 1 |
6. | 2 15366 12 1 |
7. | 2 22223 12 1 |
|-------------------------|
8. | 4 15342 64 1 |
9. | 4 23453 345 1 |
+-------------------------+
Note that I changed the data example slightly for better illustration.

How to transform long to wide data in Stata?

I have this data:
id test test_date value
1 A 02/06/2014 12:26 11
1 B 02/06/2014 12:26 23
1 C 02/06/2014 13:17 43
1 D 02/06/2014 13:17 65
1 E 02/06/2014 13:17 34
1 F 02/06/2014 13:17 64
1 A 05/06/2014 15:14 234
1 B 05/06/2014 15:14 646
1 C 05/06/2014 16:50 44
1 E 05/06/2014 16:50 55
2 E 05/06/2014 16:50 443
2 F 05/06/2014 16:50 22
2 G 05/06/2014 16:59 445
2 B 05/06/2014 20:03 66
2 C 05/06/2014 20:03 77
2 D 05/06/2014 20:03 88
2 E 05/06/2014 20:03 44
2 F 05/06/2014 20:19 33
2 G 05/06/2014 20:19 22
I would like to transform this data into wide format like this:
id date A B C D E F G
1 02/06/2014 12:26 11 23 43 65 34 64 .
1 05/06/2014 15:14 234 646 44 . 55 . .
2 05/06/2014 16:50 . . . . 443 22 445
2 05/06/2014 20:03 . 66 77 88 44 33 22
I am using reshape command in Stata, but it is not producing required results:
reshape wide test_date value, i(id) j(test) string
Any idea how to do this?
UPDATE:
You're right that we need this missvar. I try to create this by programming, but failed. Let say with-in 2 hours of test date the batch will consider same. We have only 7 tests (A,B,C,D,E,F,G). First I try to find the time difference;
bysort id: gen diff_bd = (test_date[_n] - test_date[_n-1])/(1000*60*60)
bysort id: generate missvar = _n if diff_bd <= 2
#jfeigenbaum has given part of the answer.
The problem I see is that you are missing a variable that identifies relevant sub-groups. These sub-groups seem to be bounded by test taking values A - G. But I may be wrong.
I've included this variable in the example data set, and named it missvar. I forced this variable into the data set believing it identifies groups that, although implicit in your original post, are important for your analysis.
clear
set more off
*----- example data -----
input ///
id str1 test str30 test_date value missvar
1 A "02/06/2014 12:26" 11 1
1 B "02/06/2014 12:26" 23 1
1 C "02/06/2014 13:17" 43 1
1 D "02/06/2014 13:17" 65 1
1 E "02/06/2014 13:17" 34 1
1 F "02/06/2014 13:17" 64 1
1 A "05/06/2014 15:14" 234 2
1 B "05/06/2014 15:14" 646 2
1 C "05/06/2014 16:50" 44 2
1 E "05/06/2014 16:50" 55 2
2 E "05/06/2014 16:50" 443 1
2 F "05/06/2014 16:50" 22 1
2 G "05/06/2014 16:59" 445 1
2 B "05/06/2014 20:03" 66 2
2 C "05/06/2014 20:03" 77 2
2 D "05/06/2014 20:03" 88 2
2 E "05/06/2014 20:03" 44 2
2 F "05/06/2014 20:19" 33 2
2 G "05/06/2014 20:19" 22 2
end
gen double tdate = clock( test_date, "DM20Yhm")
format %tc tdate
drop test_date
list, sepby(id)
*----- what you want ? -----
reshape wide value, i(id missvar tdate) j(test) string
collapse (min) tdate value?, by(id missvar)
rename value* *
list
There should be some way of identifying the groups programmatically. Relying on the original sort order of the data is one way, but it may not be the safest. It may be the only way, but only you know that.
Edit
Regarding your comment and the "missing" variable, one way to create it is:
// one hour is 3600000 milliseconds
bysort id (tdate): gen batch = sum(tdate - tdate[_n-1] > 7200000)
For your example data, this creates a batch variable identical to my missvar. You can also use time-series operators.
Let me emphasize the need for you to be carefull when presenting your example data. It must be representative of the real one or you might get code that doesn't suit it; that includes the possibility that you don't notice it because Stata gives no error.
For example, if you have the same test, applied to the same id within the two-hour limit, then you'll lose information with this code (in the collapse). (This is not a problem in your example data.)
Edit 2
In response to another question found in the comments:
Suppose a new observation for person 1, such that he receives a repeated test within the two-hour limit, but at a different time :
1 A "02/06/2014 12:26" 11 1 // old observation
1 B "02/06/2014 12:26" 23 1
1 A "02/06/2014 12:35" 99 1 // new observation
1 C "02/06/2014 13:17" 43 1
1 D "02/06/2014 13:17" 65 1
1 E "02/06/2014 13:17" 34 1
1 F "02/06/2014 13:17" 64 1
1 A "05/06/2014 15:14" 234 2
1 B "05/06/2014 15:14" 646 2
1 C "05/06/2014 16:50" 44 2
1 E "05/06/2014 16:50" 55 2
Test A is applied at 12:26 and at 12:35. Reshape will have no problem with this, but the collapse will discard information because it is taking the minimum values amongst the id missvar groups; notice that for the variable valueA, new information (the 99) will be lost (so too happens with all other variables, but you are explicit about wanting to discard that). After the reshape but before the collapse you get:
. list, sepby(id)
+--------------------------------------------------------------------------------------------------+
| id missvar tdate valueA valueB valueC valueD valueE valueF valueG |
|--------------------------------------------------------------------------------------------------|
1. | 1 1 02jun2014 12:26:00 11 23 . . . . . |
2. | 1 1 02jun2014 12:35:00 99 . . . . . . |
3. | 1 1 02jun2014 13:17:00 . . 43 65 34 64 . |
4. | 1 2 05jun2014 15:14:00 234 646 . . . . . |
5. | 1 2 05jun2014 16:50:00 . . 44 . 55 . . |
|--------------------------------------------------------------------------------------------------|
6. | 2 1 05jun2014 16:50:00 . . . . 443 22 . |
7. | 2 1 05jun2014 16:59:00 . . . . . . 445 |
8. | 2 2 05jun2014 20:03:00 . 66 77 88 44 . . |
9. | 2 2 05jun2014 20:19:00 . . . . . 33 22 |
+--------------------------------------------------------------------------------------------------+
Running the complete code confirms what we just said:
. list, sepby(id)
+--------------------------------------------------------------------------+
| id missvar tdate A B C D E F G |
|--------------------------------------------------------------------------|
1. | 1 1 02jun2014 12:26:00 11 23 43 65 34 64 . |
2. | 1 2 05jun2014 15:14:00 234 646 44 . 55 . . |
|--------------------------------------------------------------------------|
3. | 2 1 05jun2014 16:50:00 . . . . 443 22 445 |
4. | 2 2 05jun2014 20:03:00 . 66 77 88 44 33 22 |
+--------------------------------------------------------------------------+
Suppose now a new observation for person 1, such that he receives a repeated test within the two-hour limit, but at the same time:
1 A "02/06/2014 12:26" 11 1 // old observation
1 B "02/06/2014 12:26" 23 1
1 A "02/06/2014 12:26" 99 1 // new observation
1 C "02/06/2014 13:17" 43 1
1 D "02/06/2014 13:17" 65 1
1 E "02/06/2014 13:17" 34 1
1 F "02/06/2014 13:17" 64 1
1 A "05/06/2014 15:14" 234 2
1 B "05/06/2014 15:14" 646 2
1 C "05/06/2014 16:50" 44 2
1 E "05/06/2014 16:50" 55 2
Then the reshape won't work. Stata complains:
values of variable test not unique within id missvar tdate
and with reason. The error is clear in signalling the problem. (If not clear, go back to help reshape and work out some exercises.) The request makes no sense given the functioning of the command.
Finally, note it's relatively easy to check if something will work or not: just try it! All that was necessary in this case was to modify a bit the example data. Go back to help files and manuals, if necessary.
The command is slightly misspecified. You want to reshape value. Look at the output you want and notice the observations are uniquely identified by id and test_date. Therefore, they should be in the i option.
reshape wide value, i(id test_date) j(test) string
This yields something close you what you want, you just need to rename a few variables to get exactly the output. Specifically:
rename test_date date
renpfix value

Updated exposure variables in Stata

I'm trying to create a variable for updated body mass index (bmi) through 4 visits of a study. I've tried the below but it only lists the value from the last visit. My data is in wide format where visit_v1 = 1 if the participant was present for visit 1 and bmi_v1 = bmi at visit 1. I want bmi_su to equal bmi_v1 if visit_v1=1, bmi_v2 if visit_v2==1, etc. Any thoughts where I'm going wrong?
gen bmi_su = .
replace bmi_su = bmi_v4 if visit_v4==1
replace bmi_su = bmi_v3 if visit_v3==1 & visit_v4==0
replace bmi_su = bmi_v2 if visit_v2==1 & visit_v4==0 & visit_v3==0
replace bmi_su = bmi_v1 if visit_v1==1 & visit_v4==0 & visit_v3==0 & visit_v2==0
Do you seek something like this:
. clear all
. set more off
.
. * Assumed data structure
. input ///
> id bmi visit1 visit2 visit3 bmi1 bmi2 bmi3
id bmi visit1 visit2 visit3 bmi1 bmi2 bmi3
1. 1 20 1 0 0 20 0 0
2. 1 . 0 1 0 0 25 0
3. 1 . 0 0 1 0 0 28
4. end
.
. list, noobs
+----------------------------------------------------------+
| id bmi visit1 visit2 visit3 bmi1 bmi2 bmi3 |
|----------------------------------------------------------|
| 1 20 1 0 0 20 0 0 |
| 1 . 0 1 0 0 25 0 |
| 1 . 0 0 1 0 0 28 |
+----------------------------------------------------------+
.
. * What you want?
. gen bmisu = bmi1 + bmi2 + bmi3
.
. list, noobs
+------------------------------------------------------------------+
| id bmi visit1 visit2 visit3 bmi1 bmi2 bmi3 bmisu |
|------------------------------------------------------------------|
| 1 20 1 0 0 20 0 0 20 |
| 1 . 0 1 0 0 25 0 25 |
| 1 . 0 0 1 0 0 28 28 |
+------------------------------------------------------------------+
?
Panel or longitudinal data are usually much better off in a long data structure or shape (some say format).
In your case, the definitions imply that the last measurement will trump earlier measurements, so it is not clear why you seem surprised.
Here are some more systematic ways to do calculations. First,
gen bmi_su = bmi_v4
forval j = 3(-1)1 {
replace bmi_su = bmi_v`j' if visit`j'
}
Second,
gen bmi_su2 = bmi_v1
forval j = 2/4 {
replace bmi_su2 = bmi_v`j' if visit`j'
}
Consider also variants of the above with if missing(bmi_su) or if missing(bmi_su2) rather than the if conditions shown.