I have this data:
id test test_date value
1 A 02/06/2014 12:26 11
1 B 02/06/2014 12:26 23
1 C 02/06/2014 13:17 43
1 D 02/06/2014 13:17 65
1 E 02/06/2014 13:17 34
1 F 02/06/2014 13:17 64
1 A 05/06/2014 15:14 234
1 B 05/06/2014 15:14 646
1 C 05/06/2014 16:50 44
1 E 05/06/2014 16:50 55
2 E 05/06/2014 16:50 443
2 F 05/06/2014 16:50 22
2 G 05/06/2014 16:59 445
2 B 05/06/2014 20:03 66
2 C 05/06/2014 20:03 77
2 D 05/06/2014 20:03 88
2 E 05/06/2014 20:03 44
2 F 05/06/2014 20:19 33
2 G 05/06/2014 20:19 22
I would like to transform this data into wide format like this:
id date A B C D E F G
1 02/06/2014 12:26 11 23 43 65 34 64 .
1 05/06/2014 15:14 234 646 44 . 55 . .
2 05/06/2014 16:50 . . . . 443 22 445
2 05/06/2014 20:03 . 66 77 88 44 33 22
I am using reshape command in Stata, but it is not producing required results:
reshape wide test_date value, i(id) j(test) string
Any idea how to do this?
UPDATE:
You're right that we need this missvar. I try to create this by programming, but failed. Let say with-in 2 hours of test date the batch will consider same. We have only 7 tests (A,B,C,D,E,F,G). First I try to find the time difference;
bysort id: gen diff_bd = (test_date[_n] - test_date[_n-1])/(1000*60*60)
bysort id: generate missvar = _n if diff_bd <= 2
#jfeigenbaum has given part of the answer.
The problem I see is that you are missing a variable that identifies relevant sub-groups. These sub-groups seem to be bounded by test taking values A - G. But I may be wrong.
I've included this variable in the example data set, and named it missvar. I forced this variable into the data set believing it identifies groups that, although implicit in your original post, are important for your analysis.
clear
set more off
*----- example data -----
input ///
id str1 test str30 test_date value missvar
1 A "02/06/2014 12:26" 11 1
1 B "02/06/2014 12:26" 23 1
1 C "02/06/2014 13:17" 43 1
1 D "02/06/2014 13:17" 65 1
1 E "02/06/2014 13:17" 34 1
1 F "02/06/2014 13:17" 64 1
1 A "05/06/2014 15:14" 234 2
1 B "05/06/2014 15:14" 646 2
1 C "05/06/2014 16:50" 44 2
1 E "05/06/2014 16:50" 55 2
2 E "05/06/2014 16:50" 443 1
2 F "05/06/2014 16:50" 22 1
2 G "05/06/2014 16:59" 445 1
2 B "05/06/2014 20:03" 66 2
2 C "05/06/2014 20:03" 77 2
2 D "05/06/2014 20:03" 88 2
2 E "05/06/2014 20:03" 44 2
2 F "05/06/2014 20:19" 33 2
2 G "05/06/2014 20:19" 22 2
end
gen double tdate = clock( test_date, "DM20Yhm")
format %tc tdate
drop test_date
list, sepby(id)
*----- what you want ? -----
reshape wide value, i(id missvar tdate) j(test) string
collapse (min) tdate value?, by(id missvar)
rename value* *
list
There should be some way of identifying the groups programmatically. Relying on the original sort order of the data is one way, but it may not be the safest. It may be the only way, but only you know that.
Edit
Regarding your comment and the "missing" variable, one way to create it is:
// one hour is 3600000 milliseconds
bysort id (tdate): gen batch = sum(tdate - tdate[_n-1] > 7200000)
For your example data, this creates a batch variable identical to my missvar. You can also use time-series operators.
Let me emphasize the need for you to be carefull when presenting your example data. It must be representative of the real one or you might get code that doesn't suit it; that includes the possibility that you don't notice it because Stata gives no error.
For example, if you have the same test, applied to the same id within the two-hour limit, then you'll lose information with this code (in the collapse). (This is not a problem in your example data.)
Edit 2
In response to another question found in the comments:
Suppose a new observation for person 1, such that he receives a repeated test within the two-hour limit, but at a different time :
1 A "02/06/2014 12:26" 11 1 // old observation
1 B "02/06/2014 12:26" 23 1
1 A "02/06/2014 12:35" 99 1 // new observation
1 C "02/06/2014 13:17" 43 1
1 D "02/06/2014 13:17" 65 1
1 E "02/06/2014 13:17" 34 1
1 F "02/06/2014 13:17" 64 1
1 A "05/06/2014 15:14" 234 2
1 B "05/06/2014 15:14" 646 2
1 C "05/06/2014 16:50" 44 2
1 E "05/06/2014 16:50" 55 2
Test A is applied at 12:26 and at 12:35. Reshape will have no problem with this, but the collapse will discard information because it is taking the minimum values amongst the id missvar groups; notice that for the variable valueA, new information (the 99) will be lost (so too happens with all other variables, but you are explicit about wanting to discard that). After the reshape but before the collapse you get:
. list, sepby(id)
+--------------------------------------------------------------------------------------------------+
| id missvar tdate valueA valueB valueC valueD valueE valueF valueG |
|--------------------------------------------------------------------------------------------------|
1. | 1 1 02jun2014 12:26:00 11 23 . . . . . |
2. | 1 1 02jun2014 12:35:00 99 . . . . . . |
3. | 1 1 02jun2014 13:17:00 . . 43 65 34 64 . |
4. | 1 2 05jun2014 15:14:00 234 646 . . . . . |
5. | 1 2 05jun2014 16:50:00 . . 44 . 55 . . |
|--------------------------------------------------------------------------------------------------|
6. | 2 1 05jun2014 16:50:00 . . . . 443 22 . |
7. | 2 1 05jun2014 16:59:00 . . . . . . 445 |
8. | 2 2 05jun2014 20:03:00 . 66 77 88 44 . . |
9. | 2 2 05jun2014 20:19:00 . . . . . 33 22 |
+--------------------------------------------------------------------------------------------------+
Running the complete code confirms what we just said:
. list, sepby(id)
+--------------------------------------------------------------------------+
| id missvar tdate A B C D E F G |
|--------------------------------------------------------------------------|
1. | 1 1 02jun2014 12:26:00 11 23 43 65 34 64 . |
2. | 1 2 05jun2014 15:14:00 234 646 44 . 55 . . |
|--------------------------------------------------------------------------|
3. | 2 1 05jun2014 16:50:00 . . . . 443 22 445 |
4. | 2 2 05jun2014 20:03:00 . 66 77 88 44 33 22 |
+--------------------------------------------------------------------------+
Suppose now a new observation for person 1, such that he receives a repeated test within the two-hour limit, but at the same time:
1 A "02/06/2014 12:26" 11 1 // old observation
1 B "02/06/2014 12:26" 23 1
1 A "02/06/2014 12:26" 99 1 // new observation
1 C "02/06/2014 13:17" 43 1
1 D "02/06/2014 13:17" 65 1
1 E "02/06/2014 13:17" 34 1
1 F "02/06/2014 13:17" 64 1
1 A "05/06/2014 15:14" 234 2
1 B "05/06/2014 15:14" 646 2
1 C "05/06/2014 16:50" 44 2
1 E "05/06/2014 16:50" 55 2
Then the reshape won't work. Stata complains:
values of variable test not unique within id missvar tdate
and with reason. The error is clear in signalling the problem. (If not clear, go back to help reshape and work out some exercises.) The request makes no sense given the functioning of the command.
Finally, note it's relatively easy to check if something will work or not: just try it! All that was necessary in this case was to modify a bit the example data. Go back to help files and manuals, if necessary.
The command is slightly misspecified. You want to reshape value. Look at the output you want and notice the observations are uniquely identified by id and test_date. Therefore, they should be in the i option.
reshape wide value, i(id test_date) j(test) string
This yields something close you what you want, you just need to rename a few variables to get exactly the output. Specifically:
rename test_date date
renpfix value
Related
I have a data set that looks like this:
id varA varB varC
1 0 10 .
1 0 20 .
1 0 35 .
2 1 60 76
2 1 76 60
2 0 32 .
I want to create the varC that reverses the order of varB only for the values varA=1 and missing otherwise.
This may help:
clear
input id varA varB varC
1 0 10 .
1 0 20 .
1 0 35 .
2 1 60 76
2 1 76 60
2 0 32 .
end
gen group = sum(id != id[_n-1] | varA != varA[_n-1])
sort group, stable
by group: gen wanted = cond(varA == 1, varB[_N - _n + 1], .)
list id var* wanted, sepby(id varA)
+----------------------------------+
| id varA varB varC wanted |
|----------------------------------|
1. | 1 0 10 . . |
2. | 1 0 20 . . |
3. | 1 0 35 . . |
|----------------------------------|
4. | 2 1 60 76 76 |
5. | 2 1 76 60 60 |
|----------------------------------|
6. | 2 0 32 . . |
+----------------------------------+
I need to do this:
table 1:
ID Cod.
1 20
2 102
4 30
7 10
9 201
10 305
table 2:
ID Cod.
1 20
2 50
3 15
4 30
5 25
7 10
10 300
Now, I got a table like this with an outer join:
ID Cod. ID1 Cod1.
1 20 1 20
2 50 . .
. . 2 102
3 15 . .
4 30 4 30
5 25 . .
7 10 7 10
. . 9 201
10 300 . .
. . 10 305
Now I want to add a flag that tell me if the ID have common values, so:
ID Cod. ID1 Cod1. FLag_ID Flag_cod:
1 20 1 20 0 0
2 50 . . 0 1
. . 2 102 0 1
3 15 . . 1 1
4 30 4 30 0 0
5 25 . . 1 1
7 10 7 10 0 0
. . 9 201 1 1
10 300 . . 0 1
. . 10 305 0 1
I would like to know how can I get the flag_ID, specifically to cover the cases of ID = 2 or ID=10.
Thank you
You can group by a coalescence of id in order to count and compare details.
Example
data table1;
input id code ##; datalines;
1 20 2 102 4 30 7 10 9 201 10 305
;
data table2;
input id code ##; datalines;
1 20 2 50 3 15 4 30 5 25 7 10 10 300
;
proc sql;
create table got as
select
table2.id, table2.code
, table1.id as id1, table1.code as code1
, case
when count(table1.id) = 1 and count(table2.id) = 1 then 0 else 1
end as flag_id
, case
when table1.code - table2.code ne 0 then 1 else 0
end as flag_code
from
table1
full join
table2
on
table2.id=table1.id and table2.code=table1.code
group by
coalesce(table2.id,table1.id)
;
You might also want to look into
Proc COMPARE with BY
Consider the following data example:
clear
input id code cost
1 15342 18
2 15366 12
1 16786 32
2 15342 12
3 12345 45
4 23453 345
1 34234 23
2 22223 12
4 22342 64
3 23452 23
1 23432 22
end
How can I keep all the records for the IDs that contain the code 15324 in any row?
This is a follow-up question to a previous one of mine: Keeping all the records for specific IDs
The following works for me:
clear
input id code cost
1 15342 18
2 15366 12
1 16786 32
2 15342 12
3 12345 45
4 23453 345
1 34234 23
2 22223 12
4 15342 64
3 23452 23
1 23432 22
end
bysort id (code): egen tag = total(inlist(code, 15342))
keep if tag
Results:
list, sepby(id)
+-------------------------+
| id code cost tag |
|-------------------------|
1. | 1 15342 18 1 |
2. | 1 16786 32 1 |
3. | 1 23432 22 1 |
4. | 1 34234 23 1 |
|-------------------------|
5. | 2 15342 12 1 |
6. | 2 15366 12 1 |
7. | 2 22223 12 1 |
|-------------------------|
8. | 4 15342 64 1 |
9. | 4 23453 345 1 |
+-------------------------+
Note that I changed the data example slightly for better illustration.
I have a large Stata dataset that contains the following variables: year, state, household_id, individual_id, partner_id, and race. Here is an example of my data:
year state household_id individual_id partner_id race
1980 CA 23 2 1 3
1980 CA 23 1 2 1
1990 NY 43 4 2 1
1990 NY 43 2 4 1
Note that, in the above table, column 1 and 2 are married to each other.
I want to create a variable that is one if the person is in an interracial marriage.
As a first step, I used the following code
by household_id year: gen inter=0 if race==race[partner_id]
replace inter=1 if inter==.
This code worked well but gave the wrong result in a few cases. As an alternative, I created a string variable identifying each user and its partner, using
gen id_user=string(household_id)+"."+string(individual_id)+string(year)
gen id_partner=string(household_id)+"."+string(partner_id)+string(year)
What I want to do now is to create something like what vlookup does in Excel: for each column, save locally the id_partner, find it in the id_user and find their race, and compare it with the race of the original user.
I guess it should be something like this?
gen inter2==1 if (find race[idpartner]) == (race[iduser])
The expected output should be like this
year state household_id individual_id partner_id race inter2
1980 CA 23 2 1 3 1
1980 CA 23 1 2 1 1
1990 NY 43 4 2 1 0
1990 NY 43 2 4 1 0
I don't think you need anything so general. As you realise, the information on identifiers suffices to find couples, and that in turn allows comparison of race for the people in each couple.
In the code below _N == 2 is meant to catch data errors, such as one partner but not the other being an observation in the dataset or repetitions of one partner or both.
clear
input year str2 state household_id individual_id partner_id race
1980 CA 23 2 1 3
1980 CA 23 1 2 1
1990 NY 43 4 2 1
1990 NY 43 2 4 1
end
generate couple_id = cond(individual_id < partner_id, string(individual_id) + ///
" " + string(partner_id), string(partner_id) + ///
" " + string(individual_id))
bysort state year household_id couple_id : generate mixed = race[1] != race[2] if _N == 2
list, sepby(household_id) abbreviate(15)
+-------------------------------------------------------------------------------------+
| year state household_id individual_id partner_id race couple_id mixed |
|-------------------------------------------------------------------------------------|
1. | 1980 CA 23 2 1 3 1 2 1 |
2. | 1980 CA 23 1 2 1 1 2 1 |
|-------------------------------------------------------------------------------------|
3. | 1990 NY 43 4 2 1 2 4 0 |
4. | 1990 NY 43 2 4 1 2 4 0 |
+-------------------------------------------------------------------------------------+
This idea is documented in this article. The link gives free access to a pdf file.
In my data, I have variables as follows: household ID, ID of persons in household, father ID, years of education, who is the father. So person 3 in house 23 for example might say that person 1 is his or her father, while person 6 and 7 and 8 also in house 23 says that person 9 is their father. This is likely a joint family.
So I can't make a new column eduF in the usual way, since for person 3 and 6/7/8 in the same household, the father is different so the eduF level varies even in the same household. I need however this new column eduF saying, for each member of the family, what is the education level of the person they list to be their father.
I think this requires forvalues or foreach and loops, but am not sure what would be the code!
In the image of the sample, 'father i' and 'father n' mean that the father is dead or info not available.
key pid fathID yearsEDU
282 10 fath n 13
282 9 1 10
282 8 4
282 7 4 12
282 6 4 14
282 5 fath n 10
282 4 1 9
282 3 1 8
282 2 fath i
282 1 fath i 4
283 4 1 4
283 3 1 6
283 2 fath i 14
283 1 fath i 17
In the example given, the values of xpers run 1 up in each household. (If that's not true, it can be arranged).
There is a singular lack of information here about which variables are numeric, which are numeric with value labels and which are string.
But assuming that q0111 is string, we can get the numeric only values for fathers' identifiers by
gen fatherid = real(q0111)
Then it is
bysort xhhkey (xpers) : gen father_educ = q0407_a[fatherid]
The key idea here is that under the aegis of by: subscripts are interpreted within groups, and therefore the values of fatherid are precisely the subscripts we need.
As #Metrics asserted, no loop is necessary.
list xhhkey xpers q0111 q0407_a fatherid father_educ, sep(0)
+-----------------------------------------------------------+
| xhhkey xpers q0111 q0407_a fatherid father~c |
|-----------------------------------------------------------|
1. | 282 1 father i 13 . . |
2. | 282 2 father i 10 . . |
3. | 282 3 1 . 1 13 |
4. | 282 4 1 12 1 13 |
5. | 282 5 father n 14 . . |
6. | 282 6 4 10 4 12 |
7. | 282 7 4 9 4 12 |
8. | 282 8 4 8 4 12 |
9. | 282 9 1 . 1 13 |
10. | 282 10 father n 4 . . |
11. | 283 1 father i 4 . . |
12. | 283 2 father i 6 . . |
13. | 283 3 1 14 1 4 |
14. | 283 4 1 17 1 4 |
15. | 284 1 father i 5 . . |
16. | 284 2 father n . . . |
17. | 284 3 1 1 1 5 |
18. | 284 4 father i 4 . . |
19. | 284 5 father n 8 . . |
20. | 284 6 father i 7 . . |
21. | 284 7 father n 18 . . |
22. | 284 8 6 2 6 7 |
23. | 284 9 6 . 6 7 |
24. | 284 10 father i 9 . . |
+-----------------------------------------------------------+
By the way, the terminology of columns is alien to Stata outside the context of matrices: they are variables.
There is a moderately detailed tutorial on by: in http://www.stata-journal.com/article.html?article=pr0004 Even experienced Stata users often underestimate what you can do with by:.