I'm trying to create a data set that will show me the duplicate transactions. The trouble I'm running into is when there are multiple orders on one order_id. The records that get assigned the 2s I would be considering the duplicate order.
data have;
input acct_id order_id;
datalines;
1 121
1 122
2 123
2 124
3 125
3 125
3 125
3 126
3 126
3 126
data want;
set have;
by acct_id order_id;
if first.acct_id then order_count = 1;
else order_count =2;
run;
My desired output is below.
acct_id | order_id | order_count
1 121 1
1 122 2
2 123 1
2 124 2
3 125 1
3 125 1
3 125 1
3 126 2
3 126 2
3 126 2
What I have coded out already I feel like is close but I can't get it figured out.
data want;
set have;
by acct_id order_id notsorted;
if first.acct_id then order_count=0;
if first.order_id then order_count+1;
put acct_id order_id order_count;
run;
acct_id order_id order_count
1 121 1
1 122 2
2 123 1
2 124 2
3 125 1
3 125 1
3 125 1
3 126 2
3 126 2
3 126 2
Lets say I ahve a dataset as follows:
ID Cat
101 G
101 G
101 F
101 G
102 F
102 F
102 G
102 F
102 F
i want to create a variable for sequence by group variable ID,Cat(notsorted)
a count can be this way
data X1; set have; by ID, cat notsorted;
if first.cat then count=1; else count+1;run;
ID Cat count
101 G 1
101 G 2
101 F 1
101 G 1
102 F 1
102 F 2
102 G 1
102 F 1
102 F 1
however what I am looking for is.
ID Cat Seq
101 G 1
101 G 1
101 F 2
101 G 3
102 F 1
102 F 1
102 G 2
102 F 3
102 F 3
You can just use
seq+first.cat;
So every time you start a new CAT value the SEQ will increment by one.
To reset for each ID add:
if first.id then seq=1;
I have the following dataset:
DATA survey;
INPUT id sex $ age inc r1 r2 r3 ;
DATALINES;
1 F 35 17 7 2 2
17 M 50 14 5 5 3
33 F 45 6 7 2 7
49 M 24 14 7 5 7
65 F 52 9 4 7 7
81 M 44 11 7 7 7
2 F 34 17 6 5 3
18 M 40 14 7 5 2
34 F 47 6 6 5 6
50 M 35 17 5 7 5
;
Now I would like to create to files based on whether the records are Female (F)or NOT. Therefore I do this:
date female other;
set survey;
if sex = "F" then output USA;
else output other;
run;
PROC PRINT; RUN;
This however does not give me two sets with data depending on the F and M value. Any idea on what I am doing wrong here?
When you look in the log window, do you see any error messages?
If your code is
if sex = "F" then output USA;
you should see an error, because the DATA statement does not include a dataset named USA. If you change USA to FEMALE it should work.
Learning to read log messages is an essential skill in SAS.
I need to assign seq number by group. I have tried using seq number but got it by one group (1,2,3, etc). However, I need it by two groups. As in the example below:
Have:
Var1 Var2 Var3
101 aaa 202
101 aaa 202
101 bbb 203
101 ccc 206
101 ddd 207
102 aaa 222
102 aaa 222
102 bbb 223
Want:
Obs var1 var2 var3 seq
1 101 aaa 202 1
2 101 aaa 202 1
3 101 bbb 203 2
4 101 ccc 206 3
5 101 ddd 207 4
6 102 aaa 222 1
7 102 aaa 222 1
8 102 bbb 223 2
If you sort your data it is quite simple:
proc sort data=sashelp.class out=class;
by sex age;
run;
data class;
set class;
by sex age;
if first.sex then
seqn = 0;
if first.age then
seqn + 1;
run;
I have this data:
id test test_date value
1 A 02/06/2014 12:26 11
1 B 02/06/2014 12:26 23
1 C 02/06/2014 13:17 43
1 D 02/06/2014 13:17 65
1 E 02/06/2014 13:17 34
1 F 02/06/2014 13:17 64
1 A 05/06/2014 15:14 234
1 B 05/06/2014 15:14 646
1 C 05/06/2014 16:50 44
1 E 05/06/2014 16:50 55
2 E 05/06/2014 16:50 443
2 F 05/06/2014 16:50 22
2 G 05/06/2014 16:59 445
2 B 05/06/2014 20:03 66
2 C 05/06/2014 20:03 77
2 D 05/06/2014 20:03 88
2 E 05/06/2014 20:03 44
2 F 05/06/2014 20:19 33
2 G 05/06/2014 20:19 22
I would like to transform this data into wide format like this:
id date A B C D E F G
1 02/06/2014 12:26 11 23 43 65 34 64 .
1 05/06/2014 15:14 234 646 44 . 55 . .
2 05/06/2014 16:50 . . . . 443 22 445
2 05/06/2014 20:03 . 66 77 88 44 33 22
I am using reshape command in Stata, but it is not producing required results:
reshape wide test_date value, i(id) j(test) string
Any idea how to do this?
UPDATE:
You're right that we need this missvar. I try to create this by programming, but failed. Let say with-in 2 hours of test date the batch will consider same. We have only 7 tests (A,B,C,D,E,F,G). First I try to find the time difference;
bysort id: gen diff_bd = (test_date[_n] - test_date[_n-1])/(1000*60*60)
bysort id: generate missvar = _n if diff_bd <= 2
#jfeigenbaum has given part of the answer.
The problem I see is that you are missing a variable that identifies relevant sub-groups. These sub-groups seem to be bounded by test taking values A - G. But I may be wrong.
I've included this variable in the example data set, and named it missvar. I forced this variable into the data set believing it identifies groups that, although implicit in your original post, are important for your analysis.
clear
set more off
*----- example data -----
input ///
id str1 test str30 test_date value missvar
1 A "02/06/2014 12:26" 11 1
1 B "02/06/2014 12:26" 23 1
1 C "02/06/2014 13:17" 43 1
1 D "02/06/2014 13:17" 65 1
1 E "02/06/2014 13:17" 34 1
1 F "02/06/2014 13:17" 64 1
1 A "05/06/2014 15:14" 234 2
1 B "05/06/2014 15:14" 646 2
1 C "05/06/2014 16:50" 44 2
1 E "05/06/2014 16:50" 55 2
2 E "05/06/2014 16:50" 443 1
2 F "05/06/2014 16:50" 22 1
2 G "05/06/2014 16:59" 445 1
2 B "05/06/2014 20:03" 66 2
2 C "05/06/2014 20:03" 77 2
2 D "05/06/2014 20:03" 88 2
2 E "05/06/2014 20:03" 44 2
2 F "05/06/2014 20:19" 33 2
2 G "05/06/2014 20:19" 22 2
end
gen double tdate = clock( test_date, "DM20Yhm")
format %tc tdate
drop test_date
list, sepby(id)
*----- what you want ? -----
reshape wide value, i(id missvar tdate) j(test) string
collapse (min) tdate value?, by(id missvar)
rename value* *
list
There should be some way of identifying the groups programmatically. Relying on the original sort order of the data is one way, but it may not be the safest. It may be the only way, but only you know that.
Edit
Regarding your comment and the "missing" variable, one way to create it is:
// one hour is 3600000 milliseconds
bysort id (tdate): gen batch = sum(tdate - tdate[_n-1] > 7200000)
For your example data, this creates a batch variable identical to my missvar. You can also use time-series operators.
Let me emphasize the need for you to be carefull when presenting your example data. It must be representative of the real one or you might get code that doesn't suit it; that includes the possibility that you don't notice it because Stata gives no error.
For example, if you have the same test, applied to the same id within the two-hour limit, then you'll lose information with this code (in the collapse). (This is not a problem in your example data.)
Edit 2
In response to another question found in the comments:
Suppose a new observation for person 1, such that he receives a repeated test within the two-hour limit, but at a different time :
1 A "02/06/2014 12:26" 11 1 // old observation
1 B "02/06/2014 12:26" 23 1
1 A "02/06/2014 12:35" 99 1 // new observation
1 C "02/06/2014 13:17" 43 1
1 D "02/06/2014 13:17" 65 1
1 E "02/06/2014 13:17" 34 1
1 F "02/06/2014 13:17" 64 1
1 A "05/06/2014 15:14" 234 2
1 B "05/06/2014 15:14" 646 2
1 C "05/06/2014 16:50" 44 2
1 E "05/06/2014 16:50" 55 2
Test A is applied at 12:26 and at 12:35. Reshape will have no problem with this, but the collapse will discard information because it is taking the minimum values amongst the id missvar groups; notice that for the variable valueA, new information (the 99) will be lost (so too happens with all other variables, but you are explicit about wanting to discard that). After the reshape but before the collapse you get:
. list, sepby(id)
+--------------------------------------------------------------------------------------------------+
| id missvar tdate valueA valueB valueC valueD valueE valueF valueG |
|--------------------------------------------------------------------------------------------------|
1. | 1 1 02jun2014 12:26:00 11 23 . . . . . |
2. | 1 1 02jun2014 12:35:00 99 . . . . . . |
3. | 1 1 02jun2014 13:17:00 . . 43 65 34 64 . |
4. | 1 2 05jun2014 15:14:00 234 646 . . . . . |
5. | 1 2 05jun2014 16:50:00 . . 44 . 55 . . |
|--------------------------------------------------------------------------------------------------|
6. | 2 1 05jun2014 16:50:00 . . . . 443 22 . |
7. | 2 1 05jun2014 16:59:00 . . . . . . 445 |
8. | 2 2 05jun2014 20:03:00 . 66 77 88 44 . . |
9. | 2 2 05jun2014 20:19:00 . . . . . 33 22 |
+--------------------------------------------------------------------------------------------------+
Running the complete code confirms what we just said:
. list, sepby(id)
+--------------------------------------------------------------------------+
| id missvar tdate A B C D E F G |
|--------------------------------------------------------------------------|
1. | 1 1 02jun2014 12:26:00 11 23 43 65 34 64 . |
2. | 1 2 05jun2014 15:14:00 234 646 44 . 55 . . |
|--------------------------------------------------------------------------|
3. | 2 1 05jun2014 16:50:00 . . . . 443 22 445 |
4. | 2 2 05jun2014 20:03:00 . 66 77 88 44 33 22 |
+--------------------------------------------------------------------------+
Suppose now a new observation for person 1, such that he receives a repeated test within the two-hour limit, but at the same time:
1 A "02/06/2014 12:26" 11 1 // old observation
1 B "02/06/2014 12:26" 23 1
1 A "02/06/2014 12:26" 99 1 // new observation
1 C "02/06/2014 13:17" 43 1
1 D "02/06/2014 13:17" 65 1
1 E "02/06/2014 13:17" 34 1
1 F "02/06/2014 13:17" 64 1
1 A "05/06/2014 15:14" 234 2
1 B "05/06/2014 15:14" 646 2
1 C "05/06/2014 16:50" 44 2
1 E "05/06/2014 16:50" 55 2
Then the reshape won't work. Stata complains:
values of variable test not unique within id missvar tdate
and with reason. The error is clear in signalling the problem. (If not clear, go back to help reshape and work out some exercises.) The request makes no sense given the functioning of the command.
Finally, note it's relatively easy to check if something will work or not: just try it! All that was necessary in this case was to modify a bit the example data. Go back to help files and manuals, if necessary.
The command is slightly misspecified. You want to reshape value. Look at the output you want and notice the observations are uniquely identified by id and test_date. Therefore, they should be in the i option.
reshape wide value, i(id test_date) j(test) string
This yields something close you what you want, you just need to rename a few variables to get exactly the output. Specifically:
rename test_date date
renpfix value