I have a dataset temp1 as:
ID
Drug
lob
Timestamp
123
acetam
counter
01JAN22 04:25:17
123
acetam
counter
01JAN22 09:15:13
123
acetam
prescr
02JAN22 15:05:25
123
acetam
counter
03JAN22 23:28:05
234
tylenol
counter
11JAN22 18:12:39
345
aztyr
counter
03FEB22 16:11:19
345
aztyr
counter
03FEB22 16:15:20
for the same ID, Drug, lob and the old timestamp create a flag as 'Yes'
the second record of same ID, Drug, lob will also have a flag of 'Yes' when the timestamp is greater than 5 days compared to timestamp of previous entry.
expected output
ID
Drug
lob
Timestamp
flag
123
acetam
counter
01JAN22 04:25:17
Yes
123
acetam
counter
01JAN22 09:15:13
No
123
acetam
prescr
02JAN22 15:05:25
Yes
123
acetam
counter
11JAN22 23:28:05
Yes
234
tylenol
counter
11JAN22 18:12:39
Yes
345
aztyr
counter
03FEB22 16:11:19
Yes
345
aztyr
counter
03FEB22 16:15:20
No
My Code:
create table temp2 as
select id, drug,lob, datepart(timestamp)as timestamp format=mmddyy10.
from temp1
order by id, drug,lob
quit;
data temp3;
set temp2;
by id, drug,lob
diff=timestamp-lag(timestamp);
if first.id and first.drug and first.lob then do diff=.;end;
run;
I am first trying to calculate the difference between dates so that I can create a flag, but for the 3rd record for an lob='prescr' the diff I am getting is 1 but since the lob is different from the previous records this has to be considered as new record and flag has to 'yes'. I am stuck here and not able to move. Can anyone help?
I recommend reading up on BY group processing:
https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/lrcon/n01a08zkzy5igbn173zjz82zsi1s.htm
Example 1 illustrates how the first. and last. variables are set.
If I understand correctly you probably want to do something like this
data temp3;
set temp2;
by id drug lob;
diff = timestamp - lag(timestamp);
if first.lob or diff > 5 then
flag = 'yes';
else
flag = 'no';
run;
Related
I feel like I am making this more complicated than it should be. I have a sample dataset below with an ID column and a Counter column. The counter column resets and I would like to create a dataset containing only the rows where the counter column is the max value before it resets again. My dataset also has thousands of ID's that I would need to do this for.
data test;
infile datalines delimiter=",";
informat ID $3.
TCOUNT 10.;
input ID $ TCOUNT $ ;
datalines;
123,1
123,2
123,3
123,4
123,1
123,2
123,3
123,1
123,2
;
run;
and my desired output in a new table would look like...
ID TCOUNT
123 4
123 3
123 2
It might be easiest/clearest to first assign a label to each of the non-decreasing TCOUNT blocks of observations.
data groups;
set test;
by id ;
if first.id then group=0;
if first.id or tcount<lag(tcount) then group+1;
run;
Then it is a simple matter to find the last observation in each group.
data want;
set groups;
by id group;
if last.group;
run;
I use SAS EG and have a data set that looks like:
CLIENT_ID Segment Yearmonth
XXXX A 201305
XXXX A 201306
XXXX A 201307
YYYY A 201305
YYYY A 201306
YYYY B 201307
i want an output that has a number assigned to a new column which resets when a new account is there:
CLIENT_ID Segment Yearmonth New_Variable
XXXX A 201305 1
XXXX A 201306 2
XXXX A 201307 3
YYYY A 201305 1
YYYY A 201306 2
YYYY B 201307 3
That was problem number one, which i solved with this code:
PROC SORT DATA= GENERAL.HISTORICAL_SEGMENTS;
by Client_ID;
RUN;
data HISTORICAL_SEGMENTS2;
SET GENERAL.HISTORICAL_SEGMENTS;
count + 1;
by Client_ID;
if first.Client_ID then count = 1;
run;
I want to create a second data set and i want to see if there is a way to get the segments only if the segment changes: For example from the above the
CLIENT_ID Segment Yearmonth New_Variable
YYYY A 201305 1
YYYY B 201306 2
Any help would be appreciated. Thanks.
Nice job on answering your first question. I think that step reads more clearly if you rearrange it a bit, e.g.:
data HISTORICAL_SEGMENTS2 ;
set GENERAL.HISTORICAL_SEGMENTS ;
by Client_ID ;
if first.Client_ID then count = 0 ;
count + 1 ;
run;
I think it's customary to put the BY statement right after the SET statement it applies to, for clarity sake. Reset the counter to 0 when Client_ID changes.
It looks like you want a second dataset, call it FIRSTS, with the first record from each by group. To do that, note that it's possible for one DATA step to write multiple output datasets. This can be done by using an explicit OUTPUT statement to write to each dataset, e.g. :
data HISTORICAL_SEGMENTS2 FIRSTS ;
set GENERAL.HISTORICAL_SEGMENTS ;
by Client_ID ;
if first.Client_ID then count = 0 ;
count + 1 ;
output HISTORICAL_SEGMENTS2 ; *output every record;
if first.Client_ID then output FIRSTS ; *output first of each group;
run;
I have the following dataset:
DATA survey;
INPUT zip_code number;
DATALINES;
1212 12
1213 23
1214 23
;
PROC PRINT; RUN;
I want to link this data to another table but the thing is that the numbers in the other table are stored in the following format: 0012, 0023, 0023.
So I am looking for a way to do the following:
Check how long the number is
If length = 1, add 3 0 values to the beginning
If length = 2, add 2 0 values to the beginning
Any thoughts on how I can get this working?
Numbers are numbers so if the other table has the field as a number then you don't need to do anything. 13 = 0013 = 13.00 = ....
If the other table actually has a character variable then you need to convert one or the other.
char_number = put(number, Z4.);
number = input(char_number, 4.);
You can use z#. formats to accomplish this:
DATA survey;
INPUT zip_code number;
DATALINES;
1212 12
1213 23
1214 23
9999 999
8888 8
;
data survey2;
set survey;
number_long = put(number, z4.);
run;
If you need it to be four characters long, then you could do it like this:
want = put(input(number,best32.),z4.);
I have a large dataset, with hundreds of variables and hundreds of observations, coming from a clinical trial. Variable V1 is a yes/no variable, indicating some condition. V2 is numeric, and representing a dose. T is a time variable. The dataset is "long" shaped, every subject has few observations, one for each time point. For every subject, I want to create a new yes/no variable (can be in a new dataset), which is yes if: V1 is "yes" in at least one time point, OR, V2 is above 0 in at least one time point. How do I do that? Thank you.
Try the following:
data ds;
set ds;
if V1="yes" or V2>0 then do;
flag="yes;
end;
else do;
flag= "no";
end;
summarize the dataset to ID level:
proc sql;
create table summary as
select ID, count(flag) as flag_cnt
from ds
where flag="yes"
group by ID;
quit;
These are the IDs which satisfy the condition
You can submit the code on the example below to verify.
Here (V1="yes" or V2>0) gives a dummy variable for eauch row. When we sum it we have the number of rows satisfying the condition you mentioned for each ID.
To have a flag, we compare the sum to 0 and put it between () to create a 0/1 variable that you want to have.
hope it helps !
MK
data have;
input ID V1 $ V2;
cards;
1 yes 0
1 no 0
1 no 0
2 no 0
2 no 0
2 no 0
3 no 1
3 no 0
4 yes 0
4 yes 0
5 yes 1
5 no 1
5 yes 0
;
run;
proc sql;
select ID
, (sum((V1="yes")or(V2>0))>0) as new_flag
from have
group by ID;
quit;
data want (keep=id flag);
flag = 'no ';
do until (last.id);
set have;
by id;
if v1 = 'yes' or v2 > 0 then flag = 'yes';
end;
output;
run;
I have a dataset that looks something like this:
IDnum State Product Consumption
123 MI A 30
123 MI B 20
123 MI C 45
456 NJ A 15
456 NJ D 10
789 MI B 60
... ... ... ...
And i would like to create a new dataset, where i have one row for each IDnum, and a new dummy variable for each different product (in my real dataset i have close to 1000 products), along with it's associated consumption. It would look like something in these lines
IDnum State Prod.A Cons.A Prod.B Cons.B Prod.C Cons.C Prod.D Cons.D
123 MI yes 30 yes 20 yes 45 no -
456 NJ yes 15 no - no - yes 10
789 MI no - yes 60 no - no -
... ... ... ... ... ... ... ... ... ...
Some variables like "State" doesn't change within the same IDnum, but each row in the original bank are equivalent to one purchase, hence the change in the "product" and "consumption" variables for the same IDnum. I would like that my new dataset showed all the consumption habits of each costumer in one single row, but so far i have failed.
Any help would be greatly apreciated.
Without yes/no variables, it's really easy:
data input;
length State $2 Product $1;
input IDnum State Product Consumption;
cards;
123 MI A 30
123 MI B 20
123 MI C 45
456 NJ A 15
456 NJ D 10
789 MI B 60
;
run;
proc transpose data=input out=output(drop=_NAME_) prefix=Cons_;
var Consumption;
id Product;
by IDnum State;
run;
Adding the yes/no fields:
proc sql;/* from column names or alternatively
create it from source data directly if not taking too long */
create table work.products as
select scan(name, 2, '_') as product length=1
from dictionary.columns
where libname='WORK' and memname='OUTPUT'
and upcase(name) like 'CONS_%';
quit;
filename vars temp;/* write a temp file containing variable definitions
in desired order */
data _null_;
set work.products end=last;
file vars;
length str $40;
if _N_ = 1 then put 'LENGTH ';
str = catt('Prod_', product, ' $3');
put str;
str = catt('Cons_', product, ' 8');
put str;
if last then put ';';
run;
options source2;
data output2;
length IdNum 8 State $2;
%include vars;
set output;
array prod{*} Prod_:;
array cons{*} Cons_:;
drop i;
do i=1 to dim(prod);
if coalesce(cons(i), 0) ne 0 then prod(i)='yes';
else prod(i)='no';
end;
run;