assign site name and number sequence to patid in SAS - sas

I have a dataset with many sites (like in excerpt below) where the 3 participants at the end with XXX are additional dummy participants. What I want is to modify these XXX patids so that they are in the format I have in the "want" dataset below, as opposed to having XXX at the end, while keeping the group assignment from have dataset unchanged.
data have;
input site $ patid $ group $;
datalines;
ABC ABCPROT01001 A
ABC ABCPROT01002 B
ABC ABCPROT01003 A
ABC ABCPROT01005 A
ABC ABCPROT01006 A
ABC ABCPROT01XXX B
ABC ABCPROT01XXX A
ABC ABCPROT01XXX B
CDF CDFPROT01004 A
CDF CDFPROT01005 A
CDF CDFPROT01006 A
CDF CDFPROT01007 B
CDF CDFPROT01008 A
CDF CDFPROT01009 B
CDF CDFPROT01010 A
CDF CDFPROT01012 A
CDF CDFPROT01013 B
CDF CDFPROT01XXX B
CDF CDFPROT01XXX B
CDF CDFPROT01XXX A
AMD AMDPROT01001 A
AMD AMDPROT01002 B
AMD AMDPROT01003 A
AMD AMDPROT01XXX B
AMD AMDPROT01XXX A
AMD AMDPROT01XXX A
;
run;
data want;
input site $ patid $ group $;
datalines;
ABC ABCPROT01001 A
ABC ABCPROT01002 B
ABC ABCPROT01003 A
ABC ABCPROT01005 A
ABC ABCPROT01006 A
ABC ABCPROT01007 B
ABC ABCPROT01008 A
ABC ABCPROT01009 B
CDF CDFPROT01004 A
CDF CDFPROT01005 A
CDF CDFPROT01006 A
CDF CDFPROT01007 B
CDF CDFPROT01008 A
CDF CDFPROT01009 B
CDF CDFPROT01010 A
CDF CDFPROT01012 A
CDF CDFPROT01013 B
CDF CDFPROT01014 B
CDF CDFPROT01015 B
CDF CDFPROT01016 A
AMD AMDPROT01001 A
AMD AMDPROT01002 B
AMD AMDPROT01003 A
AMD AMDPROT01004 B
AMD AMDPROT01005 A
AMD AMDPROT01006 A
;
run;

This answer takes care of both problems:
Your XXX observations are already in the Data
It handles group with only XXX observations
code:
data have;
input site $ patid :$12. group $;
datalines;
ABC ABCPROT01001 A
ABC ABCPROT01002 B
ABC ABCPROT01003 A
ABC ABCPROT01005 A
ABC ABCPROT01006 A
ABC ABCPROT01XXX B
ABC ABCPROT01XXX A
ABC ABCPROT01XXX B
CDF CDFPROT01004 A
CDF CDFPROT01005 A
CDF CDFPROT01006 A
CDF CDFPROT01007 B
CDF CDFPROT01008 A
CDF CDFPROT01009 B
CDF CDFPROT01010 A
CDF CDFPROT01012 A
CDF CDFPROT01013 B
CDF CDFPROT01XXX B
CDF CDFPROT01XXX B
CDF CDFPROT01XXX A
QQL QQLPROT01XXX A
QQL QQLPROT01XXX B
QQL QQLPROT01XXX A
AMD AMDPROT01001 A
AMD AMDPROT01002 B
AMD AMDPROT01003 A
AMD AMDPROT01XXX B
AMD AMDPROT01XXX A
AMD AMDPROT01XXX A
;
run;
data want (drop = n l);
set have;
by site notsorted;
if first.site then n = 0;
l = substr(patid, length(patid) - 2);
if not find(l, 'x', 'i') then n = input(l, 3.);
else n = sum(n, 1);
substr(patid, 10, 3) = put(n, z3.);
retain n;
run;

I assume that the three extra obs are not there in your have data set.
Try this
data have;
input site $ patid $12.;
datalines;
ABC ABCPROT01001
ABC ABCPROT01002
ABC ABCPROT01003
ABC ABCPROT01005
ABC ABCPROT01006
CDF CDFPROT01004
CDF CDFPROT01005
CDF CDFPROT01006
CDF CDFPROT01007
CDF CDFPROT01008
CDF CDFPROT01009
CDF CDFPROT01010
CDF CDFPROT01012
CDF CDFPROT01013
AMD AMDPROT01001
AMD AMDPROT01002
AMD AMDPROT01003
;
data want;
do _N_ = 1 by 1 until (last.site);
set have;
by site notsorted;
l = input(substr(patid, length(patid) - 2), 3.);
output;
end;
do l = l + 1 to l + 3;
substr(patid, 10, 3) = put(l, z3.);
output;
end;
run;

Related

Removing rows between two values in SAS

For the following data I am trying to filter rows, of each group ID, based on these conditions:
After every row with type='B' and value='Y' do the following
Remove the rows until the next row having type='F' and value='Y'.
If there is no B='Y then keep all of them (e.g. id=002)
Can we create the flag variable as shown in my want dataset? so that I can filter on Flag='Y'?
Have
ID Type Date Value
001 F 1/2/2018 Y
001 B 1/3/2018
001 B 1/4/2018 Y
001 B 1/5/2018
001 B 1/6/2018
001 F 1/6/2018 Y
001 B 1/6/2018
001 B 1/7/2018
001 B 1/8/2018 Y
001 B 1/8/2018
001 B 1/9/2018
002 F 1/2/2018 Y
002 B 1/3/2018
002 B 1/4/2018
Want
ID Type Date Value Flag
001 F 1/2/2018 Y Y
001 B 1/3/2018 Y
001 B 1/4/2018 Y Y
001 B 1/5/2018
001 B 1/6/2018
001 F 1/6/2018 Y Y
001 B 1/6/2018 Y
001 B 1/7/2018 Y
001 B 1/8/2018 Y Y
001 B 1/8/2018
001 B 1/9/2018
002 F 1/2/2018 Y Y
002 B 1/3/2018 Y
002 B 1/4/2018 Y
I tried to do the following
data F;
set have;
where Type='F';run;
data B;
set have;
where Type='B';run;
proc sql;
create table all as select
a.* from B as b
inner join F as f
on a.id=b.id
and b.date >= a.date;
quit;
This includes all the rows from my have dataset. Any help is much appreciated.
The criteria for computing the state of a row as part of a contiguous sub-group (call it a 'run' of rows) within group ID are relatively simple, but a compromised state might occur or be indicated if some funny cases of data occur:
two or more B Y before a F Y (extra 'run ending')
two or more F Y before a B Y ('run starting' within a run)
first row in group not F Y ('run starting' not first in group)
data want(drop=run_:);
SET have;
BY id;
run_first = (type='F' and value='Y');
run_final = (type='B' and value='Y');
* set flag state at criteria for start of contiguous sub-group criteria;
run_flag + run_first;
if first.id and NOT run_flag then
put 'WARNING: first row in group ' id= ' is not F Y, this may be incorrect';
if run_flag > 1 and run_first then
put 'WARNING: an additional F Y before a B Y at row ' _n_;
if run_flag then
OUTPUT;
if run_flag = 0 and run_final then
put 'WARNING: an additional B Y before a F Y at row ' _n_;
* reset flag at criteria for contiguous sub-group;
if last.id or run_final then
run_flag = 0;
run;
Same as Richard, I don't quite understand what the filtering criteria are.
I could see one problem with your join. you used a.* in your select statement, but "b" and "f" as your dataset aliases. this would not work as no dataset have been assigned to alias "a".
Proper way would be as follow:
proc sql;
create table all as
select b.* from B as b
inner join F as f
on b.id=f.id
and b.date >= f.date;
quit;
However, even then, I don't believe inner join is the proper way to solve your problem. Do let us your filtering condition please?
I have a solution but it is not the most elegant (and might not cover corner cases.) If anyone else has a better solution please share.
First, to create the dataset in-case anyone else want to try it out:
Data work.have;
input #01 ID 3.
#05 Type $1.
#07 Date date7.
#18 Value $1.;
format ID 3.
Type $1.
Date date11.
Value $1.;
datalines;
001 F '02Jan18'n Y
001 B '03Jan18'n
001 B '04Jan18'n Y
001 B '05Jan18'n
001 B '06Jan18'n
001 F '06Jan18'n Y
001 B '06Jan18'n
001 B '07Jan18'n
001 B '08Jan18'n Y
001 B '08Jan18'n
001 B '09Jan18'n
002 F '02Jan18'n Y
002 B '03Jan18'n
002 B '04Jan18'n
;
run;
Solution:
I based on your edited suggestion of creating a flag variable.
Data Flag;
set work.have;
if Type = 'B' and Value = 'Y' then
flag + 1;
if Type = 'F' then
flag = 0;
if Value ne 'Y' and flag = 1 then delete;
run;
The flag variable is 0 by default.
The first IF-Then condition identifies the Type B ='Y' rows and flag them as 1, as well as retaining this flag for the subsequent rows.
The second IF-Then condition identifies the type='F' row and resets the Flag to 0
The Last If-Then condition drops all rows with Flag=1 except the first occurrence which are the Type B ='Y' rows.
I hope this applies to your problem.

SAS count unique observation by group

I am looking to figure out how many customers get their product from a certain store. The problem each prod_id can have up to 12 weeks of data for each customer. I have tried a multitude of codes, some add up all of the obersvations for each customer while others like the one below remove all but the last observation.
proc sort data= have; BY Prod_ID cust; run;
Data want;
Set have;
by Prod_Id cust;
if (last.Prod_Id and last.cust);
count= +1;
run;
data have
prod_id cust week store
1 A 7/29 ABC
1 A 8/5 ABC
1 A 8/12 ABC
1 A 8/19 ABC
1 B 7/29 ABC
1 B 8/5 ABC
1 B 8/12 ABC
1 B 8/19 ABC
1 B 8/26 ABC
1 C 7/29 XYZ
1 C 8/5 XYZ
1 F 7/29 XYZ
1 F 8/5 XYZ
2 A 7/29 ABC
2 A 8/5 ABC
2 A 8/12 ABC
2 A 8/19 ABC
2 C 7/29 EFG
2 C 8/5 EFG
2 C 8/12 EFG
2 C 8/19 EFG
2 C 8/26 EFG
what i want it to look like
prod_id store count
1 ABC 2
1 XYZ 2
2 ABC 1
2 EFG 2
Firstly, read about if-statement.
I've just edited your code to make it work:
proc sort data=have;
by prod_id store cust;
run;
data want(drop=cust week);
set have;
retain count;
by prod_id store cust;
if (last.cust) then count=count+1;
else if (first.prod_id or first.store) then count = 0;
if (last.prod_id or last.store) then output;
run;
If you will have questions, ask.
The only place where the result of the COUNT() aggregate function in SQL might be confusing is that it will not count missing values of the variable.
select prod_id
, store
, count(distinct cust) as count
, count(distinct cust)+max(missing(cust)) as count_plus_missing
from have
group by prod_id ,store
;

Update master table with update table while no obs in master table SAS

I have two datasets: base (master table) updateX (updated table which might contain new observation)
data base;
input Field1 $ Field2 $ Field3 $ Field4 $;
datalines;
F 0001 20160501 ABC
NF 0001 20160502 CDF
NF 0002 20160601 ABC
NF 0002 20160602 CDF
;
run;
data updateX;
input Field1 $ Field2 $ Field3 $ Field4 $;
datalines;
F 0001 20160502 CDF
F 0002 20160602 CDF
F 0003 20160603 CDF
;
run;
My desired output
F 0001 20160501 ABC
F 0001 20160502 CDF
NF 0001 20160502 CDF
F 0002 20160602 CDF
F 0003 20160603 CDF
My effort:
data base;
modify base updateX;
by Field2 Field3;
run;
With MODIFY you need to tell SAS to REPLACE or OUTPUT depending on if the records are matched or not.
data base;
modify base updatex;
by field2 field3;
if _iorc_ eq 0 then replace;
else do;
output;
_error_=0;
end;
run;
It is easier if you can create a new data set using UPDATE. With update the matching records are updated and then output (replaced) and new records from the transaction file are output.
data ubase;
update base updatex;
by field2 field3;
run;

SAS: comparisons across multiple columns for pairs of IDs

I am working with data that derives from an 'indicate all that apply' question. Two raters were asked to complete the question for a unique subject list. The data looks something like this.
ID| Rater|Q1A|Q1B|Q1C|Q1D
------------------------
1 | 1 | A | F | E | B
1 | 2 | E | G |
2 | 1 | D | C | A
2 | 2 | C | D | A
I want to compare the two raters' answers for each ID and determine whether answers for Q1A-Q1D are the same. I am not interested in the direct comparisons between each rater by ID for Q1A, Q1B, etc. individually. I want to know if all the values in Q1A-Q1D as a set are the same. (E.g., in the example data above, the raters for ID 2 would be identical). I am assuming I would do this with an array. Thanks.
Here is a similar solution also using call sortc, but rather using vectors and retain variables.
Create example dataset
data ratings;
infile datalines truncover;
input ID Rater (Q1A Q1B Q1C Q1D) ($);
datalines;
1 1 A F E B
1 2 E G
2 1 D C A
2 2 C D A
3 1 A B C
3 2 A B D
;
Do the comparison
data compare(keep=ID EQUAL);
set ratings;
by ID;
format PREV_1A PREV_Q1B PREV_Q1C PREV_Q1D $1.
EQUAL 1.;
retain PREV_:;
call sortc(of Q1:);
array Q(4) Q1:;
array PREV(4) PREV_:;
if first.ID then do;
do _i = 1 to 4;
PREV(_i) = Q(_i);
end;
end;
else do;
EQUAL = 1;
do _i = 1 to 4;
if Q(_i) NE PREV(_i) then EQUAL = 0;
end;
output;
end;
run;
Results
ID EQUAL
1 0
2 1
3 0
This looks like a job for call sortc:
data have;
infile cards missover;
input ID Rater (Q1A Q1B Q1C Q1D) ($);
cards;
1 1 A F E B
1 2 E G
2 1 D C A
2 2 C D A
3 1 A B C
3 2 A B D
;
run;
/*You can use an array if you like, but this works fine too*/
data temp /view = temp;
set have;
call sortc(of q:);
run;
data want;
set temp;
/*If you have more questions, extend the double-dash list to cover all of them*/
by ID Q1A--Q1D notsorted;
/*Replace Q1D with the name of the variable for the last question*/
IDENTICAL_RATERS = not(first.Q1D and last.Q1D);
run;
Sort, Concatenate, then compare.
data want ;
set ratings;
by id;
call sortc(of Q1A -- Q1D);
rating = cats(of Q1A -- Q1D);
retain rater1 rating1 ;
if first.id then rater1=rater;
if first.id then rating1=rating;
if not first.id ;
rater2 = rater ;
rating2 = rating;
match = rating1=rating2 ;
keep id rater1 rater2 rating1 rating2 match;
run;

Transpose data and only keep needed obs

I have one table displaying 3 obs and 4 fields (ID lastname, firstname and telephone number) for each of ID and I prefer to transpose lastname to 3 fields and for firstname field, I want to only keep the one associated with 1st last name and for telephone number, I want to only keep the one associated with 3rd (last) last name.
Table:
ID Lastname FirstName TelephoneNumber
001 Y A 123
001 W B 345
001 Z C 567
002 M D 789
002 N E 912
002 L F 934
Table want:
ID LastName_1 LastName_2 LastName_3 FirstName TelephoneNumber
001 Y W Z A 567
002 M N L D 934
Can anyone help out?
You can do this with PROC SUMMARY IDGROUP. I will leave it to you to research the syntax.
data id;
input (ID Lastname FirstName)(:$3. 2*:$1.) TelephoneNumber;
cards;
001 Y A 123
001 W B 345
001 Z C 567
002 M D 789
002 N E 912
002 L F 934
;;;;
run;
proc print;
run;
proc summary nway;
class id;
output
out=id2
idgroup(out(firstname)=)
idgroup(last out(telephonenumber)=)
idgroup(out[3](lastname)=)
;
run;
proc print;
run;