condense multiple records into single record in sas - sas

I have row data by account level and I wish to group them by the account owner as a new data. Yes will take the priority.
Account_Owner Account_No Ever_Purchase Ever_Purchase_within_2days Ever_Deliver_in_2weeks
Tom 12345 Yes Yes No
Tom 34567 Yes No Yes
Tom 09876 No No No
Desired Outcome
Account_Owner Ever_Purchase Ever_Purchase_within_2days Ever_Deliver_in_2weeks
Tom Yes Yes Yes
I am sorry that I don't have any code because I don't know where to start.

You can use a DOW loop to track the group result for each ever_* variable in a temporary array.
proc format;
value yesno .,0 = 'No' other='Yes';
data have; input
Account_Owner $ Account_No Ever_Purchase $ Ever_Purchase_within_2days $ Ever_Deliver_in_2weeks $;
datalines;
Tom 12345 Yes Yes No
Tom 34567 Yes No Yes
Tom 09876 No No No
;
data want;
array evals(100) _temporary_; * presume never more than 100 flag variables;
call missing (of evals(*));
* dow loop;
do until (last.account_owner);
set have;
by account_owner;
array flags ever:;
do _n_ = 1 to dim(flags);
evals(_n_) = evals(_n_) or flags(_n_) = 'Yes'; * compute aggregate result;
end;
end;
* move results back into original variables;
do _n_ = 1 to dim(flags);
flags(_n_) = put(evals(_n_), yesno.);
end;
* implicit output, one row per group combination;
run;
Note: In an alternative solution you can convert Yes/No to numeric 1/0 you can use Proc SUMMARY or Proc MEANS to computed the group result (max of var would be 1 if any Yes and 0 if all No)

Related

SAS - Row by row Comparison within different ID Variables of Same Dataset and delete ALL Duplicates

I need some help in trying to execute a comparison of rows within different ID variable groups, all in a single dataset.
That is, if there is any duplicate observation within two or more ID groups, then I'd like to delete the observation entirely.
I want to identify any duplicates between rows of different groups and delete the observation entirely.
For example:
ID Value
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
The output I desire is:
ID Value
1 D
3 Z
I have looked online extensively, and tried a few things. I thought I could mark the duplicates with a flag and then delete based off that flag.
The flagging code is:
data have;
set want;
flag = first.ID ne last.ID;
run;
This worked for some cases, but I also got duplicates within the same value group flagged.
Therefore the first observation got deleted:
ID Value
3 Z
I also tried:
data have;
set want;
flag = first.ID ne last.ID and first.value ne last.value;
run;
but that didn't mark any duplicates at all.
I would appreciate any help.
Please let me know if any other information is required.
Thanks.
Here's a fairly simple way to do it: sort and deduplicate by value + ID, then keep only rows with values that occur only for a single ID.
data have;
input ID Value $;
cards;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
;
run;
proc sort data = have nodupkey;
by value ID;
run;
data want;
set have;
by value;
if first.value and last.value;
run;
proc sql version:
proc sql;
create table want as
select distinct ID, value from have
group by value
having count(distinct id) =1
order by id
;
quit;
This is my interpretation of the requirements.
Find levels of value that occur in only 1 ID.
data have;
input ID Value:$1.;
cards;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
;;;;
proc print;
proc summary nway; /*Dedup*/
class id value;
output out=dedup(drop=_type_ rename=(_freq_=occr));
run;
proc print;
run;
proc summary nway;
class value;
output out=want(drop=_type_) idgroup(out[1](id)=) sum(occr)=;
run;
proc print;
where _freq_ eq 1;
run;
proc print;
run;
A slightly different approach can use a hash object to track the unique values belonging to a single group.
data have; input
ID Value:& $1.; datalines;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
run;
proc delete data=want;
proc ds2;
data _null_;
declare package hash values();
declare package hash discards();
declare double idhave;
method init();
values.keys([value]);
values.data([value ID]);
values.defineDone();
discards.keys([value]);
discards.defineDone();
end;
method run();
set have;
if discards.find() ne 0 then do;
idhave = id;
if values.find() eq 0 and id ne idhave then do;
values.remove();
discards.add();
end;
else
values.add();
end;
end;
method term();
values.output('want');
end;
enddata;
run;
quit;
%let syslast = want;
I think what you should do is:
data want;
set have;
by ID value;
if not first.value then flag = 1;
else flag = 0;
run;
This basically flags all occurrences of a value except the first for a given ID.
Also I changed want and have assuming you create what you want from what you have. Also I assume have is sorted by ID value order.
Also this will only flag 1 D above. Not 3 Z
Additional Inputs
Can't you just do a sort to get rid of the duplicates:
proc sort data = have out = want nodupkey dupout = not_wanted;
by ID value;
run;
So if you process the observations by VALUE levels (instead of by ID levels) then you just need keep track of whether any ID is ever different than the first one.
data want ;
do until (last.value);
set have ;
by value ;
if first.value then first_id=id;
else if id ne first_id then remapped=1;
end;
if not remapped;
keep value id;
run;

SAS: Creating dummy variables from categorical variable

I would like to turn the following long dataset:
data test;
input Id Injury $;
datalines;
1 Ankle
1 Shoulder
2 Ankle
2 Head
3 Head
3 Shoulder
;
run;
Into a wide dataset that looks like this:
ID Ankle Shoulder Head
1 1 1 0
2 1 0 1
3 0 1 1'
This answer seemed the most relevant but was falling over at the proc freq stage (my real dataset is around 1 million records, and has around 30 injury types):
Creating dummy variables from multiple strings in the same row
Additional help: https://communities.sas.com/t5/SAS-Statistical-Procedures/Possible-to-create-dummy-variables-with-proc-transpose/td-p/235140
Thanks for the help!
Here's a basic method that should work easily, even with several million records.
First you sort the data, then add in a count to create the 1 variable. Next you use PROC TRANSPOSE to flip the data from long to wide. Then fill in the missing values with a 0. This is a fully dynamic method, it doesn't matter how many different Injury types you have or how many records per person. There are other methods that are probably shorter code, but I think this is simple and easy to understand and modify if required.
data test;
input Id Injury $;
datalines;
1 Ankle
1 Shoulder
2 Ankle
2 Head
3 Head
3 Shoulder
;
run;
proc sort data=test;
by id injury;
run;
data test2;
set test;
count=1;
run;
proc transpose data=test2 out=want prefix=Injury_;
by id;
var count;
id injury;
idlabel injury;
run;
data want;
set want;
array inj(*) injury_:;
do i=1 to dim(inj);
if inj(i)=. then inj(i) = 0;
end;
drop _name_ i;
run;
Here's a solution involving only two steps... Just make sure your data is sorted by id first (the injury column doesn't need to be sorted).
First, create a macro variable containing the list of injuries
proc sql noprint;
select distinct injury
into :injuries separated by " "
from have
order by injury;
quit;
Then, let RETAIN do the magic -- no transposition needed!
data want(drop=i injury);
set have;
by id;
format &injuries 1.;
retain &injuries;
array injuries(*) &injuries;
if first.id then do i = 1 to dim(injuries);
injuries(i) = 0;
end;
do i = 1 to dim(injuries);
if injury = scan("&injuries",i) then injuries(i) = 1;
end;
if last.id then output;
run;
EDIT
Following OP's question in the comments, here's how we could use codes and labels for injuries. It could be done directly in the last data step with a label statement, but to minimize hard-coding, I'll assume the labels are entered into a sas dataset.
1 - Define Labels:
data myLabels;
infile datalines dlm="|" truncover;
informat injury $12. labl $24.;
input injury labl;
datalines;
S460|Acute meniscal tear, medial
S520|Head trauma
;
2 - Add a new query to the existing proc sql step to prepare the label assignment.
proc sql noprint;
/* Existing query */
select distinct injury
into :injuries separated by " "
from have
order by injury;
/* New query */
select catx("=",injury,quote(trim(labl)))
into :labls separated by " "
from myLabels;
quit;
3 - Then, at the end of the data want step, just add a label statement.
data want(drop=i injury);
set have;
by id;
/* ...same as before... */
* Add labels;
label &labls;
run;
And that should do it!

Flagging rows that fulfill two conditions in SAS

I have a large dataset, with hundreds of variables and hundreds of observations, coming from a clinical trial. Variable V1 is a yes/no variable, indicating some condition. V2 is numeric, and representing a dose. T is a time variable. The dataset is "long" shaped, every subject has few observations, one for each time point. For every subject, I want to create a new yes/no variable (can be in a new dataset), which is yes if: V1 is "yes" in at least one time point, OR, V2 is above 0 in at least one time point. How do I do that? Thank you.
Try the following:
data ds;
set ds;
if V1="yes" or V2>0 then do;
flag="yes;
end;
else do;
flag= "no";
end;
summarize the dataset to ID level:
proc sql;
create table summary as
select ID, count(flag) as flag_cnt
from ds
where flag="yes"
group by ID;
quit;
These are the IDs which satisfy the condition
You can submit the code on the example below to verify.
Here (V1="yes" or V2>0) gives a dummy variable for eauch row. When we sum it we have the number of rows satisfying the condition you mentioned for each ID.
To have a flag, we compare the sum to 0 and put it between () to create a 0/1 variable that you want to have.
hope it helps !
MK
data have;
input ID V1 $ V2;
cards;
1 yes 0
1 no 0
1 no 0
2 no 0
2 no 0
2 no 0
3 no 1
3 no 0
4 yes 0
4 yes 0
5 yes 1
5 no 1
5 yes 0
;
run;
proc sql;
select ID
, (sum((V1="yes")or(V2>0))>0) as new_flag
from have
group by ID;
quit;
data want (keep=id flag);
flag = 'no ';
do until (last.id);
set have;
by id;
if v1 = 'yes' or v2 > 0 then flag = 'yes';
end;
output;
run;

sas drop value through the whole data

I want to turn invalid values to a single value and them print only the ID's of the obs that has these values=> I have multiple error is the second part (the loop), is it possible to fix somehow this code, or is it simply impossible to do that this way?
data comb8;set comb;
if q2 not in (1,2,3,4,5)then q2=666;
if q3 not in (1,2,3,4,5)then q3=666;
if q10 not in (1,2,3,4,5)then q10=666;
if gender not in(0,1)then gender=666;
if married not in(0,1)then married=666;
run;
data comb10; set comb8;
do i=1 to n;
if i NE 666 then drop(?????????);
end;
keep id;
run;
Hoping I've understood this right - the end result you're after is to just keep the obervation number of any observation that contains an invalid value for any of the criteria you're checking for? If so, try this:
data comb8(keep=id where=(not missing(obid)));
set comb;
if q2 not in (1,2,3,4,5) then obid=_n_;
if q3 not in (1,2,3,4,5) then obid=_n_;
if q10 not in (1,2,3,4,5) then obid=_n_;
if gender not in(0,1) then obid=_n_;
if married not in(0,1) then obid=_n_;
run;
_n_ is an automatic variable that identifies the observation number that has been loaded in from the set statement. You can set obid to this value when an issue is found and then use (where=(not missing(obid))) to only keep the observations that had invalid values
Just in case it's of help to anyone else later on...
data comb8;set comb;
if q2 not in (1,2,3,4,5)then q2=666;
if q3 not in (1,2,3,4,5)then q3=666;
if q10 not in (1,2,3,4,5)then q10=666;
if gender not in(0,1)then gender=666;
if married not in(0,1)then married=666;
run;
data comb10(keep=obid where=(not missing(obid)));
set comb8;
Array Q q1-q10 gender married . ;
do i=1 to dim(Q) ;
if Q[i] EQ 666 then obid=_n_;
end;
run;

Concatenating 3 variable name in a new variable within a datastep by using an array reference

I would like to create a variable called DATFL that would have the following values for the last obseration :
DATFL
gender/scan
Here is the code :
data mix_ ;
input id $ name $ gender $ scan $;
datalines;
1 jon M F
2 jill F L
3 james F M
4 jonas M M
;
run;
data mix_3; set mix_;
length datfl datfl_ $ 50;
array m4(*) id name gender scan;
retain datfl;
do i=1 to dim(m4);
if index(m4(i) ,'M') then do;
datfl_=vname(m4(i)) ;
if missing(datfl) then datfl=datfl_;
else datfl=strip(datfl)||"/"||datfl_;
end;
end;
run;
Unfortunately, the value I get for 'DATFL' at the last observation is 'gender/scan/gender/scan'.Obviously because of the retain statement that I used for 'DATFL' I ended up with duplicates. At the end of this data step, I was planning to use a CALL SYMPUT statement to load the last value into macro variable but I won't do it until I fix my issue...Can anyone provide me with a guidance on how to prevent 'DATFL' to have duplicates value at the end of the dataset ? Cheers
sas_kappel
Don't retain DATFL, Instead, retain DATFL_.
data mix_3; set mix_;
length datfl datfl_ $ 50;
array m4(*) id name gender scan;
retain datfl_;
do i=1 to dim(m4);
if index(m4(i) ,'M') then do;
datfl_=vname(m4(i)) ;
if missing(datfl) then datfl=datfl_;
else datfl=strip(datfl)||"/"||datfl_;
end;
end;
if missing(datfl) then datfl = datfl_;
run;
It doesn't work...Let me change the dataset (mix_) and you can see that RETAIN DATFLl_, is not working in this scenario.
data mix_ ;
input id $ name $ gender $ scan $;
datalines;
1 jon M M
2 Marc F L
3 james F M
4 jonas H M
;
run;
To resume, what I want is to have the DISTINCT value of DATFL, into a macro variable. The code that I proposed does,for each records,a search for variables having the letter M, if it true then DATFL receives the variable name of the array variable. If there are multiple variable names then they will be separated by '/'. For the next records, do the same, BUT add only variable names satisfying the condition AND the variables that were not already kept in DATFL. Currently, if you run my program I have for DATFL at observation 4, DATFL=gender/scan/name/scan/scan but I would like to have DATFL=gender/scan/name , because those one are the distinct values. Ultimatlly, I will then write the following code;
if eof then CALL SYMPUT('DATFL',datfl);
sas_kappel
Your revised data makes it much clearer what you're looking for. Here is some code that should give the correct result.
I've used the CALL CATX function to add new values to DATFL, separated by a /. It first checks that the relevant variable name doesn't already exist in the string.
data mix_ ;
input id $ name $ gender $ scan $;
datalines;
1 jon M M
2 Marc F L
3 james F M
4 jonas H M
;
run;
data _null_;
set mix_ end=eof;
length datfl $100; /*or whatever*/
retain datfl;
array m4{*} $ id name gender scan;
do i = 1 to dim(m4);
if index(m4{i},'M') and not index(datfl,vname(m4{i})) then call catx('/',datfl,vname(m4{i}));
end;
if eof then call symput('DATFL', datfl);
run;
%put datfl = &DATFL.;