Combination of grouped values in SAS - sas

I am trying to find the various combinations only in the cases where the records in the below dataset only differ by the column type. So for example: the first three rows only differ by the column type
Given Dataset
ins_id ins_number type
1234 1234-1234-1 AU
1234 1234-1234-1 HM
1234 1234-1234-1 RE
567 567-567-12 TL
567 567-567-13 TL
9101 9101-1234-1 AU
9101 9101-1234-1 TX
9101 9101-1234-1 CN
8854 8854-1234-1 TX
8854 8854-1234-1 GB
8854 8854-1234-1 RE
8854 8854-1234-2 RX
Expected Output:
combination count
AU,HM,RE 1
AU,TX,CN 1
TX,GB,RE 1
I tried writing the query but I am not getting the desired output, please help:
proc sql;create table tst as select cp.type,
count(distinct ins_id)
from (select distinct fac_prod_typ from dataset3a) cp cross join
(select distinct ins_number from dataset3a) pes left join
dataset3a
on dataset3a.type = cp.type and
dataset3a.ins_number = pes.ins_number
group by cp.type, pes.ins_number;quit;

You will want to sort the data to ensure the types list is consistent over all ids.
A DOW loop over a SET...; BY...; will output one types list per group.
The last step is using Proc FREQ to count the number of ids for each types list.
Example:
data have;
informat ins_id $8. ins_number $25. type $2.;
input ins_id $ ins_number $ type $;
cards;
1234 1234-1234-1 AU
1234 1234-1234-1 HM
1234 1234-1234-1 RE
567 567-567-12 TL
567 567-567-13 TL
9101 9101-1234-1 AU
9101 9101-1234-1 TX
9101 9101-1234-1 CN
8854 8854-1234-1 TX
8854 8854-1234-1 GB
8854 8854-1234-1 RE
8854 8854-1234-2 RX
;
/* force specific ordering of type within group id and number */
/* necessary for proper frequency counting */
/* if sequence of types IS important do not sort and data step by ... NOTSORTED */
proc sort data=have;
by ins_id ins_number type;
run;
data types(keep=types);
length types $200;
do until (last.ins_number);
set have;
by ins_id ins_number;
if indexw(types, type) = 0 then types = catx(',',types,type);
end;
if index(types,',') then output;
run;
proc freq noprint data=types;
table types / out=types_counts(keep=types count) ;
run;

Use FIRST/LAST logic is nice here.
To get the counts, run a PROC FREQ on the final output and this would also allow you to identify the ins_id for the mixes as well.
data have;
informat ins_id $8. ins_number $25. type $2.;
input ins_id $ ins_number $ type $;
cards;
1234 1234-1234-1 AU
1234 1234-1234-1 HM
1234 1234-1234-1 RE
567 567-567-12 TL
567 567-567-13 TL
9101 9101-1234-1 AU
9101 9101-1234-1 TX
9101 9101-1234-1 CN
8854 8854-1234-1 TX
8854 8854-1234-1 GB
8854 8854-1234-1 RE
;;;;
data want;
set have;
by ins_id ins_number type notsorted;
retain combo;
length combo $256.;
if first.ins_number then call missing(combo);
if first.type then combo = catx(", ", combo, type);
if last.ins_number and countw(combo)>1 then output;
run;

Related

Populating a dataset depending on the values of a variable in another dataset

I have two data sets INPUT and OUTPUT.
data INPUT;
input
id 1-4
var1 $ 6-10
var2 $ 12-17
var3 $ 19-22
transformation $ 24-26
;
datalines;
1023 apple banana oats 1:1
1049 12 22 8 2x
1219 milk cream fish 1:1
;
run;
The OUTPUT dataset has a different structure. The variables do not have the same name.
data work.output;
attrib
variable_1 length=8 format=best12. label="Variable 1"
variable_2 length=$50 format=$50. label="Variable 2"
Variable_3 length=8 format=date9. label="Variable 3";
stop;
run;
OUTPUT will be filled with the values from input based on what is specified in column "transformation" in table INPUT: when "transformation" equals "1:1", I want to fill the OUTPUT ds with the values of the corresponding INPUT dataset. If this were a small excel, I would do copy & paste or a lookup.
For example, obs1 of dataset INPUT has transformation = 1:1, so I want to fill variable_1 of dataset OUTPUT with "apple", variable_2 with "banana" and variable_3 with "oats".
For the second observation of ds INPUT I want to multiply each variable with two and assign them to variable_1 - variable_3 respectively.
In my real dataset I have much more columns so I need to automate this, probalby via index, since the variable names do not correspond.
You probably need to code each transformation rule separately.
This works for your example. But you did not include any date transformations so variable3 is not used.
data INPUT;
input
id 1-4
var1 $ 6-10
var2 $ 12-17
var3 $ 19-22
transformation $ 24-26
;
datalines;
1023 apple banana oats 1:1
1049 12 22 8 2x
1219 milk cream fish 1:1
;
proc transpose data=input prefix=value out=step1;
by id transformation;
var var1-var3 ;
run;
data output;
set step1;
length variable1 8 variable2 $50 variable3 8;
format variable3 date9.;
if transformation='1:1' then variable2=value1;
if transformation='2x' then variable1 = 2*input(value1,32.);
run;
Result
Obs id transformation _NAME_ value1 variable1 variable2 variable3
1 1023 1:1 var1 apple . apple .
2 1023 1:1 var2 banana . banana .
3 1023 1:1 var3 oats . oats .
4 1049 2x var1 12 24 .
5 1049 2x var2 22 44 .
6 1049 2x var3 8 16 .
7 1219 1:1 var1 milk . milk .
8 1219 1:1 var2 cream . cream .
9 1219 1:1 var3 fish . fish .

SAS cumulative count by unique ID and date

I have a dataset like below
Customer_ID Vistited_Date
1234 7-Feb-20
4567 7-Feb-20
9870 7-Feb-20
1234 14-Feb-20
7654 14-Feb-20
3421 14-Feb-20
I am trying find the cumulative unique count of customers by date, assuming my output will be like below
Cust_count Vistited_Date
3 7-Feb-20
2 14-Feb-20
7-Feb-2020 has 3 unique customers, whereas 14-Feb-2020 has only 2 hence customer 1234 has visited already.
Anyone knows how I could develop a data set in these conditions?
Sorry if my question is not clear enough, and I am available to give more details if necessary.
Thanks!
NOTE: #draycut's answer has the same logic but is faster, and I will explain why.
#draycut's code uses one hash method, add(), using the return code as test for conditional increment. My code uses check() to test for conditional increment and then add (which will never fail) to track. The one method approach can be perceived as being anywhere from 15% to 40% faster in performance (depending on number of groups, size of groups and id reuse rate)
You will need to track the IDs that have occurred in all prior groups, and exclude the tracked IDs from the current group count.
Tracking can be done with a hash, and conditional counting can be performed in a DOW loop over each group. A DOW loop places the SET statement inside an explicit DO.
Example:
data have;
input ID Date: date9.; format date date11.;
datalines;
1234 7-Feb-20
4567 7-Feb-20
9870 7-Feb-20
1234 14-Feb-20
7654 14-Feb-20
3421 14-Feb-20
;
data counts(keep=date count);
if _n_ = 1 then do;
declare hash tracker();
tracker.defineKey('id');
tracker.defineDone();
end;
do until (last.date);
set have;
by date;
if tracker.check() ne 0 then do;
count = sum(count, 1);
tracker.add();
end;
end;
run;
Raw performance benchmark - no disk io, cpu required to fill array before doing hashing, so those performance components are combined.
The root performance is how fast can new items be added to the hash.
Simulate 3,000,000 'records', 1,000 groups of 3,000 dates, 10% id reuse (so the distinct ids will be ~2.7M).
%macro array_fill (top=3000000, n_group = 1000, overlap_factor=0.10);
%local group_size n_overlap index P Q;
%let group_size = %eval (&top / &n_group);
%if (&group_size < 1) %then %let group_size = 1;
%let n_overlap = %sysevalf (&group_size * &overlap_factor, floor);
%if &n_overlap < 0 %then %let n_overlap = 0;
%let top = %sysevalf (&group_size * &n_group);
P = 1;
Q = &group_size;
array ids(&top) _temporary_;
_n_ = 0;
do i = 1 to &n_group;
do j = P to Q;
_n_+1;
ids(_n_) = j;
end;
P = Q - &n_overlap;
Q = P + &group_size - 1;
end;
%mend;
options nomprint;
data _null_ (label='check then add');
length id 8;
declare hash h();
h.defineKey('id');
h.defineDone();
%array_fill;
do index = 1 to dim(ids);
id = ids(index);
if h.check() ne 0 then do;
count = sum(count,1);
h.add();
end;
end;
_n_ = h.num_items;
put 'num_items=' _n_ comma12.;
put index= comma12.;
stop;
run;
data _null_ (label='just add');
length id 8;
declare hash h();
h.defineKey('id');
h.defineDone();
%array_fill;
do index = 1 to dim(ids);
id = ids(index);
if h.add() = 0 then
count = sum(count,1);
end;
_n_ = h.num_items;
put 'num_items=' _n_ comma12.;
put index= comma12.;
stop;
run;
data have;
input Customer_ID Vistited_Date :anydtdte12.;
format Vistited_Date date9.;
datalines;
1234 7-Feb-2020
4567 7-Feb-2020
9870 7-Feb-2020
1234 14-Feb-2020
7654 14-Feb-2020
3421 14-Feb-2020
;
data want (drop=Customer_ID);
if _N_=1 then do;
declare hash h ();
h.definekey ('Customer_ID');
h.definedone ();
end;
do until (last.Vistited_Date);
set have;
by Vistited_Date;
if h.add() = 0 then Count = sum(Count, 1);
end;
run;
If your data is not sorted and you like the SQL maybe this solution is same good for you and it is very simple:
/* your example 3 rows */
data have;
input ID Date: date9.; format date date11.;
datalines;
1234 7-Feb-20
4567 7-Feb-20
9870 7-Feb-20
1234 14-Feb-20
7654 14-Feb-20
3421 14-Feb-20
1234 15-Feb-20
7654 15-Feb-20
1111 15-Feb-20
;
run;
/* simple set theory. Final dataset contains your final data like results
below*/
proc sql;
create table temp(where =(mindate=date)) as select
ID, date,min(date) as mindate from have
group by id;
create table final as select count(*) as customer_count,date from temp
group by date;
quit;
/* results:
customer_count Date
3 07.febr.20
2 14.febr.20
1 15.febr.20
*/
Another method cause I dont know hash so well. >_<
data have;
input ID Date: date9.; format date date11.;
datalines;
1234 7-Feb-20
4567 7-Feb-20
9870 7-Feb-20
1234 14-Feb-20
7654 14-Feb-20
3421 14-Feb-20
;
data want;
length Used $200.;
retain Used;
set have;
by Date;
if first.Date then count = .;
if not find(Used,cats(ID)) then do;
count + 1;
Used = catx(',',Used,ID);
end;
if last.Date;
put Date= count=;
run;
If you are not overly concerned with processing speed and want something simple:
proc sort data=have;
by id date;
** Get date of each customer's first unique visit **;
proc sort data=have out=first_visit nodupkey;
by id;
proc freq data=first_visit noprint;
tables date /out=want (keep=date count);
run;

Finding values not in another dataset in SAS

In SAS:I'm having 2 datasets & if I want to find out values of only variable which are not present in another dataset that's easy.
Now if I have to compare in the following way:
data dataset1;
input PointA $ PointB $ #6 date date7.;
format date mmddyy10.;
datalines;
NY LV 02Oct2018
NY LV 04Oct2018
NY LV 06Oct2018
;
which gives Dataset1:
Obs PointA PointB Date
1 NY LV 10/02/2002
2 NY LV 10/04/2002
3 NY LV 10/06/2002
Dataset2 has dates from 01Oct2018 to 06Oct2018.
DATE
01Oct2018
02Oct2018
03Oct2018
04Oct2018
05Oct2018
06Oct2018
WANTED: The final output i want is which all values (dates) in Dataset1 are absent for PointA-PointB as compared to Dataset2. So my desired output is:
Obs PointA PointB Date
1 NY LV 10/01/2002
2 NY LV 10/03/2002
3 NY LV 10/05/2002
I'm using NOT IN but it gives me only the dates. Somehow I need to include the other variables; in this case PointA, PointB.
Perform a full cartesian join with ON criteria involving a not equal (ne) comparison. Remove the original 'have' rows from the cartesian with an EXCEPT set-operator
* use the proper informat! ;
data have;
input PointA $ PointB $ date date9.; format date mmddyy10.; datalines;
NY LV 02Oct2018
NY LV 04Oct2018
NY LV 06Oct2018
;
data dates; input
date date9.; format date mmddyy10.; datalines;
01Oct2018
02Oct2018
03Oct2018
04Oct2018
05Oct2018
06Oct2018
;
proc sql;
create table have_not_lookup as
select distinct
have.PointA, have.PointB, lookup.date
from
have
join
dates lookup
on
lookup.date NE have.date
except
select * from have
;

Flattening a file while preserving duplicate contents

I want to flatten a file to consolidate the variable contents for any occurrence of an ID into one record. Consider the example below...
I have:
ID Date Color Letter
1012 01/23 Red X
1012 10/17 Blu F
1012 07/28 Red N
1012 04/09 Ylw G
1392 04/12 Ylw P
1392 03/11 Blu A
1001 03/11 Blu E
I want:
ID Date1 Date2 Date3 Date4 Clr1 Clr2 Clr3 Clr4 Ltr1 Ltr2 Ltr3 Ltr4
1012 01/23 10/17 07/28 04/09 Red Blu Red Ylw X F N G
1392 04/12 03/11 . . Ylw Blu P A
1001 03/11 . . . Blu E
What is an efficient way to do this?
This works well if you have 100 or less obs per group(id). It works to flop both character and numeric variables at the same time. If you wanted preserve the original order for ID you can add the PROC statement option ORDER=DATA.
data tall;
input (ID Date Color Letter)($);
cards;
1012 01/23 Red X
1012 10/17 Blu F
1012 07/28 Red N
1012 04/09 Ylw G
1392 04/12 Ylw P
1392 03/11 Blu A
1001 03/11 Blu E
;;;;
run;
proc sql noprint;
select max(obs) into :obs
from (select count(*) as obs from tall group by id);
quit;
%put NOTE: &=obs;
proc summary data=tall nway;
class id;
output out=wide(drop=_: id_:) idgroup(out[&obs](_all_)=);
run;
I currently do this by transposing every variable (in a macro when there are more than a few), then merging the resulting datasets (that just contain IDs and the transposed variable of choice) together.
Transposing:
%macro flattener(minids= , fix= , trnvar= );
proc transpose data=have out=&minids prefix=&fix;
by ID;
var &trnvar;
run;
%mend flattener;
%flattener(minids=datDS, fix=Date, trnvar=Date );
%flattener(minids=clrDS, fix=Clr , trnvar=Color );
%flattener(minids=ltrDS, fix=Ltr , trnvar=Letter);
Merging the result:
data ostudentflat;
merge datDS (drop=_NAME_ _LABEL_)
clrDS (drop=_NAME_ _LABEL_)
ltrDS (drop=_NAME_ _LABEL_);
by ID;
run;
I feel like there has to be an easier and faster way to do this, but it gets the job done.

PROC SUMMARY using multiple fields as a key

Is there a way to replicate the behaviour below using PROC SUMMARY without having to create the dummy variable UniqueKey?
DATA table1;
input record_desc $ record_id $ response;
informat record_desc $4. record_id $1. response best32.;
format record_desc $4. record_id $1. response best32.;
DATALINES;
A001 1 300
A001 1 150
A002 1 200
A003 1 150
A003 1 99
A003 2 250
A003 2 450
A003 2 250
A004 1 900
A005 1 100
;
RUN;
DATA table2;
SET table1;
UniqueKey = record_desc || record_id;
RUN;
PROC SUMMARY data = table2 NWAY MISSING;
class UniqueKey record_desc record_id;
var response;
output out=table3(drop = _FREQ_ _TYPE_) sum=;
RUN;
You can summarise by record_desc and record_id (see class statement below) without creating a concatenation of the two columns. What made you think you couldn't?
PROC SUMMARY data = table1 NWAY MISSING;
class record_desc record_id;
var response;
output out=table4(drop = _FREQ_ _TYPE_) sum=;
RUN;
proc compare
base=table3
compare=table4;
run;