SAS Find Top Combinations in Dataset - sas

Hell everyone --
I have some sales data which looks like this:
data have;
input order_id item $;
cards;
1 A
1 B
2 A
2 C
3 B
4 A
4 B
;
run;
What I'm trying to find out is what are the most popular combinations of items ordered. For example in the above case, there were 2 orders that contained items A&B, 1 order of A&C, and 1 order of B. What would be the best way to output the different combinations along with the numbers of orders placed?

It seems there is no permutation issue, you could try this:
proc sort data=have;
by order_id item;
run;
data temp;
set have;
by order_id;
retain comb;
length comb $4;
comb=cats(comb,item);
if last.order_id then do;
output;
call missing(comb);
end;
run;
proc freq data=temp;
table comb/norow nopercent nocol nocum;
run;

There are many possible approaches to this problem, and I would not presume to say which is the best. Here's a fairly simple method you could use:
Transpose your data so that you only have 1 row for each order, with an indicator variable for each product.
Feed the transposed dataset into proc corr to produce a correlation matrix for the indicator variables, and look for the strongest correlations.

Related

Proc transpose in SAS with multiple observations in var variable

I have a dataset that I want to tranpose from long to wide. I have:
**ID **Question** Answer**
1 Referral to a
1 Referral to b
1 Referral to d
2 Referral to a
2 Referral to c
4 Referral to a
6 Referral to a
6 Referral to c
6 Referral to d
What I want the tranposed dataset to look like:
**ID **Referral to**
1 a, b, d
2 a, c
4 a
6 a, c, d
I've tried to transpose the data, but the resulting dataset only contains 1 of the responses from the answer column, not all of them.
Code I've been using:
proc transpose data=test out=test2 let;
by ID;
id Question;
var Answer; run;
The dataset has hundreds of thousands of rows with dozens of variables that are exactly the same as the 'Referral to' example. How can make it so the tranposed wide dataset contains all of the answers to the Question in the same cell and not just one? Any help would be appreciated.
Thank you.
Here's two methods you can use in this case.
The first uses a data step approach, which is a single step. The second is more dynamic and uses a PROC TRANSPOSE + CATX() after the fact to create the combined variable. Note the use of PREFIX option in the transpose procedure to make the variables easier to identify and concatenate.
*create sample data for demonstration;
data have;
infile cards dlm='09'x;
input OrgID Product $ States $;
cards;
1 football DC
1 football VA
1 football MD
2 football CA
3 football NV
3 football CA
;
run;
*Sort - required for both options;
proc sort data=have;
by orgID;
run;
**********************************************************************;
*Use RETAIN and BY group processing to combine the information;
**********************************************************************;
data want_option1;
set have;
by orgID;
length combined $100.;
retain combined;
if first.orgID then
combined=states;
else
combined=catx(', ', combined, states);
if last.orgID then
output;
run;
**********************************************************************;
*Transpose it to a wide format and then combine into a single field;
**********************************************************************;
proc transpose data=have out=wide prefix=state_;
by orgID;
var states;
run;
data want_option2;
set wide;
length combined $100.;
combined=catx(', ', of state_:);
run;

Select many columns and other non-continuous columns to find duplicate?

I have a dataset with many columns like this:
ID Indicator Name C1 C2 C3....C90
A 0001 Black 0 1 1.....0
B 0001 Blue 1 0 0.....1
B 0002 Blue 1 0 0.....1
Some of the IDs are duplicates because the indicator is different, but they're essentially the same record. To find duplicates, I want to select distinct ID, Name and then C1 through C90 to check because some claims who have the same Id and indicator have different C1...C90 values.
Is there a way to select c1...c90 either through proc sql or a sas data step? It seems the only way I can think of is to set the dataset and then drop the non essential columns, but in the actual dataset, it's not only Indicator but at least 15 other columns.
It would be nice if PROC SQL used the : variable name wildcard like other Procs do. When no other alternative is reasonable, I usually use a macro to select bulk columns. This might work for you:
%macro sel_C(n);
%do i=1 %to %eval(&n.-1);
C&i.,
%end;
C&n.
%mend sel_C;
proc sql;
select ID,
Indicator,
Name,
%sel_C(90)
from have_data;
quit;
If I understand the question properly, the easiest way would be to concatenate the columns to one. RETAIN that value from row to row, and you can compare it across rows to see if it's the same or not.
data want;
set have;
by id indicator;
retain last_cols;
length last_cols $500;
cols = catx('|',of c1-c90);
if first.id then call missing(last_cols);
else do;
identical = (cols = last_cols); *or whatever check you need to perform;
end;
output;
last_cols = cols;
run;
There are a few different ways you can do this and it will be much easier if the actual column names are C1 - C90. If you're just looking to remove anything that you know is a duplicate you can use proc sort.
proc sort data=dups out=nodups nodupkey;
by ID Name C1-C90;
run;
The nodupkey option will automatically remove any duplicates in the by statement.
Alternatively, if you want to know which records contain duplicates, you could use proc summary.
proc summary data=dups nway missing;
class ID Name C1-C90;
output out=onlydups(where=(_freq_ > 1));
run;
proc summary creates two new variables, _type_ and _freq_. If you specify _freq_ > 1 you will only output the duplicate records. Also, note that this will remove the Indicator variable.

How do I stack many frequency tables in SAS

I have a dataset of about 800 observations. I want to get the frequency of 14 variables. I want to get the frequency of these variable by shape (an example). There are 3 different shapes.
An example of doing this one time would obviously be:
proc freq; tables color; by shape;run;
However, I do not want 42 frequency tables. I want one frequency table that has the list of 14 variables on the left side. The top heading will have shape1 shape2 shape3 with the frequencies of each variable underneath them.
It would look like I transposed the data sets by percentage and then stacked them on top of each other.
I have several sets of combinations where I need to do this. I have about 5 different groups of variables and I need to make tables using 3 different by groups (necessitating about 15 tables). The first example I discussed is one example of such groups.
Any help would be appreciated!
Using proc means and proc transpose. I give you some example. You can add more categories.
proc means data=sashelp.class nway n;
class sex age;
output out=class(drop=_freq_ _type_) n=freq;
run;
proc transpose data=class out=class(drop=_name_) prefix=AGE;
by sex;
var freq;
id age;
run;
data class_sum;
set class;
array a(*) age:;
age_sum = sum(of age:);
do i = 1 to dim(a);
a(i) = a(i) / age_sum;
end;
drop i;
run;

Frequency of a value across multiple variables?

I have a data set of patient information where I want to count how many patients (observations) have a given diagnostic code. I have 9 possible variables where it can be, in diag1, diag2... diag9. The code is V271. I cannot figure out how to do this with the "WHERE" clause or proc freq.
Any help would be appreciated!!
Your basic strategy to this is to create a dataset that is not patient level, but one observation is one patient-diagnostic code (so up to 9 observations per patient). Something like this:
data want;
set have;
array diag[9];
do _i = 1 to dim(diag);
if not missing(diag[_i]) then do;
diagnosis_Code = diag[_i];
output;
end;
end;
keep diagnosis_code patient_id [other variables you might want];
run;
You could then run a proc freq on the resulting dataset. You could also change the criteria from not missing to if diag[_i] = 'V271' then do; to get only V271s in the data.
An alternate way to reshape the data that can match Joe's method is to use proc transpose like so.
proc transpose data=have out=want(keep=patient_id col1
rename=(col1=diag)
where=(diag is not missing));
by patient_id;
var diag1-diag9;
run;

calculate total combinations of observations in SAS

I have a variable with few values for that.
Example: Var1 A,B,C,D,E,F,G,H
How can i find the total 2 letter combinations possible? eg: AB, AC, AD etc.
Here the list i have mentioned is small but in general I have a huge list and need to find total two letter combinations possible with all the values present for the variable. Thanks
A Cartesian join will give you every combination against every combination, so a self join here will give you all possibilities. I usually use Proc SQL:
Proc sql;
create table cartesian1 as
select * from table1,table1;
Quit;
Does this give you the table you want? I'm assuming that you want all 2 letter combinations rather than permutations (i.e. order is not relevant).
data have;
input var1 $;
datalines;
A
B
C
D
E
F
G
H
;
run;
data want;
set have nobs=nobs;
length two_way $2;
do i=_n_+1 to nobs;
set have (rename=(var1=temp)) point=i;
two_way=cats(var1,temp);
keep two_way;
output;
end;
run;