Row-wise operation for subset of columns - sas

I have the following data:
data df;
input id $ d1 d2 d3;
datalines;
a . 2 3
b . . .
c 1 . 3
d . . .
;
run;
I want to apply some transformation/operation across a subset of columns. In this case, that means dropping all rows where columns prefixed with d are all missing/null.
Here's one way I accomplished this, taking heavy influence from this SO post.
First, sum all numeric columns, row-wise.
data df_total;
set df;
total = sum(of _numeric_);
run;
Next, drop all rows where total is missing/null.
data df_final;
set df_total;
where total is not missing;
run;
Which gives me the output I wanted:
a . 2 3
c 1 . 3
My issue, however, is that this approach assumes that there's only one "primary-key" column (id, in this case) and everything else is numeric and should be considered as a part of this sum(of _numeric_) is not missing logic.
In reality, I have a diverse array of other columns in the original dataset, df, and it's not feasible to simply drop all of them, writing all of that out. I know the columns for which I want to run this "test" all are prefixed with d (and more specifically, match the pattern d<mm><dd>).
How can I extend this approach to a particular subset of columns?

Use a different short cut reference, since you know it all starts with D,
total = sum( of D:);
if n(of D:) = 0 then delete;
Which will add variables that are numeric and start with D. If you have variables you want to exclude that start with D, that's problematic.
Since it's numeric, you can also use the N() function instead, which counts the non missing values in the row. In general though, SAS will do this automatically for most PROCS such as REG/GLM(not in a data step obviously).
If that doesn't work for some reason you can query the list of variables from the sashelp table.
proc sql noprint;
select name into :var_list separated by ", " from sashelp.vcolumn
where libname='WORK' and memname='DF' and name like 'D%';
quit;
data df;
set have;
if n(&var_list.)=0 then delete;
run;

Related

Collapsing a large dataset while conditionally preserving some missing values

Dataset HAVE includes id values and a character variable of names. Values in names are usually missing. If names is missing for all values of an id EXCEPT one, the obs for IDs with missing values in names can be deleted. If names is completely missing for all id of a certain value (like id = 2 or 5 below), one record for this id value must be preserved.
In other words, I need to turn HAVE:
id names
1
1
1 Matt, Lisa, Dan
1
2
2
2
3
3
3 Emily, Nate
3
4
4
4 Bob
5
into WANT:
id names
1 Matt, Lisa, Dan
2
3 Emily, Nate
4 Bob
5
I currently do this by deleting all records where names is missing, then merging the results onto a new dataset KEY with one variable id that contains all original values (1, 2, 3, 4, and 5):
data WANT_pre;
set HAVE;
if names = " " then delete;
run;
data WANT;
merge KEY
WANT_pre;
by id;
run;
This is ideal for HAVE because I know that id is a set of numeric values ranging from 1 to 5. But I am less sure how I could do this efficiently (A) on a much larger file, and (B) if if I couldn't simply create an id KEY dataset by counting from 1 to n. If your HAVE had a few million observations and your id values were more complex (e.g., hexadecimal values like XR4GN), how would you produce WANT?
You can use SQL here easily, MAX() applies to character variables within SQL.
proc sql;
create table want as
select id, max(names) as names
from have
group by ID;
quit;
Another option is to use an UPDATE statement instead.
data want;
update have (obs=0) have;
by ID;
run;
This seems like a good candidate for a DOW-loop, assuming that your dataset is sorted by id:
data want;
do until(last.id);
set have;
by id;
length t_names $50; /*Set this to at least the same length as names unless you want the default length of 200 from coalescec*/
t_names = coalescec(t_names,names);
end;
names = t_names;
drop t_names;
run;
proc summary data=have nway missing;
class id;
output out=want(drop=_:) idgroup(max(names) out(names)=);
run;
Use the UPDATE statement. That will ignore the missing values and keep the last non-missing value. It normally requires a master and transaction dataset, but you can use your single dataset for both.
data want;
update have(obs=0) have ;
by id;
run;

SAS Find Top Combinations in Dataset

Hell everyone --
I have some sales data which looks like this:
data have;
input order_id item $;
cards;
1 A
1 B
2 A
2 C
3 B
4 A
4 B
;
run;
What I'm trying to find out is what are the most popular combinations of items ordered. For example in the above case, there were 2 orders that contained items A&B, 1 order of A&C, and 1 order of B. What would be the best way to output the different combinations along with the numbers of orders placed?
It seems there is no permutation issue, you could try this:
proc sort data=have;
by order_id item;
run;
data temp;
set have;
by order_id;
retain comb;
length comb $4;
comb=cats(comb,item);
if last.order_id then do;
output;
call missing(comb);
end;
run;
proc freq data=temp;
table comb/norow nopercent nocol nocum;
run;
There are many possible approaches to this problem, and I would not presume to say which is the best. Here's a fairly simple method you could use:
Transpose your data so that you only have 1 row for each order, with an indicator variable for each product.
Feed the transposed dataset into proc corr to produce a correlation matrix for the indicator variables, and look for the strongest correlations.

PROC SQL - Counting distinct values across variables

Looking for ways of counting distinct entries across multiple columns / variables with PROC SQL, all I am coming across is how to count combinations of values.
However, I would like to search through 2 (character) columns (within rows that meet a certain condition) and count the number of distinct values that appear in any of the two.
Consider a dataset that looks like this:
DATA have;
INPUT A_ID C C_ID1 $ C_ID2 $;
DATALINES;
1 1 abc .
2 0 . .
3 1 efg abc
4 0 . .
5 1 abc kli
6 1 hij .
;
RUN;
I now want to have a table containing the count of the nr. of unique values within C_ID1 and C_ID2 in rows where C = 1.
The result should be 4 (abc, efg, hij, kli):
nr_distinct_C_IDs
4
So far, I only have been able to process one column (C_ID1):
PROC SQL;
CREATE TABLE try AS
SELECT
COUNT (DISTINCT
(CASE WHEN C=1 THEN C_ID1 ELSE ' ' END)) AS nr_distinct_C_IDs
FROM have;
QUIT;
(Note that I use CASE processing instead of a WHERE clause since my actual PROC SQL also processes other cases within the same query).
This gives me:
nr_distinct_C_IDs
3
How can I extend this to two variables (C_ID1 and C_ID2 in my example)?
It is hard to extend this to two or more variables with your method. Try to stack variables first, then count distinct value. Like this:
proc sql;
create table want as
select count(ID) as nr_distinct_C_IDs from
(select C_ID1 as ID from have
union
select C_ID2 as ID from have)
where not missing(ID);
quit;
I think in this case a data step may be a better fit if your priority is to come up with something that extends easily to a large number of variables. E.g.
data _null_;
length ID $3;
declare hash h();
rc = h.definekey('ID');
rc = h.definedone();
array IDs $ C_ID1-C_ID2;
do until(eof);
set have(where = (C = 1)) end = eof;
do i = 1 to dim(IDs);
if not(missing(IDs[i])) then do;
ID = IDs[i];
rc = h.add();
if rc = 0 then COUNT + 1;
end;
end;
end;
put "Total distinct values found: " COUNT;
run;
All that needs to be done here to accommodate a further variable is to add it to the array.
N.B. as this uses a hash object, you will need sufficient memory to hold all of the distinct values you expect to find. On the other hand, it only reads the input dataset once, with no sorting required, so it might be faster than SQL approaches that require multiple internal reads and sorts.

proc tabulate output not in alphabetical order

original output
Count
AAB BB
01NOV2014 5 4
02NOV2014 4 3
But ideal output is
Count
BB AAB
01NOV2014 4 5
02NOV2014 4 4
Is there a way to change a n by k tables from proc tabulate to list it as requested?
Since k is not small, I'm looking for an efficient way to achieve this. Maybe store the requested order in a macro variable?
The easiest answer depends on how the order is derived.
You have some ordering options on the class variable, such as order=data, which may give you the desired result if the data is stored in that order. This can be tricky, but sometimes is a simple method to get to that result.
Second, you have a couple of options related to formats.
If the data can be stored as a formatted numeric, where BB=1, AAB=2, etc., then use order=unformatted to achieve that.
Create a format that lists the values in order, just formatting them to themselves, with notsorted in the options of the value statement, and then use order=data on the class statement and preloadfmt.
Example of the second option:
data have;
input var $ count;
datalines;
AAA 1
AAB 2
BBA 3
BBB 4
;;;;
run;
proc format;
value $myformatf (notsorted)
BBB=BBB
AAB=AAB
BBA=BBA
AAA=AAA
other=' ';
quit;
proc tabulate data=have;
class var/order=data preloadfmt;
format var $myformatf.;
var count;
tables var,count*sum;
run;

calculate total combinations of observations in SAS

I have a variable with few values for that.
Example: Var1 A,B,C,D,E,F,G,H
How can i find the total 2 letter combinations possible? eg: AB, AC, AD etc.
Here the list i have mentioned is small but in general I have a huge list and need to find total two letter combinations possible with all the values present for the variable. Thanks
A Cartesian join will give you every combination against every combination, so a self join here will give you all possibilities. I usually use Proc SQL:
Proc sql;
create table cartesian1 as
select * from table1,table1;
Quit;
Does this give you the table you want? I'm assuming that you want all 2 letter combinations rather than permutations (i.e. order is not relevant).
data have;
input var1 $;
datalines;
A
B
C
D
E
F
G
H
;
run;
data want;
set have nobs=nobs;
length two_way $2;
do i=_n_+1 to nobs;
set have (rename=(var1=temp)) point=i;
two_way=cats(var1,temp);
keep two_way;
output;
end;
run;