PROC SQL - Counting distinct values across variables

PROC SQL - Counting distinct values across variables - sas

Looking for ways of counting distinct entries across multiple columns / variables with PROC SQL, all I am coming across is how to count combinations of values.
However, I would like to search through 2 (character) columns (within rows that meet a certain condition) and count the number of distinct values that appear in any of the two.
Consider a dataset that looks like this:
DATA have;
INPUT A_ID C C_ID1 $ C_ID2 $;
DATALINES;
1 1 abc .
2 0 . .
3 1 efg abc
4 0 . .
5 1 abc kli
6 1 hij .
;
RUN;
I now want to have a table containing the count of the nr. of unique values within C_ID1 and C_ID2 in rows where C = 1.
The result should be 4 (abc, efg, hij, kli):
nr_distinct_C_IDs
4
So far, I only have been able to process one column (C_ID1):
PROC SQL;
CREATE TABLE try AS
SELECT
COUNT (DISTINCT
(CASE WHEN C=1 THEN C_ID1 ELSE ' ' END)) AS nr_distinct_C_IDs
FROM have;
QUIT;
(Note that I use CASE processing instead of a WHERE clause since my actual PROC SQL also processes other cases within the same query).
This gives me:
nr_distinct_C_IDs
3
How can I extend this to two variables (C_ID1 and C_ID2 in my example)?

It is hard to extend this to two or more variables with your method. Try to stack variables first, then count distinct value. Like this:
proc sql;
create table want as
select count(ID) as nr_distinct_C_IDs from
(select C_ID1 as ID from have
union
select C_ID2 as ID from have)
where not missing(ID);
quit;

I think in this case a data step may be a better fit if your priority is to come up with something that extends easily to a large number of variables. E.g.
data _null_;
length ID $3;
declare hash h();
rc = h.definekey('ID');
rc = h.definedone();
array IDs $ C_ID1-C_ID2;
do until(eof);
set have(where = (C = 1)) end = eof;
do i = 1 to dim(IDs);
if not(missing(IDs[i])) then do;
ID = IDs[i];
rc = h.add();
if rc = 0 then COUNT + 1;
end;
end;
end;
put "Total distinct values found: " COUNT;
run;
All that needs to be done here to accommodate a further variable is to add it to the array.
N.B. as this uses a hash object, you will need sufficient memory to hold all of the distinct values you expect to find. On the other hand, it only reads the input dataset once, with no sorting required, so it might be faster than SQL approaches that require multiple internal reads and sorts.

Related

Row-wise operation for subset of columns

I have the following data:
data df;
input id $ d1 d2 d3;
datalines;
a . 2 3
b . . .
c 1 . 3
d . . .
;
run;
I want to apply some transformation/operation across a subset of columns. In this case, that means dropping all rows where columns prefixed with d are all missing/null.
Here's one way I accomplished this, taking heavy influence from this SO post.
First, sum all numeric columns, row-wise.
data df_total;
set df;
total = sum(of _numeric_);
run;
Next, drop all rows where total is missing/null.
data df_final;
set df_total;
where total is not missing;
run;
Which gives me the output I wanted:
a . 2 3
c 1 . 3
My issue, however, is that this approach assumes that there's only one "primary-key" column (id, in this case) and everything else is numeric and should be considered as a part of this sum(of _numeric_) is not missing logic.
In reality, I have a diverse array of other columns in the original dataset, df, and it's not feasible to simply drop all of them, writing all of that out. I know the columns for which I want to run this "test" all are prefixed with d (and more specifically, match the pattern d<mm><dd>).
How can I extend this approach to a particular subset of columns?

Use a different short cut reference, since you know it all starts with D,
total = sum( of D:);
if n(of D:) = 0 then delete;
Which will add variables that are numeric and start with D. If you have variables you want to exclude that start with D, that's problematic.
Since it's numeric, you can also use the N() function instead, which counts the non missing values in the row. In general though, SAS will do this automatically for most PROCS such as REG/GLM(not in a data step obviously).
If that doesn't work for some reason you can query the list of variables from the sashelp table.
proc sql noprint;
select name into :var_list separated by ", " from sashelp.vcolumn
where libname='WORK' and memname='DF' and name like 'D%';
quit;
data df;
set have;
if n(&var_list.)=0 then delete;
run;

SAS: Grouping by ID and summing the number of a condition in a variable for the ID

I have a dataset that contains the ID and a variable called CC. The CC holds multiple numbered values where each value represents something. It looks like this:
An ID can have the same CC in multiple rows, I just want to flag if the CC exists or not so even if Joe had five rows stating that he has CC equal to 3 I just want a 1 or 0 stating if Joe ever had a CC equal to 3.
I want it to look like this:
I tried coding it as shown below but the issue is that although I know an ID can have more than one type of CC the final dataset that's created from the code only shows 1 CC for each ID that is filled. I think maybe it's overwriting it?
Also I should note that prior to this code I created the CC Flag variables and filled it all as zeros.
proc sql;
DROP TABLE Flagged_CCs;
CREATE TABLE Flagged_CCs AS
select
ID,
COUNT(ID) as count_ID,
case when CC=1 then 1 end as CC_1,
case when CC=2 then 1 end as CC_2,
case when CC=3 then 1 end as CC_3
from Original_Dataset
group by ID;
quit;
Any help is appreciated, thank you.

Is your issue the fact that after running your new code you still get multiple line per ID?
If so I propose this:
proc sql;
DROP TABLE Flagged_CCs;
CREATE TABLE Flagged_CCs AS
select ID
,case when CC_1 >0 then 1 else 0 end as CC_1
,case when CC_2 >0 then 1 else 0 end as CC_2
,case when CC_3 >0 then 1 else 0 end as CC_3
from (
select
ID,
COUNT(ID) as count_ID,
sum(case when CC=1 then 1 end) as CC_1,
sum(case when CC=2 then 1 end) as CC_2,
sum(case when CC=3 then 1 end) as CC_3
from Original_Dataset
group by ID
);
quit;
The reason you are having the issue is that you are only aggregating the count of ID and not the other values, using an aggregate on them will eliminate duplicate records.
Hope this helps

If you're looking for a report here's one method, using PROC TABULATE.
proc format ;
value indicator_fmt
low - 0, . = 0
0 - high = 1;
run;
proc tabulate data=have;
class id cc;
table id , cc*N=''*f=indicator_fmt.;
run;
Your output will look like this then:
If you want a fully dynamic approach in a table where you don't need to know anything ahead of time, such as the number of CC's this is a different approach. It's a bit longer but the dynamic part makes it possibly worthwhile to implement.

How can I eliminate subset combinations of binary indicators?

I have a data set where each observation is a combination of binary indicator variables, but not necessarily all possible combinations. I'd like to eliminate observations that are subsets of other observations. As an example, suppose I had these three observations:
var1 var2 var3 var4
0 0 1 1
1 0 0 1
0 1 1 1
In this case, I would want to eliminate observation 1, because it's a subset of observation 3. Observation 2 isn't a subset of anything else, so my output data set should contain observations 2 and 3.
Is there an elegant and preferably fast way to do this in SAS? My best solution thus far is a brute force loop through the data set using a second set statement with the point option to see if the current observation is a subset of any others, but these data sets could become huge once I start working with a lot of variables, so I'm hoping to find a better way.

First off, one consideration: is it possible for one row to have 1 for all indicators? You should check for that first - if one row does have all 1s, then it will always be the unique solution.
_POINT_ is inefficient, but loading into a hash table isn't a terribly bad way to do it. Just load up a hash table with a string of the binary indicators CATted together, and then search that table.
First, use PROC SORT NODUPKEY to eliminate the exact matches. Unless you have a very large number of indicator variables, this will eliminate many rows.
Then, sort it in an order where the more "complicated" rows are at the top, and the less complicated at the bottom. This might be as simple as making a variable which is the sum of binary indicators and sort by that descending; or if your data suggests, might be sorting by a particular order of indicators (if some are more likely to be present). The purpose of this is to reduce the number of times we search; if the likely matches are on top, we will leave the loop faster.
Finally, use a hash iterator to search the list, in descending order by the indicators variable, for any matches.
See below for a partially-tested example. I didn't verify that it eliminated every valid elimination, but it eliminates around half of the rows, which sounds reasonable.
data have;
array vars var1-var20;
do _u = 1 to 1e4;
do _t = 1 to dim(Vars);
vars[_t] = round(ranuni(7),1);
end;
complexity = sum(of vars[*]);
indicators = cats(of vars[*]);
output;
end;
drop _:;
run;
proc sort nodupkey data=have;
by indicators;
run;
proc sort data=have;
by descending complexity;
run;
data want;
if _n_ = 1 then do;
format indicators $20.;
call missing(indicators, complexity);
declare hash indic(dataset:'have', ordered:'d');
indic.defineKey('indicators');
indic.defineData('complexity','indicators');
indic.defineDone();
declare hiter inditer('indic');
end;
set have(drop=indicators rename=complexity=thisrow_complex); *assuming have has a variable, "indicators", like "0011001";
array vars var1-var20;
rc=inditer.first();
rowcounter=1;
do while (rc=0 and complexity ge thisrow_complex);
do _t = 1 to dim(vars);
if vars[_t]=1 and char(indicators,_t) ne '1' then leave;
end;
if _t gt dim(Vars) then delete;
else rc=inditer.next();
rowcounter=rowcounter+1;
end;
run;

I'm prety sure their is probably a more math-oriented way of doing this but for now this what I can think about. Proceed with caution as I only checked on a small no. of test cases.
My pseudo-algorithm:
(Bit pattern= concatenation of all binary variables into a string.)
get a unique list of binary(bit) patterns which is sorted in
asceding order (this is what the PROC SQL step is doing
Set up two arrays. 1 array to track the variables (var1 to var4) on the current row and another array to track lag(of var1 to var4).
If a bit pattern is a subset of another bit pattern (lagged version) then "dot product vector multiplication" should give back the lagged bit pattern.
If bit pattern = lagged_bit_pattern then flag that pattern to be excluded.
In the last data step you will get the list of bit patterns you need to exclude. NOTE: this approach does not take care of duplicate patterns such as the following:
record1: 1 0 0 1
record2: 1 0 0 1
which can be easily excluded via PROC SORT & NODUPKEY.
/*sample data*/
data sample_data;
input id var1-var4;
bit_pattern = compress(catx('',var1,var2,var3,var4));
datalines;
1 0 0 1 1
2 1 0 0 1
3 0 1 1 1
4 0 0 0 1
5 1 1 1 0
;
run;
/*in the above example, 0001 0011 need to be eliminated. These will be highlighted in the last datastep*/
/*get unique combination of patterns in the dataset*/
proc sql ;
create table all_poss_patterns as
select
var1,var2, var3,var4,
count(*) as freq
from sample_data
group by var1,var2, var3,var4
order by var1,var2, var3,var4;
quit;
data patterns_to_exclude;
set all_poss_patterns;
by var1-var4;
length lagged1-lagged4 8;
array first_array{*} var1-var4;
array lagged{*}lagged1-lagged4;
length bit_pattern $32.;
length lagged_bit_pattern $32.;
bit_pattern = '';
lagged_bit_pattern='';
do i = 1 to dim(first_array);
lagged{i}=lag(first_array{i});
end;
do i = 1 to dim(first_array);
bit_pattern=cats("", bit_pattern,lagged{i}*first_array{i});
lagged_bit_pattern=cats("",lagged_bit_pattern,lagged{i});
end;
if bit_pattern=lagged_bit_pattern then exclude_pattern=1;
else exclude_pattern=0;
/*uncomment the following two lines to just keep the patterns that need to be excluded*/
/*if bit_pattern ne '....' and exclude_pattern=1;*/ /*note the bit_pattern ne '....' the no. of dots should equal no. of binary vars*/
/*keep bit_pattern;*/
run;

SAS data step/ proc sql insert rows from another table with auto increment primary key

I have 2 datasets as below
id name status
1 A a
2 B b
3 C c
Another dataset
name status new
C c 0
D d 1
E e 1
F f 1
How do I insert all rows from 2nd table to 1st table? The situation is that the first table is permanent. The 2nd table is updated monthly, so I would like to add all rows from the monthly updated table to the permanent table, so that it would look like this
id name status
1 A a
2 B b
3 C c
4 D d
5 E e
6 F f
The problem I'm facing is that I cannot increment the id from dataset 1. As far as I searched, the dataset in SAS does not have auto increment property. The auto increment can be done with using data step, but I don't know if data step could be use in the case with 2 tables like this.
The usual sql would be
Insert into table1 (name, status)
select name, status from table2 where new = 1;
But since the sas dataset not support auto increment column hence the problem I'm facing.
I could solve it by using SAS data step as below after the above proc sql
data table1;
set table1;
if _n_ > 3 then id = _n_;
run;
This would increase the value of id column, but the code is kinda ugly, and also the id is a primary key, and being used as a foreign key in other table, so I don't want to mess up the ids of old rows.
I'm in the process of both learning and working with SAS so help is really appreciated. Thanks in advance.
Extra question:
If the 2nd table does not have the new column, is there any way to complete what I want (add new row from monthly table (2nd) to permanent table (1st)) with data step? Currently, I use this ugly proc sql/data step to create new column
proc sql; //create a temp table from table2
create t2temp as select t2.*,
(case when t2.name = t1.name and t2.status = t1.status then 0 else 1) as new
from table2 as t2
left join table1 as t1
on t2.name = t1.name and t2.status = t1.status;
drop table t2; //drop the old table2 with no column "new"
quit;
data table2; //rename the t2temp as table2
set t2temp;
run;

You can do it in the datastep. BTW, if you were creating it entirely anew, you could just use
id+1;
to create an autonumbered field (assuming your data step wasn't too complicated). This will keep track of the current highest ID number and assign one higher to each row as you go if it is in the new dataset.
data have;
input id name $ status $;
datalines;
2 A a
3 B b
1 C c
;;;;
run;
data addon;
input name $ status $ new;
datalines;
C c 0
D d 1
E e 1
F f 1
;;;;
run;
data want;
retain _maxID; *keep the value of _maxID from one row to the next,
do not reset it;
set have(in=old) addon(in=add); *in= creates a temporary variable indicating which
dataset a row came from;
if (old) or (add and new); *in SAS like in c/etc., 0/missing(null) is
false negative/positive numbers are true;
if add then ID = _maxID+1; *assigns ID to the new records;
_maxID = max(id,_maxID); *determines the new maximum ID -
this structure guarantees it works
even if the old DS is not sorted;
put id= name=;
drop _maxID;
run;
Response to second question:
Yes, you can still do that. One of the easiest ways is, if you have the datasets sorted by NAME:
data want;
retain _maxID;
set have(in=old) addon(in=add);
by name;
if (old) or (add and first.name);
if add then ID = _maxID+1;
_maxID = max(id,_maxID);
put id= name=;
run;
first.name will be true for the first record with the same value of name; so if HAVE has a value of that name, then ADDON will not be permitted to add a new record.
This does require name to be unique in HAVE, or you might delete some records. If that is not true then you have a more complicated solution.

How to create a new variable in SAS by extracting part of the value of an existing numeric variable?

I have two datasets in SAS that I would like to merge, but they have no common variables. One dataset has a "subject_id" variable, while the other has a "mom_subject_id" variable. Both of these variables are 9-digit codes that have just 3 digits in the middle of the code with common meaning, and that's what I need to match the two datasets on when I merge them.
What I'd like to do is create a new common variable in each dataset that is just the 3 digits from within the subject ID. Those 3 digits will always be in the same location within the 9-digit subject ID, so I'm wondering if there's a way to extract those 3 digits from the variable to make a new variable.
Thanks!

SQL(using sample data from Data Step code):
proc sql;
create table want2 as
select a.subject_id, a.other, b.mom_subject_id, b.misc
from have1 a JOIN have2 b
on(substr(a.subject_id,4,3)=substr(b.mom_subject_id,4,3));
quit;
Data Step:
data have1;
length subject_id $9;
input subject_id $ other $;
datalines;
abc001def other1
abc002def other2
abc003def other3
abc004def other4
abc005def other5
;
data have2;
length mom_subject_id $9;
input mom_subject_id $ misc $;
datalines;
ghi001jkl misc1
ghi003jkl misc3
ghi005jkl misc5
;
data have1;
length id $3;
set have1;
id=substr(subject_id,4,3);
run;
data have2;
length id $3;
set have2;
id=substr(mom_subject_id,4,3);
run;
Proc sort data=have1;
by id;
run;
Proc sort data=have2;
by id;
run;
data work.want;
merge have1(in=a) have2(in=b);
by id;
run;

an alternative would be to use
proc sql
and then use a join and the substr() just as explained above, if you are comfortable with sql

Assuming that your "subject_id" variable is a number then the substr function wont work as sas will try convert the number to a string. But by default it pads some paces on the left of the number.
You can use the modulus function mod(input, base) which returns the remainder when input is divided by base.
/*First get rid of the last 3 digits*/
temp_var = floor( subject_id / 1000);
/* then get the next three digits that we want*/
id = mod(temp_var ,1000);
Or in one line:
id = mod(floor(subject_id / 1000), 1000);
Then you can continue with sorting the new data sets by id and then merging.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js