check all entries in a column sas and return a dummy variable sas - sas

I am new in SAS and want to check if all entries in a variable in a data set satisfy a condition (namely =1) and return just one dummy variable 0 pr one depending whether all entries in the variable are 1 or at least one is not 1.
Any idea how to do it?
IF colvar = 1 THEN dummy_variable = 1
creates another variable dummy_variable of the same size as the original variable.
Thank you

* Generate test data;
data have;
colvar=0;
colvar2=0;
do i=1 to 20;
colvar=round(ranuni(0));
output;
end;
drop i;
run;
* Read the input dataset twice, first counting the number
* of observations and setting the dummy variables to 1 if
* the corresponding variable has the value 1 in any obser-
* vation, second outputting the result. The dummy variables
* remain unchanged during the second loop.;
data want;
_n_=0;
d_colvar=0;
d_colvar2=0;
do until (eof);
set have end=eof;
if colvar = 1
then d_colvar=1;
if colvar2 = 1
then d_colvar2=1;
* etc.... *;
_n_=_n_+1;
end;
do _n_=1 to _n_;
set have;
output;
end;
run;

PROC SQL is a good tool for quickly generating a summary of an arbitrarily defined condition. What exactly your condition is is not clear. I think you want the ALL_ONE value in the table the code below generates. That will be 1 when every observation has COLVAR=1. Any value that is NOT a one will cause the condition to be false (0) and so ALL_ONE will then have a value of 0 instead of 1.
You could store the result into a small table.
proc sql ;
create table check_if_one as
select min( colvar=1 ) as all_one
, max( colvar=1 ) as any_one
, max( colvar ne 1 ) as any_not_one
, min( colvar ne 1 ) as all_not_one
from my_table
;
quit;
But you could also just store the value into a macro variable that you could easily use later for some purpose.
proc sql noprint ;
select min( colvar=1 ) into :all_one trimmed from my_table ;
quit;

Related

Populate SAS macro-variable using a SQL statement within another SQL statement?

I stumbled upon the following code snippet in which the variable top3 has to be filled from a table have rather than from an array of numbers.
%let top3 = 14 15 42; /* This should be made obsolete.. */
%let no = 3;
proc sql;
create table want as
select *
from (select x, y from foo) a
%do i = 1 %to &no.;
%let current = %scan(&top3.,&i.); /* What do I need to put here? */
left join (select x, y from bar where z=&current.) row_&current.
on a.x = row_&current..x
%end;
;
quit;
The table have contains the xs from the string and looks as follows:
i x
1 14
2 15
3 42
I am now wondering how I should modify the %let current = ... line such that current is populated from the table have. I know how to populate a macro variable using proc sql with select .. into, but I am afraid that the way I am going right now is fully against SAS philosophy.
It looks like you're more or less transposing something. If that's the case, this is doable in macro/sql pretty easily.
First, here's the simple version - no macro.
proc sql;
create table class_t as
select * from (
select name from sashelp.class ) class
left join (
select name, age as age_Alfred
from sashelp.class
where name='Alfred') Alfred
on class.name = Alfred.name
;
quit;
We grab the value of age from the Alfred row and put it on the main join. This isn't exactly what you're doing, but it seems similar. (I'm just using one table, but you can of course use two here.)
Now, how do we extend this to be table-driven and not handwritten? Macros!
First, here's the macro - just taking the Alfred bit and making it generic.
%macro joiner(name=);
left join (
select name, age as age_&name.
from sashelp.class
where name="&name.") &name.
on class.name = &name..name
%mend joiner;
Second, we look at this and see two things we need to put into macro lists: the SELECT variable list (we'll get one new variable for each call), and the JOIN list.
proc sql;
select cats('%joiner(name=',name,')')
into :joinlist separated by ' '
from sashelp.class;
select cats(name,'.age_',name)
into :selectlist separated by ','
from sashelp.class;
quit;
And then, we just call it!
proc sql;
create table class_t as
select class.name,&selectlist. from (
select name from sashelp.class) class
&joinlist.
;
quit;
Now, your dataset you call the macro lists from is perhaps the dataset with the 3 rows in it you have above ("have"). The dataset you actually get the appending data from is some other dataset ("bar"), right? And then the ones you join to is perhaps a third dataset ("foo"). Here I just use the one, for simplicity, but the concept is the same, just different sources.
When the lookup data is in a table you can perform a three way join without any need for SAS Macro. You don't provide any data so the example will mock some.
Example:
Suppose a master record has several associated detail records, and the detail records contain a z value used for selection into a result set per a wanted z lookup table.
data masters;
call streaminit(2020);
do id = 1 to 100;
do x = 1 to 100;
m_rownum + 1;
code = rand('integer', 10,45);
output;
end;
end;
run;
data details;
call streaminit(2020);
do date = 1 to 20;
do x = 1 to 100;
do rep = 1 to 5;
d_rownum + 1;
amount = rand('integer', 100,200);
z = rand('integer', 10,45);
output;
end;
end;
end;
run;
data zs;
input z ##; datalines;
14 15 42
;
proc sql;
create table want as
select
m_rownum
, d_rownum
, masters.id
, masters.x
, masters.code
, details.z
, details.date
, details.amount
from
masters
left join
details
on
details.x = masters.x
inner join
zs
on
zs.z = details.z
order by
masters.id, masters.x, details.z, details.date
;
quit;

how to transpose data with multiple occurrences in sas

I have a 2 column dataset - accounts and attributes, where there are 6 types of attributes.
I am trying to use PROC TRANSPOSE in order to set the 6 different attributes as 6 new columns and set 1 where the column has that attribute and 0 where it doesn't
This answer shows two approaches:
Proc TRANSPOSE, and
array based transposition using index lookup via hash.
For the case that all of the accounts missing the same attribute, there would be no way for the data itself to exhibit all the attributes -- ideally the allowed or expected attributes should be listed in a separate table as part of your data reshaping.
Proc TRANSPOSE
When working with a table of only account and attribute you will need to construct a view adding a numeric variable that can be transposed. After TRANSPOSE the result data will have to be further massaged, replacing missing values (.) with 0.
Example:
data have;
call streaminit(123);
do account = 1 to 10;
do attribute = 'a','b','c','d','e','f';
if rand('uniform') < 0.75 then output;
end;
end;
run;
data stage / view=stage;
set have;
num = 1;
run;
proc transpose data=stage out=want;
by account;
id attribute;
var num;
run;
data want;
set want;
array attrs _numeric_;
do index = 1 to dim(attrs);
if missing(attrs(index)) then attrs(index) = 0;
end;
drop index;
run;
proc sql;
drop view stage;
From
To
Advanced technique - Array and Hash mapping
In some cases the Proc TRANSPOSE is deemed unusable by the coder or operator, perhaps very many by groups and very many attributes. An alternate way to transpose attribute values into like named flag variables is to code:
Two scans
Scan 1 determine attribute values that will be encountered and used as column names
Store list of values in a macro variable
Scan 2
Arrayify the attribute values as variable names
Map values to array index using hash (or custom informat per #Joe)
Process each group. Set arrayed variable corresponding to each encountered attribute value to 1.  Array index obtained via lookup through hash map.
Example:
* pass #1, determine attribute values present in data, the values will become column names;
proc sql noprint;
select distinct attribute into :attrs separated by ' ' from have;
* or make list of attributes from table of attributes (if such a table exists outside of 'have');
* select distinct attribute into :attrs separated by ' ' from attributes;
%put NOTE: &=attrs;
* pass #2, perform array based tranposformation;
data want2(drop=attribute);
* prep pdv, promulgate by group variable attributes;
if 0 then set have(keep=account);
array attrs &attrs.;
format &attrs. 4.;
if _n_=1 then do;
declare hash attrmap();
attrmap.defineKey('attribute');
attrmap.defineData('_n_');
attrmap.defineDone();
do _n_ = 1 to dim(attrs);
attrmap.add(key:vname(attrs(_n_)), data: _n_);
end;
end;
* preset all flags to zero;
do _n_ = 1 to dim(attrs);
attrs(_n_) = 0;
end;
* DOW loop over by group;
do until (last.account);
set have;
by account;
attrmap.find(); * lookup array index for attribute as column;
attrs(_n_) = 1; * set flag for attribute (as column);
end;
* implicit output one row per by group;
run;
One other option for doing this not using PROC TRANSPOSE is the data step array technique.
Here, I have a dataset that hopefully matches yours approximately. ID is probably your account, Product is your attribute.
data have;
call streaminit(2007);
do id = 1 to 4;
do prodnum = 1 to 6;
if rand('Uniform') > 0.5 then do;
product = byte(96+prodnum);
output;
end;
end;
end;
run;
Now, here we transpose it. We make an array with the six variables that could occur in HAVE. Then we iterate through the array to see if that variable is there. You can add a few additional lines to the if first.id block to set all of the variables to 0 instead of missing initially (I think missing is better, but YMMV).
data want;
set have;
by id;
array vars[6] a b c d e f;
retain a b c d e f;
if first.id then call missing(of vars[*]);
do _i = 1 to dim(vars);
if lowcase(vname(vars[_i])) = product then
vars[_i] = 1;
end;
if last.id then output;
run;
We could do it a lot faster if we knew how the dataset was constructed, of course.
data want;
set have;
by id;
array vars[6] a b c d e f;
if first.id then call missing(of vars[*]);
retain a b c d e f;
vars[rank(product)-96]=1;
if last.id then output;
run;
While your data doesn't really work that way, you could make an informat though that did this.
*First we build an informat relating the product to its number in the array order;
proc format;
invalue arrayi
'a'=1
'b'=2
'c'=3
'd'=4
'e'=5
'f'=6
;
quit;
*Now we can use that!;
data want;
set have;
by id;
array vars[6] a b c d e f;
if first.id then call missing(of vars[*]);
retain a b c d e f;
vars[input(product,arrayi.)]=1;
if last.id then output;
run;
This last one is probably the absolute fastest option - most likely much faster than PROC TRANSPOSE, which tends to be one of the slower procs in my book, but at the cost of having to know ahead of time what variables you're going to have in that array.

Designing new RK number for unique record

I am a SAS Developer. I am starting a project that requires me to assign RK number to unique record. Every extraction will get data that is already existed in the target table and some may not.
For example.
Source Data:
Name
A
B
C
D
E
Target Table:
Name RK
A 1
B 2
C 3
When I load, i want it to insert D and E into the target table with RK 4 & 5 respectively. Currently, I can think of doing hash lookup from source with target table. For data that is not mapped using hash object, RK field will be blank. I will put the max RK number from the target table and incremental 1 to it by appending D & E into it.
I am not sure if this is the most efficient way of doing so. Is there another more efficient way?
You could use a hash to determine if some name (I'll call it value) already exists in target table. However, new keys would have to be tracked, output at the end of the step and then PROC APPPEND'd to target table (I'll call it master) .
For the case of just updating the master table with new RK values, a traditional SAS approach is to use a DATA step to MODIFY a unique keyed master table. The coding pattern is:
SET <source>
MODIFY <master> KEY=<value> / UNIQUE;
... _IORC_ logic ...
Example:
%* Create some source data and the master table;
data have1 have2 have3 have4 have5;
call streaminit(123);
value = 2020; output; output; output;
do _n_ = 1 to 2500;
value = ceil(rand('uniform', 5000));
select;
when (rand('uniform') < 0.20) output have1;
when (rand('uniform') < 0.20) output have2;
when (rand('uniform') < 0.20) output have3;
when (rand('uniform') < 0.20) output have4;
otherwise output have5;
end;
end;
run;
data have6;
do _n_ = 1 to 20;
value = 2020;
output;
end;
run;
* Create the unique keyed master table;
* Typically done once and stored in a permanent library.;
proc sql;
create table keys (value integer, RK integer);
create distinct index value on work.keys;
quit;
%* A macro for adding new RK values as needed;
%macro RK_ASSIGN(master, data);
%local last;
proc sql noprint;
select max(RK) into :last trimmed from &master;
quit;
data &master;
retain newkey %sysevalf(0&last+0); %* trickery for 1st use case when max(RK) is .;
set &data;
modify &master key=value / unique;
if _iorc_ eq %sysrc(_DSENOM);
newkey + 1;
RK = newkey;
output;
_error_ = 0;
run;
%mend;
%* Use the macro to process source data;
%RK_ASSIGN(keys,have1)
%RK_ASSIGN(keys,have2)
%RK_ASSIGN(keys,have3)
%RK_ASSIGN(keys,have4)
%RK_ASSIGN(keys,have5)
%RK_ASSIGN(keys,have6)
You can see the forced repeats of the 2020 value in the source data is only RK'd once in the master table, and there are no errors during processing.
If you want to backfill the source data with the found or assigned RK value there would be additional steps. You could update a custom format, or do a traditional left join. If you want to focus on backfill during a read over source data the HASH step + APPEND new RK's step might be preferable.
Example 2 Master table is named values
HASH version with RK assignment added to source data. New RKs output and appended.
proc sql;
create table values (value integer, RK integer);
create distinct index value on work.values;
%macro RK_HASH_ASSIGN(master,data);
%local last;
proc sql noprint;
select max(RK) into :last trimmed from &master;
quit;
data &data(drop=next_RK);
set &data end=end;
if _n_ = 1 then do;
declare hash lookup (dataset:"&master");
lookup.defineKey("value");
lookup.defineData("value", "RK");
lookup.defineDone();
declare hash newlookup (dataset:"&master(obs=0)");
newlookup.defineKey("value");
newlookup.defineData("value", "RK");
newlookup.defineDone();
end;
retain next_RK %sysevalf(0&last+0); %* trick;
* either load existing RK from hash, or compute and apply next RK value;
if lookup.find() ne 0 then do;
next_RK + 1;
RK = next_RK;
lookup.add();
newlookup.add();
end;
if end then do;
newlookup.output(dataset:'work.newmasters');
end;
run;
proc append base=&master data=work.newmasters;
proc delete data=work.newmasters;
run;
%mend;
%RK_HASH_ASSIGN(values,have1)
%RK_HASH_ASSIGN(values,have2)
%RK_HASH_ASSIGN(values,have3)
%RK_HASH_ASSIGN(values,have4)
%RK_HASH_ASSIGN(values,have5)
%RK_HASH_ASSIGN(values,have6)
%* Compare the two assignment strategies, no differences!;
proc sort force data=values(index=(value));
by RK;
run;
proc compare noprint base=keys compare=values out=diffs outnoequal;
by RK;
run;
----- LOG -----
2525 proc compare noprint base=keys compare=values out=diffs
outnoequal <------------- do not output when data is identical ;
;
2526 by RK;
2527 run;
NOTE: There were 215971 observations read from the data set WORK.KEYS.
NOTE: There were 215971 observations read from the data set WORK.VALUES.
NOTE: The data set WORK.DIFFS has 0 observations and 4 variables. <--- all the same ---
NOTE: PROCEDURE COMPARE used (Total process time):
real time 0.25 seconds
cpu time 0.26 seconds

SAS - Row by row Comparison within different ID Variables of Same Dataset and delete ALL Duplicates

I need some help in trying to execute a comparison of rows within different ID variable groups, all in a single dataset.
That is, if there is any duplicate observation within two or more ID groups, then I'd like to delete the observation entirely.
I want to identify any duplicates between rows of different groups and delete the observation entirely.
For example:
ID Value
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
The output I desire is:
ID Value
1 D
3 Z
I have looked online extensively, and tried a few things. I thought I could mark the duplicates with a flag and then delete based off that flag.
The flagging code is:
data have;
set want;
flag = first.ID ne last.ID;
run;
This worked for some cases, but I also got duplicates within the same value group flagged.
Therefore the first observation got deleted:
ID Value
3 Z
I also tried:
data have;
set want;
flag = first.ID ne last.ID and first.value ne last.value;
run;
but that didn't mark any duplicates at all.
I would appreciate any help.
Please let me know if any other information is required.
Thanks.
Here's a fairly simple way to do it: sort and deduplicate by value + ID, then keep only rows with values that occur only for a single ID.
data have;
input ID Value $;
cards;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
;
run;
proc sort data = have nodupkey;
by value ID;
run;
data want;
set have;
by value;
if first.value and last.value;
run;
proc sql version:
proc sql;
create table want as
select distinct ID, value from have
group by value
having count(distinct id) =1
order by id
;
quit;
This is my interpretation of the requirements.
Find levels of value that occur in only 1 ID.
data have;
input ID Value:$1.;
cards;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
;;;;
proc print;
proc summary nway; /*Dedup*/
class id value;
output out=dedup(drop=_type_ rename=(_freq_=occr));
run;
proc print;
run;
proc summary nway;
class value;
output out=want(drop=_type_) idgroup(out[1](id)=) sum(occr)=;
run;
proc print;
where _freq_ eq 1;
run;
proc print;
run;
A slightly different approach can use a hash object to track the unique values belonging to a single group.
data have; input
ID Value:& $1.; datalines;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
run;
proc delete data=want;
proc ds2;
data _null_;
declare package hash values();
declare package hash discards();
declare double idhave;
method init();
values.keys([value]);
values.data([value ID]);
values.defineDone();
discards.keys([value]);
discards.defineDone();
end;
method run();
set have;
if discards.find() ne 0 then do;
idhave = id;
if values.find() eq 0 and id ne idhave then do;
values.remove();
discards.add();
end;
else
values.add();
end;
end;
method term();
values.output('want');
end;
enddata;
run;
quit;
%let syslast = want;
I think what you should do is:
data want;
set have;
by ID value;
if not first.value then flag = 1;
else flag = 0;
run;
This basically flags all occurrences of a value except the first for a given ID.
Also I changed want and have assuming you create what you want from what you have. Also I assume have is sorted by ID value order.
Also this will only flag 1 D above. Not 3 Z
Additional Inputs
Can't you just do a sort to get rid of the duplicates:
proc sort data = have out = want nodupkey dupout = not_wanted;
by ID value;
run;
So if you process the observations by VALUE levels (instead of by ID levels) then you just need keep track of whether any ID is ever different than the first one.
data want ;
do until (last.value);
set have ;
by value ;
if first.value then first_id=id;
else if id ne first_id then remapped=1;
end;
if not remapped;
keep value id;
run;

SAS summary statistic from a dataset

The dataset looks like this:
colx coly colz
0 1 0
0 1 1
0 1 0
Required output:
Colname value count
colx 0 3
coly 1 3
colz 0 2
colz 1 1
The following code works perfectly...
ods output onewayfreqs=outfreq;
proc freq data=final;
tables colx coly colz / nocum nofreq;
run;
data freq;
retain colname column_value;
set outfreq;
colname = scan(tables, 2, ' ');
column_Value = trim(left(vvaluex(colname)));
keep colname column_value frequency percent;
run;
... but I believe that's not efficient. Say I have 1000 columns, running prof freq on all 1000 columns is not efficient. Is there any other efficient way with out using the proc freq that accomplishes my desired output?
One of the most efficient mechanisms for computing frequency counts is through a hash object set up for reference counting via the suminc tag.
The SAS documentation for "Hash Object - Maintaining Key Summaries" demonstrates the technique for a single variable. The following example goes one step further and computes for each variable specified in an array. The suminc:'one' specifies that each use of ref will add the value of one to an internal reference sum. While iterating over the distinct keys for output, the frequency count is extracted via the sum method.
* one million data values;
data have;
array v(1000);
do row = 1 to 1000;
do index = 1 to dim(v);
v(index) = ceil(100*ranuni(123));
end;
output;
end;
keep v:;
format v: 4.;
run;
* compute frequency counts via .ref();
data freak_out(keep=name value count);
length name $32 value 8;
declare hash bins(ordered:'a', suminc:'one');
bins.defineKey('name', 'value');
bins.defineData('name', 'value');
bins.defineDone();
one = 1;
do until (end_of_data);
set have end=end_of_data;
array v v1-v1000;
do index = 1 to dim(v);
name = vname(v(index));
value = v(index);
bins.ref();
end;
end;
declare hiter out('bins');
do while (out.next() = 0);
bins.sum(sum:count);
output;
end;
run;
Note Proc FREQ uses standard grammars, variables can be a mixed of character and numeric, and has lots of additional features that are specified through options.
I think the most time consuming part in your code is generation of the ODS report. You can transpose the data before applying the freq. The below example does the task for 1000 rows with 1000 variables in few seconds. If you do it using ODS it may take much longer.
data dummy;
array colNames [1000] col1-col1000;
do line = 1 to 1000;
do j = 1 to dim(colNames);
colNames[j] = int(rand("uniform")*100);
end;
output;
end;
drop j;
run;
proc transpose
data = dummy
out = dummyTransposed (drop = line rename = (_name_ = colName col1 = value))
;
var col1-col1000;
by line;
run;
proc freq data = dummyTransposed noprint;
tables colName*value / out = result(drop = percent);
run;
Perhaps this statement from the comments is the real problem.
I felt like the odsoutput with proc freq is slowing down and creating
huge logs and outputs. think of 10,000 variables and million records.
I felt there should be another way of accomplishing this and arrays
seems to be a great fit
You can tell ODS not to produce the printed output if you don't want it.
ods exclude all ;
ods output onewayfreqs=outfreq;
proc freq data=final;
tables colx coly colz / nocum nofreq;
run;
ods exclude none ;