Count function for the last columns - sas

How do i stat a count function for the last 3 columns in my dataset, putting into consideration that the name of the last 3 columns always changes
Data test;
Set test1;
Count=count(coulmn12,column13,column14);
Run;

You could also use an ARRAY. And the old if 0 then set.
data have;
retain id x1-x5 z1-z6 . z7-z10 . a ' ' z11-z12 . ;
id+1; z10 = 1; output;
id+1; z11 = 3.14159; output;
id+1; z12 = 42; output;
format _numeric_ 4.;
run;
data want;
if 0 then set have(drop=id /*or other numeric vars as needed*/);
array _v[*] _numeric_;
set have;
nmiss_3 = nmiss(_v[dim(_v)],_v[dim(_v)-1],_v[dim(_v)-2]);
run;
data want; /*move ID and NMISS_3 back to left*/
if 0 then set want(keep=id nmiss_3);
set want;
run;
proc print;
run;

You can query a data set's metadata to get the names of the last three columns.
data have;
retain id x1-x15 z1-z12 . ;
id+1; z10 = 1; output;
id+1; z11 = 3.14159; output;
id+1; z12 = 42; output;
format _numeric_ 4.;
run;
* get data sets metadata;
proc contents noprint data=have out=have_metadata;
run;
* query for names of last 3 columns;
proc sql noprint;
select name
into :last_three_columns separated by ','
from have_metadata
having varnum > max(varnum) - 3
;
%put NOTE: &=last_three_columns;
data want;
attrib last3_nmiss_count length=8;
set have;
last3_nmiss_count = nmiss(&last_three_columns);
run;
dm 'viewtable
want(keep=last3_nmiss_count id z:)';

Related

SAS Array Variable Name Based on Another Array

I have data in the following format:
data have;
input id rtl_apples rtl_oranges rtl_berries;
datalines;
1 50 60 10
2 10 30 80
3 40 8 1
;
I'm trying to create new variables that represent the percent of the sum of the RTL variables, PCT_APPLES, PCT_ORANGES, PCT_BERRIES. The problem is I'm doing this within a macro so the names and number of RTL variables with vary with each iteration so the new variable names need to be generated dynamically.
This data step essentially gets what I need, but the new variables are in the format PCT1, PCT2, PCTn format so it's difficult to know which RTL variable the PCT corresponds too.
data want;
set have;
array rtls[*] rtl_:;
total_sales = sum(of rtl_:);
call symput("dim",dim(rtls));
array pct[&dim.];
do i=1 to dim(rtls);
pct[i] = rtls[i] / total_sales;
end;
drop i;
run;
I also tried creating the new variable name by using a macro variable, but only the last variable in the array is created. In this case, PCT_BERRIES.
data want;
set have;
array rtls[*] rtl_:;
total_sales = sum(of rtl_:);
do i=1 to dim(rtls);
var_name = compress(tranwrd(upcase(vname(rtls[i])),'RTL','PCT'));
call symput("var_name",var_name);
&var_name. = rtls[i] / total_sales;
end;
drop i var_name;
run;
I have a feeling I'm over complicating this so any help would be appreciated.
If you have the list of names in data already then use the list to create the names you need for your arrays.
proc sql noprint;
select distinct cats('RTL_',name),cats('PCT_',name)
into :rtl_list separated by ' '
, :pct_list separated by ' '
from dataset_with_names
;
quit;
data want;
set have;
array rtls &rtl_list;
array pcts &pct_list;
total_sales = sum(of rtls[*]);
do index=1 to dim(rtls);
pcts[index] = rtls[index] / total_sales;
end;
drop index ;
run;
You can't create variables while a data step is executing. This program uses PROC TRANSPOSE to create a new data using the RTL_ variables "renamed" PCT_.
data have;
input id rtl_apples rtl_oranges rtl_berries;
datalines;
1 50 60 10
2 10 30 80
3 40 8 1
;;;;
run;
proc transpose data=have(obs=0) out=names;
var rtl_:;
run;
data pct;
set names;
_name_ = transtrn(_name_,'rtl_','PCT_');
y = .;
run;
proc transpose data=pct out=pct2;
id _name_;
var y;
run;
data want;
set have;
if 0 then set pct2(drop=_name_);
array _rtl[*] rtl_:;
array _pct[*] pct_:;
call missing(of _pct[*]);
total = sum(of _rtl[*]);
do i = 1 to dim(_rtl);
_pct[i] = _rtl[i]/total*1e2;
end;
drop i;
run;
proc print;
run;
You may want to just report the row percents
proc transpose data=&data out=&data.T;
by id;
var rtl_:;
run;
proc tabulate data=&data.T;
class id _name_;
var col1;
table
id=''
, _name_='Result'*col1=''*sum=''
_name_='Percent'*col1=''*rowpctsum=''
/ nocellmerge;
run;

Assign 3 values to make one record becomes three records

I want to make a simple dataset as left, to become a dataset at the right.
There are three values for V1, 1,2,3. And I want to assign it to each of the id.
No problem. Do this
data have;
input id;
datalines;
1
2
;
data want;
set have;
do _N_ = 1 to 3;
v1 = put(_N_, z2. -l);
output;
end;
run;
Try this
data have;
input id;
datalines;
1
2
;
data want;
set have;
do v1 = 1 to 3;
output;
end;
run;

SAS code for calculating IV

I found some code from obseveupdate websit. They are used for IV calculation. When I run it code it goes through, but all IV and Woe are zeros. I changed another data set to try, also get zeros for all variables. Could you help me figure out why?
data inputdata;
length Region $ 20 age $ 20 Gender $ 20;
infile datalines dsd dlm= ':' truncover;
input Region $ age $ Gender $ target ;
datalines;
Scotland:18-25:Male:1
Scotland:18-25:Female:0
Scotland:26-35:Male:0
Wales:26-35:Male:1
Wales:36-45:Female:0
Wales:26-35:Male:1
London:36-45:Male:1
London:26-35:Male:0
London:18-25:Unknown:1
London:36-45:Male:0
Northern Ireland:36-45:Female:0
Northern Ireland:26-35:Male:1
Northern Ireland:36-45:Male:0
Engand (Not London):45+:Female:0
Engand (Not London):18-25:Male:1
Engand (Not London):26-35:Female:0
Engand (Not London):45+:Female:0
Engand (Not London):36-45:Female:1
Engand (Not London):45+:Female:1
;
data _tempdata;
set inputdata;;
n=_n_;
run;
proc sort data=_tempdata;
by target n;
run;
proc transpose data=_tempdata out = _tempdata;
by target n;
var _character_ _numeric_;
run;
proc sort data=_tempdata out=_tempdata;
by _name_ target;
run;
proc freq data=_tempdata;
by _name_ target;
tables col1 /out=_tempdata;
run;
proc sort data=_tempdata;
by _name_ col1;
run;
proc transpose data=_tempdata out=_tempdata;
by _name_ col1;
id target;
var percent;
run;
data IV_Table(keep=variable IV) WOE_Table(keep=variable attribute woe);
set _tempdata;
by _name_;
rename col1=attribute _name_=variable;
_0=sum(_0,0)/100; *Convert to percent and convert null to zero;
_1=sum(_1,0)/100; *Convert to percent and convert null to zero;
woe=log(_0/_1)*100;output WOE_Table;*Output WOE;
if _1 ne 0 and _0 ne 0 then do;
raw=(_0-_1)*log(_0/_1);
end;
else raw=0;
IV+sum(raw,0);*Culmulativly add to IV, set null to zero;
if last._name_ then do; *only _tempdata the last final row;
output IV_table;
IV=0;
end;
where upcase(_name_) ^='TARGET' and upcase(_name_) ^= 'N';run;
proc sort data=IV_table;by descending IV;run;
title1 "IV Listing";proc print data=IV_table;run;
proc sort data=woe_table;
by variable WOE;
run;
title1 "WOE Listing";
proc print data=WOE_Table;run;

Indicator variable for maximum value by groups

Is there any more elegant way than that presented below for the following task:
to create Indicator Variables (below "MAX_X1" and "MAX_X2") whithin each group (below "key1") of multiple observation (below "key2") with value 1 if this observation corresponds to the maximum value of the variable in eache group and 0 otherwise
data have;
call streaminit(4321);
do key1=1 to 10;
do key2=1 to 5;
do x1=rand("uniform");
x2=rand("Normal");
output;
end;
end;
end;
run;
proc means data=have noprint;
by key1;
var x1 x2;
output out=max
max= / autoname;
run;
data want;
merge have max;
by key1;
drop _:;
run;
proc sql;
title "MAX";
select name into :MAXvars separated by ' '
from dictionary.columns
WHERE LIBNAME="WORK" AND MEMNAME="WANT" AND NAME like "%_Max"
order by name;
quit;
title;
data want; set want;
array MAX (*) &MAXvars;
array XVars (*) x1 x2;
array Indicators (*) MAX_X1 MAX_X2;
do i=1 to dim(MAX);
if XVars[i]=MAX[i] then Indicators[i]=1; else Indicators[i]=0;
end;
drop i;
run;
Thanks for any suggestion of optimization
Proc sql can be used with a group by statement to allow summary functions across values of a variable.
data have;
call streaminit(4321);
do key1=1 to 10;
do key2=1 to 5;
do x1=rand("uniform");
x2=rand("Normal");
output;
end;
end;
end;
run;
proc sql;
create table want
as select
key1,
key2,
x1,
x2,
case
when x1 = max(x1) then 1
else 0 end as max_x1,
case
when x2 = max(x2) then 1
else 0 end as max_x2
from have
group by key1
order by key1, key2;
quit;
It is also possible to do this in a single data step, provided that you read the input dataset twice - this is an example of a double DOW-loop.
data have;
call streaminit(4321);
do key1=1 to 10;
do key2=1 to 5;
do x1=rand("uniform");
x2=rand("Normal");
output;
end;
end;
end;
run;
/*Sort by key1 (or generate index) if not already sorted*/
proc sort data = have;
by key1;
run;
data want;
if 0 then set have;
array xvars[3,2] x1 x2 x1_max_flag x2_max_flag t_x1_max t_x2_max;
/*1st DOW-loop*/
do _n_ = 1 by 1 until(last.key1);
set have;
by key1;
do i = 1 to 2;
xvars[3,i] = max(xvars[1,i],xvars[3,i]);
end;
end;
/*2nd DOW-loop*/
do _n_ = 1 to _n_;
set have;
do i = 1 to 2;
xvars[2,i] = (xvars[1,i] = xvars[3,i]);
end;
output;
end;
drop i t_:;
run;
This may be a bit complicated to understand, so here's a rough explanation of how it flows:
Read one by group with the first DOW-loop, updating rolling max variables as each row is read in. Don't output anything yet.
Now read the same by-group again using the second DOW-loop, checking to see whether each row is equal to the rolling max and outputting each row.
Go back to first DOW-loop, read the next by-group and repeat.

How to delete blank observations in a data set in SAS

I want to delete ALL blank observations from a data set.
I only know how to get rid of blanks from one variable:
data a;
set data(where=(var1 ne .)) ;
run;
Here I set a new data set without the blanks from var1.
But how to do it, when I want to get rid of ALL the blanks in the whole data set?
Thanks in advance for your answers.
If you are attempting to get rid of rows where ALL variables are missing, it's quite easy:
/* Create an example with some or all columns missing */
data have;
set sashelp.class;
if _N_ in (2,5,8,13) then do;
call missing(of _numeric_);
end;
if _N_ in (5,6,8,12) then do;
call missing(of _character_);
end;
run;
/* This is the answer */
data want;
set have;
if compress(cats(of _all_),'.')=' ' then delete;
run;
Instead of the compress you could also use OPTIONS MISSING=' '; beforehand.
If you want to remove ALL Rows with ANY missing values, then you can use NMISS/CMISS functions.
data want;
set have;
if nmiss(of _numeric_) > 0 then delete;
run;
or
data want;
set have;
if nmiss(of _numeric_) + cmiss(of _character_) > 0 then delete;
run;
for all char+numeric variables.
You can do something like this:
data myData;
set myData;
array a(*) _numeric_;
do i=1 to dim(a);
if a(i) = . then delete;
end;
drop i;
This will scan trough all the numeric variables and will delete the observation where it finds a missing value
Here you go. This will work irrespective of the variable being character or numeric.
data withBlanks;
input a$ x y z;
datalines;
a 1 2 3
b 1 . 3
c . . 3
. . .
d . 2 3
e 1 . 3
f 1 2 3
;
run;
%macro removeRowsWithMissingVals(inDsn, outDsn, Exclusion);
/*Inputs:
inDsn: Input dataset with some or all columns missing for some or all rows
outDsn: Output dataset with some or all columns NOT missing for some or all rows
Exclusion: Should be one of {AND, OR}. AND will only exclude rows if any columns have missing values, OR will exclude only rows where all columns have missing values
*/
/*get a list of variables in the input dataset along with their types (i.e., whether they are numericor character type)*/
PROC CONTENTS DATA = &inDsn OUT = CONTENTS(keep = name type varnum);
RUN;
/*put each variable with its own comparison string in a seperate macro variable*/
data _null_;
set CONTENTS nobs = num_of_vars end = lastObs;
/*use NE. for numeric cols (type=1) and NE '' for char types*/
if type = 1 then call symputx(compress("var"!!varnum), compbl(name!!" NE . "));
else call symputx(compress("var"!!varnum), compbl(name!!" NE '' "));
/*make a note of no. of variables to check in the dataset*/
if lastObs then call symputx("no_of_obs", _n_);
run;
DATA &outDsn;
set &inDsn;
where
%do i =1 %to &no_of_obs.;
&&var&i.
%if &i < &no_of_obs. %then &Exclusion;
%end;
;
run;
%mend removeRowsWithMissingVals;
%removeRowsWithMissingVals(withBlanks, withOutBlanksAND, AND);
%removeRowsWithMissingVals(withBlanks, withOutBlanksOR, OR);
Outout of withOutBlanksAND:
a x y z
a 1 2 3
f 1 2 3
Output of withOutBlanksOR:
a x y z
a 1 2 3
b 1 . 3
c . . 3
e 1 . 3
f 1 2 3
Really weird nobody provided this elegant answer:
if missing(cats(of _all_)) then delete;
Edit: indeed, I didn't realized the cats(of _all_) returns a dot '.' for missing numeric value.
As a fix, I suggest this, which seems to be more reliable:
*-- Building a sample dataset with test cases --*;
data test;
attrib a format=8.;
attrib b format=$8.;
a=.; b='a'; output;
a=1; b=''; output;
a=.; b=''; output; * should be deleted;
a=.a; b=''; output; * should be deleted;
a=.a; b='.'; output;
a=1; b='b'; output;
run;
*-- Apply the logic to delete blank records --*;
data test2;
set test;
*-- Build arrays of numeric and characters --*;
*-- Note: array can only contains variables of the same type, thus we must create 2 different arrays --*;
array nvars(*) _numeric_;
array cvars(*) _character_;
*-- Delete blank records --*;
*-- Blank record: # of missing num variables + # of missing char variables = # of numeric variables + # of char variables --*;
if nmiss(of _numeric_) + cmiss(of _character_) = dim(nvars) + dim(cvars) then delete;
run;
The main issue being if there is no numeric at all (or not char at all), the creation of an empty array will generate a WARNING and the call to nmiss/cmiss an ERROR.
So, I think so far there is not other option than building a SAS statement outside the data step to identify empty records:
*-- Building a sample dataset with test cases --*;
data test;
attrib a format=8.;
attrib b format=$8.;
a=.; b='a'; output;
a=1; b=''; output;
a=.; b=''; output; * should be deleted;
a=.a; b=''; output; * should be deleted;
a=.a; b='.'; output;
a=1; b='b'; output;
run;
*-- Create a SAS statement which test any missing variable, regardless of its type --*;
proc sql noprint;
select distinct 'missing(' || strip(name) || ')'
into :miss_stmt separated by ' and '
from dictionary.columns
where libname = 'WORK'
and memname = 'TEST'
;
quit;
/*
miss_stmt looks like missing(a) and missing(b)
*/
*-- Delete blank records --*;
data test2;
set test;
if &miss_stmt. then delete;
run;