SAS - Creating combinations of different independant variables with lags

SAS - Creating combinations of different independant variables with lags - sas

I'd like to create dynamic entries in a data set (in SAS) formed using the names of variables (e.g. VarA, VarB, VarC) each having lags up to 4.
The input data set HAVE has this information (the column names are Variables and Values):
Variables Values
VarA 0
VarB 0
VarC 0
Lags 4
and the output data set WANT should be something like below (Var1, Var2, and Var3 are dynamic column names i.e. appending 1,2,3 to any string Var)
Var1 Var2 Var3
VarA VarB VarC
VarA1 VarB1 VarC1
..
VarA4 VarB4 VarC4
The intention is to have this work for any number of variables in HAVE data set.
Thanks

The following code returns what you want. Please modify according to your needs.
/*sample input dataset*/
data have;
input Variables $ Values;
datalines;
VarA 0
VarB 0
VarC 0
Lags 4
;
run;
/*get the no. of lags form the input dataset*/
proc sql noprint;
select Values into :num_of_lags from have where upcase(variables)='LAGS';
quit;
/*transpose the input dataset such that the VarA, VarB, VarC are put in columns Var1, Var2, & Var3 respectively*/
/*have_t, the transposed dataset only has 1 row.*/
proc transpose data = have out = have_t(drop = _name_) prefix = var;
where upcase(variables) ne 'LAGS';
var variables;
run;
/*replicate the 1 row in have_t num_of_lags times*/
data pre_want;
set have_t;
array myVars{*} _character_;
do j= 1 to &num_of_lags+1;
do i = 1 to dim(myVars);
myVars[i]=myVars[i];
end;
output;
end;
run;
/*final dataset*/
data want;
set pre_want;
array myVars{*} _character_;
if _N_>1 then do;
do i = 1 to dim(myVars);
myVars[i]=compress(myVars[i]!!_n_-1);
end;
end;
drop i j;
run;
proc print data = want; run;
Output:
var1 var2 var3
VarA VarB VarC
VarA1 VarB1 VarC1
VarA2 VarB2 VarC2
VarA3 VarB3 VarC3
VarA4 VarB4 VarC4

Related

SAS - Flag if variable is present in another column of same dataset

I have a SAS dataset with 2 columns that I want to compare (VAR1 and VAR2). I would like to check if for each value of VAR1 this value exists anywhere in the column VAR2. If the VAR1 value does not exist anywhere in the column VAR2 I want to flag it as 1.
For exemple :
I have this :
TABLE in
VAR1
VAR2
k3
t7
t7
g7
p8
k3
...
...
And would want this
TABLE out
VAR1
VAR2
FLAG
k3
t7
0
t7
g7
0
p8
k3
1
...
...
...
I tried using
FLAG = ifn(indexw(VAR2,VAR1,0,1)
But this method only compare the two columns for the current row.
Thank you in advance for your help !
Edit : I tried running this code as suggested by Joe but ran into an error.
Code :
data your_table;
length VAR1 $2;
length VAR2 $2;
input VAR1 VAR2;
datalines;
k3 t7
t7 g7
p8 k3
;
data for_fmt;
set your_table;
fmtname = 'VAR2F';
start = var2;
label = '0';
output;
if _n_ eq 1 then do;
hlo = 'o';
start = .;
label = '1';
output;
end;
run;
proc sort nodupkey data=for_fmt;
by start;
run;
proc format cntlin=for_fmt;
quit;
data want;
set your_table;
flag = put(var1,var2f.);
run;
Error:
ERROR: This range is repeated, or values overlap: .- ..

In SAS, everything is based on one row at a time in the data step, so you can't do what you're looking to directly.
What you can do, though, is use a lookup technique - there are quite a few - and that will let you get what you're after.
The easiest one to use in your case is probably a format.
data for_fmt;
set your_table;
fmtname = 'VAR2F';
start = var2;
label = '0';
output;
if _n_ eq 1 then do;
hlo = 'o'; *this is for "other" (not found) records;
start = .;
label = '1';
output;
end;
run;
proc sort nodupkey data=for_fmt;
by start;
run;
proc format cntlin=for_fmt;
quit;
data want;
set your_table;
flag = put(var1,var2f.);
run;
This is pretty fast (only limited by dataset read/write time) unless you have millions of unique rows.
You could also merge the dataset to itself, or do this in SQL, or use a hash table, but the format approach is probably simplest.

As #Joe says, it is always on one row at a time in the data step. So if you can make all possible value of var1 in one row, you can do the character match work easily.
data your_table;
length VAR1 $2;
length VAR2 $2;
input VAR1 VAR2;
datalines;
k3 t7
t7 g7
p8 k3
;
data want;
array _char_[&sysnobs.]$200._temporary_;
do until(eof1);
set your_table end=eof1;
i+1;
_char_[i]=var2;
end;
do until(eof2);
set your_table end=eof2;
flag=1;
do i=1to dim(_char_) until(flag=0);
if var1=_char_[i] then flag=0;
end;
output;
end;
run;

Here's the method I typically use in situations like this. I first create a list of the variables to check against, then merge with that and can easily pick out the ones that are found.
proc sort data=have (keep=var2) out=var2levels (rename=(var2=var1)) nodupkey; by var2;
proc sort data=have; by var1;
data want;
merge have (in=in1) var2levels (in=in2);
by var1;
if in1;
flag = in2;
run;
So here the first proc sort creates a list of all the unique values of var2. The output data set renames that to var1 for merging purposes (this can be done more clearly but less efficiently by renaming multiple variables). Then we simply merge the original data set (keeping all records) with the list of existing var2 values and set the flag accordingly.

How do you add a row Number in SAS by multiple groups with one variable in decending order?

I have discovered this code in SAS that mimics the following window function in SQL server:
ROW_NUMBER() OVER (PARTITION BY Var1,var2 ORDER BY var1, var2)
=
data want;
set have
by var1 var2;
if first.var1 AND first.var2 then n=1;
else n+1;
run;
"She's a beaut' Clark"... but, How does one mimic this operation:
ROW_NUMBER() OVER (PARTITION BY Var1,var2 ORDER BY var1, var2 Desc)
I've made sure I have before:
PROC SORT DATA=WORK.TEST
OUT=WORK.TEST;
BY var1 DECENDING var2 ;
RUN;
data WORK.want;
set WORK.Test;
by var1 var2;
if first.var1 AND last.var2 then n=1;
else n+1;
run;
But this doesn't work.
ERROR: BY variables are not properly sorted on data set WORK.TEST.
Sample DataSet:
data test;
infile datalines dlm='#';
INPUT var1 var2;
datalines;
1#5
2#4
1#3
1#6
1#9
2#5
2#2
1#7
;
run;
I was thinking I can make one variable temporary negative, but I don't want to change the data, I'm looking for a more elegant solution.

You have to tell the data step to expect the data in descending order if that is what you are giving it.
You also don't seem to quite get the logic of the FIRST. and LAST. flags. If it is FIRST.VAR1 then by definition it is FIRST.VAR2. The first observation for this value of VAR1 is also the first observation for the first value of VAR2 within this specific value of VAR1.
Do you want to number the observations within each combination of VAR1 and VAR2?
data WORK.want;
set WORK.Test;
BY var1 DESCENDING var2 ;
if first.var2 then n=1;
else n+1;
run;
Or number the distinct values of VAR2 within VAR1?
data WORK.want;
set WORK.Test;
BY var1 DESCENDING var2 ;
if first.var1 then n=0;
if first.var2 then n+1;
run;
Or number the distinct combinations of VAR2 and VAR1?
data WORK.want;
set WORK.Test;
BY var1 DESCENDING var2 ;
if first.var2 then n+1;
run;

Setting names to idgroup

Follow up to
SAS - transpose multiple variables in rows to columns
I have the following code:
data have;
input CX_ID 1. TYPE $1. COUNT_RATE 1. SUM_RATE 2.;
datalines;
1A110
1B220
2A120
;
run;
proc summary data = have nway;
class cx_id;
output out=want (drop = _:)
idgroup(out[2] (count_rate sum_rate)= count sum);
run;
So this table:
CX_ID TYPE COUNT_RATE SUM_RATE
1 A 1 10
1 B 2 20
2 A 1 20
becomes
CX_ID COUNT_1 COUNT_2 SUM_1 SUM_2
1 1 2 10 20
2 1 . 20 .
Which is perfect, but how do I set the names to be
Count_A Count_B Sum_A Sum_B
Or in general whatever the value in the type field of the have table ?
Thank you

A double PROC TRANSPOSE is dynamic and you can add a data step to customize the names easily.
*sample data;
data have;
input CX_ID 1. TYPE $1. COUNT 1. SUM 2.;
datalines;
1A110
1B220
2A120
;
run;
*transpose to long;
proc transpose data=have out=long;
by cx_id type;
run;
*transpose to wide;
proc transpose data=long out=wide;
by cx_id;
var col1;
id _name_ type;
run;

SAS code for calculating IV

I found some code from obseveupdate websit. They are used for IV calculation. When I run it code it goes through, but all IV and Woe are zeros. I changed another data set to try, also get zeros for all variables. Could you help me figure out why?
data inputdata;
length Region $ 20 age $ 20 Gender $ 20;
infile datalines dsd dlm= ':' truncover;
input Region $ age $ Gender $ target ;
datalines;
Scotland:18-25:Male:1
Scotland:18-25:Female:0
Scotland:26-35:Male:0
Wales:26-35:Male:1
Wales:36-45:Female:0
Wales:26-35:Male:1
London:36-45:Male:1
London:26-35:Male:0
London:18-25:Unknown:1
London:36-45:Male:0
Northern Ireland:36-45:Female:0
Northern Ireland:26-35:Male:1
Northern Ireland:36-45:Male:0
Engand (Not London):45+:Female:0
Engand (Not London):18-25:Male:1
Engand (Not London):26-35:Female:0
Engand (Not London):45+:Female:0
Engand (Not London):36-45:Female:1
Engand (Not London):45+:Female:1
;
data _tempdata;
set inputdata;;
n=_n_;
run;
proc sort data=_tempdata;
by target n;
run;
proc transpose data=_tempdata out = _tempdata;
by target n;
var _character_ _numeric_;
run;
proc sort data=_tempdata out=_tempdata;
by _name_ target;
run;
proc freq data=_tempdata;
by _name_ target;
tables col1 /out=_tempdata;
run;
proc sort data=_tempdata;
by _name_ col1;
run;
proc transpose data=_tempdata out=_tempdata;
by _name_ col1;
id target;
var percent;
run;
data IV_Table(keep=variable IV) WOE_Table(keep=variable attribute woe);
set _tempdata;
by _name_;
rename col1=attribute _name_=variable;
_0=sum(_0,0)/100; *Convert to percent and convert null to zero;
_1=sum(_1,0)/100; *Convert to percent and convert null to zero;
woe=log(_0/_1)*100;output WOE_Table;*Output WOE;
if _1 ne 0 and _0 ne 0 then do;
raw=(_0-_1)*log(_0/_1);
end;
else raw=0;
IV+sum(raw,0);*Culmulativly add to IV, set null to zero;
if last._name_ then do; *only _tempdata the last final row;
output IV_table;
IV=0;
end;
where upcase(_name_) ^='TARGET' and upcase(_name_) ^= 'N';run;
proc sort data=IV_table;by descending IV;run;
title1 "IV Listing";proc print data=IV_table;run;
proc sort data=woe_table;
by variable WOE;
run;
title1 "WOE Listing";
proc print data=WOE_Table;run;

Is there a built in method to choose non key variables in a SAS merge?

If you have multiple datasets (hundreds) with the same variable names and would like to merge them by a key, is there a simple way to control which value of a variable to take for the variables that are not the key? One way to do this would be a rename on the merge statement then write another step to use those renamed variable to calculate the most frequent value with an array...but I'm really wondering if there's an built in way of handling this. For example:
data ds1;
infile datalines dsd delimiter=' ';
input var1 $ var2;
datalines;
a 1
b 2
;
run;
data ds2;
infile datalines dsd delimiter=' ';
input var1 $ var2;
datalines;
a
b 2
;
run;
data ds3;
infile datalines dsd delimiter=' ';
input var1 $ var2;
datalines;
a 1
b
;
run;
data ds123;
merge ds1 ds2 ds3;
by var1;
run;
This code will 'pick' the 'furthest right' var2 i.e. the dataset ds123:
a 1
b
But I may want it to be:
a 1
b 2
as this would match the most frequent values.

Use an SQL join and the coalesce function. Specify the preference order in the coalesce and the first non-missing in that order will be used.
proc sql noprint;
create table ds123 as
select a.var1,
coalesce(a.var2,b.var2,c.var2) as var2
from ds1 as a,
ds2 as b,
ds3 as c
where a.var1 = b.var1
and b.var1 = c.var1;
quit;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

SAS - Creating combinations of different independant variables with lags - sas

Related

SAS - Flag if variable is present in another column of same dataset

How do you add a row Number in SAS by multiple groups with one variable in decending order?

Setting names to idgroup

SAS code for calculating IV

Is there a built in method to choose non key variables in a SAS merge?

Categories

Resources