I have a SAS dataset with 2 columns that I want to compare (VAR1 and VAR2). I would like to check if for each value of VAR1 this value exists anywhere in the column VAR2. If the VAR1 value does not exist anywhere in the column VAR2 I want to flag it as 1.
For exemple :
I have this :
TABLE in
VAR1
VAR2
k3
t7
t7
g7
p8
k3
...
...
And would want this
TABLE out
VAR1
VAR2
FLAG
k3
t7
0
t7
g7
0
p8
k3
1
...
...
...
I tried using
FLAG = ifn(indexw(VAR2,VAR1,0,1)
But this method only compare the two columns for the current row.
Thank you in advance for your help !
Edit : I tried running this code as suggested by Joe but ran into an error.
Code :
data your_table;
length VAR1 $2;
length VAR2 $2;
input VAR1 VAR2;
datalines;
k3 t7
t7 g7
p8 k3
;
data for_fmt;
set your_table;
fmtname = 'VAR2F';
start = var2;
label = '0';
output;
if _n_ eq 1 then do;
hlo = 'o';
start = .;
label = '1';
output;
end;
run;
proc sort nodupkey data=for_fmt;
by start;
run;
proc format cntlin=for_fmt;
quit;
data want;
set your_table;
flag = put(var1,var2f.);
run;
Error:
ERROR: This range is repeated, or values overlap: .- ..
In SAS, everything is based on one row at a time in the data step, so you can't do what you're looking to directly.
What you can do, though, is use a lookup technique - there are quite a few - and that will let you get what you're after.
The easiest one to use in your case is probably a format.
data for_fmt;
set your_table;
fmtname = 'VAR2F';
start = var2;
label = '0';
output;
if _n_ eq 1 then do;
hlo = 'o'; *this is for "other" (not found) records;
start = .;
label = '1';
output;
end;
run;
proc sort nodupkey data=for_fmt;
by start;
run;
proc format cntlin=for_fmt;
quit;
data want;
set your_table;
flag = put(var1,var2f.);
run;
This is pretty fast (only limited by dataset read/write time) unless you have millions of unique rows.
You could also merge the dataset to itself, or do this in SQL, or use a hash table, but the format approach is probably simplest.
As #Joe says, it is always on one row at a time in the data step. So if you can make all possible value of var1 in one row, you can do the character match work easily.
data your_table;
length VAR1 $2;
length VAR2 $2;
input VAR1 VAR2;
datalines;
k3 t7
t7 g7
p8 k3
;
data want;
array _char_[&sysnobs.]$200._temporary_;
do until(eof1);
set your_table end=eof1;
i+1;
_char_[i]=var2;
end;
do until(eof2);
set your_table end=eof2;
flag=1;
do i=1to dim(_char_) until(flag=0);
if var1=_char_[i] then flag=0;
end;
output;
end;
run;
Here's the method I typically use in situations like this. I first create a list of the variables to check against, then merge with that and can easily pick out the ones that are found.
proc sort data=have (keep=var2) out=var2levels (rename=(var2=var1)) nodupkey; by var2;
proc sort data=have; by var1;
data want;
merge have (in=in1) var2levels (in=in2);
by var1;
if in1;
flag = in2;
run;
So here the first proc sort creates a list of all the unique values of var2. The output data set renames that to var1 for merging purposes (this can be done more clearly but less efficiently by renaming multiple variables). Then we simply merge the original data set (keeping all records) with the list of existing var2 values and set the flag accordingly.
Related
I have discovered this code in SAS that mimics the following window function in SQL server:
ROW_NUMBER() OVER (PARTITION BY Var1,var2 ORDER BY var1, var2)
=
data want;
set have
by var1 var2;
if first.var1 AND first.var2 then n=1;
else n+1;
run;
"She's a beaut' Clark"... but, How does one mimic this operation:
ROW_NUMBER() OVER (PARTITION BY Var1,var2 ORDER BY var1, var2 Desc)
I've made sure I have before:
PROC SORT DATA=WORK.TEST
OUT=WORK.TEST;
BY var1 DECENDING var2 ;
RUN;
data WORK.want;
set WORK.Test;
by var1 var2;
if first.var1 AND last.var2 then n=1;
else n+1;
run;
But this doesn't work.
ERROR: BY variables are not properly sorted on data set WORK.TEST.
Sample DataSet:
data test;
infile datalines dlm='#';
INPUT var1 var2;
datalines;
1#5
2#4
1#3
1#6
1#9
2#5
2#2
1#7
;
run;
I was thinking I can make one variable temporary negative, but I don't want to change the data, I'm looking for a more elegant solution.
You have to tell the data step to expect the data in descending order if that is what you are giving it.
You also don't seem to quite get the logic of the FIRST. and LAST. flags. If it is FIRST.VAR1 then by definition it is FIRST.VAR2. The first observation for this value of VAR1 is also the first observation for the first value of VAR2 within this specific value of VAR1.
Do you want to number the observations within each combination of VAR1 and VAR2?
data WORK.want;
set WORK.Test;
BY var1 DESCENDING var2 ;
if first.var2 then n=1;
else n+1;
run;
Or number the distinct values of VAR2 within VAR1?
data WORK.want;
set WORK.Test;
BY var1 DESCENDING var2 ;
if first.var1 then n=0;
if first.var2 then n+1;
run;
Or number the distinct combinations of VAR2 and VAR1?
data WORK.want;
set WORK.Test;
BY var1 DESCENDING var2 ;
if first.var2 then n+1;
run;
I've got pretty big table where I want to replace rare values (for this example that have less than 10 occurancies but real case is more complicated- it might have 1000 levels while I want to have only 15). This list of possible levels might change so I don't want to hardcode anything.
My code is like:
%let var = Make;
proc sql;
create table stage1_ as
select &var.,
count(*) as count
from sashelp.cars
group by &var.
having count >= 10
order by count desc
;
quit;
/* Join table with table including only top obs to replace rare
values with "other" category */
proc sql;
create table stage2_ as
select t1.*,
case when t2.&var. is missing then "Other_&var." else t1.&var. end as &var._new
from sashelp.cars t1 left join
stage1_ t2 on t1.&var. = t2.&var.
;
quit;
/* Drop old variable and rename the new as old */
data result;
set stage2_(drop= &var.);
rename &var._new=&var.;
run;
It works, but unfortunately it is not very officient as it needs to make a join for each variable (in real case I am doing it in loop).
Is there a better way to do it? Maybe some smart replace function?
Thanks!!
You probably don't want to change the actual data values. Instead consider creating a custom format for each variable that will map the rare values to an 'Other' category.
The FREQ procedure ODS can be used to capture the counts and percentages of every variable listed into a single table. NOTE: Freq table/out= captures only the last listed variable. Those counts can be used to construct the format according to the 'othering' rules you want to implement.
data have;
do row = 1 to 1000;
array x x1-x10;
do over x;
if row < 600
then x = ceil(100*ranuni(123));
else x = ceil(150*ranuni(123));
end;
output;
end;
run;
ods output onewayfreqs=counts;
proc freq data=have ;
table x1-x10;
run;
data count_stack;
length name $32;
set counts;
array x x1-x10;
do over x;
name = vname(x);
value = x;
if value then output;
end;
keep name value frequency;
run;
proc sort data=count_stack;
by name descending frequency ;
run;
data cntlin;
do _n_ = 1 by 1 until (last.name);
set count_stack;
by name;
length fmtname $32;
fmtname = trim(name)||'top';
start = value;
label = cats(value);
if _n_ < 11 then output;
end;
hlo = 'O';
label = 'Other';
output;
run;
proc format cntlin=cntlin;
run;
ods html;
proc freq data=have;
table x1-x10;
format
x1 x1top.
x2 x2top.
x3 x3top.
x4 x4top.
x5 x5top.
x6 x6top.
x7 x7top.
x8 x8top.
x9 x9top.
x10 x10top.
;
run;
Is there any more elegant way than that presented below for the following task:
to create Indicator Variables (below "MAX_X1" and "MAX_X2") whithin each group (below "key1") of multiple observation (below "key2") with value 1 if this observation corresponds to the maximum value of the variable in eache group and 0 otherwise
data have;
call streaminit(4321);
do key1=1 to 10;
do key2=1 to 5;
do x1=rand("uniform");
x2=rand("Normal");
output;
end;
end;
end;
run;
proc means data=have noprint;
by key1;
var x1 x2;
output out=max
max= / autoname;
run;
data want;
merge have max;
by key1;
drop _:;
run;
proc sql;
title "MAX";
select name into :MAXvars separated by ' '
from dictionary.columns
WHERE LIBNAME="WORK" AND MEMNAME="WANT" AND NAME like "%_Max"
order by name;
quit;
title;
data want; set want;
array MAX (*) &MAXvars;
array XVars (*) x1 x2;
array Indicators (*) MAX_X1 MAX_X2;
do i=1 to dim(MAX);
if XVars[i]=MAX[i] then Indicators[i]=1; else Indicators[i]=0;
end;
drop i;
run;
Thanks for any suggestion of optimization
Proc sql can be used with a group by statement to allow summary functions across values of a variable.
data have;
call streaminit(4321);
do key1=1 to 10;
do key2=1 to 5;
do x1=rand("uniform");
x2=rand("Normal");
output;
end;
end;
end;
run;
proc sql;
create table want
as select
key1,
key2,
x1,
x2,
case
when x1 = max(x1) then 1
else 0 end as max_x1,
case
when x2 = max(x2) then 1
else 0 end as max_x2
from have
group by key1
order by key1, key2;
quit;
It is also possible to do this in a single data step, provided that you read the input dataset twice - this is an example of a double DOW-loop.
data have;
call streaminit(4321);
do key1=1 to 10;
do key2=1 to 5;
do x1=rand("uniform");
x2=rand("Normal");
output;
end;
end;
end;
run;
/*Sort by key1 (or generate index) if not already sorted*/
proc sort data = have;
by key1;
run;
data want;
if 0 then set have;
array xvars[3,2] x1 x2 x1_max_flag x2_max_flag t_x1_max t_x2_max;
/*1st DOW-loop*/
do _n_ = 1 by 1 until(last.key1);
set have;
by key1;
do i = 1 to 2;
xvars[3,i] = max(xvars[1,i],xvars[3,i]);
end;
end;
/*2nd DOW-loop*/
do _n_ = 1 to _n_;
set have;
do i = 1 to 2;
xvars[2,i] = (xvars[1,i] = xvars[3,i]);
end;
output;
end;
drop i t_:;
run;
This may be a bit complicated to understand, so here's a rough explanation of how it flows:
Read one by group with the first DOW-loop, updating rolling max variables as each row is read in. Don't output anything yet.
Now read the same by-group again using the second DOW-loop, checking to see whether each row is equal to the rolling max and outputting each row.
Go back to first DOW-loop, read the next by-group and repeat.
I want to use proc compare to update dataset on a daily basis.
work.HAVE1
Date Key Var1 Var2
01Aug2013 K1 a 2
01Aug2013 K2 a 3
02Aug2013 K1 b 4
work.HAVE2
Date Key Var1 Var2
01Aug2013 K1 a 3
01Aug2013 K2 a 3
02Aug2013 K1 b 4
03Aug2013 K2 c 1
Date and Key are uniquely determine one record.
How can I use the above two tables to construct the following
work.WANT
Date Key Var1 Var2
01Aug2013 K1 a 3
01Aug2013 K2 a 3
02Aug2013 K1 b 4
03Aug2013 K2 c 1
I don't want to delete the previous data and then rebuild it. I want to modify it via append new records at the bottom and adjust the values in VAR1 or VAR2.
I'm struggling with proc compare but it just doesn't return what I want.
proc compare base=work.HAVE1 compare=work.HAVE2 out=WORK.DIFF outnoequal outcomp;
id Date Key;
run;
This will give you new and changed (unequal records) in single dataset WORK.DIFF. You'll have to distinguish new vs changed yourself.
However, what you want to achieve is actually a MERGE - inserts new, overwrites existing, though maybe due to performance reasons etc. you don't want to re-create the full table.
data work.WANT;
merge work.HAVE1 work.HAVE2;
by Date Key;
run;
Edit1:
/* outdiff option will produce records with _type_ = 'DIF' for matched keys */
proc compare base=work.HAVE1 compare=work.HAVE2 out=WORK.RESULT outnoequal outcomp outdiff;
id Date Key;
run;
data WORK.DIFF_KEYS; /* keys of changed records */
set WORK.RESULT;
where _type_ = 'DIF';
keep Date Key;
run;
/* split NEW and CHANGED */
data
WORK.NEW
WORK.CHANGED
;
merge
WORK.RESULT (where=( _type_ ne 'DIF'));
WORK.DIFF_KEYS (in = d)
;
by Date Key;
if d then output WORK.CHANGED;
else output WORK.NEW;
run;
Edit2:
Now you can just APPEND the WORK.NEW to target table.
For WORK.CHANGED - either use MODIFY or UPDATE statement to update the records.
Depending on the size of the changes, you can also think about PROC SQL; DELETE to delete old records and PROC APPEND to add new values.
All a PROC COMPARE will do will tell you the differences between 2 datasets. To achieve your goal you need to use an UPDATE statement in a data step. This way, values in HAVE1 are updated with HAVE2 where the date and key match, or a new record inserted if there are no matches.
data have1;
input Date :date9. Key $ Var1 $ Var2;
format date date9.;
datalines;
01Aug2013 K1 a 2
01Aug2013 K2 a 3
02Aug2013 K1 b 4
;
run;
data have2;
input Date :date9. Key $ Var1 $ Var2;
format date date9.;
datalines;
01Aug2013 K1 a 3
01Aug2013 K2 a 3
02Aug2013 K1 b 4
03Aug2013 K2 c 1
;
run;
data want;
update have1 have2;
by date key;
run;
It is a simple one but I'm a struggling a bit.
What I have :
What I want :
I want to remove the v0 , v1 and etc.
I'm using this piece of code
data IndieDay20140704;
set IndieDay20140704;
do i=1 to 5;
VAR1=tranwrd(var1,"v&i","");
end;
run;
It is not working correctly as it is giving me this instead (see below) plus the error
WARNING: Apparent symbolic reference I not resolved.
Questions:
1) Do I need a macro?
2) Why the error?
Many thanks for your insights.
There's an error because you're (unintentionally) using macro variable i, that you did not initialize.
I guess the idea of tranwrd is to remove words in VAR2, VAR3.. from VAR1.
The logical error is to do it also for VAR1 itself.
Check if this helps (using array):
data IndieDay20140704;
length VAR1 VAR2 VAR3 VAR3 VAR5 $10;
VAR1 = 'TEST IT';VAR5 = 'TEST';
output;
VAR1 = 'STEST IT';VAR5 = 'TEST';
output;
run;
data IndieDay20140704_modified / view= IndieDay20140704_modified;
set IndieDay20140704;
array vals VAR1 - VAR5;
do i=1 to dim(vals);
if i ne 1 then VAR1=tranwrd(var1,trim(vals(i)),"");
end;
drop i;
run;
Here I'm creating a SAS view on top of table (not a good idea to overwrite the source).
Also I think you should trim() the values from VAR2,VAR3... depending on what you want to achieve and what's in the data.
EDIT:
here the version with 'v0', 'v1'...'v5' strings:
data IndieDay20140704;
length VAR1$10;
VAR1 = 'TEST v0';
output;
VAR1 = 'TEST v11';
output;
VAR1 = 'TEST v1';
output;
run;
data IndieDay20140704_modified / view= IndieDay20140704_modified;
set IndieDay20140704;
org_var1 = var1;
do i=0 to 5;
var1 =tranwrd(var1, catt('v', put(i, 1. -L)),"");
end;
run;
catt('v', put(i, 1. -L)) concatenates string 'v' and the result of put.
put(i, 1. -L)) converts numeric variable i to text using plain numeric format w.d, 1. used here - enough for single digit numbers, -L left aligns the result
Here's one way, there are many others and this may not work if your data has a lot of variability.
data have;
length VAR1$10;
VAR1 = 'fic19v0.csv';
output;
VAR1 = 'fic19v1.cs';
output;
run;
data want ;
set have;
original_var=var1;
var1=substr(var1, 1, index(var1, ".")-3)||".csv";
run;