I'm trying to understand the behaviour of the code below. It seems wrong to me but I'd appreciate someone else taking a look in case I'm just going mad or something.
data testdata;
length var1 var2 $ 10;
var1 = "house"; var2 = ""; output;
var1 = "house"; var2 = "car"; output;
var1 = "house"; var2 = "house"; output;
run;
proc sql;
* Select all three obs- ok;
create table try1 as select * from testdata where indexw("house", var1);
* Selects one obs - ok;
create table try2 as select * from testdata where indexw("house", var2);
* Selects one obs - ok;
create table try3 as select * from testdata where indexw("car", var2);
* Selects all three obs - why?;
create table try4 as select * from testdata where indexw("house", var1) and indexw("house", var2);
* Selects one obs - ok;
create table try5 as select * from testdata where indexw("house", var1) and indexw("car", var2);
* Explicit comparison to zero - selects one obs - ok;
create table try6 as select * from testdata where indexw("house", var1) and (indexw("house", var2) ne 0);
* Compare to VAR2 first - selects one obs - ok;
create table try7 as select * from testdata where indexw("house", var2) and indexw("house", var1);
quit;
When I run this code, TRY1 has three observations and TRY2 has one - the ones with VAR1="house" and VAR2="house" respectively. This is what I would expect, and based on that, I would expect TRY4 to only contain the single observation where both VAR1 and VAR2 are "house". Instead TRY4 selects all three observations from the input.
Even more strangely, TRY6 uses an explicit compare to zero and only selects one observation as expected, as does TRY7, which reverses the order of the comparisons.
A similar thing happens in data step if I use a WHERE statement, but not if I use an IF statement:
114 data t;
115 set testdata;
116 where indexw("house", var1) and indexw("house", var2);
117 run;
NOTE: There were 3 observations read from the data set WORK.TESTDATA.
WHERE not (not INDEXW('house', var1));
NOTE: The data set WORK.T has 3 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
118 data t;
119 set testdata;
120 if indexw("house", var1) and indexw("house", var2);
121 run;
NOTE: There were 3 observations read from the data set WORK.TESTDATA.
NOTE: The data set WORK.T has 1 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
Note that the first data step outputs all three observations, while the second only outputs one.
The log for the first step reveals a clue - it looks like SAS has re-interpreted the WHERE clause in a way that changes its meaning.
What's going on here?
Run in 64-bit SAS 9.4 (TS1M7) on Windows 10.
The bug is in SAS at least as far back as 9.4M5 also. Contact SAS directly to get details on this bug. You can raise a ticket at: http://support.sas.com
As I remember it is related to using the same function (possibly it also requires that both usages use the same first argument?) and also as your example shows treating the function result as a boolean.
The similarity in the two expressions is confusing the optimization logic. The work around of adding a explicit comparison operator that you showed works to avoid the bug.
where 0<indexw("house", var1) and 0<indexw("house", var2);
Related
I have a table which contains one key id and 100 variables (x1, x2, x3 ..... x100) and i need to check every variables if there are any values stored as -9999, -8888, -7777, -6666 in of them.
For one variable i use
proc sql;
select keyid, x1
from mytable
where x1 in(-9999,-8888,-7777,-6666);
quit;
This is the data i am trying to get but it is just for one variable.
I do not have time for copying and pasting all the variables (100 times) in this basic query.
I have searched the forum but the answers i have found are a bit far from what i actually need
and since i am new to SAS i can not write a macro.
Can you help me please?
Thanks.
Try this. Just made up some sample data that resembles what you describe :-)
data have;
do key = 1 to 1e5;
array x x1 - x100;
do over x;
x = rand('integer', -10000, -5000);
end;
output;
end;
run;
data want;
set have;
array x x1 - x100;
do over x;
if x in (-9999, -8888, -7777, -6666) then do;
output;
leave;
end;
end;
run;
Don't use SQL. Instead use normal SAS code so you can take advantage of SAS syntax like ARRAYs and variable lists.
So make an array containing the variable you want to look at. Then loop over the array. There is no need to keep looking once you find one.
data want;
set mytable;
array list var1 varb another_var x1-x10 Z: ;
found=0;
do index=1 to dim(list) until (found);
found = ( list[index] in (-9999 -8888 -7777 -6666) );
end;
if found;
run;
And if you want to search all of the numeric variables you can even use the special variable list _NUMERIC_ when defining the array:
array list _numeric_;
thank you for your help i have found a solution and wanted to share it with you.
It has some points that needs to be evaluated but it is fine for me now. (gets the job done)
`%LET LIB = 'LIBRARY';
%LET MEM = 'GIVENTABLE';
%PUT &LIB &MEM;
PROC SQL;
SELECT
NAME INTO :VARLIST SEPARATED BY ' '
FROM DICTIONARY.COLUMNS
WHERE
LIBNAME=&LIB
AND
MEMNAME=&MEM
AND
TYPE='num';
QUIT;
%PUT &VARLIST;
%MACRO COUNTS(INPUT);
%LOCAL i NEXT_VAR;
%DO i=1 %TO %SYSFUNC(COUNTW(&VARLIST));
%LET NEXT_VAR = %SCAN(&VARLIST, &i);
PROC SQL;
CREATE TABLE &NEXT_VAR AS
SELECT
COUNT(ID) AS NUMBEROFDESIREDVALUES
FROM &INPUT
WHERE
&NEXT_VAR IN (6666, 7777, 8888, 9999)
GROUP BY
&NEXT_VAR;
QUIT;
%END;
%MEND;
%COUNTS(GIVENTABLE);`
The answer you provided to your own question gives more insight to what you really wanted. However, the solution you offered while it works is not very efficient. The SQL statement runs 100 times for each variable in the source data. That means the source table is read 100 times. Another problem is that it creates 100 output tables. Why?
A better solution is to create 1 table that contains the counts for each of the 100 variables. Even better is to do it in 1 pass of the source data instead of 100.
data sum;
set have end=eof;
array x(*) x:;
array csum(100) _temporary_;
do i = 1 to dim(x);
x(i) = (x(i) in (-9999, -8888, -7777, -6666)); * flag (0 or 1) those meeting criteria;
csum(i) + x(i); * cumulative count;
if eof then do;
x(i) = csum(i); * move the final total to the orig variable;
end;
end;
if eof then output; * only output the final obs which has the totals;
drop key i;
run;
Partial result:
x1 x2 x3 x4 x5 x6 x7 x8 ...
90 84 88 85 81 83 59 71 ...
You can keep it in that form or you can transpose it.
proc transpose data=sum out=want (rename=(col1=counts))
name=variable;
run;
Partial result:
variable counts
x1 90
x2 84
x3 88
x4 85
x5 81
... ...
I have a SAS dataset with 700 columns (variables). For all 700 of them, I want to cap all values below the 1st percentile to the 1st percentile and all values above the 99th percentile to the 99th percentile. I want to do this iteratively for all 700 variables without having to specify their names explicitly.
How can I do this?
Perhaps slightly easier than the hash table - and somewhat faster, I believe - is using the horizontal output of proc means, and then using an array.
proc means data=sashelp.prdsale;
var _numeric_;
output out=quantiles p1= p99= /autoname;
run;
proc sql;
select name
into :numlist separated by ' '
from dictionary.columns
where libname='SASHELP' and memname='PRDSALE' and type='num';
quit;
data prdsale_capped;
set sashelp.prdsale;
if _n_ eq 1 then set quantiles;
array vars &numlist.;
array p1 actual_p1--month_p1;
array p99 actual_p99--month_p99;
do _i = 1 to dim(vars);
vars[_i] = max(min(vars[_i],p99[_i]),p1[_i]);
end;
run;
Basically it's just setting up three arrays - vars, p1, p99 - and then you have all 3 values for every numeric variable on the PDV and can just compare during a single array traversal.
For a production process I'd probably not use the -- but instead make 3 lists from proc sql and make 100% sure they're in the same order by using an order by.
You can do this with proc means and a hash table lookup. Let's create some test data with 100 variables pulled from a normal distribution. For testing, we'll change all the variables in the first and second rows to really big and really small numbers.
Our approach: create a lookup table where we can find the variable's name, pull its percentiles, and compare its value against those percentiles.
data have;
array var[100];
do i = 1 to 100;
do j = 1 to dim(var);
var[j] = rand('normal');
/* Test values */
if(i = 1) then var[j] = 99999;
if(i = 2) then var[j] = -99999;
end;
output;
end;
drop i j;
run;
Data:
var1 var2 var3 ...
99999 99999 99999 ...
-99999 -99999 -99999 ...
-0.149875111 0.4455504523 -0.783127138 ...
-0.731432437 -0.572508065 -1.044928486 ...
0.0108184539 1.0605591996 1.9132874927 ...
... ... ... ...
Let's get all the percentiles with proc means. You might be tempted to use output out=, but it does not create the data in a vertical lookup table that's easy for us to use in this manner; however, the stackODSOutput option on proc means does. More info on this from Rick Wicklin.
We'll use ods select none so we don't render a large table but still produce the dataset that drives the table.
/* Get a dataset of all 1st and 99th percentiles for each variable */
ods select none;
proc means data=have stackODSOutput p1 p99;
var var1-var100;
ods output summary = percentiles;
run;
ods select all;
Note that all the percentiles will be the same in this case. This is expected. We set all the variables in the first and second rows to the same big and small numbers for easy testing.
Data:
Variable P1 P99 ...
var1 -50001 50001 ...
var2 -50001 50001 ...
var3 -50001 50001 ...
var4 -50001 50001 ...
... ... ... ...
Now we'll use our lookup approach. We know our variable names and we can store them in an array. We can loop through that array, look up the variable in the hash table by name with vname(), and get its percentile.
data want;
set have;
array var[*] var1-var100;
/* Load a table of these values into memory and search for each percentile.
Think of this like a simple lookup table that floats out in memory.
*/
if(_N_ = 1) then do;
length variable $32.;
dcl hash pctiles(dataset: 'percentiles');
pctiles.defineKey('variable');
pctiles.defineData('p1', 'p99');
pctiles.defineDone();
call missing(p1, p99);
end;
/* Get the 1st and 99th percentile of each variable.
If the variable's name matches the variable name
in the hash table, check the variable's value
against the lookup percentile.
Cap it if it's above or below the percentile.
*/
do i = 1 to dim(var);
if(pctiles.Find(key:vname(var[i]) ) = 0) then do;
if(var[i] < p1) then var[i] = p1;
else if(var[i] > p99) then var[i] = p99;
end;
end;
drop i variable p1 p99;
run;
Output:
var1 var2 var3 ...
50000.532908 50000.721522 99999 ...
-50000.61447 -50000.92196 -50001.19549 ...
-0.149875111 0.4455504523 -0.783127138 ...
-0.731432437 -0.572508065 -1.044928486 ...
0.0108184539 1.0605591996 1.9132874927 ...
... ... ... ...
If your variables do not follow an easy sequential name, you can use the -- shortcut. For example, varA varB varC varD can be selected by varA--varD.
Say I have two data sets A and B that have identical variables and want to rank values in B based on values in A, not B itself (as "PROC RANK data=B" does.)
Here's a simplified example of data sets A, B and want (the desired output):
A:
obs_A VAR1 VAR2 VAR3
1 10 100 2000
2 20 300 1000
3 30 200 4000
4 40 500 3000
5 50 400 5000
B:
obs_B VAR1 VAR2 VAR3
1 15 150 2234
2 14 352 1555
3 36 251 1000
4 41 350 2011
5 60 553 5012
want:
obs VAR1 VAR2 VAR3
1 2 2 3
2 2 4 2
3 4 3 1
4 5 4 3
5 6 6 6
I come up with a macro loop that involves PROC RANK and PROC APPEND like below:
%macro MyRank(A,B);
data AB; set &A &B; run;
%do i=1 %to 5;
proc rank data=AB(where=(obs_A ne . OR obs_B=&i) out=tmp;
var VAR1-3;
run;
proc append base=want data=tmp(where=(obs_B=&i) rename=(obs_B=obs)); run;
%end;
%mend;
This is ok when the number of observations in B is small. But when it comes to very large number, it takes so long and thus wouldn't be a good solution.
Thanks in advance for suggestions.
I would create formats to do this. What you're really doing is defining ranges via A that you want to apply to B. Formats are very fast - here assuming "A" is relatively small, "B" can be as big as you like and it's always going to take just as long as it takes to read and write out the B dataset once, plus a couple read/writes of A.
First, reading in the A dataset:
data ranking_vals;
input obs_A VAR1 VAR2 VAR3;
datalines;
1 10 100 2000
2 20 300 1000
3 30 200 4000
4 40 500 3000
5 50 400 5000
;;;;
run;
Then transposing it to vertical, as this will be the easiest way to rank them (just plain old sorting, no need for proc rank).
data for_ranking;
set ranking_vals;
array var[3];
do _i = 1 to dim(var);
var_name = vname(var[_i]);
var_value = var[_i];
output;
end;
run;
proc sort data=for_ranking;
by var_name var_value;
run;
Then we create a format input dataset, and use the rank as the label. The range is (previous value -> current value), and label is the rank. I leave it to you how you want to handle ties.
data for_fmt;
set for_ranking;
by var_name var_value;
retain prev_value;
if first.var_name then do; *initialize things for a new varname;
rank=0;
prev_value=.;
hlo='l'; *first record has 'minimum' as starting point;
end;
rank+1;
fmtname=cats(var_name,'F');
start=prev_value;
end=var_value;
label=rank;
output;
if last.var_name then do; *For last record, some special stuff;
start=var_value;
end=.;
hlo='h';
label=rank+1;
output; * Output that 'high' record;
start=.;
end=.;
label=.;
hlo='o';
output; * And a "invalid" record, though this should never happen;
end;
prev_value=var_value; * Store the value for next row.;
run;
proc format cntlin=for_fmt;
quit;
And then we test it out.
data test_b;
input obs_B VAR1 VAR2 VAR3;
var1r=put(var1,var1f.);
var2r=put(var2,var2f.);
var3r=put(var3,var3f.);
datalines;
1 15 150 2234
2 14 352 1555
3 36 251 1000
4 41 350 2011
5 60 553 5012
;;;;
run;
One way that you can rank by a variable from a separate dataset is by using proc sql's correlated subqueries. Essentially you counts the number of lower values in the lookup dataset for each value in the data to be ranked.
proc sql;
create table want as
select
B.obs_B,
(
select count(distinct A.Var1) + 1
from A
where A.var1 <= B.var1.
) as var1
from B;
quit;
Which can be wrapped in a macro. Below, a macro loop is used to write each of the subqueries. It looks through the list of variable and parametrises the subquery as required.
%macro rankBy(
inScore /*Dataset containing data to be ranked*/,
inLookup /*Dataset containing data against which to rank*/,
varID /*Variable by which to identify an observation*/,
varsRank /*Space separated list of variable names to be ranked*/,
outData /*Output dataset name*/);
/* Rank variables in one dataset by identically named variables in another */
proc sql;
create table &outData. as
select
scr.&varID.
/* Loop through each variable to be ranked */
%do i = 1 %to %sysfunc(countw(&varsRank., %str( )));
/* Store the variable name in a macro variable */
%let var = %scan(&varsRank., &i., %str( ));
/* Rank: count all the rows with lower value in lookup */
, (
select count(distinct lkp&i..&var.) + 1
from &inLookup. as lkp&i.
where lkp&i..&var. <= scr.&var.
) as &var.
%end;
from &inScore. as scr;
quit;
%mend rankBy;
%rankBy(
inScore = B,
inLookup = A,
varID = obs_B,
varsRank = VAR1 VAR2 VAR3,
outData = want);
Regarding speed, this will be slow if your A is large, but should be okay for large B and small A.
In rough testing on a slow PC I saw:
A: 1e1 B: 1e6 time: ~1s
A: 1e2 B: 1e6 time: ~2s
A: 1e3 B: 1e6 time: ~5s
A: 1e1 B: 1e7 time: ~10s
A: 1e2 B: 1e7 time: ~12s
A: 1e4 B: 1e6 time: ~30s
Edit:
As Joe points out below the length of time the query takes depends not just on the number of observations in the dataset, but how many unique values exist within the data. Apparently SAS performs optimisations to reduce the comparisons to only the distinct values in B, thereby reducing the number of times the elements in A need to be counted. This means that if the dataset B contains a large number of unique values (in the ranking variables) the process will take significantly longer then the times shown. This is more likely to happen if your data is not integers as Joe demonstrates.
Edit:
Runtime test rig:
data A;
input obs_A VAR1 VAR2 VAR3;
datalines;
1 10 100 2000
2 20 300 1000
3 30 200 4000
4 40 500 3000
5 50 400 5000
;
run;
data B;
do obs_B = 1 to 1e7;
VAR1 = ceil(rand("uniform")* 60);
VAR2 = ceil(rand("uniform")* 500);
VAR3 = ceil(rand("uniform")* 6000);
output;
end;
run;
%let start = %sysfunc(time());
%rankBy(
inScore = B,
inLookup = A,
varID = obs_B,
varsRank = VAR1 VAR2 VAR3,
outData = want);
%let time = %sysfunc(putn(%sysevalf(%sysfunc(time()) - &start.), time12.2));
%put &time.;
Output:
0:00:12.41
I want to use dataset B to overwrite some values in dataset A by merging dataset A & B with a merging ID. However it doesn't work as expected. Here is the test I did:
/* create table A */
data a;
infile datalines;
input id1 $ id2 $ var1;
datalines;
1 a 10
1 b 10
2 a 10
2 b 10
;
run;
/* create table B */
data b;
infile datalines;
input id1 $ var1 var2;
datalines;
1 20 30
2 20 30
;
run;
/* merge A&B to overwrite var1 in table A using values in table B */
data c;
merge a b;
by id1;
run;
Table C looks like this:
ID1 ID2 VAR1 VAR2
1 a 20 30
1 b 10 30
2 a 20 30
2 b 10 30
Why the 10s in row 2&4 didn't get replaced by 20 from table B? While var2 works as expected?
I know I can do this simply using proc SQL, and that's what I did to solve the problem. But I still quite curious if there is a way to do what I wanted using merge? And why this wasn't working? I prefer merge over SQL in this circumstance because the logic is easier to implement (util I found this not working properly).
I use SAS 9.4.
This has to do with how SAS iterates over the data sets during the merge. Basically, the second record for each of A doesn't get lined up with a record from B. The value of VAR2 is carried over from the previous record. VAR1 gets its value from A (because there is no B).
IF there is record in B for EVERY ID1, then you can rewrite your merge like this to achieve what you want.
/* merge A&B to overwrite var1 in table A using values in table B */
data c;
merge a(drop=var1) b;
by id1;
run;
This drops the VAR1 from A so that it is carried down from the record in B.
Otherwise you will need more complex logic (might I suggest an SQL left join with the coalesce() function?).
Like DomPazz suggests, proc sql is the way to do this. merge will only keep one value from each data set. The coalesce function pick the first non-missing value from the list, so it uses var1 from b, but if b.var1 is null then it uses a.var1.
proc sql;
create table c as
select
a.id1,
a.id2,
coalesce(b.var1,a.var1) as var1,
b.var2
from
a
left join b
on a.id1 = b.id1
;
quit;
The merge method could still work fine, you would just need to be more explicit about how to choose the 'best' value for var1, such as:
data c (drop = a_var1 b_var1);
merge a(rename=(var1 = a_var1))
b(rename=(var1 = b_var1));
by id1;
* Now you have two different variables named a_var1 and b_var1;
* Implement logic to choose your favorite;
if NOT MISSING(b_var1) Then DO;
var1 = b_var1;
var1_source='B';
END;
else DO;
var1 = a_var1;
var1_source='A';
END;
run;
If your criteria for which 'var1' to choose is as simple as 'If b exists, use it' then this is identical to the the SQL method with coalesce().
Where I've found this method useful is for more complicated criteria, plus its always nice to know the source of the data (which doesn't happen with coalesce()).
I have a PROC EXPORT question that I am wondering if you can answer.
I have a SAS dataset with 800+ variables and over 200K observations and I am trying to export a subset of the variables to a CSV file (i.e. I need all records; I just don’t want all 800+ variables). I can always create a temporary dataset “KEEP”ing just the fields I need and run the EXPORT on that temp dataset, but I am trying to avoid the additional step because I have a large number of records.
To demonstrate this, consider a dataset that has three variables named x, y and z. But, I want the text file generated through PROC EXPORT to only contain x and y. My attempt at a solution below does not quite work.
The SAS Code
When I run the following code, I don’t get exactly what I need. If you run this code and look at the text file that was generated, it has a comma at the end of every line and the header includes all variables in the dataset anyway. Also, I get some messages in the log that I shouldnt be getting.
data ds1;
do x = 1 to 100;
y = x * x;
z = x * x * x;
output;
end;
run;
proc export data=ds1(keep=x y)
file='c:\test.csv'
dbms=csv
replace;
quit;
Here are the first few lines of the text file that was generated ("C:\test.csv")
x,y,z
1,1,
2,4,
3,9,
4,16,
The SAS Log
9343 proc export data=ds1(keep=x y)
9344 file='c:\test.csv'
9345 dbms=csv
9346 replace;
9347 quit;
9348 /**********************************************************************
9349 * PRODUCT: SAS
9350 * VERSION: 9.2
9351 * CREATOR: External File Interface
9352 * DATE: 30JUL12
9353 * DESC: Generated SAS Datastep Code
9354 * TEMPLATE SOURCE: (None Specified.)
9355 ***********************************************************************/
9356 data _null_;
9357 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
9358 %let _EFIREC_ = 0; /* clear export record count macro variable */
9359 file 'c:\test.csv' delimiter=',' DSD DROPOVER lrecl=32767;
9360 if _n_ = 1 then /* write column names or labels */
9361 do;
9362 put
9363 "x"
9364 ','
9365 "y"
9366 ','
9367 "z"
9368 ;
9369 end;
9370 set DS1(keep=x y) end=EFIEOD;
9371 format x best12. ;
9372 format y best12. ;
9373 format z best12. ;
9374 do;
9375 EFIOUT + 1;
9376 put x #;
9377 put y #;
9378 put z ;
9379 ;
9380 end;
9381 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
9382 if EFIEOD then call symputx('_EFIREC_',EFIOUT);
9383 run;
NOTE: Variable z is uninitialized.
NOTE: The file 'c:\test.csv' is:
Filename=c:\test.csv,
RECFM=V,LRECL=32767,File Size (bytes)=0,
Last Modified=30Jul2012:12:05:02,
Create Time=30Jul2012:12:05:02
NOTE: 101 records were written to the file 'c:\test.csv'.
The minimum record length was 4.
The maximum record length was 10.
NOTE: There were 100 observations read from the data set WORK.DS1.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.01 seconds
100 records created in c:\test.csv from DS1.
NOTE: "c:\test.csv" file was successfully created.
NOTE: PROCEDURE EXPORT used (Total process time):
real time 0.12 seconds
cpu time 0.06 seconds
Any ideas how I can solve this problem? I am running SAS 9.2 on windows 7.
Any help would be appreciated. Thanks.
Karthik
Based in Itzy's comment to my question, here is the answer and this does exactly what I need.
proc sql;
create view vw_ds1 as
select x, y from ds1;
quit;
proc export data=vw_ds1
file='c:\test.csv'
dbms=csv
replace;
quit;
Thanks for the help!