product of common variables in two datasets - sas

data a1
a b c
2 3 4
1 2 3
data a2
a b d
0 .3 1
0 .2 0
proc sql;
create table a3 as
select a.*, a.a * b.a + a.b * b.b as Value
from a1 a, a2 b;
There are many common columns in a1 and a2 (numeric columns with different values). I want to calculate Value as the 'sumproduct' of those common columns.
I try to avoid something like a.common1 * b.common1 + a.common2 * b.common2 + ...

A few steps of preprocessing are needed as far as I can tell....
Load your data:
data a1 ;
input a b c ;
cards ;
2 3 4
1 2 3
;run ;
data a2 ;
input a b d ;
cards ;
0 0.3 1
0 0.2 0
;run ;
Pull all variable names in A1 and A2 datasets (update your libname if required):
proc sql ;
create table data1 as
select libname, memname, name, label
from sashelp.vcolumn
where libname= 'WORK' and memname in ('A1','A2')
order by name
;quit ;
Keep only variables which are common to both datasets:
data data2 ;
set data1 ;
by name ;
if last.name and not first.name ;
run ;
Put both a list and a count of the common variables into macro variables:
proc sql ;
select name
into :commvarnames separated by ' '
from data2
;
select count(name)
into :commoncount
from data2
;quit ;
Read in your source datasets - load the first, transfer them to a temporary array (therefore they do not overwrite the variable values) and then load the second dataset and do your calculations in a do loop:
data output ;
set a1(keep=&commvarnames) ;
array one(&commoncount) _temporary_ ;
array two(&commoncount) &commvarnames ;
* Load A1 to temporary array ;
do i=1 to &commoncount ;
one(i)=two(i) ;
end ;
* Load A2 to variables ;
set a2(keep=&commvarnames) ;
do i=1 to &commoncount ;
product=sum(product,one(i)*two(i)) ;
end ;
run ;

It would take quite a bit of code to make this dynamic. I'd break it down like so:
Get lists of the variables present in each dataset
Merge the lists to get a list of the common variables
Feed this into some array logic in a data step
Will post some code later, but hopefully that's enough to give you some ideas.

Related

How to work SET statement in a DO loop in SAS?

I studied SET statement in Do loop in SAS but i don't understand how to work SET statement in DO loop.
I create the following example dataset a1:
/* Create data a1 */
data a1 ;
input fruit $ ;
cards ;
melon
apple
orange
;
run ;
proc print data=a1 ;
title "Results of a1" ;
run;
Then, I create the following new dataset c1 :
/* Create data c1 using a1 -- This is a upper code block */
data c1 ;
do i = 1 to 3 ;
set a1 ;
count + 1 ;
N_VAR = _N_ ;
ERR_VAR = _ERROR_ ;
output ;
end;
run ;
proc print data=c1 LABEL ;
LABEL N_VAR = "_N_" ;
LABEL ERR_VAR = "_ERROR_" ;
title "Results of c1" ;
run ;
Question: Why doesn't the upper code have the same output as the below code block? I don't understand how to work SET statement in a DO loop. What concept am I missing?
/* My expectation for c1 -- This is a below code block */
data my_expectation ;
input i fruit $ count N ERROR ;
cards ;
1 melon 1 1 0
1 apple 2 2 0
1 orange 3 3 0
2 melon 4 1 0
2 apple 5 2 0
2 orange 6 3 0
3 melon 7 1 0
3 apple 8 2 0
3 orange 9 3 0
;
run;
proc print data=my_expectation label ;
LABEL N = "_N_" ;
LABEL ERROR = "_ERROR_" ;
title "The result that I expected for c1" ;
run ;
I attached result image file below.
Thank you for your attention.
Each SET statement sets up an independent reading stream.
A DATA step is an implicit loop.
After the DO loop iterates 3 times the implicit DATA step loop returns control to the top of the step.
At the second implicit iteration, the DO loop is entered, and in its first iteration the SET statement is reached (for the 4th time). The input data set (A1) has no more observations, so the DATA step ends.
You can observe the flow behavior with this version of your DATA step:
data c1 ;
put 'TOP';
do i = 1 to 3 ;
put i= 'pre SET';
set a1 ;
put i= 'post SET';
count + 1 ;
N_VAR = _N_ ;
ERR_VAR = _ERROR_ ;
output ;
end;
put 'BOTTOM';
run;
Aside:
When a DATA step does not have any explicit OUTPUT statements, the step will implicitly output an observation when control reaches the bottom of the step -- There are statements that prevent flow from reaching the bottom, such as, a RETURN statement or a subsetting IF statement that fails.
I answered your why question, #Tom showed you how to produce your expected result with DATA step. The result is a cross join that SQL can also perform:
data a1 ;
input fruit $ ;
cards ;
melon
apple
orange
;
data replicates;
do i = 1 to 3;
output;
end;
run;
proc sql;
create table want as
select i, a1.*
from replicates cross join a1
;
quit;
If you want to output each observation three times then move the DO loop after the SET.
set a1;
do i=1 to 3; output; end;
If you really want to read through the dataset three times then you either need three separate SET statements
i=1;
set a1;
output;
i=2;
set a1;
output;
i=3;
set a1;
output;
or use POINT= option to explicitly control which observation you are reading with the SET statement.
do i=1 to 3 ;
do p=1 to nobs;
set a1 point=p nobs=nobs ;
output;
end;
end;
stop;
Most DATA step stops when they read past the input and since that cannot happen with the POINT= option you need the STOP statement to prevent the data step from repeating forever.

Add new empty rows to a SAS table with names from another table

Assume I have table foo which contains a (dynamic) list of new rows which I want to add to another table have, so that it yields a table want looking e.g. like this:
x y p_14 p_15
1 2 2 99
2 4 7 24
Example data for foo:
id row_name
14 p_14
15 p_15
Example data for have:
x y p Z
1 2 14 2
1 2 15 99
1 2 16 59
2 4 14 7
2 4 15 24
2 4 16 58
What I have so far is the following which is not yet in macro shape:
proc sql;
create table want as
select old.*, t1.p_14, t2.p_15 /* choosing non-duplicate rows */
from (select x, y from have) old
left join (select x, y, z as p_14 from have where p=14) t1
on old.x=t1.x and old.y=t1.y
left join (select x, y, z as p_15 from have where p=15) t2
on old.x=t2.x and old.y=t2.y
;
quit;
Ideally, I am aiming for a macro where which takes foo as input and automatically creates all the joins from above. Also, the solution should not spit out any warnings in the console. My challenge is how to dynamically choose the correct (non-duplicate) rows.
PS: This is a follow-up question of Populate SAS macro-variable using a SQL statement within another SQL statement? The important bit is that it is not a full transpose, I guess.
You can go from HAVE to WANT with PROC TRANSPOSE.
proc transpose data=have out=want(drop=_name_) prefix=p_ ;
by x y ;
id p ;
var z;
run;
To limit it to the values of P that occur in FOO you could use a macro variable (as long as the number of observations in FOO is small enough).
proc sql noprint ;
select id into :idlist separated by ' ' from foo ;
quit;
proc transpose data=have out=want(drop=_name_) prefix=p_ ;
where p in (&idlist) ;
by x y ;
id p ;
var z;
run;
If the issue is you want variable P_17 to be in the result even if 17 does not appear in HAVE then add a little more complexity. For example add another data step that will force the creation of the empty variables. You can generate the list of variable names from the list of id's in FOO.
proc sql noprint ;
select id , cats('p_',id)
into :idlist separated by ' '
, :varlist separated by ' '
from foo
;
quit;
proc transpose data=have out=want(drop=_name_) prefix=p_ ;
where p in (&idlist) ;
by x y ;
id p ;
var z;
run;
data want ;
set want (keep=x y);
array all &varlist ;
set want ;
run;
Results:
Obs x y p_14 p_15 p_17
1 1 2 2 99 .
2 2 4 7 24 .
If the number of values is too large to store in a single macro variable (limit 64K bytes) you could generate the WHERE statement with a data step to a file and use %INCLUDE to add the WHERE statement into the code.
filename where temp;
data _null_;
set foo end=eof;
file where ;
if _n_=1 then put 'where p in (' #;
put id # ;
if eof then put ');' ;
run;
proc transpose ... ;
%include where / source2;
...
Use macro program:
data have;
input x y p Z;
cards;
1 2 14 2
1 2 15 99
1 2 16 59
2 4 14 7
2 4 15 24
2 4 16 58
;
data foo;
input id row_name $;
cards;
14 p_14
15 p_15
;
%macro test(dsn);
proc sql;
select count(*) into:n trimmed from &dsn;
select id into: value separated by ' ' from &dsn;
create table want as
select distinct a.x,a.y,
%do i=1 %to &n;
%let cur=%scan(&value,&i);
t&i..p_&cur
%if &i<&n %then ,;
%else ;
%end;
from have a
%do i=1 %to &n;
%let cur=%scan(&value,&i);
left join have (where=(p=&cur) rename=(z=p_&cur.)) t&i.
on a.x=t&i..x and a.y=t&i..y
%end;
;
quit;
%mend;
%test(foo);

SAS changes all numerics to length 8 even when input data sets define numerics otherwise

I have two input datasets which I need to interweave. The input files have defined lengths for numeric fields depending on the size of the integer. When I interweave the datasets -- either a DATA or PROC SQL statement -- the lengths of numeric fields are all reset to the default of 8. Outside of explicitly defining the length for each field in a LENGTH statement, is there an option for SAS to keep the original attributes of the input columns?
More details ...
data A ;
length numeric_variable 3 ;
{input data}
;
data B ;
length numeric_variable 3 ;
{input data}
;
data AB ;
set A B ;
by some_id_variable ;
{stuff};
;
In the data set AB, the variable NUMERIC_VARIABLE is length 8 instead of 3. I can explicitly put another length statement in the "data AB" statement, but I have tons of columns.
Your description is wrong. A data step will set the length based on how it is first defined. If you just select the variable in SQL it keeps its length. However in SQL if you are doing something like UNION that combines variables from different sources then the length will be set to 8.
Example:
data one; length x 3; x=1; run;
data two; length x 5; x=2; run;
data one_two; set one two; run;
data two_one; set two one; run;
proc sql ;
create table sql_one as select * from one;
create table sql_two as select * from two;
create table sql_one_two as select * from one union select * from two;
create table sql_two_one as select * from two union select * from one;
quit;
proc sql;
select memname,name,length
from dictionary.columns
where libname='WORK'
and memname like '%ONE%'
or memname like '%TWO%'
;
quit;
Results:
Column
Member Name Column Name Length
----------------------------------------------------------------------------
ONE x 3
ONE_TWO x 3
SQL_ONE x 3
SQL_ONE_TWO x 8
SQL_TWO x 5
SQL_TWO_ONE x 8
TWO x 5
TWO_ONE x 5
So if you want define your variables then either add the length statement as you mentioned or create a template datasets and reference that in your data steps before referencing the other dataset(s). For complex SQL code you will need to include the LENGTH= option in your SELECT clause to force the lengths for the variables you are creating.
Can you post code that demonstrates the problem?
This code does NOT exhibit a final data set in which the numeric lengths get changed from 3 to 8.
data A; id = 'A'; length x 3; x=1;
data B; id = 'A'; length x 3; x=2;
data AB;
set A B;
by id;
run;
proc contents data=AB; run;
Contents
# Variable Type Len
1 id Char 1
2 x Num 3

Aggregate multiple vars on different groupings in one Proc SQL query

I need to aggregate about ten different vars on different groupings using Proc SQL;
Is there a way to achieve SUM () OVER ( [ partition_by_clause ] order_by_clause) in one sql query with different partition by clauses.
I've made an example here
data have;
infile cards;
input a b c d e f;
cards;
1 2 3 4 5
2 2 4 5 6
1 4 3 4 7
3 4 4 5 8
;
run;
proc sql;
create table want as
select *,
sum a over partiton by (b,c) as a1,
sum b over partiton by (c,d) as b1
sum c over partiton by (d,e) as c1
sum d over partiton by (a,c) as d1
from have
;
quit;
I don't want to wirte multiple sql queries and grouping on different vars and calculating one var in each step.
Hope that makes sense.
Proc SQL does not implement windowing functions and thus partition syntax therein as found in other SQL implementations. You can only do partition by with passthrough SQL to a connection that allows such syntax.
You could perform such a computation in DATA step using hashes.
data have;
infile cards;
input a b c d e ;
cards;
1 2 3 4 5
2 2 4 5 6
1 4 3 4 7
3 4 4 5 8
;
run;
data want;
if 0 then set have;
length a1 b1 c1 d1 8;
declare hash a1s();
a1s.defineKey('b', 'c');
a1s.defineData('a1');
a1s.defineDone();
declare hash b1s();
b1s.defineKey('c', 'd');
b1s.defineData('b1');
b1s.defineDone();
declare hash c1s();
c1s.defineKey('d', 'e');
c1s.defineData('c1');
c1s.defineDone();
declare hash d1s();
d1s.defineKey('a', 'c');
d1s.defineData('d1');
d1s.defineDone();
do while (not end);
set have end=end;
if a1s.find() = 0 then a1+a; else a1=a; a1s.replace();
if b1s.find() = 0 then b1+b; else b1=b; b1s.replace();
if c1s.find() = 0 then c1+c; else c1=c; c1s.replace();
if d1s.find() = 0 then d1+d; else d1=d; d1s.replace();
end;
do while (not last);
set have end=last;
a1s.find();
b1s.find();
c1s.find();
d1s.find();
output;
end;
format _numeric_ 4.;
stop;
run;

modify multiple observations in a by variable

data a1
col1 col2 flag
a 2 .
b 3 .
a 4 .
c 1 .
For data a1, flag is always missing. I want to update multiple rows using a2.
data a2
col1 flag
a 1
Ideal output:
col1 col2 flag
a 2 1
b 3 .
a 4 1
c 1 .
But this doesn't update all the records in by statement.
data a1;
modify a1 a2;
by col1;
run;
Question edited
Actually a1 is a very large data set on server. Hence I prefer to modify it (if possible) instead of creating a new one. Otherwise I have to drop previous a1 first and copy a new a1 from local to server, which will take much more time.
If you want to do this with MODIFY, you have to loop over the modify dataset in some fashion or it will only replace the first row (because the other dataset will then run out of records - normally this behaves like merge, where once it finds a match it advances to next record). Here's one option - there are others.
data a1(index=(col1));
input col1 $ col2 flag;
datalines;
a 2 .
b 3 .
a 4 .
c 1 .
;;;;
run;
data a2(index=(col1));
col1='a';
flag=1;
run;
data a1;
set a2(rename=flag=flag2);
do _n_ = 1 to nobs_a1;
modify a1 key=col1 nobs=nobs_a1;
if _iorc_=0 then do;
flag=flag2;
replace;
end;
end;
if _iorc_=%sysrc(_DSENOM) then _error_=0;
run;
If you're not using Merge statement for the sorting problem, you can simply change your merging approach.
If flag in A1 is always missing, you can drop it, otherwise you should temporary rename it for not losing those informations.
Here I will merge A1 and A2 using hash objects, this approach doesn't require any prior sorting on datasets.
data final_merged(drop = finder);
length flag 8.; /*please change length with the real one, use $ if char*/
if _N_ = 1 then do;
declare hash merger(dataset:'A2');
merger.definekey('col1');
merger.DefineData ('flag');
merger.definedone();
end;
set A1(drop=flag);
finder = merger.find();
if finder ne 0 then flag = .;
/*then flag='' or then flag='unknown' as you want if flag is a character var*/
run;
Please, let me know if this will help.
You could do the following but SQL sorts the observations so not sure how useful this would be for you? (you could always preprocess with ordvar=_n_; and then sort the SQL statement on it if that helps):
Data:
data a1 ;
input col1 $ col2 flag ;
cards ;
a 2 .
b 3 .
a 4 .
c 1 .
;run ;
data a2 ;
input col1 $ flag ;
cards ;
a 1
;run ;
Merge:
proc sql ;
create table output as
select a.col1, a.col2, b.flag
from a1 a
left join
a2 b
on a.col1=b.col1
;quit ;
To try and do it in one pass, how about creating two macros variables containing the mapping from a2?
proc sql ;
select distinct col1, flag
into :colvals separated by '', :flagvals separated by ''
from a2
;quit ;
Set flag to the corresponding character position between the two macro variables:
data a1 ;
set a1 ;
if findc("&colvals",col1) then
flag=input(substr("&flagvals", findc("&colvals",col1),1),8.) ;
run ;