How to work SET statement in a DO loop in SAS? - sas

I studied SET statement in Do loop in SAS but i don't understand how to work SET statement in DO loop.
I create the following example dataset a1:
/* Create data a1 */
data a1 ;
input fruit $ ;
cards ;
melon
apple
orange
;
run ;
proc print data=a1 ;
title "Results of a1" ;
run;
Then, I create the following new dataset c1 :
/* Create data c1 using a1 -- This is a upper code block */
data c1 ;
do i = 1 to 3 ;
set a1 ;
count + 1 ;
N_VAR = _N_ ;
ERR_VAR = _ERROR_ ;
output ;
end;
run ;
proc print data=c1 LABEL ;
LABEL N_VAR = "_N_" ;
LABEL ERR_VAR = "_ERROR_" ;
title "Results of c1" ;
run ;
Question: Why doesn't the upper code have the same output as the below code block? I don't understand how to work SET statement in a DO loop. What concept am I missing?
/* My expectation for c1 -- This is a below code block */
data my_expectation ;
input i fruit $ count N ERROR ;
cards ;
1 melon 1 1 0
1 apple 2 2 0
1 orange 3 3 0
2 melon 4 1 0
2 apple 5 2 0
2 orange 6 3 0
3 melon 7 1 0
3 apple 8 2 0
3 orange 9 3 0
;
run;
proc print data=my_expectation label ;
LABEL N = "_N_" ;
LABEL ERROR = "_ERROR_" ;
title "The result that I expected for c1" ;
run ;
I attached result image file below.
Thank you for your attention.

Each SET statement sets up an independent reading stream.
A DATA step is an implicit loop.
After the DO loop iterates 3 times the implicit DATA step loop returns control to the top of the step.
At the second implicit iteration, the DO loop is entered, and in its first iteration the SET statement is reached (for the 4th time). The input data set (A1) has no more observations, so the DATA step ends.
You can observe the flow behavior with this version of your DATA step:
data c1 ;
put 'TOP';
do i = 1 to 3 ;
put i= 'pre SET';
set a1 ;
put i= 'post SET';
count + 1 ;
N_VAR = _N_ ;
ERR_VAR = _ERROR_ ;
output ;
end;
put 'BOTTOM';
run;
Aside:
When a DATA step does not have any explicit OUTPUT statements, the step will implicitly output an observation when control reaches the bottom of the step -- There are statements that prevent flow from reaching the bottom, such as, a RETURN statement or a subsetting IF statement that fails.
I answered your why question, #Tom showed you how to produce your expected result with DATA step. The result is a cross join that SQL can also perform:
data a1 ;
input fruit $ ;
cards ;
melon
apple
orange
;
data replicates;
do i = 1 to 3;
output;
end;
run;
proc sql;
create table want as
select i, a1.*
from replicates cross join a1
;
quit;

If you want to output each observation three times then move the DO loop after the SET.
set a1;
do i=1 to 3; output; end;
If you really want to read through the dataset three times then you either need three separate SET statements
i=1;
set a1;
output;
i=2;
set a1;
output;
i=3;
set a1;
output;
or use POINT= option to explicitly control which observation you are reading with the SET statement.
do i=1 to 3 ;
do p=1 to nobs;
set a1 point=p nobs=nobs ;
output;
end;
end;
stop;
Most DATA step stops when they read past the input and since that cannot happen with the POINT= option you need the STOP statement to prevent the data step from repeating forever.

Related

Cumulative sum of columns in SAS

I am trying to sum across variables
N1 N2 N3
1 1 1
1 . 1
1 1 .
Want
N1 N2 N3 B1 B2 B3
1 1 1 1 2 3
1 . 1 1 1 2
1 1 . 1 2 2
The array that I am trying looks like not working at all..
data temp2;
set temp;
array hh(*) N:;
array bb(3);
do i=1 to dim(hh);
bb(i)=bb(i)+hh(i+1);
end;
run;
I dont want to use transpose and cumulate the sum.
First, you have an error in your algorithm: a cumulated value
should not be calculated from itself an the next input value,
but from the previous cumulated value and the corresponding input value.
You rather need bb(i) = bb(i-1) + hh(i). Of course, this does not work when i is 1 because there is no hh(0), so you start doing this from i = 2 on.
Second, you need to handle missing values, something like a coalesce in SQL. Let us use ifn for that, a function that returns a numeric variable. The first argument is a condition, the second is the return value if the condition is true, the third the value the returnvale if the condition is false.
Putting it all together;
data AFTER;
set TEMP;
array hh(*) N:;
array bb(3);
bb1 = ifn(missing(N1), 0, N1);
do i=2 to 3;
bb(i) = bb(i-1) + ifn(missing(hh(i)), 0, hh(i));
end;
drop i;
run;
The down side of this sollution is the hardcoded 3 in array bb(3) and do i=2 to 3, which user3658367 tryed to solve with dim(hh). Unfortunately, that only works for one of them.
So this is better;
proc sql;
select count(*)
into :B_count
from sasHelp.vColumn
where libname eq 'WORK'
and memName eq 'TEMP'
and name like 'N%';
quit;
data AFTER;
set TEMP;
array hh(*) N:;
array bb(&B_count);
bb1 = ifn(missing(N1), 0, N1);
do i=2 to &B_count;
bb(i) = bb(i-1) + ifn(missing(hh(i)), 0, hh(i));
end;
drop i;
run;
I added the condition name like 'N%' because I assume in your real live problem TEMP has other variable than the once you are cummulating.
About the comments below : if you were not involved in this post from the start, you can neglect them. I included them in the text above.
(To the autors of these comments: thanks for your input.)
I am not very comfortable with arrays, so I would use a macro to do the work.
data temp;
input N1 N2 N3;
datalines;
1 1 1
1 . 1
1 1 .
;
run;
options mprint mlogic;
proc sql;
select name into:cols separated by ','
from dictionary.columns
where libname = upcase("work") and memname = upcase("temp") and upcase(name) like 'N%';
quit;
%macro cumul_colsum;
data temp2;
set temp;
run;
%do i = 1 %to %sysfunc(countw(%superq(cols)));
%let var = %scan(%superq(cols),&i,%str(,)); %put |&var|;
data temp2;
set temp2;
B&i. = sum(of N1-&var.);
run;
%end;
%mend cumul_colsum; %cumul_colsum;
And get the desired result, which in this case is the one OP requires. I have used the like 'N%' same as Dirk to feed the column names to a macro, and a do loop to create the columns with cumulative sums. This might take a while for huge datasets. But it's just easier to understand (options mprint nonotes;) what's happening.
follow #Reeza suggestion of using sum instead of + for handing missing values. something like below.
data have;
input
N1 N2 N3;
datalines;
1 1 1
1 . 1
1 1 .
;
data temp2;
set have;
array hh(*) N1 N2 N3;
array bb(3) b1 b2 B3;
do i= 1 to dim(hh);
if i= 1 then bb(i) = hh(i);
else bb(i)= sum(bb(i-1),hh(i));
end;
drop i;
run;

SAS: Coffee anyone?

I tried this in C# but have not had much success. So I am now trying in SAS. Using an EG session and my SAS code, we work with the list of students in SASHELP.CLASS.
These people want to get to know each other and have a monthly random pairing to go on a Coffee Date.
Rules:
A random Coffee Date List is Generated monthly;
I store each months pairing into a Historical Dataset, which I append monthly.
One person cannot have coffee with the same person within a 6 month period. So we keep a separate dataset for historical purposes with 3 Vars:
LastDate,InviterID,InvitedID
We check each pairing against the Historical list of which we only load the most recent 6 months data into a temp dataset for checking purposes.
If no recent matched pair is found, a new matched pair is added to a new Paired Dataset, and the 2 names (Rows) are removed from the original Participants dataset until the dataset has less than 2 rows. (a single person cannot be paired with another)
Unfortunately we have 19 people in this list so one person will be left out until we can add a new participant. Is anyone interested in joining our coffee club? :-)
So I start by deriving and ID (n) from the dataset, and I only keep the Name
Data Participants(Keep=ID Name);
FORMAT ID 8.;
set SASHelp.class;
ID=_n_;
run;
These 19 People will be my Participants in the Coffee Club.
I more or less follow the line of thought:
data _null_;
randvar = ceil(rand('UNIFORM') * 100000);
call symput('RANDSEED', randvar);
run;
data CR.names2(keep=MEMID randid);
set CR.MasterNames;
randid = rand('UNIFORM');
run;
proc sort data=CR.names2 ; by randid; run;
data CR.pairs(keep=pairgrp MEMID);
set CR.names2 nobs=num_peeps;
pairgrp+1;
if pairgrp > floor(num_peeps/2) then pairgrp=1;
run;
proc sort data=CR.pairs; by pairgrp;run;
proc transpose data=CR.pairs
out=CR.pairs2 (drop=_NAME_);
var memid;
by pairgrp;
run;
Data CR.Pairs3;
set CR.pairs2;
rename COL1=InviterID COL2=InvitedID;
run;
But I get stuck :-(
I need help with the rest please...
Has anyone else done this type of random pairing successfully before? I am grasping straws here...
Any help much appreciated.
Len
Here is my idea. This is far from efficient. Esp. when NOBS is getting big, as there is a cartesian product involved. Also I cheated on the odd number by adding another row in that case.
Prepare data and generate empty result table.
Create a list of all possible pairings (combinations) excluding recent pairings.
Random sort and descend through the list until every element has been picked once.
Append to result table.
There is a drawback as there might be members who will not get pairings as all possible partners are already picked. To avoid that we could iterate until we get a maximum of pairings.
EDIT: Added iteration. Now the program makes draws randomly until everyone is matched or a threshold is reached.
This problem should probably be implemented in a matrix orientated language like IML or R.
data Participants(Keep=ID Name) ;
set SASHelp.class nobs = num_peeps ;
ID=_n_ ;
output ;
if _n_ = 1 and mod(num_peeps,2) then do ; /* get even number of members: empty ID to pair with last participant*/
name = 'empty' ;
id = 0 ;
output ;
end ;
run ;
data list_of_meetings ;
length iteration InviterID InvitedID 8. ;
run ;
/****
iter = number of club meetings
hist = length of memory for pairings
tries = number of iterations to pair everyone
****/
%macro loop_coffee (iter=, hist=6, tries= 10) ;
proc sql noprint ;
select max(0,max(iteration)) + 1 into :base
from list_of_meetings ;
quit ;
%do i = &base. %to &iter. ; /* loop through number of meetings */
proc sort data = list_of_meetings (where=(iteration >= &i - &hist )) out = lookup nodupkey ; by InviterID InvitedID ; run ; /* get memory of pairings */
proc sql ; /* list all acceptable pairs */
create table all_pairs as
select a.ID as InviterID, b.ID as InvitedID
from Participants a
inner join Participants b
on a.ID lt b.ID
left join lookup c /* exclude the memory */
on a.ID eq c.InviterID and b.ID eq c.InvitedID
where c.InviterID is NULL ;
quit ;
%let j = 0 ;
%let all_pairs = 0 ;
%do %until (&all_pairs | &j > &tries) ; /* iterate and random sort until all members are paired */
%let j = %eval( &j + 1 ) ;
data all_pairs;
set all_pairs;
randnum = ranuni(12345 + &i + &j);
run;
proc sort data = all_pairs ; by randnum ; run ; /* random sort */
data out_pairs ; /* select the pairs: no. of IDs/2 */
declare hash h() ;
h.defineKey("ID") ;
h.defineDone() ;
do until ( eof1 ) ;
set Participants (keep= ID) end = eof1 ;
rc = h.add () ; /* populate list of members */
end ;
do until ( eof2 ) ;
set all_pairs (keep= InviterID InvitedID) end = eof2 ;
rc1 = h.check (key:InviterID) ;
rc2 = h.check (key:InvitedID) ;
if rc1 = 0 and rc2 = 0 then do ;
rc = h.remove (key:InviterID) ; /* delete member from list if paired */
rc = h.remove (key:InvitedID) ;
output ;
end ;
if h.num_items = 0 then do ;
call symput('all_pairs', 1 ) ;
stop ;
end;
end ;
stop ;
keep InviterID InvitedID ;
run ;
%end ;
data list_of_meetings ;
set list_of_meetings (where=(iteration ne .))
Out_pairs (in=pairs) ;
if pairs then iteration = &i. ;
run ;
%end ;
%mend ;
%loop_coffee (iter=10,hist=6,tries=10) ;

product of common variables in two datasets

data a1
a b c
2 3 4
1 2 3
data a2
a b d
0 .3 1
0 .2 0
proc sql;
create table a3 as
select a.*, a.a * b.a + a.b * b.b as Value
from a1 a, a2 b;
There are many common columns in a1 and a2 (numeric columns with different values). I want to calculate Value as the 'sumproduct' of those common columns.
I try to avoid something like a.common1 * b.common1 + a.common2 * b.common2 + ...
A few steps of preprocessing are needed as far as I can tell....
Load your data:
data a1 ;
input a b c ;
cards ;
2 3 4
1 2 3
;run ;
data a2 ;
input a b d ;
cards ;
0 0.3 1
0 0.2 0
;run ;
Pull all variable names in A1 and A2 datasets (update your libname if required):
proc sql ;
create table data1 as
select libname, memname, name, label
from sashelp.vcolumn
where libname= 'WORK' and memname in ('A1','A2')
order by name
;quit ;
Keep only variables which are common to both datasets:
data data2 ;
set data1 ;
by name ;
if last.name and not first.name ;
run ;
Put both a list and a count of the common variables into macro variables:
proc sql ;
select name
into :commvarnames separated by ' '
from data2
;
select count(name)
into :commoncount
from data2
;quit ;
Read in your source datasets - load the first, transfer them to a temporary array (therefore they do not overwrite the variable values) and then load the second dataset and do your calculations in a do loop:
data output ;
set a1(keep=&commvarnames) ;
array one(&commoncount) _temporary_ ;
array two(&commoncount) &commvarnames ;
* Load A1 to temporary array ;
do i=1 to &commoncount ;
one(i)=two(i) ;
end ;
* Load A2 to variables ;
set a2(keep=&commvarnames) ;
do i=1 to &commoncount ;
product=sum(product,one(i)*two(i)) ;
end ;
run ;
It would take quite a bit of code to make this dynamic. I'd break it down like so:
Get lists of the variables present in each dataset
Merge the lists to get a list of the common variables
Feed this into some array logic in a data step
Will post some code later, but hopefully that's enough to give you some ideas.

How do i perform calculation about the last n observations

how can i perform calculation for the last n observation in a data set
For example if I have 10 observations I would like to create a variable that would sum the last 5 values of another variable. Please do not suggest that I lag 5 times or use module ( N ). I need a bit more elegant solution than that.
with the code below alpha is the data set that i have and bravo is the one i need.
data alpha;
input lima ## ;
cards ;
3 1 4 21 3 3 2 4 2 5
;
run ;
data bravo;
input lima juliet;
cards;
3 .
1 .
4 .
21 .
3 32
3 32
2 33
4 33
2 14
5 16
;
run;
thank you in advance!
You can do this in the data step or using PROC EXPAND from SAS/ETS if available.
For the data step the idea is that you start with a cumulative sum (summ), but keep track of the number of values that were added so far (ninsum). Once that reaches 5, you start outputting the cumulative sum to the target variable (juliet), and from the next step you start subtracting the lagged-5 value to only store the sum of the last five values.
data beta;
set alpha;
retain summ ninsum 0;
summ + lima;
ninsum + 1;
l5 = lag5(lima);
if ninsum = 6 then do;
summ = summ - l5;
ninsum = ninsum - 1;
end;
if ninsum = 5 then do;
juliet = summ;
end;
run;
proc print data=beta;
run;
However there is a procedure that can do all kind of cumulative, moving window, etc calculations: PROC EXPAND, in which this is really just one line. We just tell it to calculate the backward moving sum in a window of width 5 and set the first 4 observations to missing (by default it will expand your series by 0's on the left).
proc expand data=alpha out=gamma;
convert lima = juliet / transformout=(movsum 5 trimleft 4);
run;
proc print data=gamma;
run;
Edit
If you want to do more complicated calculations, you need to carry the previous values in retained variables. I thought you wanted to avoid that, but here it is:
data epsilon;
set alpha;
array lags {5};
retain lags1 - lags5;
/* do whatever calculation is needed */
juliet = 0;
do i=1 to 5;
juliet = juliet + lags{i};
end;
output;
/* shift over lagged values, and add self at the beginning */
do i=5 to 2 by -1;
lags{i} = lags{i-1};
end;
lags{1} = lima;
drop i;
run;
proc print data=epsilon;
run;
I can offer rather ugly solution:
run data step and add increasing number to each group.
run sql step and add column of max(group).
run another data step and check if value from (2)-(1) is less than 5. If so, assign to _num_to_sum_ variable (for example) the value that you want to sum, otherwise leave it blank or assign 0.
and last do a sql step with sum(_num_to_sum_) and group results by grouping variable from (1).
EDIT: I have added a live example of the concept in a bit more compacted way.
input var1 $ var2;
cards;
aaa 3
aaa 5
aaa 7
aaa 1
aaa 11
aaa 8
aaa 6
bbb 3
bbb 2
bbb 4
bbb 6
;
run;
data step1;
set sourcetable;
by var1;
retain obs 0;
if first.var1 then obs = 0;
else obs = obs+1;
if obs >=5 then to_sum = var2;
run;
proc sql;
create table rezults as
select distinct var1, sum(to_sum) as needed_summs
from step1
group by var1;
quit;
In case anyone reads this :)
I solved it the way I needed it to be solved. Although now I am more curious which of the two(the retain and my solution) is more optimal in terms of computing/processing time.
Here is my solution:
data bravo(keep = var1 summ);
set alpha;
do i=_n_ to _n_-4 by -1;
set alpha(rename=var1=var2) point=i;
summ=sum(summ,var2);
end;
run;

How to add new observation to already created dataset in SAS?

How to add new observation to already created dataset in SAS ? For example, if I have dataset 'dataX' with variable 'x' and 'y' and I want to add new observation which is multiplication by two of the
of the observation number n, how can I do it ?
dataX :
x y
1 1
1 21
2 3
I want to create :
dataX :
x y
1 1
1 21
2 3
10 210
where observation number four is multiplication by ten of observation number two.
data X;
input x y;
datalines;
1 1
1 21
2 3
;
run;
data X ;
set X end=eof;
if eof then do;
output;
x=10 ;y=210;
end;
output;
run;
Here is one way to do this:
data dataX;
input x y;
datalines;
1 1
1 21
2 3
run;
/* Create a new observation into temp data set */
data _addRec;
set dataX(firstobs=2); /* Get observation 2 */
x = x * 10; /* Multiply each by 10 */
y = y * 10;
output; /* Output new observation */
stop;
run;
/* Add new obs to original data set */
proc append base=dataX data=_addRec;
run;
/* Delete the temp data set (to be safe) */
proc delete data=_addRec;
run;
data a ;
do kk=1 to 5 ;
output ;
end ;
run;
data a2 ;
kk=999 ;
output ;
run;
data a; set a a2 ;run ;
proc print data=a ;run ;
Result:
The SAS System 1
OBS kk
1 1
2 2
3 3
4 4
5 5
6 999
You can use macro to obtain your desired result :
Write a macro which will read first DataSet and when _n_=2 it will multiply x and y with 10.
After that create another DataSet which will hold only your muliplied value let say x'=10x and y'=10y.
Pass both DataSet in another macro which will set the original datset and newly created dataset.
Logic is you have to create another dataset with value 10x and 10y and after that set wih previous dataset.
I hope this will help !