Can someone explain this code to me in depth?? I have a list of comments in the code where I am confused. Is there anyway I can attach a csv of the data? Thanks in advance.
data have;
infile "&sasforum.\datasets\Returns.csv" firstobs=2 dsd truncover;
input DATE :mmddyy10. A B B_changed;
format date yymmdd10.;
run;
data spread;
do nb = 1 by 1 until(not missing(B));
set have;
end;
br = B;
do i = 1 to nb;
set have; *** I don't get how you can do i = 1 to nb with set have. There is not variable nb on set have. The variable nb is readinto the dataset spread;
if nb > 1 then B_spread = (1+br)**(1/nb) - 1;
else B_spread = B;
output;
end;
drop nb i br;
run;
***** If i comment out "drop nb i br" i get to see that nb takes a value of 2 for the null values of B.. I don't get how this is done or possible. Because if I run the code right after the line "br = B", and put an output statement in the first do loop, I am clearly seeing that nb takes a valueof one for B null values.Honestly, It is like the first do loop is reads in future observations for B as BR. Can you please explain this to me. The second dataset "bunch" seems to follow the same type of principles as the first... So i imagine if I get a grasp on the first on how the datasetspread is created, then I will understand how bunch is created.;
This is an advanced DATA step programming technique, commonly referred to as a DoW loop. If you search lexjansen.com for DoW, you will find helpful papers like http://support.sas.com/resources/papers/proceedings09/038-2009.pdf. The DoW loop codes and explicit loop around a SET statement. This is actually a "Double-DoW loop", because you have two explicit loops.
I made some sample data, and added some PUT statements to your code:
data have ;
input B ;
cards ;
.
.
1
2
.
.
.
3
;
data spread;
do nb = 1 by 1 until(not missing(B));
set have;
put _n_= "top do-loop " (nb B)(=) ;
end;
br = B;
do i = 1 to nb;
set have;
if nb > 1 then B_spread = (1+br)**(1/nb) - 1;
else B_spread = B;
output;
put _n_= "bottom do-loop " (nb B br B_spread)(=) ;
end;
drop nb i br;
run;
With that sample data, on the first iteration of the DATA step (N=1), the top do loop will iterate three times, reading the first three records of HAVE. At that point, (not missing(B)) will be true, and the loop will not iterate again. The variable NB will have a value of 3. The bottom loop will then iterate 3 times, because NB has a value of 3. It will also read the first three records have HAVE. It will compute B_Spread, and output each record.
On the second iteration of the DATA step, the top DO loop will iterate only once. It will read the 4th record, with B=2. The bottom loop will iterate once, reading the 4th record, computing B_spread, and output.
On the third iteration of the DATA step, the top DO loop will iterate four times, reading the 5th through 8th records. The bottom loop will also iterate four times, reading the 5th through 8th records, computing B_spread, and output.
On the fourth iteration of the DATA step, the step to complete, because the SET statement in the top loop will read the End Of File mark.
The core concept of a Double-DoW loop is that typically you are reading the data in groups. Often groups are identified by an ID. Here they are defined by sequential records read until not missing(B). The top DO-loop reads the first group of records, and computes some value (in this case, it computes NB, the number of records in the group). Then the bottom DO-loop reads the first group of records, and computes some new value, using the value computed in top DO-loop. In this case, the bottom DO-loop computes B_spread, using NB.
Related
I want to do a sum of 250 previous rows for each row, starting from the row 250th.
X= lag1(VWRETD)+ lag2(VWRETD)+ ... +lag250(VWRETD)
X = sum ( lag1(VWRETD), lag2(VWRETD), ... ,lag250(VWRETD) )
I try to use lag function, but it does not work for too many lags.
I also want to calculate sum of 250 next rows after each row.
What you're looking for is a moving sum both forwards and backwards where the sum is missing until that 250th observation. The easiest way to do this is with PROC EXPAND.
Sample data:
data have;
do MKDate = '01JAN1993'd to '31DEC2000'd;
VWRET = rand('uniform');
output;
end;
format MKDate mmddyy10.;
run;
Code:
proc expand data=have out=want;
id MKDate;
convert VWRET = x_backwards_250 / transform=(movsum 250 trimleft 250);
convert VWRET = x_forwards_250 / transform=(reverse movsum 250 trimleft 250 reverse);
run;
Here's what the transformation operations are doing:
Creating a backwards moving sum of 250 observations, then setting the initial 250 to missing.
Reversing VWRET, creating a moving sum of 250 observations, setting the initial 250 to missing, then reversing it again. This effectively creates a forward moving sum.
The key is how to read observations from previous and post rows. As for your sum(n1, n2,...,nx) function, you can replace it with iterative summation.
This example uses multiple set skill to achieve summing a variable from 25 previous and post rows:
data test;
set sashelp.air nobs=nobs;
if 25<_n_<nobs-25+1 then do;
do i=_n_-25 to _n_-1;
set sashelp.air(keep=air rename=air=pre_air) point=i;
sum_pre=sum(sum_pre,pre_air);
end;
do j=_n_+1 to _n_+25;
set sashelp.air(keep=air rename=air=post_air) point=j;
sum_post=sum(sum_post,post_air);
end;
end;
drop pre_air post_air;
run;
Only 26th to nobs-25th rows will be calculated, where nobs stands for number of observations of the setting data sashelp.air.
Multiple set may take long time when meeting big dataset, if you want to be more effective, you can use array and DOW-loop to instead multiple set skill:
data test;
array _val_[1024]_temporary_;
if _n_=1 then do i=1 by 1 until(eof);
set sashelp.air end=eof;
_val_[i]=air;
end;
set sashelp.air nobs=nobs;
if 25<_n_<nobs-25+1 then do;
do i=_n_-25 to _n_-1;
sum_pre=sum(sum_pre,_val_[i]);
end;
do j=_n_+1 to _n_+25;
sum_post=sum(sum_post,_val_[j]);
end;
end;
drop i j;
run;
The weakness is you have to give a dimension number to array, it should be equal or great than nobs.
These skills are from a concept called "Table Look-Up", For SAS context, read "Table Look-Up by Direct Addressing: Key-Indexing -- Bitmapping -- Hashing", Paul Dorfman, SUGI 26.
You don't want use normal arithmetic with missing values becasue then the result is always a missing value. Use the SUM() function instead.
You don't need to spell out all of the lags. Just keep a normal running sum but add the wrinkle of removing the last one in by subtraction. So your equation only needs to reference the one lagged value.
Here is a simple example using running sum of 5 using SASHELP.CLASS data as an example:
%let n=5 ;
data step1;
set sashelp.class(keep=name age);
retain running_sum ;
running_sum=sum(running_sum,age,-(sum(0,lag&n.(age))));
if _n_ >= &n then want=running_sum;
run;
So the sum of the first 5 observations is 68. But for the next observation the sum goes down to 66 since the age on the 6th observation is 2 less than the age on the first observation.
To calculate the other variable sort the dataset in descending order and use the same logic to make another variable.
I have the following code. I am trying to test a paragraph (descr) for a list of keywords (key_words). When I execute this code, the log reads in all the variables for the array, but will only test 2 of the 20,000 rows in the do loop (do i=1 to 100 and on). Any suggestions on how to fix this issue?
data JE.KeywordMatchTemp1;
set JE.JEMasterTemp end=eof;
if _n_ = 1 then do i = 1 by 1 until (eof);
set JE.KeyWords;
array keywords[100] $30 _temporary_;
keywords[i] = Key_Words;
end;
match = 0;
do i = 1 to 100;
if index(descr, keywords[i]) then match = 1;
end;
drop i;
run;
Your problem is that your end=eof is in the wrong place.
This is a trivial example calculating the 'rank' of the age value for each SASHELP.CLASS respondent.
See where I put the end=eof. That's because you need to use it to control the array filling operation. Otherwise, what happens is your loop that is do i = 1 to eof; doesn't really do what you're saying it should: it's not actually terminating at eof since that is never true (as it is defined in the first set statement). Instead it terminates because you reach beyond the end of the dataset, which is specifically what you don't want.
That's what the end=eof is doing: it's preventing you from trying to pull a row when the array filling dataset is finished, which terminates the whole data step. Any time you see a data step terminate after exactly 2 iterations, you can be confident that's what the problem is likely to be - it is a very common issue.
data class_ranks;
set sashelp.class; *This dataset you are okay iterating over until the end of the dataset and then quitting the data step, like a normal data step.;
array ages[19] _temporary_;
if _n_=1 then do;
do _i = 1 by 1 until (eof); *iterate until the end of the *second* set statement;
set sashelp.class end=eof; *see here? This eof is telling this loop when to stop. It is okay that it is not created until after the loop is.;
ages[_i] = age;
end;
call sortn(of ages[*]); *ordering the ages loaded by number so they are in proper order for doing the trivial rank task;
end;
age_rank = whichn(age,of ages[*]); *determine where in the list the age falls. For a real version of this task you would have to check whether this ever happens, and if not you would have to have logic to find the nearest point or whatnot.;
run;
Probably a simple question. I have a simple dataset with scheduled payment dates in it.
DATA INFORM2;
INFORMAT previous_pmt_date scheduled_pmt_date MMDDYY10.;
INPUT previous_pmt_date scheduled_pmt_date;
FORMAT previous_pmt_date scheduled_pmt_date MMDDYYS10.;
DATALINES;
11/16/2015 12/16/2015
12/17/2015 01/16/2016
01/17/2016 02/16/2016
;
What I'm trying to do is to create a binary latest row indicator. For example, If I wanted to know the latest row as of 1/31/2016 I'd want row 2 to be flagged as the latest row. What I had been doing before is finding out where 1/31/2016 is between the previous_pmt_date and the scheduled_pmt_date, but that isn't correct for my purposes. I'd like to do this in an data step as opposed to SQL subqueries. Any ideas?
Want:
previous_pmt_date scheduled_pmt_date latest_row_ind
11/16/2015 12/16/2015 0
12/17/2015 01/16/2016 1
01/17/2016 02/16/2016 0
Here's a solution that does it all in the single existing datastep without any additional sorting. First I'm going to modify your data slightly to include account as the solution really should take that into account as well:
DATA INFORM2;
INFORMAT previous_pmt_date scheduled_pmt_date MMDDYY10.;
INPUT account previous_pmt_date scheduled_pmt_date;
FORMAT previous_pmt_date scheduled_pmt_date MMDDYYS10.;
DATALINES;
1 11/16/2015 12/16/2015
1 12/17/2015 01/16/2016
1 01/17/2016 02/16/2016
2 11/16/2015 12/16/2015
2 12/17/2015 01/16/2016
2 01/17/2016 02/16/2016
;
run;
Specify a cutoff date:
%let cutoff_date = %sysfunc(mdy(1,31,2016));
This solution uses the approach from this question to save the variables in the next row of data, into the current row. You can drop the vars at the end if desired (I've commented out for the purposes of testing).
data want;
set inform2 end=eof;
by account scheduled_pmt_date;
recno = _n_ + 1;
if not eof then do;
set inform2 (keep=account previous_pmt_date scheduled_pmt_date
rename=(account = next_account
previous_pmt_date = next_previous_pmt_date
scheduled_pmt_date = next_scheduled_pmt_date)
) point=recno;
end;
else do;
call missing(next_account, next_previous_pmt_date, next_scheduled_pmt_date);
end;
select;
when ( next_account eq account and next_scheduled_pmt_date gt &cutoff_date ) flag='a';
when ( next_account ne account ) flag='b';
otherwise flag = 'z';
end;
*drop next:;
run;
This approach works by using the current observation in the dataset (obtained via _n_) and adding 1 to it to get the next observation. We then use a second set statement with the point= option to load in that next observation and rename the variables at the same time so that they don't overwrite the current variables.
We then use some logic to flag the necessary records. I'm not 100% of the logic you require for your purposes, so I've provided some sample logic and used different flags to show which logic is being triggered.
Some notes...
The by statement isn't strictly necessary but I'm including it to (a) ensure that the data is sorted correctly, and (b) help future readers understand the intent of the datastep as some of the logic requires this sort order.
The call missing statement is simply there to clean up the log. SAS doesn't like it when you have variables that don't get assigned values, and this will happen on the very last observation so this is why we include this. Comment it out to see what happens.
The end=eof syntax basically creates a temporary variable called eof that has a value of 1 when we get to the last observation on that set statement. We simply use this to determine if we're at the last row or not.
Finally but very importantly, be sure to make sure you are keeping only the variables required when you load in the second dataset otherwise you will overwrite existing vars in the original data.
Here's a very similar question
My question is a bit different from the one in the above link.
Background
I have a data set contains hourly data. So each object has 24 records per day. Now I want to create K new columns represents next 1,2,...K hourly records for each object. If not exist, replace them with missing values.
K is dynamic and is defined by users.
The original order must be preserved. No matter it's guaranteed in the data steps or by using sorting in the end.
I'm looking for an efficient way to achieve this.
Example
Original data:
Object Hour Value
A 1 2.3
A 2 2.3
A 3 4.0
A 4 1.3
Given K = 2, desired output is
Object Hour Value Value1 Value2
A 1 2.3 2.3 4.0
A 2 2.3 4.0 1.3
A 3 4.0 1.3 .
A 4 1.3 . .
Possible solutions
sort in reverse order -> obtain previous k records -> sort them back.
When the no. of observation is large, this shouldn't be an ideal way.
proc expand. I don't familiar with it cause it's never licensed on my pc.
Using point in data step.
retain statement inside data step. I'm not sure how this works.
Assuming this is provided as a macro variable, this is pretty easily done with a side to side merge-ahead. Certainly faster than a transpose for K much larger than the total record count, and probably faster than looping POINTs.
Basically you merge the original dataset to itself, and use FIRSTOBS to push the starting point down one for each successive merge iteration. This needs a bit of extra work if you have BY groups that need protecting, but that's usually not too hard to manage.
Here's an example using SASHELP.CLASS:
%let K=5;
%macro makemergesets(k=, datain=, varin=, keepin=);
%do _i = 2 %to &k;
&datain (firstobs=&_i rename=&varin.=&varin._&_i. keep=&keepin. &varin.)
%end;
%mend makemregesets;
data class_all;
merge sashelp.class
%makemergesets(k=&k,datain=sashelp.class, varin=age,keepin=)
;
run;
You could transpose the hours and then freely access the hours ahead within each object. Just to set the value of K and generate some dummy data:
* Assign K ;
%let K=3 ;
%let Kn=value&k;
* Generate test objects each containing 24 hourly records ;
data time ;
do object=1 to 10 ;
do hour=1 to 24 ;
value=round(ranuni(1)*10,0.1) ;
output ;
end ;
end ;
run ;
EDIT: I updated the below step as realised the transpose isn't needed. Doing it all in one step gives ~20% improvement in CPU time
Use an array of the 24 hour values and loop through do i=1 to &k for each hour:
* Populate K variables ;
data output(keep=object hour value value1-&kn ) ;
set time ;
by object ;
retain k1-k24 . ;
array k(2,24) k1-k24 value1-value24 ;
k(1,hour)=value ;
if last.object then do hour=1 to 24 ;
value=k(1,hour) ;
do i=1 to &k ;
if hour+i <=24 then k(2,i)=k(1,hour+i) ;
else k(2,i)=.;
end ;
output ;
end ;
run ;
Consider the following example:
/* Create two not too interesting datasets: */
Data ones (keep = A);
Do i = 1 to 3;
A = 1;
output;
End;
run;
Data numbers;
Do B = 1 to 5;
output;
End;
Run;
/* The interesting step: */
Data together;
Set ones numbers;
if B = 2 then A = 2;
run;
So dataset ones contains one variable A with 3 observations, all ones and dataset numbers contains one variable (B) with 5 observations: the numbers 1 to 5.
I expect the resulting dataset together to have two columns (A and B) and the A column to read (vertically) 1, 1, 1, . , 2, . , . , .
However, when executing the code I find that column A reads 1, 1, 1, . , 2, 2, 2 , 2
Apparently the 2 created in the fifth observation is retained all the way down for no apparent reason. What is going on here?
(For the sake of completeness: when I split the last data step into two as below:
Data together;
set ones numbers;
run;
Data together;
set together;
if B = 2 then A = 2;
run;
it does do what I expect.)
Yes, any variable that is defined in a SET, MERGE, or UPDATE statement is automatically retained (not set to missing at the top of the data step loop). You can effectively ignore that with
output;
call missing(of <list of variables to clear out>);
run;
at the end of your data step.
This is how MERGE works for many-to-one merges, by the way, and the reason that many-to-many merges don't usually work the way you want them to.
The difference between the 'together' and the 'separate' cases is that in the separate case, you have two data sets with different variables. If you are running this in interactive mode, ie SAS Program Editor or Enhanced Editor (not EG or batch mode), you can use the data step debugger to see this a little more clearly. You would see the following:
At the end of the last row of the ones dataset:
i A B
3 1 .
Notice B exists, but is missing. Then it goes back to the top of the data step loop. All three variables are left alone since they're all from the data sets. Then it attempts to read from ones one more time, which generates:
i A B
. . .
Then it realizes it cannot read from ones, and starts to read from numbers. At the end of the first row of the numbers dataset:
i A B
. . 1
Then it goes to the top, again changes nothing; then it reads in a 2 for B.
i A B
. . 2
Then it sets A to 2, per your program:
i A B
. 2 2
Then it returns to the start of the data step loop again.
i A B
. 2 2
Then it reads in B=3:
i A B
. 2 3
Then it continues looping, for B=4, 5.
Now, compare that to the single dataset. It will be nearly the same (with a small difference at the switch between datasets that does not yield a different result). Now we go to the step where A=2 B=2:
i A B
. 2 2
Now when the data step reads in the next row, it has all three variables on it. So it yields:
i A B
. . 3
Since it read in A=. from the row, it is setting it to missing. In the one-data-step version, it didn't have a value for A to read in, so it didn't replace the 2 with missing.