I have the following code. I am trying to test a paragraph (descr) for a list of keywords (key_words). When I execute this code, the log reads in all the variables for the array, but will only test 2 of the 20,000 rows in the do loop (do i=1 to 100 and on). Any suggestions on how to fix this issue?
data JE.KeywordMatchTemp1;
set JE.JEMasterTemp end=eof;
if _n_ = 1 then do i = 1 by 1 until (eof);
set JE.KeyWords;
array keywords[100] $30 _temporary_;
keywords[i] = Key_Words;
end;
match = 0;
do i = 1 to 100;
if index(descr, keywords[i]) then match = 1;
end;
drop i;
run;
Your problem is that your end=eof is in the wrong place.
This is a trivial example calculating the 'rank' of the age value for each SASHELP.CLASS respondent.
See where I put the end=eof. That's because you need to use it to control the array filling operation. Otherwise, what happens is your loop that is do i = 1 to eof; doesn't really do what you're saying it should: it's not actually terminating at eof since that is never true (as it is defined in the first set statement). Instead it terminates because you reach beyond the end of the dataset, which is specifically what you don't want.
That's what the end=eof is doing: it's preventing you from trying to pull a row when the array filling dataset is finished, which terminates the whole data step. Any time you see a data step terminate after exactly 2 iterations, you can be confident that's what the problem is likely to be - it is a very common issue.
data class_ranks;
set sashelp.class; *This dataset you are okay iterating over until the end of the dataset and then quitting the data step, like a normal data step.;
array ages[19] _temporary_;
if _n_=1 then do;
do _i = 1 by 1 until (eof); *iterate until the end of the *second* set statement;
set sashelp.class end=eof; *see here? This eof is telling this loop when to stop. It is okay that it is not created until after the loop is.;
ages[_i] = age;
end;
call sortn(of ages[*]); *ordering the ages loaded by number so they are in proper order for doing the trivial rank task;
end;
age_rank = whichn(age,of ages[*]); *determine where in the list the age falls. For a real version of this task you would have to check whether this ever happens, and if not you would have to have logic to find the nearest point or whatnot.;
run;
Related
I want to do a sum of 250 previous rows for each row, starting from the row 250th.
X= lag1(VWRETD)+ lag2(VWRETD)+ ... +lag250(VWRETD)
X = sum ( lag1(VWRETD), lag2(VWRETD), ... ,lag250(VWRETD) )
I try to use lag function, but it does not work for too many lags.
I also want to calculate sum of 250 next rows after each row.
What you're looking for is a moving sum both forwards and backwards where the sum is missing until that 250th observation. The easiest way to do this is with PROC EXPAND.
Sample data:
data have;
do MKDate = '01JAN1993'd to '31DEC2000'd;
VWRET = rand('uniform');
output;
end;
format MKDate mmddyy10.;
run;
Code:
proc expand data=have out=want;
id MKDate;
convert VWRET = x_backwards_250 / transform=(movsum 250 trimleft 250);
convert VWRET = x_forwards_250 / transform=(reverse movsum 250 trimleft 250 reverse);
run;
Here's what the transformation operations are doing:
Creating a backwards moving sum of 250 observations, then setting the initial 250 to missing.
Reversing VWRET, creating a moving sum of 250 observations, setting the initial 250 to missing, then reversing it again. This effectively creates a forward moving sum.
The key is how to read observations from previous and post rows. As for your sum(n1, n2,...,nx) function, you can replace it with iterative summation.
This example uses multiple set skill to achieve summing a variable from 25 previous and post rows:
data test;
set sashelp.air nobs=nobs;
if 25<_n_<nobs-25+1 then do;
do i=_n_-25 to _n_-1;
set sashelp.air(keep=air rename=air=pre_air) point=i;
sum_pre=sum(sum_pre,pre_air);
end;
do j=_n_+1 to _n_+25;
set sashelp.air(keep=air rename=air=post_air) point=j;
sum_post=sum(sum_post,post_air);
end;
end;
drop pre_air post_air;
run;
Only 26th to nobs-25th rows will be calculated, where nobs stands for number of observations of the setting data sashelp.air.
Multiple set may take long time when meeting big dataset, if you want to be more effective, you can use array and DOW-loop to instead multiple set skill:
data test;
array _val_[1024]_temporary_;
if _n_=1 then do i=1 by 1 until(eof);
set sashelp.air end=eof;
_val_[i]=air;
end;
set sashelp.air nobs=nobs;
if 25<_n_<nobs-25+1 then do;
do i=_n_-25 to _n_-1;
sum_pre=sum(sum_pre,_val_[i]);
end;
do j=_n_+1 to _n_+25;
sum_post=sum(sum_post,_val_[j]);
end;
end;
drop i j;
run;
The weakness is you have to give a dimension number to array, it should be equal or great than nobs.
These skills are from a concept called "Table Look-Up", For SAS context, read "Table Look-Up by Direct Addressing: Key-Indexing -- Bitmapping -- Hashing", Paul Dorfman, SUGI 26.
You don't want use normal arithmetic with missing values becasue then the result is always a missing value. Use the SUM() function instead.
You don't need to spell out all of the lags. Just keep a normal running sum but add the wrinkle of removing the last one in by subtraction. So your equation only needs to reference the one lagged value.
Here is a simple example using running sum of 5 using SASHELP.CLASS data as an example:
%let n=5 ;
data step1;
set sashelp.class(keep=name age);
retain running_sum ;
running_sum=sum(running_sum,age,-(sum(0,lag&n.(age))));
if _n_ >= &n then want=running_sum;
run;
So the sum of the first 5 observations is 68. But for the next observation the sum goes down to 66 since the age on the 6th observation is 2 less than the age on the first observation.
To calculate the other variable sort the dataset in descending order and use the same logic to make another variable.
I have a below code in the SAS:
proc sort data=MYDATA1;
by VarNum Size Flavour Brand Retailer Market date;
run;
DATA MYDATA;
SET MYDATA1;
by VarNum Brand Size Flavour Retailer Market date;
/* Loop while for transformations. */
SUM = 0;
VAR1 = 1;
V1= Transformation;
VAR = Variable_for_SAS;
DO WHILE(FIND(V1,";")<>0);
V=V1;
V1=substr(V1,1,FIND(V1,";")-1);
IF SUBSTR(V1,1,1)="/" THEN
VT=STRIP(SUBSTR(V1,2,Find(V1,";")-2))||STRIP(date);
if _n_=1 then do;
declare hash h(dataset: 'MYDATA1');
h.definekey('Variable_date');
h.definedata('Variable_for_SAS');
h.definedone();
end;
if not h.find(key: VT) then new=Variable_for_SAS;
h.find();
SUM1=1*VAR;
/*Overwrite variable*/
VAR=SUM1;
V1=substr(TRIM(V),FIND(V,";")+1);
run;
But I have error:
run;
_
117
ERROR 117-185: There was 1 unclosed DO block.
Do you know what I should do to solve this problem?
Is problem because I use DO WHILE and hash together?
Now code is full.
Just add in the missing END statement for where you want your DO WHILE () loop to end.
Because you could be looping multiple times on even the first iteration of the data step your IF condition to run the hash creation steps is not sufficient to make sure those statements only run once. So either move the block that creates the HASH object to BEFORE the while loop. Or add additional conditions to the IF statement to keep it from re-running on every iteration of the DO WHILE loop.
Can someone explain this code to me in depth?? I have a list of comments in the code where I am confused. Is there anyway I can attach a csv of the data? Thanks in advance.
data have;
infile "&sasforum.\datasets\Returns.csv" firstobs=2 dsd truncover;
input DATE :mmddyy10. A B B_changed;
format date yymmdd10.;
run;
data spread;
do nb = 1 by 1 until(not missing(B));
set have;
end;
br = B;
do i = 1 to nb;
set have; *** I don't get how you can do i = 1 to nb with set have. There is not variable nb on set have. The variable nb is readinto the dataset spread;
if nb > 1 then B_spread = (1+br)**(1/nb) - 1;
else B_spread = B;
output;
end;
drop nb i br;
run;
***** If i comment out "drop nb i br" i get to see that nb takes a value of 2 for the null values of B.. I don't get how this is done or possible. Because if I run the code right after the line "br = B", and put an output statement in the first do loop, I am clearly seeing that nb takes a valueof one for B null values.Honestly, It is like the first do loop is reads in future observations for B as BR. Can you please explain this to me. The second dataset "bunch" seems to follow the same type of principles as the first... So i imagine if I get a grasp on the first on how the datasetspread is created, then I will understand how bunch is created.;
This is an advanced DATA step programming technique, commonly referred to as a DoW loop. If you search lexjansen.com for DoW, you will find helpful papers like http://support.sas.com/resources/papers/proceedings09/038-2009.pdf. The DoW loop codes and explicit loop around a SET statement. This is actually a "Double-DoW loop", because you have two explicit loops.
I made some sample data, and added some PUT statements to your code:
data have ;
input B ;
cards ;
.
.
1
2
.
.
.
3
;
data spread;
do nb = 1 by 1 until(not missing(B));
set have;
put _n_= "top do-loop " (nb B)(=) ;
end;
br = B;
do i = 1 to nb;
set have;
if nb > 1 then B_spread = (1+br)**(1/nb) - 1;
else B_spread = B;
output;
put _n_= "bottom do-loop " (nb B br B_spread)(=) ;
end;
drop nb i br;
run;
With that sample data, on the first iteration of the DATA step (N=1), the top do loop will iterate three times, reading the first three records of HAVE. At that point, (not missing(B)) will be true, and the loop will not iterate again. The variable NB will have a value of 3. The bottom loop will then iterate 3 times, because NB has a value of 3. It will also read the first three records have HAVE. It will compute B_Spread, and output each record.
On the second iteration of the DATA step, the top DO loop will iterate only once. It will read the 4th record, with B=2. The bottom loop will iterate once, reading the 4th record, computing B_spread, and output.
On the third iteration of the DATA step, the top DO loop will iterate four times, reading the 5th through 8th records. The bottom loop will also iterate four times, reading the 5th through 8th records, computing B_spread, and output.
On the fourth iteration of the DATA step, the step to complete, because the SET statement in the top loop will read the End Of File mark.
The core concept of a Double-DoW loop is that typically you are reading the data in groups. Often groups are identified by an ID. Here they are defined by sequential records read until not missing(B). The top DO-loop reads the first group of records, and computes some value (in this case, it computes NB, the number of records in the group). Then the bottom DO-loop reads the first group of records, and computes some new value, using the value computed in top DO-loop. In this case, the bottom DO-loop computes B_spread, using NB.
I'm coming at SAS from a Python/R/Stata background, and learning that things are rather different in SAS. I'm approaching the following problem from the standpoint of one of these languages, perhaps SAS isn't up to what I want to do.
I have a panel dataset with an age column in it. I want to add new columns to the dataset using this age column. I'm going to simplify the functions of age to keep it simple in my example.
The goal is to loop over a sequence, and use the value of that sequence at each loop step to 1. assign the name of the new column and 2. assign the values of that column. I'm hoping to get my starting dataset, with new columns added to it taking values spline1 spline2... spline7
data somePath.FinalDataset;
do i = 1 to 7;
if i = 1 then
spline&i. = age;
if i ^= 1 then spline&i. = age + i;
end;
set somePath.StartingDataset;
run;
This code won't even run, though in an earlier version I was able to get it to run, but the new columns had their values shifted down one row from what they should have been. I include this code block as pseudocode of what I'm trying to do. Any help is much appreciated
One way to do this in SAS is with arrays. A SAS array can be used to reference a group of variables, and it can also create variables.
data have;
input age;
cards;
5
10
;
run;
data want;
set have;
array spline{7}; *create spline1 spline2 ... spline7;
do i=1 to 7;
if i = 1 then spline{i} = age;
else spline{i} = age + i;
end;
drop i;
run;
Spline{i} referes to the ith variable of the array named spline.
i is a regular variable, the DROP statement prevents it from being written to the output dataset.
When you say new columns were "shifted by one," note that spline1=age and spline2=age+2. You can change your code accordingly, e.g. if you want spline2=age+1, you could change your else statement to else spline{i} = age + i - 1 ; It is also possible to change the array statement to define it with 0 as the lower bound, rather than 1.
Arrays are likely the best way to solve this, but I will demonstrate a macro approach, which is necessary in some cases.
SAS separates its doing-things-with-data language from its writing-code language into the 'data step language' and the 'macro language'. They don't really talk to each other during a data step, because the macro language runs during the compilation stage (before any data is processed) while the data step language runs during the execution stage (while rows of data are being processed).
In any event, for something like this it's quite possible to write a macro to do what you want. Borrowing Quentin's general structure and initial dataset:
data have;
input age;
cards;
5
10
;
run;
%macro make_spline(var=, count=);
%local i;
%do i = 1 %to &count;
%if &i=1 %then &var.&i. = &var.;
%else &var.&i. = &var. + &i.;
; *this semicolon ends the assignment statement;
%end;
/* You end up with the IF statement generating:
age1 = age
and the extra semicolon after the if/else generates the ; for that line, making it
age1 = age;
etc. for the other lines.
*/
%mend make_spline;
data want;
set have;
%make_spline(var=age,count=7);
run;
This would then perform what you're looking to perform. The looping is in the macro language, not in the data step. You can assign parameters however you see fit; I prefer to have parameters like above, or even more (start loop could also be a parameter, and in fact the assignment code could be a parameter!).
Probably a simple question. I have a simple dataset with scheduled payment dates in it.
DATA INFORM2;
INFORMAT previous_pmt_date scheduled_pmt_date MMDDYY10.;
INPUT previous_pmt_date scheduled_pmt_date;
FORMAT previous_pmt_date scheduled_pmt_date MMDDYYS10.;
DATALINES;
11/16/2015 12/16/2015
12/17/2015 01/16/2016
01/17/2016 02/16/2016
;
What I'm trying to do is to create a binary latest row indicator. For example, If I wanted to know the latest row as of 1/31/2016 I'd want row 2 to be flagged as the latest row. What I had been doing before is finding out where 1/31/2016 is between the previous_pmt_date and the scheduled_pmt_date, but that isn't correct for my purposes. I'd like to do this in an data step as opposed to SQL subqueries. Any ideas?
Want:
previous_pmt_date scheduled_pmt_date latest_row_ind
11/16/2015 12/16/2015 0
12/17/2015 01/16/2016 1
01/17/2016 02/16/2016 0
Here's a solution that does it all in the single existing datastep without any additional sorting. First I'm going to modify your data slightly to include account as the solution really should take that into account as well:
DATA INFORM2;
INFORMAT previous_pmt_date scheduled_pmt_date MMDDYY10.;
INPUT account previous_pmt_date scheduled_pmt_date;
FORMAT previous_pmt_date scheduled_pmt_date MMDDYYS10.;
DATALINES;
1 11/16/2015 12/16/2015
1 12/17/2015 01/16/2016
1 01/17/2016 02/16/2016
2 11/16/2015 12/16/2015
2 12/17/2015 01/16/2016
2 01/17/2016 02/16/2016
;
run;
Specify a cutoff date:
%let cutoff_date = %sysfunc(mdy(1,31,2016));
This solution uses the approach from this question to save the variables in the next row of data, into the current row. You can drop the vars at the end if desired (I've commented out for the purposes of testing).
data want;
set inform2 end=eof;
by account scheduled_pmt_date;
recno = _n_ + 1;
if not eof then do;
set inform2 (keep=account previous_pmt_date scheduled_pmt_date
rename=(account = next_account
previous_pmt_date = next_previous_pmt_date
scheduled_pmt_date = next_scheduled_pmt_date)
) point=recno;
end;
else do;
call missing(next_account, next_previous_pmt_date, next_scheduled_pmt_date);
end;
select;
when ( next_account eq account and next_scheduled_pmt_date gt &cutoff_date ) flag='a';
when ( next_account ne account ) flag='b';
otherwise flag = 'z';
end;
*drop next:;
run;
This approach works by using the current observation in the dataset (obtained via _n_) and adding 1 to it to get the next observation. We then use a second set statement with the point= option to load in that next observation and rename the variables at the same time so that they don't overwrite the current variables.
We then use some logic to flag the necessary records. I'm not 100% of the logic you require for your purposes, so I've provided some sample logic and used different flags to show which logic is being triggered.
Some notes...
The by statement isn't strictly necessary but I'm including it to (a) ensure that the data is sorted correctly, and (b) help future readers understand the intent of the datastep as some of the logic requires this sort order.
The call missing statement is simply there to clean up the log. SAS doesn't like it when you have variables that don't get assigned values, and this will happen on the very last observation so this is why we include this. Comment it out to see what happens.
The end=eof syntax basically creates a temporary variable called eof that has a value of 1 when we get to the last observation on that set statement. We simply use this to determine if we're at the last row or not.
Finally but very importantly, be sure to make sure you are keeping only the variables required when you load in the second dataset otherwise you will overwrite existing vars in the original data.