SAS delete and return statements - sas

I'm looking at someone else's code and trying to determine if:
if x = y then do;
delete;
return;
end;
is equivalent to:
if x = y then do;
delete;
end;
From the documentation on DELETE:
When DELETE executes, the current observation is not written to a data set, and SAS returns immediately to the beginning of the DATA step for the next iteration.
Which leads me to believe the 'return' statement in the first example is not executed?

May as well test, rather than guess.
Create a data set with x/y values, one that will meet the condition and one that will not.
Run the data step and add PUT statement so you can trace the log
From the log you see that nothing after the DELETE is executed, so you can confirm that RETURN is not executed and is redundant.
FYI - one thing to consider - has this behaviour changed over time or has the code changed where perhaps this was once valid? Usually that's the case.
data have;
input x y;
cards;
1 2
1 1
;;;;
run;
data demo;
set have;
if x = y then do;
put "Record Deleted 1";
delete;
put "Record Deleted 2";
return;
put "Record Deleted 3";
end;
else put "Record Retained";
run;
Log:
78 data demo;
79 set have;
80 if x = y then do;
81 put "Record Deleted 1";
82 delete;
83 put "Record Deleted 2";
84 return;
85
86 put "Record Deleted 3";
87
88 end;
89 else put "Record Retained";
90
91 run;
Record Retained
Record Deleted 1
NOTE: There were 2 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.DEMO has 1 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
memory 798.71k
OS Memory 24232.00k
Timestamp 10/18/2021 10:25:05 PM
Step Count 39 Switch Count 2
Page Faults 0
Page Reclaims 135
Page Swaps 0
Voluntary Context Switches 10
Involuntary Context Switches 0
Block Input Operations 0
Block Output Operations 264

Related

SAS- Split Dataset based on values in a column (character) and then further split each of these datasets based on a time column

I have stock trading data for a day - about 60 million rows. Basically, I want to create a dataset that lists the average duration for each 5-minute interval for each of the stocks.
Dataset Original
Obs
time
symbol
tradePrice
tradeId
datatime
duration
1
093000154451968
A
152.24
7.1675E13
1943170200.2
.
2
093000845296640
A
151.99
5.2984E13
1943170200.8
0.69084
3
093000845296640
A
151.99
5.2984E13
1943170200.8
0.00000
4
093000846918400
A
151.99
5.2984E13
1943170200.8
0.00162
5
093000847665152
A
151.94
6.2879E13
1943170200.8
0.00075
6
093000847675136
A
151.94
6.2879E13
1943170200.8
0.00001
7
093000857328128
A
151.94
5.2984E13
1943170200.9
0.00965
8
093000889283840
A
151.24
7.1675E13
1943170200.9
0.03196
9
093001249114624
A
151.74
7.1675E13
1943170201.2
0.35983
10
093001824934912
A
151.99
7.1675E13
1943170201.8
0.57582
11
093001834587904
A
151.71
5.2989E13
1943170201.8
0.00965
12
093002261742336
A
151.99
7.1675E13
1943170202.3
0.42715
Here "time" variable is setup as hhmmssnnnnnnnnn (n indicates nanoseconds - i.e. seconds are counted for 9 significant digits after decimal)
and "datetime" variable is converted to nanoseconds using date and time both.
For this code, I only work with one day of data so use "time" variable only.
Final Result
Stock
TimeInterval
Average duration
A
0930-0935
23456
A
0935-0940
56789
A
........
......
A
1555-1600
57689
B
0930-0935
23456
B
0935-0940
56789
B
........
......
B
1555-1600
57689
..
...
...
Z
0930-0935
23456
Z
0935-0940
56789
Z
........
......
Z
1555-1600
57689
Step 1:
I want to split the dataset such that I have a separate dataset for each of the stock symbols. I did this already.
Step 2:
To sum up the values in a column for every 5-minute interval from 0930 to 1600. I am struggling here.
Current Code:
/* Read Dataset */
DATA working_dataset;
set "C:\EQY_US_ALL_TRADE_202107\test_sample_sorted";
run;
/* List of Unique Symbols and feed them into new variables */
proc sql noprint;
select distinct symbol into :symbol1 - (NOTRIM)
from working_dataset;
%put &symbol1;
%put &symbol2;
/* Count of Unique Symbols and store the value in variable "n" */
proc sql noprint;
select count(distinct symbol) into: n
from working_dataset;
%put &n;
/* Keeping the variables needed for the analysis */
DATA working_dataset_2;
SET working_dataset (keep = symbol time duration tradePrice datetime tradeId);
run;
/* Extracting stock symbol names from the dataset;*/
proc sort data=working_dataset_2 out=symblist (keep = symbol)
nodupkey;
by symbol;
run;
/* Creating multiple datasets from the parent dataset;*/
data _null_;
set symblist;
call execute('data ' !! compress(symbol) !! '; set working_dataset_2; where symbol = "' !! symbol !! '"; run;');
run;
For Step 2:
I don't know how to but I am planning to run a loop for 78x 5 minute intervals between 0930 to 1600 using an if statement controlled by the loop value. The following is just wishful thinking - not code. I don't know where to begin.
data dataset_final;
set "A"; /* To be changed as per variable for stock symbol */
array symb(&n); /* this array should have all the stock symbols */
do over; /* do over for all the array items in the array symb(&n) */
do i = 1 to 78;
if (time GE (093000000000000 + &i.- 1)) & (time LT (093000000000000 + &i.))
then send obs to symb_j_0930+&i.-1
end;
Any help is appreciated. I am not sure how to attach the datafile.
Step 1 works. I am able to create different datasets using and call/execute.
Log for Step 1:
439
440 DATA working_dataset;
441 set "C:\EQY_US_ALL_TRADE_202107\test_sample_sorted";
442 run;
NOTE: There were 50000 observations read from the data set
C:\EQY_US_ALL_TRADE_202107\test_sample_sorted.
NOTE: The data set WORK.WORKING_DATASET has 50000 observations and 25 variables.
NOTE: DATA statement used (Total process time):
real time 0.12 seconds
cpu time 0.09 seconds
443
444 proc sql noprint;
445 select distinct symbol into :symbol1 - (NOTRIM)
-
22
76
ERROR 22-322: Syntax error, expecting one of the following: ',', :, FROM, NOTRIM.
ERROR 76-322: Syntax error, statement will be ignored.
446 from working_dataset;
447 %put &symbol1;
A
448 %put &symbol2;
AA
449
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SQL used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds
450 proc sql noprint;
451 select count(distinct symbol) into: n
452 from working_dataset;
453 %put &n;
2
454
NOTE: PROCEDURE SQL used (Total process time):
real time 0.04 seconds
cpu time 0.04 seconds
455 DATA working_dataset_2;
456 SET working_dataset (keep = symbol time duration tradePrice datetime tradeId);
457
458 /* Extracting stock symbol names from the dataset;*/
NOTE: There were 50000 observations read from the data set WORK.WORKING_DATASET.
NOTE: The data set WORK.WORKING_DATASET_2 has 50000 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
459 proc sort data=working_dataset_2 out=symblist (keep = symbol)
460 nodupkey;
461 by symbol;
462 run;
NOTE: There were 50000 observations read from the data set WORK.WORKING_DATASET_2.
NOTE: 49998 observations with duplicate key values were deleted.
NOTE: The data set WORK.SYMBLIST has 2 observations and 1 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds
463 /* Creating multiple datasets from the parent dataset;*/
464 data _null_;
465 set symblist;
466 call execute('data ' !! compress(symbol) !! '; set working_dataset_2; where symbol = "' !! symbol
466! !! '"; run;');
467 run;
NOTE: There were 2 observations read from the data set WORK.SYMBLIST.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
NOTE: CALL EXECUTE generated line.
1 + data A; set working_dataset_2; where symbol = "A "; run;
NOTE: There were 24304 observations read from the data set WORK.WORKING_DATASET_2.
WHERE symbol='A ';
NOTE: The data set WORK.A has 24304 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
2 + data AA; set working_dataset_2; where symbol = "AA "; run;
NOTE: There were 25696 observations read from the data set WORK.WORKING_DATASET_2.
WHERE symbol='AA ';
NOTE: The data set WORK.AA has 25696 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.01 seconds
Step 2 is where I am horribly struggling. I am not sure how to do the code.
Assuming you have an actual time value (you can create one from your first 16 digit string) you can just convert that time to the start of the 5 minute interval and use that to group the data. No need for looping (or splitting).
Let's modify your example data so it actually has more than one stock symbol and more than one time interval. You can convert the first 6 characters of your TIME string into an actual TIME value. Which we can then convert to the beginning of the 5 minute interval.
data have ;
input time :$16. symbol :$4. tradePrice tradeId datatime duration;
tod = input(time,hhmmss6.);
interval='00:05:00't*int(tod/'00:05:00't);
format tod interval tod8.;
nanosec = input(substr(time,7),32.);
cards;
093000154451968 A 152.24 7.1675E13 1943170200.2 .
093000845296640 A 151.99 5.2984E13 1943170200.8 0.69084
093500845296640 A 151.99 5.2984E13 1943170200.8 0.00000
093500846918400 A 151.99 5.2984E13 1943170200.8 0.00162
093800847665152 A 151.94 6.2879E13 1943170200.8 0.00075
093000847675136 B 151.94 6.2879E13 1943170200.8 0.00001
093100857328128 B 151.94 5.2984E13 1943170200.9 0.00965
093900889283840 B 151.24 7.1675E13 1943170200.9 0.03196
093001249114624 C 151.74 7.1675E13 1943170201.2 0.35983
093301824934912 C 151.99 7.1675E13 1943170201.8 0.57582
093801834587904 C 151.71 5.2989E13 1943170201.8 0.00965
094102261742336 C 151.99 7.1675E13 1943170202.3 0.42715
;
So once you have a dataset (or even a view) that has the three variables needed, SYBMOL INTERVAL and DURATION, you can then just use PROC SUMMARY to produce the mean of the durations.
proc summary nway ;
class symbol interval;
var duration;
output out=want mean=mean_duration ;
run;
Results:
mean_
Obs symbol interval _TYPE_ _FREQ_ duration
1 A 09:30:00 3 2 0.69084
2 A 09:35:00 3 3 0.00079
3 B 09:30:00 3 2 0.00483
4 B 09:35:00 3 1 0.03196
5 C 09:30:00 3 2 0.46783
6 C 09:35:00 3 1 0.00965
7 C 09:40:00 3 1 0.42715
You say you're struggling with Step 2: "To sum up the values in a column for every 5-minute interval from 0930 to 1600. I am struggling here."
I'm just going to address that part of your question based on looking at the final result that you want. I'm providing code so you don't need to split the data into multiple datasets of each stock.
data final;
set <dataset>;
time_interval = intck("minute", "09:30:00", tradetime);
time_interval = time_interval - mod(time_interval, 5);
run;
proc sql;
select stock, time_interval, avg(duration) as avg_duration
from final
group by stock, time_interval;
quit;
But, if you want to keep multiple datasets by stock, then just remove the "stock" variable from the code and apply this to every stock dataset you have.

SAS - If then do condition

I have a column which is numeric
and I have a logic as shown below:
if col_1 = "2" then do;
col2 = col3+col4
end;
Now; since its a numeric column; i was expecting the sas code to throw error or do not perform the actions under do statement.
however the statements under do get executed.
It produces the same result as below code
if col_1 = 2 then do;
col2 = col3+col4s
end;
can u explain how this is possible
Did not notice the log note?. This is a data statement option NOTE2ERR which switch off automatic type conversion.
44 data _null_;
45 x = 2;
46 if x eq '2' then put 'NOTE: C2N ' _all_;
47 run;
NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).
46:12
NOTE: C2N x=2 _ERROR_=0 _N_=1
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
2 The SAS System 19:00 Friday, February 26, 2021
48
49 data _null_ / note2err;
50 x = 2;
51 if x eq '2' then put 'NOTE: C2N ' _all_;
ERROR: Character value found where numeric value needed at line 51 column 12.
52 run;
NOTE: The SAS System stopped processing this step because of errors.

SAS and do loop

I'm writing a program in SAS.
Here's the dataset I have:
id huuse days
1 0 4
1 0 3
1 1 12
1 1 1
1 2 15
2 1 13
2 0 16
2 1 18
2 0 44
For each ID, I want to delete the record if variable huuse ne 1, until I get to the first huuse=1. Then I want to keep that record and all subsequent records for that id, no matter what value huuse is. So for id=1, I want to delete the first two records than keep all records for id=1 starting with the 3rd record. For id=2, the first record has huuse=1, so I want to keep all records for id=2.
The data set I want should look like this:
id huuse days
1 0 4
1 0 3
1 1 12
1 1 1
1 2 15
2 1 13
2 0 16
2 1 18
2 0 44
I tried this code, but it removes all records that have huuse ne 1.
data want;
set have;
by id;
do until (huuse=1);
if huuse = 1 then LEAVE;
if huuse ne 1 then DELETE;
END;
run;
I've tried several variations of do loops, but they all do the same thing.
The DATA step is a program with an implicit loop that reads every record of the data set specified in the SET statement. Any program data vector (pdv) variables not coming from the data set are, by default, reset to missing at the top of the implicit loop. You change that behavior using a RETAIN statement to name variables that should not get reset.
So, in your problem you have two situations when a tracking variable is needed. The variable will track the state of the condition Have I seen huuse=1 yet in this group ?. Call this variable one_flag
RETAIN one_flag; so you control when it's value changes
At the start of a BY group one_flag needs to be reset to false (0)
When huuse is first seen as 1 set the flag to true (1)
Example:
data want(drop=one_flag);
set have;
by id;
retain one_flag 0;
if first.id then one_flag = 0;
if not one_flag and huuse = 1 then one_flag = 1;
if one_flag then OUTPUT; * want all rows in group starting at first huuse=1;
run;
You can place the SET and BY statement inside an explicit DO and that changes the operating behavior of the program, especially if the explicit loop is terminated according to a LAST.<var> automatic variable. Such a loop is commonly called a DOW loop by SAS programmers. There is no phrase DOW loop in the SAS documentation.
Example:
data want;
do until (last.id);
set have;
by id;
if not one_flag and huuse=1 then one_flag = 1;
if one_flag then OUTPUT; * want all rows in group starting at first huuse=1;
end;
run;
Because the looping is explicit and never reaches the TOP of the program with in the loop, there is no need to RETAIN the flag variable, nor reset it. Program variables that are not retained are reset automatically at the top of the program, and the top of the program is only reached at the start of the BY group. Learn more about this programming construct in the SGF 2013 paper "The Magnificent DO", Paul M. Dorfman
Your source and result are same :-)
But if I understood your question correctly the solution is quite simple with a retain solution. I add 2 lines to the example to make it clear that I understood correctly.
The code with example table:
data test;
id=1;huuse=0;days=4;output;
id=1;huuse=0;days=3;output;
id=1;huuse=1;days=12;output;
id=1;huuse=1;days=1;output;
id=1;huuse=2;days=15;output;
id=2;huuse=1;days=13;output;
id=2;huuse=0;days=16;output;
id=2;huuse=1;days=18;output;
id=2;huuse=0;days=44;output;
id=3;huuse=0;days=1;output;
id=3;huuse=1;days=2;output;
run;
data test_output;
set test;
retain keep_id -1;
if (keep_id ne id and huuse ne 0) then keep_id=id;
if keep_id = id then output;
run;
/* the results:
id huuse days
1 1 12 1
1 1 1 1
1 2 15 1
2 1 13 2
2 0 16 2
2 1 18 2
2 0 44 2
3 1 2 3
*/

Reshaping data from long to wide

Below is an example that I found to reshape data from long to wide.But I am not able ti understand the code, especially the way they are replacing blanks and why. Can someone help me understand the code?
Example 1: Reshaping one variable
We will begin with a small data set with only one variable to be reshaped. We will use the variables year and faminc (for family income) to create three new variables: faminc96, faminc97 and faminc98. First, let's look at the data set and use proc print to display it.
DATA long ;
INPUT famid year faminc ;
CARDS ;
1 96 40000
1 97 40500
1 98 41000
2 96 45000
2 97 45400
2 98 45800
3 96 75000
3 97 76000
3 98 77000
;
RUN ;
PROC PRINT DATA=long ;
RUN ;
Obs famid year faminc
1 1 96 40000
2 1 97 40500
3 1 98 41000
4 2 96 45000
5 2 97 45400
6 2 98 45800
7 3 96 75000
8 3 97 76000
9 3 98 77000
Now let's look at the program. The first step in the reshaping process is sorting the data (using proc sort) on an identification variable (famid) and saving the sorted data set (longsort). Next we write a data step to do the actual reshaping. We will explain each of the statements in the data step in order.
PROC SORT DATA=long OUT=longsort ;
BY famid ;
RUN ;
DATA wide1 ;
SET longsort ;
BY famid ;
KEEP famid faminc96 -faminc98 ;
RETAIN faminc96 - faminc98 ;
ARRAY afaminc(96:98) faminc96 - faminc98 ;
IF first.famid THEN
DO;
DO i = 96 to 98 ;
afaminc( i ) = . ;
END;
END;
afaminc( year ) = faminc ;
IF last.famid THEN OUTPUT ;
RUN;
This is a good example to compare and contrast with DO UNTIL(LAST. It does away with the RETAIN and INIT to missing on FIRST.FAMID and the LAST. test for when to OUTPUT. Those operations are sill done just using the built in features of the data step loop.
DATA long;
INPUT famid year faminc;
CARDS;
1 96 40000
1 97 40500
1 98 41000
2 96 45000
2 97 45400
2 98 45800
3 96 75000
3 97 76000
3 98 77000
;;;;
RUN;
proc print;
run;
data wide;
do until(last.famid);
set long;
by famid;
ARRAY afaminc[96:98] faminc96-faminc98;
afaminc[year]=faminc;
end;
drop year faminc;
run;
proc print;
run;
The main element here is the SAS retain statement.
The datastep is executed for every observation in the dataset. For every iteration all variables are set to missing and then the data is loaded from the dataset.
If a variable is RETAINed it will not be reset, but will keep the information from the last iteration.
BY famid ;
Your dataset is ordered and the datastep is using a by statement. This will initialize the first.famid and last.famid. These are just binaries that turn to 1 for the first/last observation of a single id-group.
RETAIN faminc96 - faminc98 ;
As already explained faminc96 - faminc98 will keep their value from one datastep iteration to the next.
ARRAY afaminc(96:98) faminc96 - faminc98 ;
Just an array, so you can call the variables by number instead of name.
IF first.famid THEN
DO;
DO i = 96 to 98 ;
afaminc( i ) = . ;
END;
END;
For every first observation in an id-group the retained variables are reset. Otherwise you would keep values from one od-group to the next. Same could be done by IF first.famid then call missing(of afaminc(*));
afaminc( year ) = faminc ;
Writing the information to your transposed variables, according to the year.
IF last.famid THEN OUTPUT ;
After you have written all the values to your new variables, you only OUTPUT one observation (the last) in every id-group to the new dataset. As the variables were retained, they are all filled at this point.
This datastep is fast and purpose build. But generally you could just use proc transpose
I highly recommend proc transpose. It'll make your life easier.
http://support.sas.com/resources/papers/proceedings09/060-2009.pdf

Output when using FIRST and LAST

Say we have the SAS code:
data t1 (keep=KEY COUNT C_AMT2 C_AMT);
SET t1;
BY key;
RETAIN COUNT C_AMT;
IF FIRST.KEY THEN
DO;
COUNT=0;
C_AMT2=0;
END;
COUNT+1;
C_AMT=SUM(C_AMT2, C_AMT);
IF LAST.KEY THEN
OUTPUT;
RUN;
What would change here if I were to remove "IF LAST.KEY THEN OUTPUT;". The documentation says that output causes SAS to write to the datastep immediately, not at the end of the data step. Because here it is right before the end of the data step, would this mean removing it would cause no difference?
Removing it would cause a difference.
Then you would have a record for every value of key, assuming multiple values. Controlling the output means you'd have only the last record.
It looks like it's calculating a count and total so there are other ways to achieve this. I'm going to assume that there's some other code that you've suppressed.
The relevant section from the documentation that refers to this is in the link you have above
Implicit versus Explicit Output
By default, every DATA step contains an implicit OUTPUT statement at the end of each iteration that tells SAS to write observations to the data set or data sets that are being created. Placing an explicit OUTPUT statement in a DATA step overrides the automatic output, and SAS adds an observation to a data set only when an explicit OUTPUT statement is executed. Once you use an OUTPUT statement to write an observation to any one data set, however, there is no implicit OUTPUT statement at the end of the DATA step. In this situation, a DATA step writes an observation to a data set only when an explicit OUTPUT executes. You can use the OUTPUT statement alone or as part of an IF-THEN or SELECT statement or in DO-loop processing.
Here's some code that simulates your issue:
*Generate random data;
Data have;
do Key=1 to 2;
do i=1 to 3;
Amount=floor(rand('normal', 50, 5));
OUTPUT;
end;
end;
run;
data t1;
set have;
retain count C_Amt;
by Key;
if first.key then do;
count=0;
C_Amt=0;
end;
Count+1;
c_amt=sum(c_amt, amount);
if last.key then output;
run;
proc print data=t1;
run;
data t1;
set have;
retain count C_Amt;
by Key;
if first.key then do;
count=0;
C_Amt=0;
end;
Count+1;
c_amt=sum(c_amt, amount);
*if last.key then output;
run;
proc print data=t1;
run;
And the corresponding output:
With last.key then output
Obs Key i Amount count C_Amt
1 1 3 46 3 147
2 2 3 44 3 154
And with out last.key
Obs Key i Amount count C_Amt
1 1 1 47 1 47
2 1 2 54 2 101
3 1 3 46 3 147
4 2 1 61 1 61
5 2 2 49 2 110
6 2 3 44 3 154
Commas are an error here:
(keep=KEY, COUNT, C_AMT2, C_AMT)
Anyway:
RUN;
usually means:
output;
return;
But if SAS encounters an output statement in your code, the output at the end (enclosed in the run statement) will be ignored.
Hence, since your output statement is conditionally executed only IF LAST.KEY, in your dataset you will have only observations marked as last.key, because your RUN; will only mean return.
Something like:
data want; set have; output; run;
Is exactly the same to not explicit output:
data want; set have; output; run;
You can use output as you want:
data want01 want02;
set have;
if a then output want01;
if b then output want02;
run;
data want01;
var=var1;
output;
var=var2;
output;
run;