SAS - If then do condition - sas

I have a column which is numeric
and I have a logic as shown below:
if col_1 = "2" then do;
col2 = col3+col4
end;
Now; since its a numeric column; i was expecting the sas code to throw error or do not perform the actions under do statement.
however the statements under do get executed.
It produces the same result as below code
if col_1 = 2 then do;
col2 = col3+col4s
end;
can u explain how this is possible

Did not notice the log note?. This is a data statement option NOTE2ERR which switch off automatic type conversion.
44 data _null_;
45 x = 2;
46 if x eq '2' then put 'NOTE: C2N ' _all_;
47 run;
NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).
46:12
NOTE: C2N x=2 _ERROR_=0 _N_=1
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
2 The SAS System 19:00 Friday, February 26, 2021
48
49 data _null_ / note2err;
50 x = 2;
51 if x eq '2' then put 'NOTE: C2N ' _all_;
ERROR: Character value found where numeric value needed at line 51 column 12.
52 run;
NOTE: The SAS System stopped processing this step because of errors.

Related

SAS- Split Dataset based on values in a column (character) and then further split each of these datasets based on a time column

I have stock trading data for a day - about 60 million rows. Basically, I want to create a dataset that lists the average duration for each 5-minute interval for each of the stocks.
Dataset Original
Obs
time
symbol
tradePrice
tradeId
datatime
duration
1
093000154451968
A
152.24
7.1675E13
1943170200.2
.
2
093000845296640
A
151.99
5.2984E13
1943170200.8
0.69084
3
093000845296640
A
151.99
5.2984E13
1943170200.8
0.00000
4
093000846918400
A
151.99
5.2984E13
1943170200.8
0.00162
5
093000847665152
A
151.94
6.2879E13
1943170200.8
0.00075
6
093000847675136
A
151.94
6.2879E13
1943170200.8
0.00001
7
093000857328128
A
151.94
5.2984E13
1943170200.9
0.00965
8
093000889283840
A
151.24
7.1675E13
1943170200.9
0.03196
9
093001249114624
A
151.74
7.1675E13
1943170201.2
0.35983
10
093001824934912
A
151.99
7.1675E13
1943170201.8
0.57582
11
093001834587904
A
151.71
5.2989E13
1943170201.8
0.00965
12
093002261742336
A
151.99
7.1675E13
1943170202.3
0.42715
Here "time" variable is setup as hhmmssnnnnnnnnn (n indicates nanoseconds - i.e. seconds are counted for 9 significant digits after decimal)
and "datetime" variable is converted to nanoseconds using date and time both.
For this code, I only work with one day of data so use "time" variable only.
Final Result
Stock
TimeInterval
Average duration
A
0930-0935
23456
A
0935-0940
56789
A
........
......
A
1555-1600
57689
B
0930-0935
23456
B
0935-0940
56789
B
........
......
B
1555-1600
57689
..
...
...
Z
0930-0935
23456
Z
0935-0940
56789
Z
........
......
Z
1555-1600
57689
Step 1:
I want to split the dataset such that I have a separate dataset for each of the stock symbols. I did this already.
Step 2:
To sum up the values in a column for every 5-minute interval from 0930 to 1600. I am struggling here.
Current Code:
/* Read Dataset */
DATA working_dataset;
set "C:\EQY_US_ALL_TRADE_202107\test_sample_sorted";
run;
/* List of Unique Symbols and feed them into new variables */
proc sql noprint;
select distinct symbol into :symbol1 - (NOTRIM)
from working_dataset;
%put &symbol1;
%put &symbol2;
/* Count of Unique Symbols and store the value in variable "n" */
proc sql noprint;
select count(distinct symbol) into: n
from working_dataset;
%put &n;
/* Keeping the variables needed for the analysis */
DATA working_dataset_2;
SET working_dataset (keep = symbol time duration tradePrice datetime tradeId);
run;
/* Extracting stock symbol names from the dataset;*/
proc sort data=working_dataset_2 out=symblist (keep = symbol)
nodupkey;
by symbol;
run;
/* Creating multiple datasets from the parent dataset;*/
data _null_;
set symblist;
call execute('data ' !! compress(symbol) !! '; set working_dataset_2; where symbol = "' !! symbol !! '"; run;');
run;
For Step 2:
I don't know how to but I am planning to run a loop for 78x 5 minute intervals between 0930 to 1600 using an if statement controlled by the loop value. The following is just wishful thinking - not code. I don't know where to begin.
data dataset_final;
set "A"; /* To be changed as per variable for stock symbol */
array symb(&n); /* this array should have all the stock symbols */
do over; /* do over for all the array items in the array symb(&n) */
do i = 1 to 78;
if (time GE (093000000000000 + &i.- 1)) & (time LT (093000000000000 + &i.))
then send obs to symb_j_0930+&i.-1
end;
Any help is appreciated. I am not sure how to attach the datafile.
Step 1 works. I am able to create different datasets using and call/execute.
Log for Step 1:
439
440 DATA working_dataset;
441 set "C:\EQY_US_ALL_TRADE_202107\test_sample_sorted";
442 run;
NOTE: There were 50000 observations read from the data set
C:\EQY_US_ALL_TRADE_202107\test_sample_sorted.
NOTE: The data set WORK.WORKING_DATASET has 50000 observations and 25 variables.
NOTE: DATA statement used (Total process time):
real time 0.12 seconds
cpu time 0.09 seconds
443
444 proc sql noprint;
445 select distinct symbol into :symbol1 - (NOTRIM)
-
22
76
ERROR 22-322: Syntax error, expecting one of the following: ',', :, FROM, NOTRIM.
ERROR 76-322: Syntax error, statement will be ignored.
446 from working_dataset;
447 %put &symbol1;
A
448 %put &symbol2;
AA
449
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SQL used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds
450 proc sql noprint;
451 select count(distinct symbol) into: n
452 from working_dataset;
453 %put &n;
2
454
NOTE: PROCEDURE SQL used (Total process time):
real time 0.04 seconds
cpu time 0.04 seconds
455 DATA working_dataset_2;
456 SET working_dataset (keep = symbol time duration tradePrice datetime tradeId);
457
458 /* Extracting stock symbol names from the dataset;*/
NOTE: There were 50000 observations read from the data set WORK.WORKING_DATASET.
NOTE: The data set WORK.WORKING_DATASET_2 has 50000 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
459 proc sort data=working_dataset_2 out=symblist (keep = symbol)
460 nodupkey;
461 by symbol;
462 run;
NOTE: There were 50000 observations read from the data set WORK.WORKING_DATASET_2.
NOTE: 49998 observations with duplicate key values were deleted.
NOTE: The data set WORK.SYMBLIST has 2 observations and 1 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds
463 /* Creating multiple datasets from the parent dataset;*/
464 data _null_;
465 set symblist;
466 call execute('data ' !! compress(symbol) !! '; set working_dataset_2; where symbol = "' !! symbol
466! !! '"; run;');
467 run;
NOTE: There were 2 observations read from the data set WORK.SYMBLIST.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
NOTE: CALL EXECUTE generated line.
1 + data A; set working_dataset_2; where symbol = "A "; run;
NOTE: There were 24304 observations read from the data set WORK.WORKING_DATASET_2.
WHERE symbol='A ';
NOTE: The data set WORK.A has 24304 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
2 + data AA; set working_dataset_2; where symbol = "AA "; run;
NOTE: There were 25696 observations read from the data set WORK.WORKING_DATASET_2.
WHERE symbol='AA ';
NOTE: The data set WORK.AA has 25696 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.01 seconds
Step 2 is where I am horribly struggling. I am not sure how to do the code.
Assuming you have an actual time value (you can create one from your first 16 digit string) you can just convert that time to the start of the 5 minute interval and use that to group the data. No need for looping (or splitting).
Let's modify your example data so it actually has more than one stock symbol and more than one time interval. You can convert the first 6 characters of your TIME string into an actual TIME value. Which we can then convert to the beginning of the 5 minute interval.
data have ;
input time :$16. symbol :$4. tradePrice tradeId datatime duration;
tod = input(time,hhmmss6.);
interval='00:05:00't*int(tod/'00:05:00't);
format tod interval tod8.;
nanosec = input(substr(time,7),32.);
cards;
093000154451968 A 152.24 7.1675E13 1943170200.2 .
093000845296640 A 151.99 5.2984E13 1943170200.8 0.69084
093500845296640 A 151.99 5.2984E13 1943170200.8 0.00000
093500846918400 A 151.99 5.2984E13 1943170200.8 0.00162
093800847665152 A 151.94 6.2879E13 1943170200.8 0.00075
093000847675136 B 151.94 6.2879E13 1943170200.8 0.00001
093100857328128 B 151.94 5.2984E13 1943170200.9 0.00965
093900889283840 B 151.24 7.1675E13 1943170200.9 0.03196
093001249114624 C 151.74 7.1675E13 1943170201.2 0.35983
093301824934912 C 151.99 7.1675E13 1943170201.8 0.57582
093801834587904 C 151.71 5.2989E13 1943170201.8 0.00965
094102261742336 C 151.99 7.1675E13 1943170202.3 0.42715
;
So once you have a dataset (or even a view) that has the three variables needed, SYBMOL INTERVAL and DURATION, you can then just use PROC SUMMARY to produce the mean of the durations.
proc summary nway ;
class symbol interval;
var duration;
output out=want mean=mean_duration ;
run;
Results:
mean_
Obs symbol interval _TYPE_ _FREQ_ duration
1 A 09:30:00 3 2 0.69084
2 A 09:35:00 3 3 0.00079
3 B 09:30:00 3 2 0.00483
4 B 09:35:00 3 1 0.03196
5 C 09:30:00 3 2 0.46783
6 C 09:35:00 3 1 0.00965
7 C 09:40:00 3 1 0.42715
You say you're struggling with Step 2: "To sum up the values in a column for every 5-minute interval from 0930 to 1600. I am struggling here."
I'm just going to address that part of your question based on looking at the final result that you want. I'm providing code so you don't need to split the data into multiple datasets of each stock.
data final;
set <dataset>;
time_interval = intck("minute", "09:30:00", tradetime);
time_interval = time_interval - mod(time_interval, 5);
run;
proc sql;
select stock, time_interval, avg(duration) as avg_duration
from final
group by stock, time_interval;
quit;
But, if you want to keep multiple datasets by stock, then just remove the "stock" variable from the code and apply this to every stock dataset you have.

Reshaping data from long to wide

Below is an example that I found to reshape data from long to wide.But I am not able ti understand the code, especially the way they are replacing blanks and why. Can someone help me understand the code?
Example 1: Reshaping one variable
We will begin with a small data set with only one variable to be reshaped. We will use the variables year and faminc (for family income) to create three new variables: faminc96, faminc97 and faminc98. First, let's look at the data set and use proc print to display it.
DATA long ;
INPUT famid year faminc ;
CARDS ;
1 96 40000
1 97 40500
1 98 41000
2 96 45000
2 97 45400
2 98 45800
3 96 75000
3 97 76000
3 98 77000
;
RUN ;
PROC PRINT DATA=long ;
RUN ;
Obs famid year faminc
1 1 96 40000
2 1 97 40500
3 1 98 41000
4 2 96 45000
5 2 97 45400
6 2 98 45800
7 3 96 75000
8 3 97 76000
9 3 98 77000
Now let's look at the program. The first step in the reshaping process is sorting the data (using proc sort) on an identification variable (famid) and saving the sorted data set (longsort). Next we write a data step to do the actual reshaping. We will explain each of the statements in the data step in order.
PROC SORT DATA=long OUT=longsort ;
BY famid ;
RUN ;
DATA wide1 ;
SET longsort ;
BY famid ;
KEEP famid faminc96 -faminc98 ;
RETAIN faminc96 - faminc98 ;
ARRAY afaminc(96:98) faminc96 - faminc98 ;
IF first.famid THEN
DO;
DO i = 96 to 98 ;
afaminc( i ) = . ;
END;
END;
afaminc( year ) = faminc ;
IF last.famid THEN OUTPUT ;
RUN;
This is a good example to compare and contrast with DO UNTIL(LAST. It does away with the RETAIN and INIT to missing on FIRST.FAMID and the LAST. test for when to OUTPUT. Those operations are sill done just using the built in features of the data step loop.
DATA long;
INPUT famid year faminc;
CARDS;
1 96 40000
1 97 40500
1 98 41000
2 96 45000
2 97 45400
2 98 45800
3 96 75000
3 97 76000
3 98 77000
;;;;
RUN;
proc print;
run;
data wide;
do until(last.famid);
set long;
by famid;
ARRAY afaminc[96:98] faminc96-faminc98;
afaminc[year]=faminc;
end;
drop year faminc;
run;
proc print;
run;
The main element here is the SAS retain statement.
The datastep is executed for every observation in the dataset. For every iteration all variables are set to missing and then the data is loaded from the dataset.
If a variable is RETAINed it will not be reset, but will keep the information from the last iteration.
BY famid ;
Your dataset is ordered and the datastep is using a by statement. This will initialize the first.famid and last.famid. These are just binaries that turn to 1 for the first/last observation of a single id-group.
RETAIN faminc96 - faminc98 ;
As already explained faminc96 - faminc98 will keep their value from one datastep iteration to the next.
ARRAY afaminc(96:98) faminc96 - faminc98 ;
Just an array, so you can call the variables by number instead of name.
IF first.famid THEN
DO;
DO i = 96 to 98 ;
afaminc( i ) = . ;
END;
END;
For every first observation in an id-group the retained variables are reset. Otherwise you would keep values from one od-group to the next. Same could be done by IF first.famid then call missing(of afaminc(*));
afaminc( year ) = faminc ;
Writing the information to your transposed variables, according to the year.
IF last.famid THEN OUTPUT ;
After you have written all the values to your new variables, you only OUTPUT one observation (the last) in every id-group to the new dataset. As the variables were retained, they are all filled at this point.
This datastep is fast and purpose build. But generally you could just use proc transpose
I highly recommend proc transpose. It'll make your life easier.
http://support.sas.com/resources/papers/proceedings09/060-2009.pdf

Output when using FIRST and LAST

Say we have the SAS code:
data t1 (keep=KEY COUNT C_AMT2 C_AMT);
SET t1;
BY key;
RETAIN COUNT C_AMT;
IF FIRST.KEY THEN
DO;
COUNT=0;
C_AMT2=0;
END;
COUNT+1;
C_AMT=SUM(C_AMT2, C_AMT);
IF LAST.KEY THEN
OUTPUT;
RUN;
What would change here if I were to remove "IF LAST.KEY THEN OUTPUT;". The documentation says that output causes SAS to write to the datastep immediately, not at the end of the data step. Because here it is right before the end of the data step, would this mean removing it would cause no difference?
Removing it would cause a difference.
Then you would have a record for every value of key, assuming multiple values. Controlling the output means you'd have only the last record.
It looks like it's calculating a count and total so there are other ways to achieve this. I'm going to assume that there's some other code that you've suppressed.
The relevant section from the documentation that refers to this is in the link you have above
Implicit versus Explicit Output
By default, every DATA step contains an implicit OUTPUT statement at the end of each iteration that tells SAS to write observations to the data set or data sets that are being created. Placing an explicit OUTPUT statement in a DATA step overrides the automatic output, and SAS adds an observation to a data set only when an explicit OUTPUT statement is executed. Once you use an OUTPUT statement to write an observation to any one data set, however, there is no implicit OUTPUT statement at the end of the DATA step. In this situation, a DATA step writes an observation to a data set only when an explicit OUTPUT executes. You can use the OUTPUT statement alone or as part of an IF-THEN or SELECT statement or in DO-loop processing.
Here's some code that simulates your issue:
*Generate random data;
Data have;
do Key=1 to 2;
do i=1 to 3;
Amount=floor(rand('normal', 50, 5));
OUTPUT;
end;
end;
run;
data t1;
set have;
retain count C_Amt;
by Key;
if first.key then do;
count=0;
C_Amt=0;
end;
Count+1;
c_amt=sum(c_amt, amount);
if last.key then output;
run;
proc print data=t1;
run;
data t1;
set have;
retain count C_Amt;
by Key;
if first.key then do;
count=0;
C_Amt=0;
end;
Count+1;
c_amt=sum(c_amt, amount);
*if last.key then output;
run;
proc print data=t1;
run;
And the corresponding output:
With last.key then output
Obs Key i Amount count C_Amt
1 1 3 46 3 147
2 2 3 44 3 154
And with out last.key
Obs Key i Amount count C_Amt
1 1 1 47 1 47
2 1 2 54 2 101
3 1 3 46 3 147
4 2 1 61 1 61
5 2 2 49 2 110
6 2 3 44 3 154
Commas are an error here:
(keep=KEY, COUNT, C_AMT2, C_AMT)
Anyway:
RUN;
usually means:
output;
return;
But if SAS encounters an output statement in your code, the output at the end (enclosed in the run statement) will be ignored.
Hence, since your output statement is conditionally executed only IF LAST.KEY, in your dataset you will have only observations marked as last.key, because your RUN; will only mean return.
Something like:
data want; set have; output; run;
Is exactly the same to not explicit output:
data want; set have; output; run;
You can use output as you want:
data want01 want02;
set have;
if a then output want01;
if b then output want02;
run;
data want01;
var=var1;
output;
var=var2;
output;
run;

How to sum value goup by 4 Quarter data into one value in SAS

I have a data set that contains quarterly data value. But now I want to sum the quarterly values which have the same year.
Data h :
time value
01JAN90 23
01APR90 31
01JUL90 13
01OCT90 45
01JAN91 11
01APR91 4
01JUL91 1
01OCT91 17
I want my result data like this:
time value
1990 53
1991 35
If your time variable is numeric, you can use a FORMAT statement within PROC SUMMARY to automatically extract the year as the PROC runs. (Thanks to #Joe for showing this in comments to my original answer.)
PROC SUMMARY NWAY DATA = h;
CLASS time;
FORMAT time YEAR. ;
OUTPUT
OUT = result (
KEEP = year value
)
SUM (value) =
;
RUN;

SAS: code to create 'ever' variable for subsequent obsevations once event occurs

data have;
input ID Herpes;
datalines;
111 1
111 .
111 1
111 1
111 1
111 .
111 .
254 0
254 0
254 1
254 .
254 1
331 1
331 1
331 1
331 0
331 1
331 1
;
Where 1=Positive, 0=Negative, .=Missing/Not Indicated
Observations are sorted by ID (random numbers, no meaning) and date of visit (not included because not needed from here forward). Once you have Herpes, you always have Herpes. How do I adjust the Herpes variable (or create a new one) so that once a Positive is indicated (Herpes=1), all following obs will show Herpes=1 for that ID?
I want the resulting set to look like this:
111 1
111 1 (missing changed to 1)
111 1
111 1
111 1 (missing changed to 1)
111 1 (missing changed to 1)
111 1
254 0
254 0
254 1
254 1 (missing changed to 1 following positive at prior visit)
254 1
331 1
331 1
331 1
331 1 (patient-indicated negative/0 changed to 1 because of prior + visit)
331 1
331 1
The below code should do the trick. The trick is to use by-group processing in conjunction with the retain statement.
proc sort data=have;
by id;
run;
data want;
set have;
by id;
retain uh_oh .;
if first.id then do;
uh_oh = .;
end;
if herpes then do;
uh_oh = 1;
end;
if uh_oh then do;
herpes = 1;
end;
drop uh_oh;
run;
You could create a new variable that sums the herpes flag within ID:-
proc sort data=have;
by id;
data have_too;
set have;
by id;
if first.id then sum_herpes_in_id = 0;
sum_herpes_in_id ++ herpes;
run;
That way it's always positive from the first time herpes=1 within id. You can access these observations in other datasteps / procs with where sum_herpes_in_id;.
And for free, you also have the total number of herpes flags per id (if that's of any use).
This can also be done in SQL. Here is an example using UPDATE to update the table in place. (This could also be done in base SAS with MODIFY.)
proc sql undopolicy=none;
update have H
set herpes=1 where exists (
select 1 from have V
where h.id=v.id
and h.dtvar ge v.dtvar
and v.herpes=1
);
quit;
The SAS version using modify. BY doesn't work in a one-dataset modify for some reason, so you have to do your own version of first.id.
data have;
modify have;
drop _:;
retain _t _i;
if _i ne id then _t=.;
_i=id;
_t = _t or herpes;
if _t then herpes=1;
run;