I am using SAS UE which doesn´t come with PROC EXPAND. I need to compute rolling std. deviations for return_stock using by using 12 month-window. The date frequency of my dataset is monthly. It looks something like this:
date permno ret return_mkt
02/01/2000 10000 0.06 0.03
03/01/2000 10000 0.03 0.08
...
01/01/2005 10000 0.03 0.04
02/01/2005 10000 0.06 0.03
03/01/2005 10000 0.09 0.08
my code:
data df1;
array ret{0:11} _temporary_;
set df;
by permno;
if first.permno then call missing(of ret{*});
ret{mod(_n_,12)} = monthly_ret;
std_dev = std(of ret{*});
run;
Can anyone tell me why I am getting this error? "The variable type of ret is invalid in this context"?
Your temporary array name ret is the same as the variable ret in the df data set.
Change the name of the variable ret in the data set to monthly_ret.
You could do this with PROC SQL
proc sql;
create table want as
select *,
(select std(close) from sashelp.stocks
where stock=a.stock
and (intnx('month', a.Date, -11, 'b') le Date le a.Date))
as std
from sashelp.stocks as a;
quit;
Related
I have stock trading data for a day - about 60 million rows. Basically, I want to create a dataset that lists the average duration for each 5-minute interval for each of the stocks.
Dataset Original
Obs
time
symbol
tradePrice
tradeId
datatime
duration
1
093000154451968
A
152.24
7.1675E13
1943170200.2
.
2
093000845296640
A
151.99
5.2984E13
1943170200.8
0.69084
3
093000845296640
A
151.99
5.2984E13
1943170200.8
0.00000
4
093000846918400
A
151.99
5.2984E13
1943170200.8
0.00162
5
093000847665152
A
151.94
6.2879E13
1943170200.8
0.00075
6
093000847675136
A
151.94
6.2879E13
1943170200.8
0.00001
7
093000857328128
A
151.94
5.2984E13
1943170200.9
0.00965
8
093000889283840
A
151.24
7.1675E13
1943170200.9
0.03196
9
093001249114624
A
151.74
7.1675E13
1943170201.2
0.35983
10
093001824934912
A
151.99
7.1675E13
1943170201.8
0.57582
11
093001834587904
A
151.71
5.2989E13
1943170201.8
0.00965
12
093002261742336
A
151.99
7.1675E13
1943170202.3
0.42715
Here "time" variable is setup as hhmmssnnnnnnnnn (n indicates nanoseconds - i.e. seconds are counted for 9 significant digits after decimal)
and "datetime" variable is converted to nanoseconds using date and time both.
For this code, I only work with one day of data so use "time" variable only.
Final Result
Stock
TimeInterval
Average duration
A
0930-0935
23456
A
0935-0940
56789
A
........
......
A
1555-1600
57689
B
0930-0935
23456
B
0935-0940
56789
B
........
......
B
1555-1600
57689
..
...
...
Z
0930-0935
23456
Z
0935-0940
56789
Z
........
......
Z
1555-1600
57689
Step 1:
I want to split the dataset such that I have a separate dataset for each of the stock symbols. I did this already.
Step 2:
To sum up the values in a column for every 5-minute interval from 0930 to 1600. I am struggling here.
Current Code:
/* Read Dataset */
DATA working_dataset;
set "C:\EQY_US_ALL_TRADE_202107\test_sample_sorted";
run;
/* List of Unique Symbols and feed them into new variables */
proc sql noprint;
select distinct symbol into :symbol1 - (NOTRIM)
from working_dataset;
%put &symbol1;
%put &symbol2;
/* Count of Unique Symbols and store the value in variable "n" */
proc sql noprint;
select count(distinct symbol) into: n
from working_dataset;
%put &n;
/* Keeping the variables needed for the analysis */
DATA working_dataset_2;
SET working_dataset (keep = symbol time duration tradePrice datetime tradeId);
run;
/* Extracting stock symbol names from the dataset;*/
proc sort data=working_dataset_2 out=symblist (keep = symbol)
nodupkey;
by symbol;
run;
/* Creating multiple datasets from the parent dataset;*/
data _null_;
set symblist;
call execute('data ' !! compress(symbol) !! '; set working_dataset_2; where symbol = "' !! symbol !! '"; run;');
run;
For Step 2:
I don't know how to but I am planning to run a loop for 78x 5 minute intervals between 0930 to 1600 using an if statement controlled by the loop value. The following is just wishful thinking - not code. I don't know where to begin.
data dataset_final;
set "A"; /* To be changed as per variable for stock symbol */
array symb(&n); /* this array should have all the stock symbols */
do over; /* do over for all the array items in the array symb(&n) */
do i = 1 to 78;
if (time GE (093000000000000 + &i.- 1)) & (time LT (093000000000000 + &i.))
then send obs to symb_j_0930+&i.-1
end;
Any help is appreciated. I am not sure how to attach the datafile.
Step 1 works. I am able to create different datasets using and call/execute.
Log for Step 1:
439
440 DATA working_dataset;
441 set "C:\EQY_US_ALL_TRADE_202107\test_sample_sorted";
442 run;
NOTE: There were 50000 observations read from the data set
C:\EQY_US_ALL_TRADE_202107\test_sample_sorted.
NOTE: The data set WORK.WORKING_DATASET has 50000 observations and 25 variables.
NOTE: DATA statement used (Total process time):
real time 0.12 seconds
cpu time 0.09 seconds
443
444 proc sql noprint;
445 select distinct symbol into :symbol1 - (NOTRIM)
-
22
76
ERROR 22-322: Syntax error, expecting one of the following: ',', :, FROM, NOTRIM.
ERROR 76-322: Syntax error, statement will be ignored.
446 from working_dataset;
447 %put &symbol1;
A
448 %put &symbol2;
AA
449
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SQL used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds
450 proc sql noprint;
451 select count(distinct symbol) into: n
452 from working_dataset;
453 %put &n;
2
454
NOTE: PROCEDURE SQL used (Total process time):
real time 0.04 seconds
cpu time 0.04 seconds
455 DATA working_dataset_2;
456 SET working_dataset (keep = symbol time duration tradePrice datetime tradeId);
457
458 /* Extracting stock symbol names from the dataset;*/
NOTE: There were 50000 observations read from the data set WORK.WORKING_DATASET.
NOTE: The data set WORK.WORKING_DATASET_2 has 50000 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
459 proc sort data=working_dataset_2 out=symblist (keep = symbol)
460 nodupkey;
461 by symbol;
462 run;
NOTE: There were 50000 observations read from the data set WORK.WORKING_DATASET_2.
NOTE: 49998 observations with duplicate key values were deleted.
NOTE: The data set WORK.SYMBLIST has 2 observations and 1 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds
463 /* Creating multiple datasets from the parent dataset;*/
464 data _null_;
465 set symblist;
466 call execute('data ' !! compress(symbol) !! '; set working_dataset_2; where symbol = "' !! symbol
466! !! '"; run;');
467 run;
NOTE: There were 2 observations read from the data set WORK.SYMBLIST.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
NOTE: CALL EXECUTE generated line.
1 + data A; set working_dataset_2; where symbol = "A "; run;
NOTE: There were 24304 observations read from the data set WORK.WORKING_DATASET_2.
WHERE symbol='A ';
NOTE: The data set WORK.A has 24304 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
2 + data AA; set working_dataset_2; where symbol = "AA "; run;
NOTE: There were 25696 observations read from the data set WORK.WORKING_DATASET_2.
WHERE symbol='AA ';
NOTE: The data set WORK.AA has 25696 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.01 seconds
Step 2 is where I am horribly struggling. I am not sure how to do the code.
Assuming you have an actual time value (you can create one from your first 16 digit string) you can just convert that time to the start of the 5 minute interval and use that to group the data. No need for looping (or splitting).
Let's modify your example data so it actually has more than one stock symbol and more than one time interval. You can convert the first 6 characters of your TIME string into an actual TIME value. Which we can then convert to the beginning of the 5 minute interval.
data have ;
input time :$16. symbol :$4. tradePrice tradeId datatime duration;
tod = input(time,hhmmss6.);
interval='00:05:00't*int(tod/'00:05:00't);
format tod interval tod8.;
nanosec = input(substr(time,7),32.);
cards;
093000154451968 A 152.24 7.1675E13 1943170200.2 .
093000845296640 A 151.99 5.2984E13 1943170200.8 0.69084
093500845296640 A 151.99 5.2984E13 1943170200.8 0.00000
093500846918400 A 151.99 5.2984E13 1943170200.8 0.00162
093800847665152 A 151.94 6.2879E13 1943170200.8 0.00075
093000847675136 B 151.94 6.2879E13 1943170200.8 0.00001
093100857328128 B 151.94 5.2984E13 1943170200.9 0.00965
093900889283840 B 151.24 7.1675E13 1943170200.9 0.03196
093001249114624 C 151.74 7.1675E13 1943170201.2 0.35983
093301824934912 C 151.99 7.1675E13 1943170201.8 0.57582
093801834587904 C 151.71 5.2989E13 1943170201.8 0.00965
094102261742336 C 151.99 7.1675E13 1943170202.3 0.42715
;
So once you have a dataset (or even a view) that has the three variables needed, SYBMOL INTERVAL and DURATION, you can then just use PROC SUMMARY to produce the mean of the durations.
proc summary nway ;
class symbol interval;
var duration;
output out=want mean=mean_duration ;
run;
Results:
mean_
Obs symbol interval _TYPE_ _FREQ_ duration
1 A 09:30:00 3 2 0.69084
2 A 09:35:00 3 3 0.00079
3 B 09:30:00 3 2 0.00483
4 B 09:35:00 3 1 0.03196
5 C 09:30:00 3 2 0.46783
6 C 09:35:00 3 1 0.00965
7 C 09:40:00 3 1 0.42715
You say you're struggling with Step 2: "To sum up the values in a column for every 5-minute interval from 0930 to 1600. I am struggling here."
I'm just going to address that part of your question based on looking at the final result that you want. I'm providing code so you don't need to split the data into multiple datasets of each stock.
data final;
set <dataset>;
time_interval = intck("minute", "09:30:00", tradetime);
time_interval = time_interval - mod(time_interval, 5);
run;
proc sql;
select stock, time_interval, avg(duration) as avg_duration
from final
group by stock, time_interval;
quit;
But, if you want to keep multiple datasets by stock, then just remove the "stock" variable from the code and apply this to every stock dataset you have.
I want to calculate daily the implied volatility for a data set of option chains. I have all necessary data in a dataset with the columns:
OptionID opt_price strike today exp eq_price intrate
The SAS code for the IV is:
options pageno=1 nodate ls=80 ps=64;
proc fcmp;
opt_price=5;
strike=50;
today='20jul2010'd;
exp='21oct2010'd;
eq_price=50;
intrate=.05;
time=exp - today;
array opts[5] initial abconv relconv maxiter status
(.5 .001 1.0e-6 100 -1);
function blksch(strike, time, eq_price, intrate, volty);
return(blkshclprc(strike, time/365.25,
eq_price, intrate, volty));
endsub;
bsvolty=solve("blksch", opts, opt_price, strike,
time, eq_price, intrate, .);
put 'Option Implied Volatility:' bsvolty
'Initial value: ' opts[1]
'Solve status: ' opts[5];
run;
Source: https://documentation.sas.com/?docsetId=proc&docsetTarget=p1xoknqns865t7n1wehj6xarwhdb.htm&docsetVersion=9.4&locale=en#p0ymk0vrf7cecfn1kec073rxqm7z
Now, this function somehow does not need sigma. Why?
Second, how can I feed in and output a dataset with option series for a few years?
I tried by optionID but I don't know how I feed in the data correctly and then add it in a dataset (new variable called bsvolty.
Use the FCMP options DATA= and OUT= to provide inputs and capture outputs.
As for the missing value (.) in the sigma argument position, the SOLVE documentation states:
The SOLVE function finds the value of the specified argument that makes the expression of the following form equal to zero.
expected-value -
function-name
(argument-1,argument-2,
..., argument-n)
You specify the argument of interest with a missing value (.), which appears in place of the argument in the parameter list that is shown above. If the SOLVE function finds the value, then the value that is returned for this function is the implied value.
So, the SOLVE() is for the blkshclprc sigma (i.e. the volatilty)
Example code:
data have;
input OptionID opt_price strike today: date9. exp: date9. eq_price intrate;
format today exp date9.;
datalines;
1 5 50 20jul2010 21oct2010 50 0.05
2 5 75 21jul2010 22oct2010 50 0.05
3 5 55 22jul2010 23oct2010 50 0.05
4 5 60 23jul2010 24oct2010 50 0.05
;
proc fcmp data=have out=want;
time = exp - today;
array opts[5]
initial abconv relconv maxiter status
( .5 .001 1.0e-6 100 -1)
;
function blksch(strike, time, eq_price, intrate, volty);
put volty=; /* show the SOLVE iterations in the OUTPUT window */
return ( blkshclprc (
strike, /* E: exercise prices */
time/365.25, /* t: time to maturity (years) */
eq_price, /* S: share price */
intrate, /* r: annualized risk-free interest rate, continuouslycompounded */
volty /* sigma: volatility of the underlying asset */
));
endsub;
bsvolty=solve("blksch", opts, opt_price, strike,
time, eq_price, intrate, .);
run;
The output data set
The OUTPUT window
I have a dataset called stores.I want to extract total_sales(retail_price),
proportion of sales and cumulative proportion of sales by each store in
SAS.
Sample dataset : - Stores
Date Store_Postcode Retail_Price month Distance
08/31/2013 CR7 8LE 470 8 7057.8
10/26/2013 CR7 8LE 640 10 7057.8
08/19/2013 CR7 8LE 500 8 7057.8
08/17/2013 E2 0RY 365 8 1702.2
09/22/2013 W4 3PH 395.5 12 2522
06/19/2013 W4 3PH 360.5 6 1280.9
11/15/2013 W10 6HQ 475 12 3213.5
06/20/2013 W10 6HQ 500 1 3213.5
09/18/2013 E7 8NW 315 9 2154.8
10/23/2013 E7 8NW 570 10 5777.9
11/18/2013 W10 6HQ 455 11 3213.5
08/21/2013 W10 6HQ 530 8 3213.5
Code i tried: -
Proc sql;
Create table work.Top_sellers as
Select Store_postcode as Stores,SUM(Retail_price) as Total_Sales,Round((Retail_price/Sum(Retail_price)),0.01) as
Proportion_of_sales
From work.stores
Group by Store_postcode
Order by total_sales;
Quit;
I've no idea on how to calculate cumulative variable in proc sql...
Please help me improve my code!!
Computing a cumulative result in SQL requires the data to have an explicit unique ordered key and the query involves a reflexive join with 'triangular' criteria for the cumulative aspect.
data have;
do id = 100 to 120;
sales = ceil (10 + 25 * ranuni(123));
output;
end;
run;
proc sql;
create table want as
select
have1.id
, have1.sales
, sum(have2.sales) as sales_cusum
from
have as have1
join
have as have2
on
have1.id >= have2.id /* 'triangle' criteria */
group by
have1.id, have1.sales
order by
have1.id
;
quit;
A second way is re-compute the cusum on row by row basis
proc sql;
create table want as
select have.id, have.sales,
( select sum(inner.sales)
from (select * from have) as inner
where inner.id <= have.id
)
as cusum
from
have;
I change my mind, CDF is a different calculation.
Here's how to do this via a data step. First calculate the cumulative totals (I used a data step here, but I could use PROC EXPAND if you had SAS/ETS).
*sort demo data;
proc sort data=sashelp.shoes out=shoes;
by region sales;
run;
data cTotal last (keep = region cTotal);
set shoes;
by region;
*calculate running total;
if first.region then cTotal=0;
cTotal = cTotal + sales;
*output records, everything to cTotal but only the last record which is total to Last dataset;
if last.region then output last;
output cTotal;
retain cTotal;
run;
*merge in results and calculate percentages;
data calcs;
merge cTotal Last (rename=cTotal=Total);
by region;
percent = cTotal/Total;
run;
If you need a more efficient solution, I'd try a DoW solution.
I have this horizontal data:
Placebo 0.90 0.37 1.63 0.83 0.95 0.78 0.86 0.61 0.38 1.97
Alcohol 1.46 1.45 1.76 1.44 1.11 3.07 0.98 1.27 2.56 1.32
But I want it to be vertical:
Placebo Alcohol
0.90 1.46
0.37 1.45
... ...
I successfully read and transpose the data this way, but I'm searching for a more elegant solution that does the same thing without creating 2 unnecessary datasets:
data female;
input cost_female :comma. ##;
datalines;
871 684 795 838 1,033 917 1,047 723 1,179 707 817 846 975 868 1,323 791 1,157 932 1,089 770
;
data male;
input cost_male :comma. ##;
datalines;
792 765 511 520 618 447 548 720 899 788 927 657 851 702 918 528 884 702 839 878
;
data repair_costs;
merge female male;
run;
You can use proc transpose to do the same.
data have;
input medicine :$7. a1-a10;
datalines;
Placebo 0.90 0.37 1.63 0.83 0.95 0.78 0.86 0.61 0.38 1.97
Alcohol 1.46 1.45 1.76 1.44 1.11 3.07 0.98 1.27 2.56 1.32
;
run;
proc transpose data=have out=want(drop=_name_);
id medicine;
var a1-a10;
run;
Let me know in case of any doubts.
For arbitrarily wide input data you will have to use binary mode input, which is specified with RECFM=N.
This sample code creates a wide data file in transposed form. Thus the data file has one row per final dataset column and one column per final dataset row.
The code presumes CRLF line termination and tests for it explicitly. The input data set is reshaped using a single Proc TRANSPOSE.
filename flipflop 'c:\temp\rowdata-across.txt';
%let NUM_ROWS = 10000; * thus 10,000 columns of data in flipflop;
%let NUM_COLS = 30;
* simulate input data where row data is across a line of arbitrary length (that means > 32K);
* recfm=n means binary mode output, hence no LRECL limit;
data _null_;
file flipflop recfm=n;
do colindex = 1 to &NUM_COLS;
put 'column' +(-1) colindex #; * first column of output data is column name;
do rowindex=1 to &NUM_ROWS;
value = (rowindex-1) * 10 ** floor(log10(&NUM_COLS)) * 10 + colindex;
put value #; * data for rows goes across;
end;
put '0d0a'x;
end;
run;
* recfm=n means binary mode input, hence no LRECL limit;
* as filesize increases, binary mode will become slower than <32K line orientated input;
data flipflop(keep=id rowseq colseq value);
length id $32 value 8;
infile flipflop unbuffered recfm=n col=p;
colseq+1;
input id +(-1);
do rowseq=1 by 1;
input value;
output;
input test $char2.;
if test = '0d0a'x then leave;
input #+(-2);
end;
run;
proc sort data=flipflop;
by rowseq colseq;
run;
proc transpose data=flipflop out=want(drop=_name_ rowseq);
by rowseq;
id id;
var value;
run;
There might be a way to speed up reading larger (say, a file with dataline width > 32k) files in binary mode, but I have not investigated such.
Other variations could utilize a hash object, however, the entire data set would have to fit in memory.
I have data set, that has States, Corn, and Cotton. I want to create a new variable, Corn_Pct in SAS (% of state corn output relative to the country's output of corn). The same for Cotton_pct.
sample of data: (numbers are not real)
State Corn Cotton
TX 135 500
AK 120 350
...
Can anyone help?
You can do this using a simple Proc SQL. Let the dataset be "Test",
Proc sql ;
create table test_percent as
select *,
Corn/sum(corn) as Corn_Pct format=percent7.1,
Cotton/sum(Cotton) as Cotton_Pct format=percent7.1
from test
;
quit;
If you have many columns, you can use Arrays and do loops to automatically generate percentages everytime.
I have calculated the total of a column in Inner Query and then used that total for the calculation in outer query using Cross Join
Hey Try this:-
/*My Dataset */
Data Test;
input State $ Corn Cotton ;
cards;
TK 135 500
AK 120 350
CK 100 250
FG 200 300
run;
/*Code*/
Proc sql;
create table test_percent as
Select a.*, (corn * 100/sm_corn) as Corn_pct, (Cotton * 100/sm_cotton) as Cotton_pct
from test a
cross join
(
select sum(corn) as sm_corn ,
sum(Cotton) as sm_cotton
from test
) b ;
quit;
/*My Output*/
State Corn Cotton Corn_pct Cotton_pct
TK 135 500 24.32432432 35.71428571
AK 120 350 21.62162162 25
CK 100 250 18.01801802 17.85714286
FG 200 300 36.03603604 21.42857143
Here you have an alternative using proc means and data step:
proc means data=test sum noprint;
output out=test2(keep=corn cotton) sum=corn cotton;
quit;
data test_percent (drop=corn_sum cotton_sum);
set test2(rename=(corn=corn_sum cotton=cotton_sum) in=in1) test(in=in2);
if (in1=1) then do;
call symput('corn_sum',corn_sum);
call symput('cotton_sum',cotton_sum);
end;
else do;
Corn_pct = corn/symget('corn_sum');
Cotton_pct = cotton/symget('cotton_sum');
output;
end;
run;