Efficiently fitting cubic splines in SAS to specific grid of objects - sas

I have a dataset mydat with the following variables:
MNES IV
0.84 0.40
0.89 0.34
0.91 0.31
0.93 0.29
0.95 0.26
0.98 0.23
0.99 0.22
1.00 0.22
1.02 0.20
1.04 0.18
1.07 0.18
And I need to fit cubic splines to these elements, where MNES is the object (X) and IV is the image (Y).
I have successfully accomplished what I need through PROC IML but I am afraid this is not the most efficient solution.
Specifically, my intended output dataset is:
mnes iv
0.333 0.40
0.332 0.40 <- for mnes out of sample MNES range, copy first IV;
0.336 0.40
... ...
0.834 0.40
0.837 0.40
0.840 0.40
0.842 INTERPOLATION
0.845 INTERPOLATION
0.848 INTERPOLATION
...
1.066 INTERPOLATION
1.069 INTERPOLATION
1.072 INTERPOLATION
1.074 0.18
1.077 0.18 <- for mnes out of sample MNES range, copy last IV;
1.080 0.18
... ...
3.000 0.18
The necessary specifics are the following:
I always have 1001 points for MNES, ranging from 0.(3) to 3 (thus, each step is (3-1/3)/1000).
The interpolation for IV should only be used for the points between the minimum and maximum MNES.
For the points where MNES is greater than the maximum MNES in the sample, IV should be equal to the IV of the maximum MNES and likewise for the minimum MNES (it is always sorted by MNES).
My worry for efficiency is due to the fact that I have to solve this problem roughly 2 million times and right now it (the code below, using PROC IML) takes roughly 5 hours for 100k different input datasets.
My question is: What alternatives do I have if I wish to fit cubic splines given an input data set such as the one above and output it to a specific grid of objects?
And what solution would be the most efficient?
With PROC IML I can do exactly this with the splinev function, but I am concerned that using PROC IML is not the most efficient way;
With PROC EXPAND, given that this is not a time series, it does not seem adequate. Additionally, I do not know how to specify the grid of objects which I need through PROC EXPAND;
With PROC TRANSREG, I do not understand how to input a dataset into the knots and I do not understand whether it will output a dataset with the corresponding interpolation;
With the MSPLINT function, it seems doable but I do not know how to input a data set to its arguments.
I have attached the code I am using below for this purpose and an explanation of what I am doing. Reading what is below is not necessary for answering the question but it could be useful for someone solving this sort of problem with PROC IML or wanting a better understanding of what I am saying.
I am replicating a methodology (Buss and Vilkov (2012)) which, among other things, applies cubic splines to these elements, where MNES is the object (X) and IVis the image (Y).
The following code is heavily based on the Model Free Implied Volatility (MFIV) MATLAB code by Vilkov for Buss and Vilkov (2012), available on his website.
The interpolation is a means to calculate a figure for stock return volatility under the risk-neutral measure, by computing OTM put and call prices. I am using this for the purpose of my master thesis. Additionally, since my version of PROC IML does not have functions for Black-Scholes option pricing, I defined my own.
proc iml;
* Define BlackScholes call and put function;
* Built-in not available in SAS/IML 9.3;
* Reference http://www.lexjansen.com/wuss/1999/WUSS99039.pdf ;
start blackcall(x,t,s,r,v,d);
d1 = (log(s/x) + ((r-d) + 0.5#(v##2)) # t) / (v # sqrt(t));
d2 = d1 - v # sqrt(t);
bcall = s # exp(-d*t) # probnorm(d1) - x # exp(-r*t) # probnorm(d2);
return (bcall);
finish blackcall;
start blackput(x,t,s,r,v,d);
d1 = (log(s/x) + ((r-d) + 0.5#(v##2)) # t) / (v # sqrt(t));
d2 = d1 - v # sqrt(t);
bput = -s # exp(-d*t) # probnorm(-d1) + x # exp(-r*t) # probnorm(-d2);
return (bput);
finish blackput;
store module=(blackcall blackput);
quit;
proc iml;
* Specify necessary input parameters;
currdate = "&currdate"d;
currpermno = &currpermno;
currsecid = &currsecid;
rate = &currrate / 100;
mat = &currdays / 365;
* Use inputed dataset and convert to matrix;
use optday;
read all var{mnes impl_volatility};
mydata = mnes || impl_volatility;
* Load BlackScholes call and Put function;
load module=(blackcall blackput);
* Define parameters;
k = 2;
m = 500;
* Define auxiliary variables according to Buss and Vilkov;
u = (1+k)##(1/m);
a = 2 * (u-1);
* Define moneyness (ki) and implied volatility (vi) grids;
mi = (-m:m);
mi = mi`;
ki = u##mi;
* Preallocation of vi with 2*m+1 ones (1001 in the base case);
vi = J(2*m+1,1,1);
* Define IV below minimum MNESS equal to the IV of the minimum MNESS;
h = loc(ki<=mydata[1,1]);
vi[h,1] = mydata[1,2];
* Define IV above maximum MNESS equal to the IV of the maximum MNESS;
h = loc(ki>=mydata[nrow(mydata),1]);
vi[h,1] = mydata[nrow(mydata),2];
* Define MNES grid where there are IV from data;
* (equal to where ki still has ones resulting from the preallocation);
grid = ki[loc(vi=1),];
* Call splinec to interpolate based on available data and obtain coefficients;
* Use coefficients to create spline on grid and save on smoothFit;
* Save smoothFit in correct vi elements;
call splinec(fitted,coeff,endSlopes,mydata);
smoothFit = splinev(coeff,grid);
vi[loc(vi=1),1] = smoothFit[,2];
* Define elements of mi corresponding to OTM calls (MNES >=1) and OTM puts (MNES <1);
ic = mi[loc(ki>=1)];
ip = mi[loc(ki<1)];
* Calculate call and put prices based on call and put module;
calls = blackcall(ki[loc(ki>=1),1],mat,1,rate,vi[loc(ki>=1),1],0);
puts = blackput(ki[loc(ki<1),1],mat,1,rate,vi[loc(ki<1),1],0);
* Complete volatility calculation based on Buss and Vilkov;
b1 = sum((1-(log(1+k)/m)#ic)#calls/u##ic);
b2 = sum((1-(log(1+k)/m)#ip)#puts/u##ip);
stddev = sqrt(a*(b1+b2)/mat);
* Append to voldata dataset;
edit voldata;
append var{currdate currsecid currpermno mat stddev};
close voldata;
quit;

Ok. I'm going to do this for 2 data sets to help you with the fact you have a bunch. You will have to modify for your inputs, but this should give you better performance.
Create some inputs
Get the first and last values from each input data set.
Create a list of all MNES values.
Merge each input to the MNES list and set the upper and lower values.
Append the Inputs together
Run PROC EXPAND with a BY statement to single pass all the input values and create the splines.
The trick is to "trick" EXPAND into thinking MNES is a Daily timeseries. I do this by making it an integer -- date values are integers behind the scenes in SAS. With no gaps, ETS Procedures will assume a "daily" frequency.
After this is done, run a Data Step to call the Black-Scholes (BLKSHPTPRC, BLKSHCLPRC) functions and complete your analysis.
/*Sample Data*/
data input1;
input MNES IV;
/*Make MNES and integer*/
MNES = MNES * 1000;
datalines;
0.84 0.40
0.89 0.34
0.91 0.31
0.93 0.29
0.95 0.26
0.98 0.23
0.99 0.22
1.00 0.22
1.02 0.20
1.04 0.18
1.07 0.18
;
run;
data input2;
input MNES IV;
MNES = MNES * 1000;
datalines;
0.80 0.40
0.9 0.34
0.91 0.31
0.93 0.29
0.95 0.26
0.98 0.23
1.02 0.19
1.04 0.18
1.07 0.16
;
run;
/*Get the first and last values from the input data*/
data _null_;
set input1 end=last;
if _n_ = 1 then do;
call symput("first1",mnes);
call symput("first1_v",iv);
end;
if last then do;
call symput("last1",mnes);
call symput("last1_v",iv);
end;
run;
data _null_;
set input2 end=last;
if _n_ = 1 then do;
call symput("first2",mnes);
call symput("first2_v",iv);
end;
if last then do;
call symput("last2",mnes);
call symput("last2_v",iv);
end;
run;
/*A list of the MNES values*/
data points;
do mnes=333 to 3000;
output;
end;
run;
/*Join Inputs to the values and set the lower and upper values*/
data input1;
merge points input1;
by mnes;
if mnes < &first1 then
iv = &first1_v;
if mnes > &last1 then
iv = &last1_v;
run;
data input2;
merge points input2;
by mnes;
if mnes < &first2 then
iv = &first2_v;
if mnes > &last2 then
iv = &last2_v;
run;
/*Append the data sets together, keep a value
so you can tell them apart*/
data toSpline;
set input1(in=ds1)
input2(in=ds2);
if ds1 then
Set=1;
else if ds2 then
Set=2;
run;
/*PROC Expand for the spline. The integer values
for MNES makes it think these are "daily" data*/
proc expand data=toSpline out=outSpline method=spline;
by set;
id mnes;
run;

Here is the solution I came up with. Sadly, I cannot yet conclude whether this is more efficient than the PROC IML solution - just for one dataset they both take the pretty much the same running time.
MSPLINT:
real time: 1.42 seconds
cpu time 0.23 seconds
PROC IML:
real time: 1.02 seconds
cpu time: 0.26 seconds
The biggest disadvantage of this solution when compared to the one above by #DomPazz is that I cannot process the data by 'By groups', which would certainly make it a lot faster... I am still thinking whether I can solve this without resorting to a macro loop but I am all out of ideas.
I keep the solution of defining a macro variable with the first and last values, as proposed by #DomPazz, but I then use a datastep, which copies the first and last values or applies the interpolation depending on what value of MNES it is stepping through. It applies the interpolation through the MSPLINT function. Its syntax is as follows:
MSPLINT(X, n, X1 <, X2, ..., Xn>, Y1 <,Y2, ..., Yn> <, D1, Dn>)
Where X is the object at which you wish to evaluate the spline, n is the number of knots supplied to the function (i.e. the number of observations in the input data), X1,...,Xn are the objects in the input data (i.e. MNES) and Y1,...,Yn are the images in the input data (i.e. IV). D1 and Dn (optional) are the derivatives you wish to maintain for interpolation objects X < X1 and X>Xn.
An interesting note is that by specifying D1 and Dn. as 0 you can have the points beyond the grid equal to the last observation inside the interpolated area. However, this forces the spline images to converge to a derivative of zero, potentially generating a non-natural pattern in the data. I opted not to define these as zero and defining the points outside the interpolation area separately.
So, I use PROC SQL to define the lists of elements of MNES and IV in macro variables, divided by commas, so that I can input them in the MSPLINT function. I also define the number of observations through PROC SQL.
MNES, as I commented in the answer above, was not well defined in my explanation. It needs to be defined as the variable u to the power of elements from -500 to 500. This is just a detail but it will allow you to understand where MNES comes from in the example below.
So, here is the solution, including example data.
* Set model inputs;
%let m = 500;
%let k = 2;
%let u = (1+&k) ** (1/&m);
/*Sample Data*/
data input1;
input MNES 13.10 IV 8.6;
cards;
0.8444984010 0.400535
0.8901469633 0.347988
0.9129712444 0.318596
0.9357955255 0.291456
0.9586198066 0.264852
0.9814440877 0.236231
0.9928562283 0.224858
1.0042683688 0.220035
1.0270926499 0.201118
1.0499169310 0.189373
1.0727412121 0.185628
;
run;
data _null_;
set input1 end=last;
if _n_ = 1 then do;
call symput("first1",MNES);
call symput("first1_v",IV);
end;
if last then do;
call symput("last1",MNES);
call symput("last1_v",IV);
end;
run;
proc sql noprint;
select MNES into:mneslist
separated by ','
from input1;
select IV into:IVlist
separated by ','
from input1;
select count(*) into:countlist
from input1;
quit;
data splined;
do grid=-500 to 500;
mnes = (&u) ** grid;
if mnes < &first1 then IV = &first1_v;
if mnes > &last1 then IV = &last1_v;
if mnes >= &first1 and mnes <= &last1 then IV = msplint(mnes, &countlist, &mneslist, &IVlist);
end;
run;

Related

Obtaining the odds ratio and 95% confidence interval from mixed effect logistic regression in sas

How can I get the odds ratio and 95% confidence interval from mixed effect logistic regression in sas?
I am aware that odds ration could be derived by exponentiating the obtained estimate.
I saw this link for getting odds ratio in R but I need it in sas.
A sample dataset and code:
data herd;
call streaminit(1);
do herd = 1 to 10;
do testyear = 2005, 2015;
do Time = 1 to 6;
eta = -1 + 0.1*herd + 0.5*Time - 2*(testyear=2015);
mu = logistic(eta);
mpd = rand("Bernoulli", mu);
output;
end;
end;
end;
proc GLIMMIX data = herd;
class testyear TIME;
model MPD = TESTYEAR TIME / s dist=binary;
RANDOM HERD;
RUN;
Appreciate any advice.

SAS: How to repeat calculation and change values?

For the following data, I want to create several new variables that combine Australia and Canada with different weights. In total, I would like to examine 10 different weight combinations.
Is there a way to do this where I can use the one formula and just change the weight values?
For example, rather than calculate Weight_1 to Weight_etc, can I just list the weights I want and then create the variables based on this list?
data Weighted_returns; set returns;
Weight_1 = (Australia*0.6)+(Canada*0.4);
Weight_2 = (Australia*0.5)+(Canada*0.5);
Weight_3 = (Australia*0.4)+(Canada*0.6);
run;
DATA Step does not have any sort of vector math syntax. You can use one array to arrange and reference the target variables and another to hold the weights.
Your result variables weight* would be a little conflicting with an array of weights, so I named the result variables result*
data have;
input australia canada;
datalines;
0.07 0.08
0.02 -0.001
0.05 0.01
run;
data want;
set have;
array results result_1-result_3;
array weights (3) _temporary_ (0.6 0.5 0.4);
do _n_ = 1 to dim(results);
results(_n_) = australia * weights(_n_) + canada * (1 - weights(_n_));
end;
run;
Use two weight arrays if the transformation is such that the sum of the weights to apply are not unity.

How to add a custom fitted line to SAS SGplot Scatter

I have a simple SAS data set I am plotting as a scatter plot, my two questions are:
I am trying to adjust the y-axis without excluding the (0.02,51) data point but I need the y-axis to only show 60 to 160 by 20. When I define this it excludes that specific data point and I don't know how to fix it.
I cannot figure out how to add a custom fitted curve and display the formula. Here is my line: Y=(160.3*x)/(0.0477+x)
Here is my code:
proc sgplot data=work.sas1;
title 'Puromycin Uptake Experiments';
scatter x=x y=y/ markerattrs=(color=black);
xaxis Label='Reactant Concentration X (mg/l)';
yaxis Label='Reaction Velocity Y (mg/s)' values=(60 to 160 by 20);
run;
Can anyone please help?
Try using OFFSETMIN= to extend the yaxis beyond your values.
Add a new variable, y_hat with the values of your formula. Plot that and label it appropriately.
data sas1;
x=.02; y=67; output;
x=.02; y=51; output;
x=.06; y=84; output;
x=.06; y=86; output;
x=.11; y=98; output;
x=.11; y=115; output;
x=.22; y=131; output;
x=.22; y=124; output;
x=.56; y=144; output;
x=.56; y=158; output;
x=1.1; y=160; output;
run;
data sas1;
set sas1;
Y_hat=(160.3*x)/(0.0477+x);
run;
proc sgplot data=work.sas1;
title 'Puromycin Uptake Experiments';
scatter x=x y=y/ markerattrs=(color=black);
series x=x y=y_hat / curvelabel="Y=(160.3*x)/(0.0477+x)";
xaxis Label='Reactant Concentration X (mg/l)';
yaxis Label='Reaction Velocity Y (mg/s)' offsetmin=.1 values=(60 to 160 by 20);
run;
Produces:
y axis
There are a couple y-axis options can affect the axis rendering. consider offsetmin or a tweaked list in the values=
formula line
There is no formula statement in SGPLOT so you have to create an auxiliary column for drawing the formula in a series. Some times you can align the x's of the data with the x's of the formula. However, for the case of wanting a higher density of x's for the formula you stack the scatter and formula data. Don't get hung up on the chunks of missing values and any feelings of wastefulness.
I am not sure where your curve fit comes from, but statistical graphics (the SG in SGPLOT) has many features for fitting data built into it.
* make some example data that looks something like the fit curve;
data have;
do x = 0.03 to 1 by 0.0125;
y = ( 160.3 * x ) / ( 0.0477 + x ) ;
y + round ( 4 * ranuni(123) - 8, 0.0001);
output;
x = x * ( 1 + ranuni(123) );
end;
x = 0.02;
y = 51;
output;
run;
* generate the series data for drawing the fit curve;
* for complicated formula you may want to adjust step during iteration;
data fit;
step = 0.001;
do x = 0 to 1;
y = ( 160.3 * x ) / ( 0.0477 + x ) ;
output;
* step = step + smartly-adjusted-x-increment;
x + step;
end;
keep x y;
rename x=xfit y=yfit;
run;
* stack the scatter data and the curve fit data;
data have_stack_fit;
set have fit;
run;
proc sgplot data=have_stack_fit;
scatter x = x y = y;
series x = xfit y = yfit / legendlabel="( 160.3 * x ) / ( 0.0477 + x )";
yaxis values = (0 60 to 160 by 20) ;
run;

How to do multiplication between two matrix using IML in SAS

I have data set named input_data below import from EXCEL.
0.353481635 0.704898683 0.078640917 0.813815803 0.510842666 0.240912872 0.986312218 0.781868961 0.682272971
0.443441526 0.653187181 0.753981865 0.34909803 0.84215961 0.793863082 0.047816942 0.176759112 0.54213244
0.21443281 0.142501578 0.927011587 0.407251043 0.290280445 0.90730524 0.677030212 0.770541244 0.915728969
0.583493041 0.685127614 0.119042255 0.067769934 0.795793907 0.405029459 0.817724346 0.594170688 0.345660875
0.816193304 0.636823417 0.036348358 0.027985453 0.117027493 0.436516667 0.593191955 0.916981676 0.574223091
0.766842249 0.743249552 0.400052263 0.809650253 0.683610082 0.42152573 0.050520292 0.329441952 0.868549022
0.112847881 0.462579082 0.526220066 0.320851313 0.944585551 0.233027402 0.66141107 0.8380858 0.120044416
0.873949265 0.118525986 0.590234323 0.481974796 0.668976582 0.466558592 0.934633956 0.643438048 0.053508922
And I have another data set called p below
data p;
input p;
datalines;
0.12
0.23
0.11
0.49
0.52
0.78
0.8
0.03
0.02
run;
proc transpose data = p out=p2;
run;
What I want to do is matrix manipulation in IML using SAS.
I have some code already, but the final calculation got error. Can someone give me a hand?
proc iml;
use input_data;
read all var _num_ into x;
print x;
proc iml;
use p2;
read all var _num_ into k;
print k;
proc iml;
Value1 = k * x;
print Value1;
quit;
You have several problems here.
First off, you have three PROC IML statements. PROC IML only persists values while it's running; once it quits, all of the vectors go away forever. So remove the PROC IMLs.
Second, you need to make sure your matrices are correctly ordered and structured. Matrix multiplication works by the following:
m x n * n x p = m x p
Where both N's must be the same. This is rows x columns, so the left-side matrix must have the same number of columns as the right-side matrix has rows. (This is because each element of each row on the left-side matrix is multiplied by the corresponding element in the column on the right-side matrix and then summed, so if the numbers don't match it's not possible to do.)
So you have 8x9 and 9x1, which you transpose to 1x9. So first off, don't transpose p, leave it 9x1. Then, make sure you have the order right (matrix multiplication is NOT commutative, the order matters). k * x means 9x1 * 8x9 which doesn't work (since 1 and 8 aren't the same - remember, the inner two numbers have to match.) x*k does work, since that is 8x9 * 9x1, the two 9s match.
Final output:
proc iml;
use input_data;
read all var _num_ into x;
print x;
use p;
read all var _num_ into k;
print k;
Value1 = x * k;
print Value1;
quit;

SAS generate normal Y~N(...)

For my SAS project I have to generate pairs of (X,Y) with a distribution Y ~ N(3 + X + .5X^2, sd = 2). I have looked at all of the SAS documentation for normal() and I see absolutely no way to do this. I have tried many different methods and am very frustrated.
I believe this is an example of what the asker wants to do:
data sample;
do i = 1 to 1000;
x = ranuni(1);
y = rand('normal', 3 + x + 0.5*x**2, 2);
output;
end;
run;
proc summary data = sample;
var x y;
output out = xy_summary;
run;
Joe is already more or less there - I think the only key point that needed addressing was making the mean of each y depend on the corresponding x, rather than using a single fixed mean for all the pairs. So rather than 1000 samples from the same Normal distribution, the above generates 1 sample from each of 1000 different Normal distributions.
I've used a uniform [0,1] distribution for x, but you could use any distribution you like.
You generate random numbers in SAS using the rand function. It has all sorts of distributions available; read the documentation to fully understand.
I'm not sure if you can directly use your PDF, but if you're able to use it with a regular normal distribution, you can do that. On top of that, most of the Univariate DFs SAS supports start out with the Uniform distribution and then apply their formula (Discrete or continuous) to that, so that might be the right way to go. That's heading into stat-land which is somewhere I'm averse to going. There isn't a direct way to simply pass a function for X as far as I know, however.
To generate [numsamp] normals with mean M and standard deviation SD:
%let m=0;
%let sd=2;
%let numsamp=100;
data want;
call streaminit(7);
do id = 1 to &numsamp;
y = rand('Normal',&m.,&sd.);
output;
end;
run;
So if I understand what you want right, this might work:
%let m=0;
%let sd=2;
%let numsamp=1000;
data want;
call streaminit(7);
do id = 1 to &numsamp;
x = rand('Normal',&m.,&sd.);
y = 0.5*x**2 + x + 3;
output;
end;
run;
proc means data=want;
var x y;
run;
X has mean 0.5 with SD 1.96 (roughly what you ask for). Y has mean 5 with SD 3.5. If you're asking for Y to have a SD of 2, i'm not sure how to do that.