I have a dataset of the following format:
a table of M rows and 2K columns.
My columns are pairs of variables: X_i, Y_i and the rows are observations.
I would like to perform many linear regressions: one for each pair of columns (Y_i ~ X_i)
and obtain the results.
I know how to access specific columns using arrays, like so:
data Xs_Ys_data (drop=i);
array Xs[60] X1-X60;
array Ys[60] Y1-Y60;
I also know how to fit a single linear regression model, like so:
proc reg data=some_data;
model y = x;
output out=out_lin_reg;
run;
And I am familiar with the concept of loops:
do i=1 to 60;
Xs[i] .......;
end;
How do I combine these three to get what I need?
Thanks!
P.S - I asked a similar question on a different format here:
SAS reading a file in long format
Update:
I have managed to create the regressions using a macro like so:
%macro mylogit();
%do i = 1 %to 60;
proc reg data=Xs_Ys_data;
model Y&i = X&i;
run;
%end;
%mend;
%mylogit()
Now I am not sure how to export the results into a single table...
You have this in your macro:
proc reg data=Xs_Ys_data;
model Y&i = X&i;
run;
So instead create:
data x_y_Data;
set xs_yx_data;
array xs x1-x60;
array yx y1-y60;
do iter = 1 to dim(xs);
x=xs[iter];
y=ys[iter];
output;
end;
run;
proc reg data=X_Y_data;
by iter;
model Y = X;
run;
And then add an output statement however you normally would to get your resulting dataset. Now you get 1 output table with all 60 iterations (still 60 printed outputs), and if you want to create one printed output you can construct that from the output dataset.
Related
I am being asked to provide summary statistics including corresponding confidence interval (CI) with its width for the population mean. I need to print 85% 90% and 99%. I know I can either use univariate or proc means to return 1 interval of your choice but how do you print all 3 in a table? Also could someone explain the difference between univariate, proc means and proc sql and when they are used?
This is what I did and it only printed 85% confidence.
proc means data = mydata n mean clm alpha = 0.01 alpha =0.1 alpha = 0.15;
var variable;
RUN;
To put all three values in one table you can execute your step three times and put the results in one table by using an append step.
For shorter code and easier usage you can define a macro for this purpose.
%macro clm_val(TAB=, VARIABLE=, CONF=);
proc means
data = &TAB. n mean clm
alpha = &CONF.;
ods output summary=result;
var &VARIABLE.;
run;
data result;
length conf $8;
format conf_interval percentn8.0;
conf="&CONF.";
conf_interval=1-&CONF.;
set result;
run;
proc append data = result
base = all_results;
quit;
%mend;
%clm_val(TAB=sashelp.class, VARIABLE=age, CONF=0.01);
%clm_val(TAB=sashelp.class, VARIABLE=age, CONF=0.1);
%clm_val(TAB=sashelp.class, VARIABLE=age, CONF=0.15);
The resulting table looks like this:
Suppose I have a dataset called example with variable x and y, where both are binary {0,1}. I want to find the risk difference stratified by a variable strata.
proc freq data = example;
table strata*x*y / commonriskdiff(CL=NEWCOMBEMR);
run;
However, suppose I want them in different direction, i.e., I want 0.004 (-0.017, 0.028)
I can do
data example2;
set example;
y_2 = 1 - y;
run;
proc freq data = example2;
table strata*x*y_2 / commonriskdiff(CL=NEWCOMBEMR);
run;
but is there a way to do it directly on proc freq, without the extra step of creating example2 dataset?
commonriskdiff(CL=NEWCOMBEMR column = 2)
I am trying to create a prediction interval in SAS. My SAS code is
Data M;
input y x;
datalines;
100 20
120 40
125 32
..
;
proc reg;
model y = x / clb clm alpha =0.05;
Output out=want p=Ypredicted;
run;
data want;
set want;
y1= Ypredicted;
proc reg data= want;
model y1 = x / clm cli;
run;
but when I run the code I could find the new Y1 how can I predict the new Y?
What you're trying to do is score your model, which takes the results from the regression and uses them to estimate new values.
The most common way to do this in SAS is simply to use PROC SCORE. This allows you to take the output of PROC REG and apply it to your data.
To use PROC SCORE, you need the OUTEST= option (think 'output estimates') on your PROC REG statement. The dataset that you assign there will be the input to PROC SCORE, along with the new data you want to score.
As Reeza notes in comments, this is covered, along with a bunch of other ways to do this that might work better for you, in Rick Wicklin's blog post, Scoring a regression model in SAS.
I have a situation that seems like it should be easy to fix. But, I’m struggling to find an elegant solution. I was given data that was already formatted. Similar to the toy dataset below.
proc format;
value x1_f 1 = "Yes"
0 = "No";
value x2_f 1 = "Yes"
2 = "No";
run;
data ds;
input x1 x2;
datalines;
1 2
1 1
0 1
;
data ds;
set ds;
format x1 x1_f.
x2 x2_f.;
run;
Now, as part of my data management process I create a 2x2 table using x1 and x2. Let’s say I’m checking my data, and expect x1 and x2 to always agree.
proc freq data = ds;
tables x1*x2;
run;
When I look at the report I notice that x1 and x2 don’t always agree. So, I want to print the observations that don’t agree to see if I can figure out what might be going on. Because this is a toy example, there are not other variables to look at, but hopefully you get the idea.
proc print data = ds;
where x1 = "Yes" & x2 = "No";
run;
SAS gives me the following error:
ERROR: WHERE clause operator requires compatible variables
Ok, I guess I need to give SAS the numeric values instead of the formatted values. But, when I go look at the PROC FREQ report from earlier, it only shows me the formatted values. So, I run another PROC FREQ.
proc freq data = ds;
tables x1*x2;
format x1 x2;
run;
Now I can see which variable uses 0’s and 1’s, and which variable uses 1’s and 2’s.
proc print data = ds;
where x1 = 0 & x2 = 1;
run;
Finally, I get what I’m looking for. This just seems really clunky and inelegant. Can someone tell me how to either view my numeric values and formatted values simultaneously in my frequency report, OR how to use the formatted values in proc print?
If you know the format name then use the PUT() function in the WHERE statement.
proc print data=sashelp.class ;
where put(age,2.) = '12';
run;
If you don't know the format name then you can use the VVALUE() function. But you probably need to add a data step for it to work.
data to_print;
set sashelp.class ;
if strip(vvalue(age))='12';
run;
proc print data=to_print;
run;
In the old days I used to just create a separate format catalog with formats that included the values in the labels.
proc format;
value x1_f 1 = "1=Yes" 0 = "0=No";
run;
Then when you read your output you knew the values the variables actually had. It is pretty simple to create a program to convert a format catalog.
http://github.com/sasutils/macros/blob/master/cfmtgen.sas
I have a point data set containing latitude, longitude and elevation data. I would like to identify the nearest neighbour of a given point by using the distance between any two given points (2d or 3d). Could anybody suggest the different methods available in SAS for such geo-spatial data analysis and an example SAS code? Thanks.
Your best bet is to look into the clustering procedures, as KNN style clustering is pretty close to what you want (and at minimum cluster analysis can get you to a 'set' of neighbors to check). PROC MODECLUS, PROC FASTCLUS, PROC CLUSTER all give you some value here, as does PROC DISTANCE which is used as input in some cases to the above. Exactly what you want to use depends on what you need and your speed/size constraints (PROC CLUSTER is very slow with large datasets, but gives more useful results oftentimes).
Here is an example of nearest-neighbour calculation via the use of SQL (given in the SAS help file somewhere):
options ls=80 ps=60 nodate pageno=1 ;
data stores;
input Store $ x y;
datalines;
store1 5 1
store2 5 3
store3 3 5
store4 7 5
;
data houses;
input House $ x y;
datalines;
house1 1 1
house2 3 3
house3 2 3
house4 7 7
;
options nodate pageno=1 linesize=80 pagesize=60;
proc sql;
title 'Each House and the Closest Store';
select house, store label='Closest Store',
sqrt((abs(s.x-h.x)**2)+(abs(h.y-s.y)**2)) as dist
label='Distance' format=4.2
from stores s, houses h
group by house
having dist=min(dist);
quit;
I wrote 2 macros to accomplish this!
first macro to take one input gps location and use the lat and lon as a set value new variables for the "neighbor" location dataset. compute all distance and select the min value and store in temp dataset.
second calling macro loop through the input datset and to pass in the indivial gps location, call the first macro to do the work and append each min distance to my output dataset.
/*** first concatenate your input lat, lon as well as some id into a | seperate long string for later %scan into individual input ***/
%macro min_distance;
data compute_all_dis;
set all_neighbor_gps_locations;
/** here create a new variable to this big dataset with the one point gps value***/
first_lat = &latitude;
first_lon = &longitude;
ID = .
/** compute all **/
distance = geodist(lat, long, first_lat, first_lon, 'dm');
run;
/** get the shorted distance***/
proc sql;
create table closest_neighbor as
select milepost,OFF_PERIOD_ID, lat, long, first_lat, first_lon, distance
from compute_all_dis
having distance = min ( distance);
quit;
%mend min_distance;
%macro find_all_closest_neighbors;
data _null_;
runno=countw("&ID",'|');
call symputx('runno',put(runno,8.));
run;
%put &runno;
%do i=1 %to &runno;
%let PERIOD = %SCAN(&OFF_ID, &i, "|");
%let latitude = %SCAN (&LAT_I, &i, "|");
%let longitude = %SCAN (&LONG_I, &i, "|");
%min_distance;
proc datasets nowarn;
append base= pout.all_close_neighbors data=closest_neighbor;
run;
%end;
%mend find_all_closest_neighbors;
%find_all_closest_neighbors;