Nearest neighbour in SAS - sas

I have a point data set containing latitude, longitude and elevation data. I would like to identify the nearest neighbour of a given point by using the distance between any two given points (2d or 3d). Could anybody suggest the different methods available in SAS for such geo-spatial data analysis and an example SAS code? Thanks.

Your best bet is to look into the clustering procedures, as KNN style clustering is pretty close to what you want (and at minimum cluster analysis can get you to a 'set' of neighbors to check). PROC MODECLUS, PROC FASTCLUS, PROC CLUSTER all give you some value here, as does PROC DISTANCE which is used as input in some cases to the above. Exactly what you want to use depends on what you need and your speed/size constraints (PROC CLUSTER is very slow with large datasets, but gives more useful results oftentimes).

Here is an example of nearest-neighbour calculation via the use of SQL (given in the SAS help file somewhere):
options ls=80 ps=60 nodate pageno=1 ;
data stores;
input Store $ x y;
datalines;
store1 5 1
store2 5 3
store3 3 5
store4 7 5
;
data houses;
input House $ x y;
datalines;
house1 1 1
house2 3 3
house3 2 3
house4 7 7
;
options nodate pageno=1 linesize=80 pagesize=60;
proc sql;
title 'Each House and the Closest Store';
select house, store label='Closest Store',
sqrt((abs(s.x-h.x)**2)+(abs(h.y-s.y)**2)) as dist
label='Distance' format=4.2
from stores s, houses h
group by house
having dist=min(dist);
quit;

I wrote 2 macros to accomplish this!
first macro to take one input gps location and use the lat and lon as a set value new variables for the "neighbor" location dataset. compute all distance and select the min value and store in temp dataset.
second calling macro loop through the input datset and to pass in the indivial gps location, call the first macro to do the work and append each min distance to my output dataset.
/*** first concatenate your input lat, lon as well as some id into a | seperate long string for later %scan into individual input ***/
%macro min_distance;
data compute_all_dis;
set all_neighbor_gps_locations;
/** here create a new variable to this big dataset with the one point gps value***/
first_lat = &latitude;
first_lon = &longitude;
ID = .
/** compute all **/
distance = geodist(lat, long, first_lat, first_lon, 'dm');
run;
/** get the shorted distance***/
proc sql;
create table closest_neighbor as
select milepost,OFF_PERIOD_ID, lat, long, first_lat, first_lon, distance
from compute_all_dis
having distance = min ( distance);
quit;
%mend min_distance;
%macro find_all_closest_neighbors;
data _null_;
runno=countw("&ID",'|');
call symputx('runno',put(runno,8.));
run;
%put &runno;
%do i=1 %to &runno;
%let PERIOD = %SCAN(&OFF_ID, &i, "|");
%let latitude = %SCAN (&LAT_I, &i, "|");
%let longitude = %SCAN (&LONG_I, &i, "|");
%min_distance;
proc datasets nowarn;
append base= pout.all_close_neighbors data=closest_neighbor;
run;
%end;
%mend find_all_closest_neighbors;
%find_all_closest_neighbors;

Related

I need help printing multiple confidence intervals in sas

I am being asked to provide summary statistics including corresponding confidence interval (CI) with its width for the population mean. I need to print 85% 90% and 99%. I know I can either use univariate or proc means to return 1 interval of your choice but how do you print all 3 in a table? Also could someone explain the difference between univariate, proc means and proc sql and when they are used?
This is what I did and it only printed 85% confidence.
proc means data = mydata n mean clm alpha = 0.01 alpha =0.1 alpha = 0.15;
var variable;
RUN;
To put all three values in one table you can execute your step three times and put the results in one table by using an append step.
For shorter code and easier usage you can define a macro for this purpose.
%macro clm_val(TAB=, VARIABLE=, CONF=);
proc means
data = &TAB. n mean clm
alpha = &CONF.;
ods output summary=result;
var &VARIABLE.;
run;
data result;
length conf $8;
format conf_interval percentn8.0;
conf="&CONF.";
conf_interval=1-&CONF.;
set result;
run;
proc append data = result
base = all_results;
quit;
%mend;
%clm_val(TAB=sashelp.class, VARIABLE=age, CONF=0.01);
%clm_val(TAB=sashelp.class, VARIABLE=age, CONF=0.1);
%clm_val(TAB=sashelp.class, VARIABLE=age, CONF=0.15);
The resulting table looks like this:

how can create a prediction interval based on a linear model in SAS

I am trying to create a prediction interval in SAS. My SAS code is
Data M;
input y x;
datalines;
100 20
120 40
125 32
..
;
proc reg;
model y = x / clb clm alpha =0.05;
Output out=want p=Ypredicted;
run;
data want;
set want;
y1= Ypredicted;
proc reg data= want;
model y1 = x / clm cli;
run;
but when I run the code I could find the new Y1 how can I predict the new Y?
What you're trying to do is score your model, which takes the results from the regression and uses them to estimate new values.
The most common way to do this in SAS is simply to use PROC SCORE. This allows you to take the output of PROC REG and apply it to your data.
To use PROC SCORE, you need the OUTEST= option (think 'output estimates') on your PROC REG statement. The dataset that you assign there will be the input to PROC SCORE, along with the new data you want to score.
As Reeza notes in comments, this is covered, along with a bunch of other ways to do this that might work better for you, in Rick Wicklin's blog post, Scoring a regression model in SAS.

several regressions on a single dataset in SAS

I have a dataset of the following format:
a table of M rows and 2K columns.
My columns are pairs of variables: X_i, Y_i and the rows are observations.
I would like to perform many linear regressions: one for each pair of columns (Y_i ~ X_i)
and obtain the results.
I know how to access specific columns using arrays, like so:
data Xs_Ys_data (drop=i);
array Xs[60] X1-X60;
array Ys[60] Y1-Y60;
I also know how to fit a single linear regression model, like so:
proc reg data=some_data;
model y = x;
output out=out_lin_reg;
run;
And I am familiar with the concept of loops:
do i=1 to 60;
Xs[i] .......;
end;
How do I combine these three to get what I need?
Thanks!
P.S - I asked a similar question on a different format here:
SAS reading a file in long format
Update:
I have managed to create the regressions using a macro like so:
%macro mylogit();
%do i = 1 %to 60;
proc reg data=Xs_Ys_data;
model Y&i = X&i;
run;
%end;
%mend;
%mylogit()
Now I am not sure how to export the results into a single table...
You have this in your macro:
proc reg data=Xs_Ys_data;
model Y&i = X&i;
run;
So instead create:
data x_y_Data;
set xs_yx_data;
array xs x1-x60;
array yx y1-y60;
do iter = 1 to dim(xs);
x=xs[iter];
y=ys[iter];
output;
end;
run;
proc reg data=X_Y_data;
by iter;
model Y = X;
run;
And then add an output statement however you normally would to get your resulting dataset. Now you get 1 output table with all 60 iterations (still 60 printed outputs), and if you want to create one printed output you can construct that from the output dataset.

Is there a way to name proc rank groups based on values within the group?

So I have multiple continuous variables that I have used proc rank to divide into 10 groups, ie for each observation there is now a "GPA" and a "GRP_GPA" value, ditto for Hmwrk_Hrs and GRP_Hmwrk_Hrs. But for each of the new group columns the values are between 1 - 10. Is there a way to change that value so that rather than 1 for instance it would be 1.2-2.8 if those were the min and max values within the group? I know I can do it by hand using proc format or if then or case in sql but since I have something like 40 different columns that would be very time intensive.
It's not clear from your question if you want to store the min-max values or just format the rank columns with them. My solution below formats the rank column and utilises the ability of SAS to create formats from a dataset. I've obviously only used 1 variable to rank, for your data it will be a simple matter to wrap a macro around the code and run for each of your 40 or so variables. Hope this helps.
/* create ranked dataset */
proc rank data=sashelp.steel groups=10 out=want;
var steel;
ranks steel_rank;
run;
/* calculate minimum and maximum values per rank */
proc summary data=want nway;
class steel_rank;
var steel;
output out=want_min_max (drop=_:) min= max= / autoname;
run;
/* create dataset with formatted values */
data steel_rank_fmt;
set want_min_max (rename=(steel_rank=start));
retain fmtname 'stl_fmt' type 'N';
label=catx('-',steel_min,steel_max);
run;
/* create format from previous dataset */
proc format cntlin=steel_rank_fmt;
run;
/* apply formatted value to rank column */
proc datasets lib=work nodetails nolist;
modify want;
format steel_rank stl_fmt10.;
quit;
In addition to Keith's good answer, you can also do the following:
proc rank data = sashelp.cars groups = 10 out = test;
var enginesize;
ranks es;
run;
proc sql ;
select *, catx('-',min(enginesize), max(enginesize)) as esrange, es from test
group by es
order by make, model
;
quit;

How to create a new variable in SAS by extracting part of the value of an existing numeric variable?

I have two datasets in SAS that I would like to merge, but they have no common variables. One dataset has a "subject_id" variable, while the other has a "mom_subject_id" variable. Both of these variables are 9-digit codes that have just 3 digits in the middle of the code with common meaning, and that's what I need to match the two datasets on when I merge them.
What I'd like to do is create a new common variable in each dataset that is just the 3 digits from within the subject ID. Those 3 digits will always be in the same location within the 9-digit subject ID, so I'm wondering if there's a way to extract those 3 digits from the variable to make a new variable.
Thanks!
SQL(using sample data from Data Step code):
proc sql;
create table want2 as
select a.subject_id, a.other, b.mom_subject_id, b.misc
from have1 a JOIN have2 b
on(substr(a.subject_id,4,3)=substr(b.mom_subject_id,4,3));
quit;
Data Step:
data have1;
length subject_id $9;
input subject_id $ other $;
datalines;
abc001def other1
abc002def other2
abc003def other3
abc004def other4
abc005def other5
;
data have2;
length mom_subject_id $9;
input mom_subject_id $ misc $;
datalines;
ghi001jkl misc1
ghi003jkl misc3
ghi005jkl misc5
;
data have1;
length id $3;
set have1;
id=substr(subject_id,4,3);
run;
data have2;
length id $3;
set have2;
id=substr(mom_subject_id,4,3);
run;
Proc sort data=have1;
by id;
run;
Proc sort data=have2;
by id;
run;
data work.want;
merge have1(in=a) have2(in=b);
by id;
run;
an alternative would be to use
proc sql
and then use a join and the substr() just as explained above, if you are comfortable with sql
Assuming that your "subject_id" variable is a number then the substr function wont work as sas will try convert the number to a string. But by default it pads some paces on the left of the number.
You can use the modulus function mod(input, base) which returns the remainder when input is divided by base.
/*First get rid of the last 3 digits*/
temp_var = floor( subject_id / 1000);
/* then get the next three digits that we want*/
id = mod(temp_var ,1000);
Or in one line:
id = mod(floor(subject_id / 1000), 1000);
Then you can continue with sorting the new data sets by id and then merging.