Use nearest neighbor algorithm to find lookalike population - sas

Nearest neighbor can classify new data point based on the k nearest neighbor's class. Assuming there is dataset A contains 10000 data points. There is also another dataset B contains 1 MM data points. The goal is to find the most similar records from dataset B that resembles dataset A on a number of pre-decided attributes(features).
SAS has a couple of procedure can do that such as PROC DISCRIM that takes a training data and classify on the test data such as below. In this case, how to define training data as the purpose is just to find the most similar records in dataset B that looks like each individual records in data A?
proc discrim data=train
method=npar k=5
testdata=toscore
testout=toscore_out
;
class y;
var x1-x10;
run;

Related

Macro that outputs table with testing results of SAS table

Problem
I'm not a very experienced SAS user, but unfortunately the lab where I can access data is restricted to SAS. Also, I don't currently have access to the data since it is only available in the lab, so I've created simulated data for testing.
I need to create a macro that gets the values and dimensions from a PROC MEANS table and performs some tests that check whether or not the top two values from the data make up 90% of the results.
As an example, assume I have panel data that lists firms revenue, costs, and profits. I've created a table that lists n, sum, mean, median, and std. Now I need to check whether or not the top two firms make up 90% of the results and if so, flag if it's profit, revenue, or costs that makes up 90%.
I'm not sure how to get started
Here are the steps :
Read the data
Read the PROC MEAN table created, get dimensions, and variables.
Get top two firms in each variable and perform check
Create new table that lists variable, value from read table, largest and second largest, and flag.
Then print table
Simulated data :
https://www.dropbox.com/s/ypmri8s6i8irn8a/dataset.csv?dl=0
PROC MEANS Table
proc import datafile="/folders/myfolders/dataset.csv"
out=dt
dbms=csv
replace;
getnames=yes;
run;
TITLE "Macro Project Sample";
PROC MEANS n sum mean median std;
VAR V1 V2 V3;
RUN;
Desired Results :
Value Largest Sec. Largest Flag
V1 463138.09 9888.09 9847.13
V2 148.92 1.99 1.99
V3 11503375 9999900 1000000 Y
At the moment I can't open your simulated dataset but I can give you some advices, hope they will help.
You can add the n extreme values of given variables using the 'output out=' statement with the option IDGROUP.
Here an example using charity dataset ( run this to create it http://support.sas.com/documentation/cdl/en/proc/65145/HTML/default/viewer.htm#p1oii7oi6k9gfxn19hxiiszb70ms.htm)
proc means data=Charity;
var MoneyRaised HoursVolunteered;
output out=try sum=
IDGROUP ( MAX (Moneyraised HoursVolunteered) OUT[2] (moneyraised hoursvolunteered)=max1 max2);
run;
data var1 (keep=name1 _freq_ moneyraised max1_1 max1_2 rename=(moneyraised=value max1_1=largest max1_2=seclargest name1=name))
var2 (keep=name2 _freq_ HoursVolunteered max2_1 max2_2 rename=(HoursVolunteered=value max2_1=largest max2_2=seclargest name2=name));
length name1 name2 $4;
set try ;
name1='VAR1';
name2='VAR2';
run;
data finalmerge;
length flag $1;
set var1 var2;
if largest+seclargest > value*0.9 then flag='Y';
run;
in the proc means I choose to variables moneyraised and hoursvolunteered, you will choose your var1 var2 var3 and make your changes in all the program.
The IDgroup will output the max value for both variables, as you see in the parentheses, but with out[2], obviously largest and second largest.
You must rename them, I choose to rename max1 and max 2, then sas will add an _1 and _2 to the first and the second max values automatically.
All the output will be on the same line, so I do a datastep referencing 2 datasets in output (data var1 var2) keeping the variables needed and renaming them for the next merge, I also choose a naming system as you see.
Finally I'll merge the 2 datasets created and add the flag.
Here are some initial steps and pointers in a non macro approach which restructures the data in such a manner that no array processing is required. This approach should be good for teaching you a bit about manipulating data in SAS but will not be as fast a single pass approach (like the macros you originally posted) as it transposes and sorts the data.
First create some nice looking dummy data.
/* Create some dummy data with three variables to assess */
data have;
do firm = 1 to 3;
revenue = rand("uniform");
costs = rand("uniform");
profits = rand("uniform");
output;
end;
run;
Transpose the data so all the values are in one column (with the variable names in another).
/* Move from wide to deep table */
proc transpose
data = have
out = trans
name = Variable;
by firm;
var revenue costs profits;
run;
Sort the data so each variable is in a contiguous group of rows and the highest values are at the end of each Variable group.
/* Sort by Variable and then value
so the biggest values are at the end of each Variable group */
proc sort data = trans;
by Variable COL1;
run;
Because of the structure of this data, you could go down through each observation in turn, creating a running total, which when you get to the final observation in a Variable group would be the Variable total. In this observation you also have the largest value (the second largest was in the previous observation).
At this point you can create a data step that:
Is aware when it is in the first and last values of each variable group
by statement to make the data step aware of your groups
first.Variable temporary variable so you can initialise your total variable to 0
last.Variable temporary variable so you can output only the last line of each group
Sums up the values in each group
retain statement so SAS doesn't empty your total with each new observation
sum() function or + operator to create your total
Creates and populates new variables for the largest and second largest values in each group
lag() function or retain statement to keep the previous value (the second largest)
Creates your flag
Outputs your new variables at the end of each group
output statement to request an observation be stored
keep statement to select which variables you want
The macros you posted originally looked like they were meant to perform the analysis you are describing but with some extras (only positive values contributed to the Total, an arbitrary number of values could be included rather than just the top 2, the total was multiplied by another variable k1198, negative values where caught in the second largest, extra flags and values were calculated).

Renaming Coefficients that Result from Proc Logistic/Problems Surrounding Variable Names Common to Multiple Datasets

I am estimating a model for firm bankruptcy that involves 11 factors. I have data from 1900 to 2000 and my goal is to estimate my model using proc logistic for the period 1900-1950 and then test its performance on the 1951 through 2000 data. Proc logistic runs fine but the problem I have is that the estimated coefficients have the same name as my factors that I was using in my model. Suppose the dataset that contains all my observations is called myData and the dataset that contains the estimated coefficients which I obtain using an outtest statement (in proc logistic) is called factorEstimates. Now both of these data sets have the variables factor1, factor2, ..., factorN. Now I want to form the dataset outOfSampleResults that does something like the following:
data outOfSampleResults;
set myData factorEstimates;
newVar=factor1*factor1;
run;
Where the first mention of factor1 refers to that contained in myData and the second refers to that contained in factorEstimates. How can I inform sas which dataset it should read for this variable that is common to both of the datasets in the set statement? Alternatively, how could I quickly rename factor1, factor2, ..., factorN as factor1Estimate, factor2Estimate, ..., factorNEstimate in the factorEstimates dataset so as to circumvent this common variable name issue altogether?
Two quick ways to get estimates for a model already developed:
1. Proc logistic score statement
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_logistic_sect066.htm
Include the data in your original proc logistic but use a new variable and ensure that the dependent variable is missing for the observations you want to predict.
data stacked;
set all;
if year >1950 then predicted=.;
else predicted=y;
run;
proc logistic data=stacked;
model predicted = factor1 - factor12;
output out=out_predicted predicted=p;
run;

Get values of survival function in SAS

I generated a random sample from an exponential distribution and sorted them so they are going from lowest to highest value, giving me my order statistics. Now I need to get the values of the survival function at these numbers and plot them against the order statistics. I cannot seem to figure out how to get the list of these survival values in SAS, so I can plot them.
Survivor function estimates at the observed event times can be obtained from the LIFETEST procedure in SAS.
proc lifetest data = yourdata outsurv = survest;
time time2event*eventindicator(0);
run;
The output dataset created by the OUTSURV= option contains the survivor function estimates. Note that by default this gives the Kaplan-Meier estimate of the survivor function.
A plot of the estimated survivor function across time is produced automatically by LIFETEST. If you want to plot against order statistics, use the STEP statement in the SGPLOT procedure after computing your order statistics.
proc sgplot data = survandorder;
step x = order y = survival;
run;

Extracting sub-data from a SAS dataset & applying to a different dataset

I have written a macro to use proc univariate to calculate custom quantiles for variables in a dataset (say dsn1) %cust_quants(dsn= , varlist= , quant_list= ). The output is a summary dataset (say dsn2)that looks something like the following:
q_1 q_2.5 q_50 q_80 q_97.5 q_99 var_name
1 2.5 50 80 97.5 99 ex_var_1_100
-2 10 25 150 500 20000 ex_var_pos_skew
-20000 -500 -150 0 10 50 ex_var_neg_skew
What I would like to do is to use the summary dataset to cap/floor extreme values in the original dataset. My idea is to extract the column of interest (say q_99) and put it into a vector of macro-variables (say q_99_1, q_99_2, ..., q_99_n). I can then do something like the following:
/* create summary of dsn1 as above example */
%cust_quants(dsn= dsn1, varlist= ex_var_1_100 ex_var_pos_skew ex_var_neg_skew,
quant_list= 1 2.5 50 80 97.5 99);
/* cap dsn1 var's at 99th percentile */
data dsn1_cap;
set dsn1;
if ex_var_1_100 > &q_99_1 then ex_var_1_100 = &q_99_1;
if ex_var_pos_skew > &q_99_2 then ex_var_pos_skew = &q_99_2;
/* don't cap neg skew */
run;
In R, it is very easy to do this. One can extract sub-data from a data-frame using matrix like indexing and assign this sub-data to an object. This second object can then be referenced later. R example--extracting b from data-frame a:
> a <- as.data.frame(cbind(c(1,2,3), c(4,5,6)))
> print(a)
V1 V2
1 1 4
2 2 5
3 3 6
> a[, 2]
[1] 4 5 6
> b <- a[, 2]
> b[1]
[1] 4
Is it possible to do the same thing in SAS? I want to be able to assign a column(s) of sub-data to a macro variable / array, such that I can then use the macro / array within a 2nd data step. One thought is proc sql into::
proc sql noprint;
select v2 into :v2_macro separated by " "
from a;
run;
However, this creates a single string variable when what I really want is a vector of variables (or array--no vectors in SAS). Another thought is to add %scan (assuming this is inside a macro):
proc sql noprint;
select v2 into :v2_macro separated by " "
from a;
run;
%let i = 1;
%do %until(%scan(&v2_macro, &i) = "");
%let var_&i = %scan(&v2_macro, &i);
%let &i = %eval(&i + 1);
%end;
This seems inefficient and takes a lot of code. It also requires the programmer to remember which var_&i corresponds to each future purpose. Is there a simpler / cleaner way to do this?
**Please let me know in the comments if this is enough background / example. I'm happy to give a more complete description of why I'm doing what I'm attempting if needed.
First off, I assume you are talking about SAS/Base not SAS/IML; SAS/IML is essentially similar to R and has the same kind of operations available in the same manner.
SAS/Base is more similar to a database language than a matrix language (though has some elements of both, and some elements of an OOP language, as well as being a full-featured functional programming language).
As a result, you do things somewhat differently in order to achieve the same goal. Additionally, because of the cost of moving data in a large data table, you are given multiple methods to achieve the same result; you can choose the appropriate method for the required situation.
To begin with, you generally should not store data in a macro variable in the manner you suggest. It is bad programming practice, and it is inefficient (as you have already noticed). SAS Datasets exist to store data; SAS macro variables exist to help simplify your programming tasks and drive the code.
Creating the dataset "b" as above is trivial in Base SAS:
data b;
set a;
keep v2;
run;
That creates a new dataset with the same rows as A, but only the second column. KEEP and DROP allow you to control which columns are in the dataset.
However, there would be very little point in this dataset, unless you were planning on modifying the data; after all, it contains the same information as A, just less. So for example, if you wanted to merge V2 into another dataset, rather than creating b, you could simply use a dataset option with A:
data c;
merge z a(keep=v2);
by id;
run;
(Note: I presuppose an ID variable of some form to combine A and Z.)
This merge combines the v2 column onto z, in a new dataset, c. This is equivalent to vertically concatenating two matrices (although a straight-up concatenation would remove the 'by id;' requirement, in databases you do not typically do that, as order is not guaranteed to be what you expect).
If you plan on using b to do something else, how you create and/or use it depends on that usage. You can create a format, which is a mapping of values [ie, 1='Hello' 2='Goodbye'] and thus allows you to convert one value to another with a single programming statement. You can load it into a hash table. You can transpose it into a row (proc transpose). Supply more detail and a more specific answer can be provided.

How to preform 1NN with single centroid per class in SAS?

I've computer a single centroid per class using PROC fastclus in SAS,
proc fastclus data=sample.samples_train mean=knn.clusters
maxc=1 *number of clusters*
maxiter=100;
var &predictors;
by class;
run;
I'm trying to classify a test set based on the closest one of these centroids created. This i'm doing using PROC discrim in also in SAS.
proc discrim data=knn.clusters
testdata=sample.samples_test
method=NPAR k=1 metric=identity noclassify;
class class;
var &predictors;
ods output ErrorTestClass=knn.error;
run;
I'm using the euclidean distance measure with the option metric=identity.
The following error is returned.
ERROR: There are not enough observations to evaluate the pooled covariance matrix in DATA= data set or BY group.
This works if I set the number of cluster in fastproc equal to 2.
How do I however preform a 1NN with single centroid per class in SAS?