I've computer a single centroid per class using PROC fastclus in SAS,
proc fastclus data=sample.samples_train mean=knn.clusters
maxc=1 *number of clusters*
maxiter=100;
var &predictors;
by class;
run;
I'm trying to classify a test set based on the closest one of these centroids created. This i'm doing using PROC discrim in also in SAS.
proc discrim data=knn.clusters
testdata=sample.samples_test
method=NPAR k=1 metric=identity noclassify;
class class;
var &predictors;
ods output ErrorTestClass=knn.error;
run;
I'm using the euclidean distance measure with the option metric=identity.
The following error is returned.
ERROR: There are not enough observations to evaluate the pooled covariance matrix in DATA= data set or BY group.
This works if I set the number of cluster in fastproc equal to 2.
How do I however preform a 1NN with single centroid per class in SAS?
Related
The title may be a little ambiguous. Essentially, using the SASHELP.SHOES dataset, I'm trying to summarize the data in a new table by totaling the Sales, and Returns for each region. For instance, instead of having 56 rows for shoes sold in Africa and their individual sales/returns values, I have one row for Africa with columns TotalSales and TotalReturns. I need to do this for each region in the original dataset.
I'm not familiar at all with SAS, this is more or less the first thing I've really had to program in it. I've tried a few variations of data steps with IN or WHERE conditions, proc means steps with SUM() statements, and DO/DO WHILE loops, but I've missed something each time.
In Proc MEANS
Use a CLASS statement to specify which variable(s) are to be used to group the data. In your case REGION.
Use the VAR statement to specify which variable(s) are to have statistics calculated for within each grouping.
Default output
Corresponding to the minimal syntax
ods listing;
proc means noprint data=SASHELP.SHOES;
class region;
var sales returns;
output out=shoes_stats;
run;
Creates data set WORK.SHOES_STATS with one row per statistic per region.
Other output structure
Use procedure option NWAY to only get summarizations for combinations of all the CLASS variables. (In your case this corresponds to rows with _TYPE_=1)
The output columns can have the statistic name automatically concatenated to the variable name using the OUTPUT statement option / autoname.
Use data set options to control variables that are kept or dropped.
proc means nway noprint data=SASHELP.SHOES;
class region;
var sales returns;
output out=shoes_sums(drop=_type_ _freq_) sum= / autoname;
run;
dm 'vt shoes_sums; column names' viewtable;
I am new in SAS and I am trying to calculate Cronbach's alpha. The code I am using is:
proc corr data=test alpha;
var A B C;
run;
However,
This way I only get the Cronbach's alpha in a table in the results section. Is there a way by only using the editor, to automatically get a new column in the dataset with the value of the Cronbach's alpha coefficient?
Is it possible to calculate the Cronabch's alpha for the variables A, B, and C, but per group? So for instance if in my dataset I have 100 groups, would it be possible to calculate the coefficient of Cronbach's alpha per group, in one go and not by creating 100 different datasets?
Ok folks I've found it. Many thanks #user456789123 for helping me with 2.
So the code I used:
proc sort data=test;
by varGroup;
run;
Which will help in step 2. that I will need to get the Cronbach's alpha for each Group.
proc corr data=test alpha outp=stats;
var A B C;
by varGroup;
run;
Here I get 'x' number of tables in the results section, with the Chronbach's alpha depending on how many categories the varGroup has. Also the command outp=gg actually creates a table with all the categories, Cronbach's alpha coefficients per category, and a bunch of other info produced by the proc corr procedure which I can drop later.
So last thing to do is to merge the new table "stats" with the old table "test" by the "varGroup" variable and I get the original table I was looking for.
I am currently running a macro code in SAS and I want to do a calculation with regards to max and min. Right now the line of code I have is :
hhincscaled = 100*(hhinc - min(hhinc) )/ (max(hhinc) - min(hhinc));
hhvaluescaled = 100*(hhvalue - min(hhvalue))/ (max(hhvalue) - min(hhvalue));
What I am trying to do is re-scale household income and value variables with the calculations below. I am trying to subtract the minimum value of each variable and subtract it from the respective maximum value and then scale it by multiplying it by 100. I'm not sure if this is the right way or if SAS is recognizing the code the way I want it.
I assume you are in a Data Step. A Data Step has an implicit loop over the records in the data set. You only have access to the record of the current loop (with some exceptions).
The "SAS" way to do this is the calculate the Min and Max values and then add them to your data set.
Proc sql noprint;
create table want as
select *,
min(hhinc) as min_hhinc,
max(hhinc) as max_hhinc,
min(hhvalue) as min_hhvalue,
max(hhvalue) as max_hhvalue
from have;
quit;
data want;
set want;
hhincscaled = 100*(hhinc - min_hhinc )/ (max_hhinc - min_hhinc);
hhvaluescaled = 100*(hhvalue - min_hhvalue)/ (max_hhvalue - min_hhvalue);
/*Delete this if you want to keep the min max*/
drop min_: max_:;
run;
Another SAS way of doing this is to create the max/min table with PROC MEANS (or PROC SUMMARY or your choice of alternatives) and merge it on. Doesn't require SQL knowledge to do, and probably about the same speed.
proc means data=have;
*use a class value if you have one;
var hhinc hhvalue;
output out=minmax min= max= /autoname;
run;
data want;
if _n_=1 then set minmax; *get the min/max values- they will be retained automatically and available on every row;
set have;
*do your calculations, using the new variables hhinc_max hhinc_min etc.;
run;
If you have a class statement - ie, a grouping like 'by state' or similar - add that in proc means and then do a merge instead of a second set in want, by your class variable. It would require a sorted (initial) dataset to merge.
You also have the option of doing this in SAS-IML, which works more similarly to how you are thinking above. IML is the SAS interactive matrix language, and more similar to r or matlab than the SAS base language.
I'm not so familiar with SAS proc glm. All I have done using proc glm so far is to output parameter estimates and predicted values on training datasets. But I also need to use the fitted model to make prediction on testing dataset. (both point estimates and interval estimates)
Here is my code.
ods output ParameterEstimates=Pi_Parameters FitStatistics=Pi_Summary PredictedValues=Pi_Fitted;
proc glm data=Train_Pi;
class Area Fo5 Tye M0 M1 M2 M3;
model Pi = Dow Area Fo5 Tye M0|HC M1|HC M2|HC M3|HC/solution p ss3 /*tolerance*/;
run;
But how to proceed to next step? something like predict(Model_from_Train_Pi,Test_Pi)
If you're on SAS 9.4 see Jake's answer from this question:
How to predict probability in logistic regression in SAS?
If not on 9.4, my answer applies for adding the data in to the original data set.
A third option is PROC SCORE - documentation has an example for proc reg that's almost identical to your question:
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_score_sect018.htm
I am estimating a model for firm bankruptcy that involves 11 factors. I have data from 1900 to 2000 and my goal is to estimate my model using proc logistic for the period 1900-1950 and then test its performance on the 1951 through 2000 data. Proc logistic runs fine but the problem I have is that the estimated coefficients have the same name as my factors that I was using in my model. Suppose the dataset that contains all my observations is called myData and the dataset that contains the estimated coefficients which I obtain using an outtest statement (in proc logistic) is called factorEstimates. Now both of these data sets have the variables factor1, factor2, ..., factorN. Now I want to form the dataset outOfSampleResults that does something like the following:
data outOfSampleResults;
set myData factorEstimates;
newVar=factor1*factor1;
run;
Where the first mention of factor1 refers to that contained in myData and the second refers to that contained in factorEstimates. How can I inform sas which dataset it should read for this variable that is common to both of the datasets in the set statement? Alternatively, how could I quickly rename factor1, factor2, ..., factorN as factor1Estimate, factor2Estimate, ..., factorNEstimate in the factorEstimates dataset so as to circumvent this common variable name issue altogether?
Two quick ways to get estimates for a model already developed:
1. Proc logistic score statement
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_logistic_sect066.htm
Include the data in your original proc logistic but use a new variable and ensure that the dependent variable is missing for the observations you want to predict.
data stacked;
set all;
if year >1950 then predicted=.;
else predicted=y;
run;
proc logistic data=stacked;
model predicted = factor1 - factor12;
output out=out_predicted predicted=p;
run;