How to run many SVM Models with SAS Code? - sas

I'm working in Enterprise Miner, and saw this video from SAS, in which he briefly shows a SAS Code node to run a bunch of SVM Models.
He didn't show the whole thing, but enough to get me curious about how to do this. Here's what I was able to get so far:
%macro hpsvm (run=1,runLabel=,penalty=10,method=activeSet,kernel=TBF);
proc hpsvm data=&em_import_data maxiter=25 metho = &method. tolerance=0.000001 c = &penalty.;
input %em_internal_input / level = interval;
target %em_target / level = binary ;
lernel &kernel.;
partition fraction (validate=.3 seed=12345);
ods output fitStatistics=fitStats&run.;
run;
data firStats&run.;
length method $ 10;
length kernel $ 10;
length runLabel $ 64;
set fitStats&run.;
run = &run;
runLabel = "&runLabel";
method="&method.";
kernel="&kernel.";
penalty=&penalty.;
run;
proc print data=fitStats&run.;
run;
proc append base=fitStats data=firStats&run.;
%mend hpsvm;
%hpsvm(run=1,runLabel=RBF c=1, method=activeSet, kernel=RBF,penalty=1);
%hpsvm(run=2,runLabel=RBF c=5, method=activeSet, kernel=RBF,penalty=5);
%hpsvm(run=3,runLabel=RBF c=10, method=activeSet, kernel=RBF,penalty=10);
%hpsvm(run=4,runLabel=RBF c=15, method=activeSet, kernel=RBF,penalty=15);
%hpsvm(run=5,runLabel=RBF c=20, method=activeSet, kernel=RBF,penalty=20);
%hpsvm(run=6,runLabel=Linear c=1, method=ipoint, kernel=linear,penalty=1);
%hpsvm(run=7,runLabel=Linear c=5, method=ipoint, kernel=linear,penalty=5);
%hpsvm(run=8,runLabel=Linear c=10, method=ipoint, kernel=linear,penalty=10);
%hpsvm(run=9,runLabel=Linear c=15, method=ipoint, kernel=linear,penalty=15);
%hpsvm(run=10,runLabel=Linear c=20, method=ipoint, kernel=linear,penalty=20);
%hpsvm(run=11,runLabel=Polynomial c=1, method=ipoint, kernel=POLYNOM,penalty=1);
%hpsvm(run=12,runLabel=Polynomial c=5, method=ipoint, kernel=POLYNOM,penalty=5);
%hpsvm(run=13,runLabel=Polynomial c=10, method=ipoint, kernel=POLYNOM,penalty=10);
%hpsvm(run=14,runLabel=Polynomial c=15, method=ipoint, kernel=POLYNOM,penalty=15);
%hpsvm(run=15,runLabel=Polynomial c=20, method=ipoint, kernel=POLYNOM,penalty=20);
%hpsvm(run=16,runLabel=Sigmoid c=1, method=activeSet, kernel=SIGMOID,penalty=1);
%hpsvm(run=17,runLabel=Sigmoid c=5, method=activeSet, kernel=SIGMOID,penalty=5);
%hpsvm(run=18,runLabel=Sigmoid c=10, method=activeSet, kernel=SIGMOID,penalty=10);
%hpsvm(run=19,runLabel=Sigmoid c=15, method=activeSet, kernel=SIGMOID,penalty=15);
%hpsvm(run=20,runLabel=Sigmoid c=20, method=activeSet, kernel=SIGMOID,penalty=20);
data fitStats;
retain run runLabel method kernal penalty;
set fitStats;
run;
%em_register(type=Data,key=fitStats);
data &em_user_fitStats;
retain Penalty;
set fitStats;
run;
%em_report(viewType=data,key=fitStats,autodisplay=y,description=Fit Statistics by Run);
%em_register(type=Data,key=Error);
A few things to note about this:
I'm using the MillionSongDataset from UCI (but let me know how to output data to a good format for SO, and I'll add some here)
This should run using data from the previous node (Data Partition)
The only error I can make out is something about not having quotes or semi-colons in the right place, but everything looks ok to me (with almost no SAS Coding experience).
He did not show the remaining 1/5th of the code.
I'm looking to run many SVM Models to try different combinations of options to find the best model.

SAS actually emailed me the code!
%macro hpsvm (run=1,runLabel=,penalty=10,method=activeSet,kernel=RBF);
proc hpsvm data=&em_import_data maxiter=25 method = &method. tolerance=0.000001 c = &Penalty.;
input %em_interval_input / level = interval;
target %em_target / level = binary ;
kernel &kernel.;
partition fraction (validate=.3 seed=12345);
ods output fitStatistics=fitStats&run.;
run;
data fitStats&run.;
length method $ 10;
length kernel $ 10;
length runLabel $ 64;
set fitStats&run.;
run = &run;
runLabel = "&runLabel";
method="&method.";
kernel="&kernel.";
penalty=&penalty.;
run;
proc print data=fitStats&run.;
run;
proc append base=fitStats data=fitStats&run.;
run;
%mend hpsvm;
%hpsvm(run=1,runLabel=RBF c=1,method=activeSet,kernel=RBF,penalty=1);
%hpsvm(run=2,runLabel=RBF c=5,method=activeSet,kernel=RBF,penalty=5);
%hpsvm(run=3,runLabel=RBF c=10,method=activeSet,kernel=RBF,penalty=10);
%hpsvm(run=4,runLabel=RBF c=15,method=activeSet,kernel=RBF,penalty=15);
%hpsvm(run=5,runLabel=RBF c=20,method=activeSet,kernel=RBF,penalty=20);
%hpsvm(run=6,runLabel=Linear c=1,method=ipoint,kernel=linear,penalty=1);
%hpsvm(run=7,runLabel=Linear c=5,method=ipoint,kernel=linear,penalty=5);
%hpsvm(run=8,runLabel=Linear c=10,method=ipoint,kernel=linear,penalty=10);
%hpsvm(run=9,runLabel=Linear c=15,method=ipoint,kernel=linear,penalty=15);
%hpsvm(run=10,runLabel=Linear c=20,method=ipoint,kernel=linear,penalty=20);
%hpsvm(run=11,runLabel=RBF c=1,method=ipoint,kernel=POLYNOM,penalty=1);
%hpsvm(run=12,runLabel=RBF c=5,method=ipoint,kernel=POLYNOM,penalty=5);
%hpsvm(run=13,runLabel=RBF c=10,method=ipoint,kernel=POLYNOM,penalty=10);
%hpsvm(run=14,runLabel=RBF c=15,method=ipoint,kernel=POLYNOM,penalty=15);
%hpsvm(run=15,runLabel=RBF c=20,method=ipoint,kernel=POLYNOM,penalty=20);
%hpsvm(run=16,runLabel=Linear c=1,method=activeSet,kernel=SIGMOID,penalty=1);
%hpsvm(run=17,runLabel=Linear c=5,method=activeSet,kernel=SIGMOID,penalty=5);
%hpsvm(run=18,runLabel=Linear c=10,method=activeSet,kernel=SIGMOID,penalty=10);
%hpsvm(run=19,runLabel=Linear c=15,method=activeSet,kernel=SIGMOID,penalty=15);
%hpsvm(run=20,runLabel=Linear c=20,method=activeSet,kernel=SIGMOID,penalty=20);
data fitStats;
retain run runLabel method kernel penalty;
set fitStats;
run;
proc print data=fitStats;
run;
%em_register(type=Data,key=fitStats);
data &em_user_fitStats;
retain Penalty;
set fitStats;
run;
%em_report(viewType=data,key=fitStats,autodisplay=y,description=Fit Statistics by Run);
%em_register(type=Data,key=Error);
data &em_user_Error;
set fitStats;
if statistic = 'Error';
run;
%em_report(viewType=lineplot,key=Error,x=penalty,y=validation,group=kernel,description=Classification Error by penalty,autodisplay=y);
%em_register(type=Data,key=Sensitivity);
data &em_user_Sensitivity;
set fitStats;
if statistic = 'Sensitivity';
run;
%em_report(viewType=lineplot,key=Sensitivity,x=penalty,y=validation,group=kernel,description=Sensitivity by penalty,autodisplay=y);
%em_register(type=Data,key=Specificity);
data &em_user_Specificity;
set fitStats;
if statistic = 'Specificity';
run;
%em_report(viewType=lineplot,key=Specificity,x=penalty,y=validation,group=kernel,description=Specificity by penalty,autodisplay=y);

Related

SaS 9.4: How to use different weights on the same variable without datastep or proc sql

I can't find a way to summarize the same variable using different weights.
I try to explain it with an example (of 3 records):
data pippo;
a=10;
wgt1=0.5;
wgt2=1;
wgt3=0;
output;
a=3;
wgt1=0;
wgt2=0;
wgt3=1;
output;
a=8.9;
wgt1=1.2;
wgt2=0.3;
wgt3=0.1;
output;
run;
I tried the following:
proc summary data=pippo missing nway;
var a /weight=wgt1;
var a /weight=wgt2;
var a /weight=wgt3;
output out=pluto (drop=_freq_ _type_) sum()=;
run;
Obviously it gives me a warning because I used the same variable "a" (I can't rename it!).
I've to save a huge amount of data and not so much physical space and I should construct like 120 field (a0-a6,b0-b6 etc) that are the same variables just with fixed weight (wgt0-wgt5).
I want to store a dataset with 20 columns (a,b,c..) and 6 weight (wgt0-wgt5) and, on demand, processing a "summary" without an intermediate datastep that oblige me to create 120 fields.
Due to the huge amount of data (more or less 55Gb every month) I'd like also not to use proc sql statement:
proc sql;
create table pluto
as select sum(db.a * wgt1) as a0, sum(db.a * wgt1) as a1 , etc.
quit;
There is a "Super proc summary" that can summarize the same field with different weights?
Thanks in advance,
Paolo
I think there are a few options. One is the data step view that data_null_ mentions. Another is just running the proc summary however many times you have weights, and either using ods output with the persist=proc or 20 output datasets and then setting them together.
A third option, though, is to roll your own summarization. This is advantageous in that it only sees the data once - so it's faster. It's disadvantageous in that there's a bit of work involved and it's more complicated.
Here's an example of doing this with sashelp.baseball. In your actual case you'll want to use code to generate the array reference for the variables, and possibly for the weights, if they're not easily creatable using a variable list or similar. This assumes you have no CLASS variable, but it's easy to add that into the key if you do have a single (set of) class variable(s) that you want NWAY combinations of only.
data test;
set sashelp.baseball;
array w[5];
do _i = 1 to dim(w);
w[_i] = rand('Uniform')*100+50;
end;
output;
run;
data want;
set test end=eof;
i = .;
length varname $32;
sumval = 0 ;
sum=0;
if _n_ eq 1 then do;
declare hash h_summary(suminc:'sumval',keysum:'sum',ordered:'a');;
h_summary.defineKey('i','varname'); *also would use any CLASS variable in the key;
h_summary.defineData('i','varname'); *also would include any CLASS variable in the key;
h_summary.defineDone();
end;
array w[5]; *if weights are not named in easy fashion like this generate this with code;
array vars[*] nHits nHome nRuns; *generate this with code for the real dataset;
do i = 1 to dim(w);
do j = 1 to dim(vars);
varname = vname(vars[j]);
sumval = vars[j]*w[i];
rc = h_summary.ref();
if i=1 then put varname= sumval= vars[j]= w[i]=;
end;
end;
if eof then do;
rc = h_summary.output(dataset:'summary_output');
end;
run;
One other thing to mention though... if you're doing this because you're doing something like jackknife variance estimation or that sort of thing, or anything that uses replicate weights, consider using PROC SURVEYMEANS which can handle replicate weights for you.
You can SCORE your data set using a customized SCORE data set that you can generate
with a data step.
options center=0;
data pippo;
retain a 10 b 1.75 c 5 d 3 e 32;
run;
data score;
if 0 then set pippo;
array v[*] _numeric_;
retain _TYPE_ 'SCORE';
length _name_ $32;
array wt[3] _temporary_ (.5 1 .333);
do i = 1 to dim(v);
call missing(of v[*]);
do j = 1 to dim(wt);
_name_ = catx('_',vname(v[i]),'WGT',j);
v[i] = wt[j];
output;
end;
end;
drop i j;
run;
proc print;[enter image description here][1]
run;
proc score data=pippo score=score;
id a--e;
var a--e;
run;
proc print;
run;
proc means stackods sum;
ods exclude summary;
ods output summary=summary;
run;
proc print;
run;
enter image description here

Highlight the corresponding line number with another datasets

I have two datasets, one extract the extreme values from proc univariate. I would like to create a new variable and label them as 1 if the n in the original dataset equals the extracted line number in the univariate dataset. But I don't know how to program it not manually enter the line number.
 
There're a few ways to do this, but one easy way is to just add the rownum to the original dataset and merge on it.
Here's an example.
ods output extremeobs=extreme_test;
proc univariate data=sashelp.heart;
run;
ods output close;
data extreme_diastolic extreme_systolic; *just creating the extreme datasets;
set extreme_test;
if varname='Diastolic' then output extreme_diastolic;
else if varname='Systolic' then output extreme_systolic;
run;
data for_merge; *adding rownum on to the original dataset;
set sashelp.heart;
rownum = _n_;
run;
*now, sort the extreme datasets by the `highobs` and `lowobs` values respectively and save those as `rownum`, so they can be merged;
proc sort data=extreme_diastolic out=high_diastolic(keep=highobs rename=highobs=rownum);
by highobs;
run;
proc sort data=extreme_systolic out=high_systolic(keep=highobs rename=highobs=rownum);
by highobs;
run;
proc sort data=extreme_diastolic out=low_diastolic(keep=lowobs rename=lowobs=rownum);
by lowobs;
run;
proc sort data=extreme_systolic out=low_systolic(keep=lowobs rename=lowobs=rownum);
by lowobs;
run;
*now, merge those on using `in=` to identify which are matches.;
data heart_extremes;
merge for_merge high_diastolic(in=_highd) high_systolic(in=_highs) low_diastolic(in=_lowd) low_systolic(in=_lows);
by rownum;
if _highd then high_diastolic = 1;
if _highs then high_systolic = 1;
if _lowd then low_diastolic = 1;
if _lows then low_systolic = 1;
run;

proc means output percentile statistics

proc means data = data1 stackODSoutput MIN P10 P25 P50 P75 P90 MAX N NMISS SUM nolabels maxdec=3;
var var1 var2;
output out = output;
run;
From the generated report, I can get all percentile and SUM. but the output data just provide me basic statistics with N, MIN, MAX, MEAN and std.
How can I also output the percentile and sum?
For output datasets in proc means, you need to specify which statistics you'd like within the output statement. Think of the proc statement as only controlling the visual output. Try this instead:
proc means data=sashelp.cars;
var horsepower MPG_City MPG_Highway;
output out=output
sum=
mean=
median=
std=
min=
max=
p10=
p25=
p75=
p90=
/ autoname
;
run;
Note that none of the statistics have anything after the =. The autoname option is automatically naming the statistic variables.
To make it easier to read, we can change the format of the output table. The naming convention of all variables is <variable>_<statistic>. Knowing this, we can transpose the table, separate out the variable and statistics from the name, then re-transpose it into a nicer format.
proc transpose data=output out=output_transposed;
var _NUMERIC_;
run;
data _want(index=(variable) );
set output_transposed;
Stat = scan(_NAME_, -1, '_');
Variable = tranwrd(_NAME_, cats('_', Stat), '');
keep Variable Stat COL1;
rename COL1 = Value;
run;
proc transpose data=_want out=want(drop=_NAME_);
by variable;
id stat;
var Value;
run;

SAS / PROC FREQ TABLES - can I suppress frequencies and percents if frequency is less than a given value?

I'm using tagsets.excelxp in SAS to output dozens of two-way tables to an .xml file. Is there syntax that will suppress rows (frequencies and percents) if the frequency in that row is less than 10? I need to apply that in order to de-identify the results, and it would be ideal if I could automate the process rather than use conditional formatting in each of the outputted tables. Below is the syntax I'm using to create the tables.
ETA: I need those suppressed values to be included in the computation of column frequencies and percents, but I need them to be invisible in the final table (examples of options I have considered: gray out the entire row, turn the font white so it doesn't show for those cells, replace those values with an asterisk).
Any suggestions would be greatly appreciated!!!
Thanks!
dr j
%include 'C:\Users\Me\Documents\excltags.tpl';
ods tagsets.excelxp file = "C:\Users\Me\Documents\Participation_rdg_LSS_3-8.xml"
style = MonoChromePrinter
options(
convert_percentages = 'yes'
embedded_titles = 'yes'
);
title1 'Participation';
title2 'LSS-Level';
title3 'Grades 3-8';
title4 'Reading';
ods noproctitle;
proc sort data = part_rdg_3to8;
by flag_accomm flag_participation lss_nm;
run;
proc freq data = part_rdg_3to8;
by flag_accomm flag_participation;
tables lss_nm*grade_p / crosslist nopercent;
run;
ods tagsets.excelxp close;
D.Jay: Proc FREQ does not contain any options for conditionally masking cells of it's output. You can leverage the output data capture capability of the ODS system with a follow-up Proc REPORT to produce the desired masked output.
I am guessing on the roles of the lss and grade_p as to be a skill level and a student grade level respectively.
Generate some sample data
data have;
do student_id = 1 to 10000;
flag1 = ranuni(123) < 0.4;
flag2 = ranuni(123) < 0.6;
lss = byte(65+int(26*ranuni(123)));
grade = int(6*ranuni(123));
* at every third lss force data to have a low percent of grades < 3;
if mod(rank(lss),3)=0 then
do until (grade > 2 or _n_ < 0.15);
grade = int(6*ranuni(123));
_n_ = ranuni(123);
end;
else if mod(rank(lss),7)=0 then
do until (grade < 3 or _n_ < 0.15);
grade = int(6*ranuni(123));
_n_ = ranuni(123);
end;
output;
end;
run;
proc sort data=have;
by flag1 flag2;
*where lss in ('A' 'B') and flag1 and flag2; * remove comment to limit amount of output during 'learning the code' phase;
run;
Perform the Proc FREQ
Only capture the data corresponding to the output that would have been generated
ods _all_ close;
* ods trace on;
/* trace will log the Output names
* that a procedure creates, and thus can be captured
*/
ods output CrossList=crosslist;
proc freq data=have;
by flag1 flag2;
tables lss * grade / crosslist nopercent;
run;
ods output close;
ods trace off;
Now generate output to your target ODS destination (be it ExcelXP, html, pdf, etc)
Reference output of which needs to be produced an equivalent having masked values.
* regular output of FREQ, to be compare to masked output
* of some information via REPORT;
proc freq data=have;
by flag1 flag2;
tables lss * grade / crosslist nopercent;
run;
Proc REPORT has great features for producing conditional output. The compute block is used to select either a value or a masked value indicator for output.
options missing = ' ';
proc format;
value $lss_report ' '= 'A0'x'Total';
value grade_report . = 'Total';
value blankfrq .b = '*masked*' ._=' ' other=[best8.];
value blankpct .b = '*masked*' ._=' ' other=[6.2];
proc report data=CrossList;
by flag1 flag2;
columns
('Table of lss by grade'
lss grade
Frequency RowPercent ColPercent
FreqMask RowPMask ColPMask
)
;
define lss / order order=formatted format=$lss_report. missing;
define grade / display format=grade_report.;
define Frequency / display noprint;
define RowPercent / display noprint;
define ColPercent / display noprint;
define FreqMask / computed format=blankfrq. 'Frequency' ;
define RowPMask / computed format=blankpct. 'Row/Percent';
define ColPMask / computed format=blankpct. 'Column/Percent';
compute FreqMask;
if 0 <= RowPercent < 10
then FreqMask = .b;
else FreqMask = Frequency;
endcomp;
compute RowPMask;
if 0 <= RowPercent < 10
then RowPMask = .b;
else RowPMask = RowPercent;
endcomp;
compute ColPMask;
if 0 <= RowPercent < 10
then ColPMask = .b;
else ColPMask = ColPercent;
endcomp;
run;
ods html close;
If you have to produce lots of cross listings for different data sets, the code is easily macro-ized.
When I've done this in the past, I've first generated the frequency to a dataset, then filtered out the N, then re-printed the dataset (using tabulate usually).
If you can't recreate the frequency table perfectly from the freq output, you can do a simple frequency, check which IDs or variables or what have you to exclude, and then filter them out from the input dataset and rerun the whole frequency.
I don't believe that you can with PROC FREQ, but you can easily replicate your code with PROC TABULATE and you can use a custom format there to mask the numbers. This example sets it to M for missing and N for less than 5 and with one decimal place for the rest of the values. You could also replace the M/N with a space (single space) to have no values shown instead.
*Create a format to mask values less than 5;
proc format;
value mask_fmt
. = 'M' /*missing*/
low - < 5='N' /*less than 5 */
other = [8.1]; /*remaining values with one decimal place*/
run;
*sort data for demo;
proc sort data=sashelp.cars out=cars;
by origin;
run;
ods tagsets.excelxp file='/folders/myfolders/demo.xml';
*values partially masked;
proc tabulate data=cars;
where origin='Asia';
by origin;
class make cylinders;
table make, cylinders*n*f=mask_fmt. ;
run;
ods tagsets.excelxp close;
This was tested on SAS UE.
EDIT: Forgot the percentage piece, so this likely will not work for that, primarily because I don't think you'll get the percentages the same as in PROC FREQ (appearance) so it depends on how important that is to you. The other possibility to accomplish this would be to modify the PROC FREQ template to use the custom format as above. Unfortunately I do not have time to mock this up for you but maybe someone else can. I'll leave this here to help get you started and delete it later on.

proc transpose with duplicate ID values

I need help with proc transpose procedure in SAS. My code initially was:
proc transpose data=temp out=temp1;
by patid;
var text;
Id datanumber;
run;
This gave me error "The ID value " " occurs twice in the same BY group". I modified the code to this:
proc sort data = temp;
by patid text datanumber;
run;
data temp;
set temp by patid text datanumber;
if first.datanunmber then n = 0;
n+1;
run;
proc sort data = temp;
by patid text datanumber n;
run;
proc transpose out=temp1 (drop=n) let;
by patid;
var text;
id datanumber;
run;
This is giving me error: variable n is not recognized. Adding a let option is giving a lot of error "occurs twice in the same BY group". I want to keep all id values.
Please help me in this.
Data Example:
Patid Text
When you get that error it is telling you that you have multiple data points for one or more variables that you are trying to create. SAS can force the transpose and delete the extra datapoints if you add "let" to the proc transpose line.
Your data is possibly not unique? I created a dataset (with unique values of patid and datanumber) and the transpose works:
data temp (drop=x y);
do x=1 to 4;
PATID='PATID'||left(x);
do y=1 to 3;
DATANUMBER='DATA'||left(y);
TEXT='TEXT'||left(x*y);
output;
end;
end;
proc sort; by _all_;
proc transpose out=temp2 (drop=_name_);
by patid;
var text;
id datanumber;
run;
my recommendation would be to forget the 'n' fix and focus on making the data unique for patid and datanumber, a dirty approach would be:
proc sort data = temp nodupkey;
by patid datanumber;
run;
at the start of your code..
Try to sort your dataset by patid text n datanumber, (n before datanumber).
Try to sort your dataset by patid n datanumber, (n before datanumber). and proc transpose "by patib n ";