Proc reg: checking the best three variables in parallel - sas

I am working on a macro for regressions using the following code:
%Macro Regression;
%let index = 1;
%do %until (%Scan(&Var2,&index," ")=);
%let Ind = %Scan(&Var2,&index," ");
ods output SelectionSummary = SelectionSummary;
proc reg data = Regression2 plots = none;
model &Ind = &var / selection = stepwise maxstep=1;
output out = summary R = RSQUARE;
run;
quit;
%if &index = 1 %then %do;
data final;
set selectionsummary;
run;
%end;
%else %do;
data final;
set final selectionsummary;
run;
%end;
%let index = %eval(&Index + 1);
%end;
%mend;
%Regression;
This code works and gives me a table which highlights the independent variable that explains with the most variation the dependent variable.
I'm looking for a way to run this but the regression gives me the three best independent variables to explain the dependent variable if it was chosen to be the first variable, for example:
models chosen:
GDP = Human Capital
GDP = Working Capital
GDP = Growth
DependentVar Ind1 Ind2 Ind3 Rsq1 Rsq2 Rsq3
GDP human capital working capital growth 0.76 0.75 0.69
or
DependentVar Independent1 Rsq
GDP human capital 0.76
GDP working capital 0.75
GDP growth 0.69
EDIT:
It would be an absolute bonus if there is a way to put stepwise maxstep = 3 and have the best three independent variable combinations for each dependent variable with the condition that the first independent variable is unique.
TIA.

Try STOP=3 option on your model statement. It will fit the best model with up to three variables. However, it does not work with the stepwise option, but will work with the R^squared option.
model &Ind = &var / selection = maxR stop=3;
If you only want to consider 3 variable models include start=3 as well.
model &Ind = &var / selection = maxR stop=3 start=3;

Related

Proc hpbin with minimum proportion per bin

I am using Proc HPBIN to split my data into equally-spaced buckets i.e. each bucket has an equal proportion of the total range of the variable.
My issue is when I have extremely skewed data with a large range. Almost all of my datapoints lie in one bucket while there is a couple of observations scattered around the extremes.
I'm wondering if there is a way to force PROC HPBIN to consider the proportion of values in each bin and make sure there is at least e.g. 5% of observations in a bin and to group others?
DATA var1;
DO VAR1 = 1 TO 100;
OUTPUT;
END;
DO VAR1 = 500 TO 505;
OUTPUT;
END;
DO VAR1 = 7000 TO 7015;
OUTPUT;
END;
DO VAR1 = 1000000 TO 1000010;
OUTPUT;
END;
RUN;
/*Use proc hpbin to generate bins of equal width*/
ODS EXCLUDE ALL;
ODS OUTPUT
Mapping = bin_width_results;
PROC HPBIN
DATA=var1
numbin = 15
bucket;
input VAR1 / numbin = 15;
RUN;
ODS EXCLUDE NONE;
Id like to see a way that proc hpbin or other method groups together the bins which are empty and allows at least 5% of proportion per bucket. However, I am not looking to use percentiles in this case (it is another plot on my pdf) because I'd see like to see the spread.
Have you tried using the WINSOR method (winsorised binning)? From the documentation:
Winsorized binning is similar to bucket binning except that both tails are cut off to obtain a smooth binning result. This technique is often used to remove outliers during the data preparation stage.
You can specify the WINSORRATE to impact how it adjusts these tails.
Quantile option and 20 bins should give you ~5% per bin
PROC HPBIN DATA=var1 quantile;
input VAR1 / numbin = 20;
RUN;
When the values of a bin need to be dynamically rebinned due overly high proportions in a bin (problem bins) you need to hpbin only those values in the problem bins. A macro can be written to loop around the HPBIN process, zooming in on problem areas.
For example:
DATA have;
DO VAR1 = 1 TO 100;
OUTPUT;
END;
DO VAR1 = 500 TO 505;
OUTPUT;
END;
DO VAR1 = 7000 TO 7015;
OUTPUT;
END;
DO VAR1 = 1000000 TO 1000010;
OUTPUT;
END;
RUN;
%macro bin_zoomer (data=, var=, nbins=, rezoom=0.25, zoomlimit=8, out=);
%local data_view step nextstep outbins zoomers;
proc sql;
create view data_zoom1 as
select 1 as step, &var from &data;
quit;
%let step = 1;
%let data_view = data_zoom&step;
%let outbins = bins_step&step;
%bin:
%if &step > &zoomlimit %then %goto done;
ODS EXCLUDE ALL;
ODS OUTPUT Mapping = &outbins;
PROC HPBIN DATA=&data_view bucket ;
id step;
input &var / numbin = &nbins;
RUN;
ODS EXCLUDE NONE;
proc sql noprint;
select count(*) into :zoomers trimmed
from &outbins
where proportion >= &rezoom
;
%put NOTE: &=zoomers;
%if &zoomers = 0 %then %goto done;
%let step = %eval(&step+1);
proc sql;
create view data_zoom&step as
select &step as step, *
from &data_view data
join &outbins bins
on data.&var between bins.LB and bins.UB
and bins.proportion >= &rezoom
;
quit;
%let outbins = bins_step&step;
%let data_view = data_zoom&step;
%goto bin;
%done:
%put NOTE: done # &=step;
* stack the bins that are non-problem or of final zoom;
* the LB to UB domains from step2+ will discretely cover the bounds
* of the original step1 bins;
data &out;
set
bins_step1-bins_step&step
indsname = source
;
if proportion < &rezoom or source = "bins_step&step";
step = source;
run;
%mend;
options mprint;
%bin_zoomer(data=have, var=var1, nbins=15, out=bins);

Splitting a SAS dataset into multiple datasets, according to value of one variable

Is there a more streamlined way of accomplishing this? This is a simplified example. In the real case there are > 10 values of var, each of which need their own dataset.
data
new1
new2
new3;
set old;
if var = 'new1' then output new1;
else if var = 'new2' then output new2;
else if var = 'new3' then output new3;
run;
This should work out. You just need to change the %to 5 to 10 (the max new number). The point made by #Reeza is great. I would also take a look at that post since it's an important suggestion. Usually this is not a good way to handle data, but this should get you around.
data have;
input var $;
datalines;
new1
new2
new3
new4
new5
;
run;
*Actual code starts here;
%macro splitting;
%do i=1 %to 5;
%put "new&i";
proc sql;
create table table&i as
select *
from have
where var contains "new&i";
quit;
%end;
%mend splitting;
%splitting;

How to do a loop with different start and end data date in SAS?

I have a panel data with a quarterly time series for each ID that has different start and end date. I'd like to run a rolling regression on each ID with 10 obs. I'm wondering how I can write a loop to do it. For instance for each ID, I'd like to run a regression from the first obs from date 19880831- 19910228 and then from the second obs 19890228-19910531 and so on and so forth until the last date for this ID. Each ID, however, has a different start and end date. Thanks.
Consider the following macro with nested do loops. Outer loop iterates through panel IDs and inner loop skips every 10 to filter data by the row number _N_ value. Regression results are then saved into individual one-line results data sets suffixed by id and current multiple of 10. Be sure to change the loops limits, (50, 1000) to accommodate data.
%macro rollingregression;
%do i = 1 %to 50; ** ID LOOP;
%do j = 1 %to 1000; ** OBS LOOP;
data new_data;
set current_data;
if ID = &i;
if _N_ >= &j & _N_ <= (&j + 9);
run;
proc reg data = new_data noprint outest=results&i&j;
model y = x1 x2 x3;
quit;
%end;
%end;
%mend rollingregression;
%rollingregression;

How to force SAS to interpret macro vars as numbers in if condition

SAS is interpreting in_year_tolerance and abs_cost as text vars in the below if statement. Therefore the if statement is testing the alphabetical order of the vars rather than the numerical order. How do I get SAS to treat them as numbers? I have tried sticking the if condition, and the macro vars, in %sysevalf but that made no difference. in_year_tolerance is 10,000,000 and cost can vary, but in the test run starts around 20,000,000 before dropping to 9,000,000 at which point it should exit the loop but doesn't.
%macro set_downward_caps(year, in_year_tolerance, large, small, start, end, increment);
%do c = &start. %to &end. %by &increment.;
%let nominal_down_large_&year. = %sysevalf(&large. + (&c. / 1000));
%let nominal_down_small_&year. = %sysevalf(&small. + (&c. / 100));
%let real_down_large_&year. = %sysevalf((1 - &&nominal_down_large_&year.) * &&rpi&year.);
%let real_down_small_&year. = %sysevalf((1 - &&nominal_down_small_&year.) * &&rpi&year.);
%rates(&year.);
proc means data = output.s_&scenario. noprint nway;
var transbill&year.;
output out = temporary (drop = _type_ _freq_) sum=cost;
run;
data _null_;
set temporary;
call symputx('cost', put(cost,best32.));
run;
data temp;
length scenario $ 30;
scenario = "&scenario.";
large = &&real_down_large_&year.;
small = &&real_down_small_&year.;
cost = &cost.;
run;
data output.summary_of_caps;
set output.summary_of_caps temp;
run;
%let abs_cost = %sysevalf(%sysfunc(abs(&cost)));
%if &in_year_tolerance. > &abs_cost. %then %return;
%end;
%mend set_downward_caps;
Use %sysevalf(&in_year_tolerance > &abs_cost,boolean).
As you have seen, compares are text based. If you put it in an %eval() or %sysevalf() the values will be interpreted as numbers.
The ,boolean option lets it know you want a TRUE/FALSE.

SAS Remove Outliers

I'm looking for a macro or something in SAS that can help me in isolating the outliers from a dataset. I define an outlier as: Upperbound: Q3+1.5(IQR) Lowerbound: Q1-1.5(IQR). I have the following SAS code:
title 'Fall 2015';
proc univariate data = fall2015 freq;
var enrollment_count;
histogram enrollment_count / vscale = percent vaxis = 0 to 50 by 5 midpoints = 0 to 300 by 5;
inset n mean std max min range / position = ne;
run;
I would like to get rid of the outliers from fall2015 dataset. I found some macros, but no luck in working the macro. Several have a class variable which I don't have. Any ideas how to separate my data?
Here's a macro I wrote a while ago to do this, under slightly different rules. I've modified it to meet your criteria (1.5).
Use proc means to calculate Q1/Q3 and IQR (QRANGE)
Build Macro to cap based on rules
Call macro using call execute and boundaries set, using the output from step 1.
*Calculate IQR and first/third quartiles;
proc means data=sashelp.class stackods n qrange p25 p75;
var weight height;
ods output summary=ranges;
run;
*create data with outliers to check;
data capped;
set sashelp.class;
if name='Alfred' then weight=220;
if name='Jane' then height=-30;
run;
*macro to cap outliers;
%macro cap(dset=,var=, lower=, upper=);
data &dset;
set &dset;
if &var>&upper then &var=&upper;
if &var<&lower then &var=&lower;
run;
%mend;
*create cutoffs and execute macro for each variable;
data cutoffs;
set ranges;
lower=p25-1.5*qrange;
upper=p75+1.5*qrange;
string = catt('%cap(dset=capped, var=', variable, ", lower=", lower, ", upper=", upper ,");");
call execute(string);
run;