Problem:
I have a dataset with hundreds of variables (columns) and I want to standardize all numeric variables. But instead of center and dividing by just one standard deviation, I need to center and divide all variables by two standard deviations.
This is an example of the dataset I have
data have;
INPUT year $1-4 program_id $6-8 program_name $10-31 enrollments 33-36 admissions 38-41 graduates 43-46;
datalines;
2010 002 Electrical Engineering 1563 0321 0156
2010 001 Civil Engineering 2356 0739 0236
2010 003 Mechanical Engineering 0982 0234 0069
2010 021 English 3945 1034 0269
2010 031 Physics 0459 0134 0069
2010 041 Arts 0234 0072 0045
2019 004 Engineering 4745 1202 0597
2019 022 English Teaching 2788 0887 0201
2019 023 English and Spanish 0751 0345 0092
2019 031 Physics 0589 0126 0039
2019 032 Astronomy 0093 0035 0021
2019 041 Arts 0359 0097 0062
2019 044 Cinema 0293 0100 0039
;
run;
I want two different datasets. In the first, standardization applies for all variables across the whole dataset.
proc sql;
create table want1 as
select *,
(enrollments - mean(enrollments))/(2*STD(enrollments)) as z_enrollments,
(admissions - mean(admissions))/(2*STD(admissions)) as z_admissions,
(graduates - mean(graduates))/(2*STD(graduates)) as z_graduates
from have;
quit;
In the second, standardization is grouped by year:
proc sql;
create table want2 as
select *,
(enrollments - mean(enrollments))/(2*STD(enrollments)) as z_enrollments,
(admissions - mean(admissions))/(2*STD(admissions)) as z_admissions,
(graduates - mean(graduates))/(2*STD(graduates)) as z_graduates
from have
group by year;
quit;
Question: How to do this for all the hundreds of numeric variables of my dataset, without needing to write down the name of each one of them?
What I tried:
As I want this code to be replicable to different datasets, I was trying to follow the reasoning of this other question. That is, first to identify all numeric variables, than to save all variables names into an array and them doing the computations. I thought that perhaps I also need to save the resulting parameters of each column (mean and std) in an array as well. But I still did not get how to make arrays, datasteps and loops to work together.
I started trying to set an array for calculating the number of numerical variables. This runs fine.
data _null_;
set have;
array x[*] _numeric_;
call symput("nVar",dim(x));
stop;
run;
%put Number Variables = &nVar;
Then I tried to adapt the following code - which is a combination of #DomPazz answer with #Tom suggestion in the comments - but it did not work:
data want;
set have nobs=nobs;
array x[&nVar] _numeric_;
array N[&nVar];
n(1)=x(1); do i=2 to dim(n); n(i)=(x(i) - mean(x(i))/(2*(STD(x(i)); end;
keep N:;
run;
I don't know if the above code would get the right result. But I get an error saying that I have the incorrect number of arguments for the STD function. I looked it up: apparently, STD() in datastep runs row-wise, not column-wise.
I also tried PROC STANDARD, I get some results, but they don't match with my calculations. Probably I did not set the parameters right:
proc standard data=have mean=0 std=2
out=want;
run;
You can use the METHED=STD on PROC STDIZE to standardize around the mean and one STD.
So just add the MULT= option to divide by 2.
proc stdize data=have method=STD mult=0.5 out=want;
run;
Answering last comment:
#Tom I was reading the proc stdize documentation, but I could not figure out if I can customize the LOCATION and SCALE measures. For example, if instead of dividing by 2sdt, I want to subtract the mean and divide by the range for all variables. Would it be possible?
Quick solution:
* Output Mean;
proc stdize data=have method=mean out=out1 outstat=mean1;
var _numeric_;
run;
* Output Range;
proc stdize data=have method=range out=out1 outstat=range1;
var _numeric_;
run;
* LOCATION and SCALE;
data scale_location;
set mean1 (where=(_type_='LOCATION')) range1 (where=(_type_='SCALE'));
run;
* Target;
proc stdize data=have method=in(scale_location) out=want;
var _numeric_;
run;
Related
For an if-query I would like to create a macro varibale giving the respective frequency of the underlying time
series. I tried to get some descriptive statistics from proc time series. However, they unfortunately do not include the figure for the frequency.
The underlying times series does not necessarily conclude all periods of the frequency. That excludes a selected count by proc sql from my point of view.
Does anyone know an efficient procedure to determine the frequency without computing the frequency on my own (in a data step or a proc sql code)?
You can use the outspectra statement to help learn what kind of seasonality it has. Based on the data, give PROC TIMESERIES your best guess of day, month, etc. In the example below, we know we want to forecast by month but we do not know what seasonality it has.
proc timeseries data=sashelp.air outspectra=spectra;
id date interval=month;
var air;
run;
Plot this spectra dataset in proc sgplot and you'll see something that looks like this:
proc sgplot data=spectra;
where NOT missing(period);
series x=period y=p;
run;
This line will naturally increase over time, but we're looking for a bumps in the line. Notice the large bump somewhere between 0 and 24 months and the several smaller bumps before it. Let's zoom in on that by filtering out the longer periods.
proc sgplot data=spectra;
where period < 24 and NOT missing(period);
series x=period y=p;
run;
It's pretty clear that there is a strong seasonality of 12, with potentially smaller cycles at 3 and 6 months. From this plot, we can conclude that our seasonality should be 12 based on our spectra plot.
You can turn this into a macro to help identify the season if you'd like. Simply search for the largest bump within a reasonable timeframe. In our case we'll choose 36 because we do not suspect that we have any seasonality > 36 months.
proc sort data=spectra;
by period;
run;
data identify_period;
set spectra;
by period;
where NOT missing(period) AND period LE 36;
delta = abs(p - lag(p) );
run;
proc sql;
select period, max(delta) as max_delta
from identify_period
having delta = max(delta)
;
quit;
Output:
PERIOD max_delta
12 163712
I don't know how to do this without data step logic, but you could wrap the data step in a macro as follows:
%macro get_frequency(data,date_variable,output_variable);
proc sort data=&data (keep=&date_variable) out=__tempsorted;
by &date_variable;
run;
data _null_;
set __tempsorted end=lastobs;
prevdate=lag(&date_variable);
if _n_ > 1 then do;
interval_number+1;
interval_total + (&date_variable - prevdate);
end;
if lastobs then do;
average_interval = interval_total/interval_number;
frequency = round(365.25/average_interval);
call symput ("&output_variable",left(put(frequency,best32.)));
end;
run;
proc datasets nolist;
delete __tempsorted;
run;
quit;
%mend get_frequency;
Then you can call the macro on your original data set timeseries to examine the variable date and create a new macro variable frequency1 with the required frequency.
data work.timeseries;
input date date. value;
format date date9.;
datalines;
01Oct18 3000
01Nov18 4000
01Dec18 6500
01Jan19 7000
01Feb19 4000
01Mar19 5000
01Apr19 7500
01May19 4800
01Jun19 4500
;
run;
%get_frequency(timeseries,date,freqency1)
%put &=frequency1;
This seems to work on your sample data where each date is the first of the month. If your dates are evenly distributed (e.g. always near month start/end, or always near mid-month etc.) then this macro should work ok. Obviously if you have multiple observations per date then it will give the completely incorrect frequency.
I have a dataset with visitors and weather variables. I'm trying to forecast visitors based on the weather variables. Since the dataset only consists of visitors in season there is missing values and gaps for every year. When running proc reg in sas it's all okay but the issue comes when i'm using proc VARMAX. I cannot run the regression due to missing values. How can i tackle this?
proc varmax data=tivoli4 printall plots=forecast(all);
id obs interval=day;
model lvisitors = rain sunshine averagetemp
dfebruary dmarch dmay djune djuly daugust doctober dnovember ddecember
dwednesday dthursday dfriday dsaturday dsunday
d_24Dec2016 d_05Dec2013 d_24Dec2017 d_24Dec2014 d_24Dec2015 d_24Dec2019
d_24Dec2018 d_24Sep2012 d_06Jul2015
d_08feb2019 d_16oct2014 d_15oct2019 d_20oct2016 d_15oct2015 d_22sep2017 d_08jul2015
d_20Sep2019 d_08jul2016 d_16oct2013 d_01aug2012 d_18oct2012 d_23dec2012 d_30nov2013 d_20sep2014 d_17oct2012 d_17jun2014
dFrock2012 dFrock2013 dFrock2014 dFrock2015 dFrock2016 dFrock2017 dFrock2018 dFrock2019
dYear2015 dYear2016 dYear2017
/p=7 q=2 Method=ml dftest;
garch p=1 q=1 form=ccc OUTHT=CONDITIONAL;
restrict
ar(3,1,1)=0, ar(4,1,1)=0, ar(5,1,1)=0,
XL(0,1,13)=0, XL(0,1,14)=0, XL(0,1,13)=0, XL(0,1,27)=0, XL(0,1,38)=0, XL(0,1,42)=0;
output lead=10 out=forecast;
run;
As with any forecast, you will first need to prepare your time-series. You should first run through your data through PROC TIMESERIES to fill-in or impute missing values. The impute choice that is most appropriate is dependent on your variables. The below code will:
Sum lvisitors by day and set missing values to 0
Set missing values of averagetemp to average
Set missing values of rain, sunshine, and your variables starting with d to 0 (assuming these are indicators)
Code:
proc timeseries data=have out=want;
id obs interval = day
setmissing = 0
notsorted
;
var lvisitors / accumulate=total;
crossvar averagetemp / accumulate=none setmissing=average;
crossvar rain sunshine d: / accumulate=none;
run;
Important Time Interval Consideration
Depending on your data, this could bias your error rate and estimates since you always know no one will be around in the off-season. If you have many missing values for off-season data, you will want to remove those rows.
Since PROC VARMAX does not support custom time intervals, you can instead create a simple time identifier. You can alternatively turn this into a format for proc format and converttime_id at the end.
data want;
set have;
time_id+1;
run;
proc varmax data=want;
id time_id interval=day;
...
output lead=10 out=myforecast;
run;
data myforecast;
merge myforecast
want(keep=time_id date)
;
by time_id;
run;
Or, if you made a format:
data myforecast;
set myforecast;
date = put(time_id, timeid.);
drop time_id;
run;
I want to count the number of unique items in a variable (call it "categories") then use that count to set the number of iterations in a SAS macro (i.e., I'd rather not hard code the number of iterations).
I can get a count like this:
proc sql;
select count(*)
from (select DISTINCT categories from myData);
quit;
I can run a macro like this:
%macro superFreq;
%do i=1 %to &iterationVariable;
Proc freq data=myData;
table var&i / out=var&i||freq;
run;
%mend superFreq;
%superFreq
I want to know how to get the count into the iteration variable so that the macro iterates as many times as there are unique values in the variable "categories".
Sorry if this is confusing. Happy to clarify if need be. Thanks in advance.
You can achieve this by using the into clause in proc sql:
proc sql noprint;
select max(age),
max(height),
max(weight)
into :max_age,
:max_height,
:max_weight
from sashelp.class;
quit;
%put &=max_age &=max_height &=max_weight;
Result:
MAX_AGE= 16 MAX_HEIGHT= 72 MAX_WEIGHT= 150
You can also select a list of results into a macro variable by combining the into clause with the separated by clause:
proc sql noprint;
select name into :list_of_names separated by ' ' from sashelp.class;
quit;
%put &=list_of_names;
Result:
LIST_OF_NAMES=Alfred Alice Barbara Carol Henry James Jane Janet Jeffrey John Joyce Judy Louise Mary Philip Robert Ronald Thomas
William
Each ID has several instances, and each instance has a different value. I would like the final output to be the maximum value per ID. So the initial dataset is:
ID Value
1 100
1 7
1 65
2 12
2 97
3 82
3 54
And the output will be:
ID Value
1 100
2 97
3 82
I tried running proc sort twice thinking that the first sort would get things in the proper order so that nodupkey on the second sort would get rid of the right values. This did not work.
proc sort work.data; by id value descending; run;
proc sort work.data nodupkey; by id; run;
Thanks!
Your approach should have worked fine but it looks like you have a syntax error - did you forget to check your log? The descending keyword needs to go before the variable you want to sort in descending order.
proc sort data=sashelp.class out=tmp;
by sex descending height;
run;
proc sort data=tmp out=final nodupkey;
by sex;
run;
Also - in case you're not familiar with SQL, I strongly suggest that you should learn it as it will simplify many data manipulation tasks. This can also be solved in a single SQL step:
proc sql noprint;
create table want as
select sex,
max(height) as height
from sashelp.class
group by sex
;
quit;
My preferred solution:
proc means data=have noprint;
class id;
var value;
output out=want max(value)=;
run;
Should be a lot faster than two sorts.
I have the following sample data and 'proc means' command.
data have;
input measure country $;
datalines;
250 UK
800 Ireland
500 Finland
250 Slovakia
3888 Slovenia
34 Portugal
44 Netherlands
4666 Austria
run;
PROC PRINT data=have; RUN;
The following PROC MEANS command prints out a listing for each country above. How can I group some of those countries (i.e. UK & Ireland, Slovakia/SLovenia as Central Europe) in the PROC MEANS step, rather than adding another datastep to add a 'case when' etc?
proc means data=have sum maxdec=2 order=freq STACKODS;
var measure;
class country;
run;
Thanks for any help at all on this. I understand there are various things you can do in the PROC MEANS command itself (like limit the number of countries by doing this:
proc means data=have(WHERE=(country not in ('Finland', 'UK')
I'd like to do the grouping in the PROC MEANS command for brevity.
Thanks.
This is very easy with a format for any PROC that takes a CLASS statement.
Simply build a format, either with code or from data; then apply the format in the PROC MEANS statement.
proc format lib=work;
value $countrygroup
"UK"="British Isles"
"Ireland"="British Isles"
"Slovakia","Slovenia"="Central Europe"
;
quit;
proc means data=have;
class country;
var measure;
format country $countrygroup.;
run;
It's usually better to have numeric codes for country and then format those to be whichever set of names is needed at any one time, particularly as capitalization/etc. is pretty irritating, but this works well enough even here.
The CNTLIN= option in PROC FORMAT allows you to make a format from a dataset, with FMTNAME as the value statement, START as the value-to-label, LABEL as the label. (END=end of range if numeric.) There are other options also, the documentation goes into more detail.