Problem:
I have a dataset with hundreds of variables (columns) and I want to standardize all numeric variables. But instead of center and dividing by just one standard deviation, I need to center and divide all variables by two standard deviations.
This is an example of the dataset I have
data have;
INPUT year $1-4 program_id $6-8 program_name $10-31 enrollments 33-36 admissions 38-41 graduates 43-46;
datalines;
2010 002 Electrical Engineering 1563 0321 0156
2010 001 Civil Engineering 2356 0739 0236
2010 003 Mechanical Engineering 0982 0234 0069
2010 021 English 3945 1034 0269
2010 031 Physics 0459 0134 0069
2010 041 Arts 0234 0072 0045
2019 004 Engineering 4745 1202 0597
2019 022 English Teaching 2788 0887 0201
2019 023 English and Spanish 0751 0345 0092
2019 031 Physics 0589 0126 0039
2019 032 Astronomy 0093 0035 0021
2019 041 Arts 0359 0097 0062
2019 044 Cinema 0293 0100 0039
;
run;
I want two different datasets. In the first, standardization applies for all variables across the whole dataset.
proc sql;
create table want1 as
select *,
(enrollments - mean(enrollments))/(2*STD(enrollments)) as z_enrollments,
(admissions - mean(admissions))/(2*STD(admissions)) as z_admissions,
(graduates - mean(graduates))/(2*STD(graduates)) as z_graduates
from have;
quit;
In the second, standardization is grouped by year:
proc sql;
create table want2 as
select *,
(enrollments - mean(enrollments))/(2*STD(enrollments)) as z_enrollments,
(admissions - mean(admissions))/(2*STD(admissions)) as z_admissions,
(graduates - mean(graduates))/(2*STD(graduates)) as z_graduates
from have
group by year;
quit;
Question: How to do this for all the hundreds of numeric variables of my dataset, without needing to write down the name of each one of them?
What I tried:
As I want this code to be replicable to different datasets, I was trying to follow the reasoning of this other question. That is, first to identify all numeric variables, than to save all variables names into an array and them doing the computations. I thought that perhaps I also need to save the resulting parameters of each column (mean and std) in an array as well. But I still did not get how to make arrays, datasteps and loops to work together.
I started trying to set an array for calculating the number of numerical variables. This runs fine.
data _null_;
set have;
array x[*] _numeric_;
call symput("nVar",dim(x));
stop;
run;
%put Number Variables = &nVar;
Then I tried to adapt the following code - which is a combination of #DomPazz answer with #Tom suggestion in the comments - but it did not work:
data want;
set have nobs=nobs;
array x[&nVar] _numeric_;
array N[&nVar];
n(1)=x(1); do i=2 to dim(n); n(i)=(x(i) - mean(x(i))/(2*(STD(x(i)); end;
keep N:;
run;
I don't know if the above code would get the right result. But I get an error saying that I have the incorrect number of arguments for the STD function. I looked it up: apparently, STD() in datastep runs row-wise, not column-wise.
I also tried PROC STANDARD, I get some results, but they don't match with my calculations. Probably I did not set the parameters right:
proc standard data=have mean=0 std=2
out=want;
run;
You can use the METHED=STD on PROC STDIZE to standardize around the mean and one STD.
So just add the MULT= option to divide by 2.
proc stdize data=have method=STD mult=0.5 out=want;
run;
Answering last comment:
#Tom I was reading the proc stdize documentation, but I could not figure out if I can customize the LOCATION and SCALE measures. For example, if instead of dividing by 2sdt, I want to subtract the mean and divide by the range for all variables. Would it be possible?
Quick solution:
* Output Mean;
proc stdize data=have method=mean out=out1 outstat=mean1;
var _numeric_;
run;
* Output Range;
proc stdize data=have method=range out=out1 outstat=range1;
var _numeric_;
run;
* LOCATION and SCALE;
data scale_location;
set mean1 (where=(_type_='LOCATION')) range1 (where=(_type_='SCALE'));
run;
* Target;
proc stdize data=have method=in(scale_location) out=want;
var _numeric_;
run;
Hi I have two tables with different column orders, and the column name are not capitalized as the same. How can I compare if the contents of these two tables are the same?
For example, I have two tables of students' grades
table A:
Math English History
-------+--------+---------
Tim 98 95 90
Helen 100 92 85
table B:
history MATH english
--------+--------+---------
Tim 90 98 95
Helen 85 100 92
You may use either of the two approaches to compare, regardless of the order or column name
/*1. Proc compare*/
proc sort data=A; by name; run;
proc sort data=B; by name; run;
proc compare base=A compare=B;
id name;
run;
/*2. Proc SQL*/
proc sql;
select Math, English, History from A
<union/ intersect/ Except>
select MATH, english, history from B;
quit;
use except corr(corresponding) it will check by name. if everything is matching you will get zero records.
data have1;
input Math English History;
datalines;
1 2 3
;
run;
data have2;
input English math History;
datalines;
2 1 3
;
run;
proc sql ;
select * from have1
except corr
select * from have2;
edit1
if you want to check which particular column it differs you may have to transpose and compare as shown below example.
data have1;
input name $ Math English pyschology History;
datalines;
Tim 98 95 76 90
Helen 100 92 55 85
;
run;
data have2;
input name $ English Math pyschology History;
datalines;
Tim 95 98 76 90
Helen 92 100 99 85
;
run;
proc sort data = have1 out =hav1;
by name;
run;
proc sort data = have2 out =hav2;
by name;
run;
proc transpose data =hav1 out=newhave1 (rename = (_name_= subject
col1=marks));
by name;
run;
proc transpose data =hav2 out=newhave2 (rename = (_name_= subject
col1=marks));
by name;
run;
proc sql;
create table want(drop=mark_dif) as
select
a.name as name
,a.subject as subject
,a.marks as have1_marks
,b.marks as have2_marks
,a.marks -b.marks as mark_dif
from newhave1 a inner join newhave2 b
on upcase(a.name) = upcase(b.name)
and upcase(a.subject) =upcase(b.subject)
where calculated mark_dif ne 0;
I have a SAS question. I have a large dataset containing unique ID's and a bunch of variables for each year in a time series. Some ID's are present throughout the entire timeseries, some new ID's are added and some old ID's are removed.
ID Year Var3 Var4
1 2015 500 200
1 2016 600 300
1 2017 800 100
2 2016 200 100
2 2017 100 204
3 2015 560 969
3 2016 456 768
4 2015 543 679
4 2017 765 534
As can be seen from the table above, ID 1 is present in all three years (2015-2017), ID 2 is present from 2016 and onwards, ID 3 is removed in 2017 and ID 4 is present in 2015, removed in 2016 and then present again in 2017.
I would like to know which ID's are new and which are removed in any given year, whilst keeping all the data. Eg. a new table with indicators for which ID's are new and which are removed. Furthermore, it would be nice to get a frequency of how many ID' are added/removed in a given year and the sum og their "Var3" and "Var4". Do you have any suggestions how to do that?
************* UPDATE ******************
Okay, so I tried the following program:
**** Addition to suggested code ****;
options validvarname=any;
proc sql noprint;
create table years as
select distinct year
from have;
create table ids as
select distinct id
from have;
create table all_id_years as
select a.id, b.year
from ids as a,
years as b
order by id, year;
create table indicators as
select coalesce(a.id,b.id) as id,
coalesce(a.year,b.year) as year,
coalesce(a.id/a.id,0) as indicator
from have as a
full join
all_id_years as b
on a.id = b.id
and a.year = b.year
order by id, year
;
quit;
Now this will provide me with a table that only contains the ID's that are new in 2017:
data new_in_17;
set indicators;
where ('2016'n=0) and ('2017'n=1);
run;
I can now merge this table to add var3 and var4:
data new17;
merge new_in_17(in=x1) have(in=x2);
by id;
if x1=x2;
run;
Now I can find the frequence of new ID's in 2017 and the sum of var3 and var4:
proc means data=new17 noprint;
var var3 var4;
where year in (2017);
output out=sum_var_freq_new sum(var3)=sum_var3 sum(var4)=sum_var4;
run;
This gives me the output I need. However, I would like the equivalent output for the ID's that are "gone" between 2016 and 2017 which can be made from:
data gone_in_17;
set indicators;
where ('2016'n=1) and ('2017'n=0);
run;
data gone17;
merge gone_in_17(in=x1) have(in=x2);
by id;
if x1=x2;
run;
proc means data=gone17 noprint;
var var3 var4;
where year in (2016);
output out=sum_var_freq_gone sum(var3)=sum_var3 sum(var4)=sum_var4;
run;
The end result should be a combination of the two tables "sum_var_freq_new" and "sum_var_freq_gone" into one table. Furthermore, I need this table for every new year, so my current approach is very inefficient. Do you guys have any suggestions how to achieve this efficiently?
Aside from a different sample, you didn't provide much extra info from your previous question in order to understand what was lacking in the previous answer.
To build on the latter though, you could use a macro do loop to dynamically account for the distinct year values present in your dataset.
data have;
infile datalines;
input ID year var3 var4;
datalines;
1 2015 500 200
1 2016 600 300
1 2017 800 100
2 2016 200 100
2 2017 100 204
3 2015 560 969
3 2016 456 768
4 2015 543 679
4 2017 765 534
;
run;
proc sql noprint;
select distinct year
into :year1-
from have
;
quit;
%macro doWant;
proc sql;
create table want as
select distinct ID
%let i=1;
%do %while(%symexist(year&i.));
,exists(select * from have b where year=&&year&i.. and a.id=b.id) as "&&year&i.."n
%let i=%eval(&i.+1);
%end;
from have a
;
quit;
%mend;
%doWant;
This will produce the following result:
ID 2015 2016 2017
-----------------
1 1 1 1
2 0 1 1
3 1 1 0
4 1 0 1
Here is a more efficient way of doing this and also giving you the summary values.
First a little SQL magic. Create the cross product of years and IDs, then join that to the table you have to create an indicator;
proc sql noprint;
/*All Years*/
create table years as
select distinct year
from have;
/*All IDS*/
create table ids as
select distinct id
from have;
/*All combinations of ID/year*/
create table all_id_years as
select a.id, b.year
from ids as a,
years as b
order by id, year;
/*Original data with rows added for missing years. Indicator=1 if it*/
/*existed prior, 0 if not.*/
create table indicators as
select coalesce(a.id,b.id) as id,
coalesce(a.year,b.year) as year,
coalesce(a.id/a.id,0) as indicator
from have as a
full join
all_id_years as b
on a.id = b.id
and a.year = b.year
order by id, year
;
quit;
Now transpose that.
proc transpose data=indicators out=indicators(drop=_name_);
by id;
id year;
var indicator;
run;
Create the sums. You could also add other summary stats if you wanted here:
proc summary data=have;
by id;
var var3 var4;
output out=summary sum=;
run;
Merge the indicators and the summary values:
data want;
merge indicators summary(keep=id var3 var4);
by id;
run;
This question partially relates to this question.
My datafile can be found here. I use a sample period from 01 Jan 2008 to 31 Dec 2013. The datafile has no missing values.
The following code generates the rolling correlation matrix on each day from 01 Jan 2008 to 31 Dec 2013 using a rolling window of the previous 1 year worth of values. E.g., the correlation between AUT and BEL on 01 Jan 2008 is calculated using the series of values from 01 Jan 2007 to 01 Jan 2008, and likewise for all other pairs.
data work.rolling;
set mm.rolling;
run;
%macro rollingCorrelations(inputDataset=, refDate=);
/*first get a list of unique dates on or after the reference date*/
proc freq data = &inputDataset. noprint;
where date >="&refDate."d;
table date/out = dates(keep = date);
run;
/*for each date calculate what the window range is, here using a year's length*/
data dateRanges(drop = date);
set dates end = endOfFile
nobs= numDates;
format toDate fromDate date9.;
toDate=date;
fromDate = intnx('year', toDate, -1, 's');
call symputx(compress("toDate"!!_n_), put(toDate,date9.));
call symputx(compress("fromDate"!!_n_), put(fromDate, date9.) );
/*find how many times(numberOfWindows) we need to iterate through*/
if endOfFile then do;
call symputx("numberOfWindows", numDates);
end;
run;
%do i = 1 %to &numberOfWindows.;
/*create a temporary view which has the filtered data that is passed to PROC CORR*/
data windowedDataview / view = windowedDataview;
set &inputDataset.;
where date between "&&fromDate&i."d and "&&toDate&i."d;
drop date;
run;
/*the output dataset from each PROC CORR run will be
correlation_DDMMMYYY<from date>_DDMMMYY<start date>*/
proc corr data = windowedDataview
outp = correlations_&&fromDate&i.._&&toDate&i. (where=(_type_ = 'CORR'))
noprint;
run;
%end;
/*append all datasets into a single table*/
data all_correlations;
format from to date9.;
set correlations_:
indsname = datasetname
;
from = input(substr(datasetname,19,9),date9.);
to = input(substr(datasetname,29,9), date9.);
run;
%mend rollingCorrelations;
%rollingCorrelations(inputDataset=rolling, refDate=01JAN2008)
An excerpt of the output can be found here.
As can be seen row 2 to row 53 presents the correlation matrix for the day 1 Apr 2008. However, a problem arises for the correlation matrix for the day 1 Apr 2009: there are missing values for correlation coefficients for ALPHA and its pairs. This is because if one looks at the datafile, the values for ALPHA from 1 Apr 2008 to 1 Apr 2009 are all zero, hence causing a division by zero. This situation happens with a few other data values too, for example, HSBC also has all values as 0 from 1 Apr 08 to 1 Apr 09.
To resolve this issue, I was wondering how the above code can be modified so that in cases where this situation happens (i.e., all values are 0 between 2 certain dates), then the correlation between the two pairs of data values are simply calculated using the WHOLE sample period. E.g., the correlation between ALPHA and AUT is missing on 1 Apr 09, thus this correlation should be calculated using the values from 1 JAN 2008 to 31 DEC 2013, rather than using the values from 1 Apr 08 to 1 Apr 09
Once you run the above macro and have got your all_correlations dataset, you would need to run another PROC CORR this time using all of the data i.e.,
/*first filter the data to be between "01JAN2008"d and "31DEC2013"d*/
data work.all_data_01JAN2008_31DEC2013;
set mm.rolling;
where date between "01JAN2008"d and "31DEC2013"d;
drop date ;
run;
Then pass the above dataset to PROC CORR:
proc corr data = work.all_data_01JAN2008_31DEC2013
outp = correlations_01JAN2008_31DEC2013
(where=(_type_ = 'CORR'))
noprint;
run;
data correlations_01JAN2008_31DEC2013;
length id 8;
set correlations_01JAN2008_31DEC2013;
/*add a column identifier to make sure the order of the correlation matrix is preserved when joined with other tables*/
id = _n_;
run;
You would get a dataset which is unique by the _name_ column.
Then you would have to join correlations_01JAN2008_31DEC2013 to all_correlations in such a way that if a value is missing in all_correlations then a corresponding value from correlations_01JAN2008_31DEC2013 is inserted in its place. For this we can use PROC SQL & the COALESCE function.
PROC SQL;
CREATE TABLE MISSING_VALUES_IMPUTED AS
SELECT
A.FROM
,A.TO
,b.id
,a._name_
,coalesce(a.AUT,b.AUT) as AUT
,coalesce(a.BEL,b.BEL) as BEL
,coalesce(a.DEN,b.DEN) as DEN
,coalesce(a.FRA,b.FRA) as FRA
,coalesce(a.GER,b.GER) as GER
,coalesce(a.GRE,b.GRE) as GRE
,coalesce(a.IRE,b.IRE) as IRE
,coalesce(a.ITA,b.ITA) as ITA
,coalesce(a.NOR,b.NOR) as NOR
,coalesce(a.POR,b.POR) as POR
,coalesce(a.SPA,b.SPA) as SPA
,coalesce(a.SWE,b.SWE) as SWE
,coalesce(a.NL,b.NL) as NL
,coalesce(a.ERS,b.ERS) as ERS
,coalesce(a.RZB,b.RZB) as RZB
,coalesce(a.DEX,b.DEX) as DEX
,coalesce(a.KBD,b.KBD) as KBD
,coalesce(a.DAB,b.DAB) as DAB
,coalesce(a.BNP,b.BNP) as BNP
,coalesce(a.CRDA,b.CRDA) as CRDA
,coalesce(a.KN,b.KN) as KN
,coalesce(a.SGE,b.SGE) as SGE
,coalesce(a.CBK,b.CBK) as CBK
,coalesce(a.DBK,b.DBK) as DBK
,coalesce(a.IKB,b.IKB) as IKB
,coalesce(a.ALPHA,b.ALPHA) as ALPHA
,coalesce(a.ALBK,b.ALBK) as ALBK
,coalesce(a.IPM,b.IPM) as IPM
,coalesce(a.BKIR,b.BKIR) as BKIR
,coalesce(a.BMPS,b.BMPS) as BMPS
,coalesce(a.PMI,b.PMI) as PMI
,coalesce(a.PLO,b.PLO) as PLO
,coalesce(a.BINS,b.BINS) as BINS
,coalesce(a.MB,b.MB) as MB
,coalesce(a.UC,b.UC) as UC
,coalesce(a.BCP,b.BCP) as BCP
,coalesce(a.BES,b.BES) as BES
,coalesce(a.BBV,b.BBV) as BBV
,coalesce(a.SCHSPS,b.SCHSPS) as SCHSPS
,coalesce(a.NDA,b.NDA) as NDA
,coalesce(a.SEA,b.SEA) as SEA
,coalesce(a.SVK,b.SVK) as SVK
,coalesce(a.SPAR,b.SPAR) as SPAR
,coalesce(a.CSGN,b.CSGN) as CSGN
,coalesce(a.UBSN,b.UBSN) as UBSN
,coalesce(a.ING,b.ING) as ING
,coalesce(a.SNS,b.SNS) as SNS
,coalesce(a.BARC,b.BARC) as BARC
,coalesce(a.HBOS,b.HBOS) as HBOS
,coalesce(a.HSBC,b.HSBC) as HSBC
,coalesce(a.LLOY,b.LLOY) as LLOY
,coalesce(a.STANBS,b.STANBS) as STANBS
from all_correlations as a
inner join correlations_01JAN2008_31DEC2013 as b
on a._name_ = b._name_
order by
A.FROM
,A.TO
,b.id
;
quit;
/*verify that no missing values are left. NMISS column should be 0 from all variables*/
proc means data = MISSING_VALUES_IMPUTED n nmiss;
run;
Following the question asked about throwing out the trimmed mean of the proc univariate in a table :
SAS: PROC UNIVARIATE: Output trimmed mean to dataset
I would like to output a trimmed mean from a proc univariate by group. However the ods output does not seem to work with noprint and there are just too many group id for it to work out . Any sidestep to this problem?
proc univariate data = Table1 idout trim=1;
var DaysBtwPay;
by id;
trimmedmeans = trimMean2 (keep = id Mean stdMean);
run;
Doh!, it appears you can use ods _all_ close; option to suppress the HTML output instead of going through the trouble of writing your own datastep routine.
%let trimmed = 1;
proc sort data=sashelp.shoes out=have;
by region;
run;
ods _all_ close;
PROC UNIVARIATE DATA=have trimmed=&trimmed. ;
VAR returns;
by region;
ods output TrimmedMeans=trimmedMeansUni ;
run;
I can't think of anyway around this issue other than to write your own data step to calculate trimmed mean.
This can be done in 2 steps.
Step-1:
In this step we would want to know how many observations are there in each by-group and the simple average of your measurement variable. In the next step, simple average will be returned when calculating trimmed mean is infeasible. For example: If you want to exclude the extreme 5 obs but you only 7 observations in a by-group then proc univarite returns a missing value.
Notice the order by clause - this ordering is used in excluding extreme obs.
proc sql;
create table inputForTmeans as
select
a.region /*your by-group var*/
,a.returns /*your measurement variable of interest*/
,b.count
,b.simpleAvgReturns
from sashelp.shoes as a
inner join (select region, count(*) as count, mean(returns) as simpleAvgReturns
from sashelp.shoes
group by region) as b
on a.region = b.region
order by
a.region
,a.returns;
quit;
Step-2:
%let trimmed = 1; /*no. of extreme obs to exclude from mean calculation*/
data trimmedMean;
set inputForTmeans;
row_count+1; /*counter variable to number each obs in a by-group*/
by region returns;
if first.region then do;
row_count=1;
returnsSum=.;
end;
if &trimmed.<row_count <=(count - &trimmed.) then returnsSum+returns;
/***************************************************************************/
if last.region then do;
trimmedMeanreturns = coalesce(returnsSum/(count - 2*&trimmed.), simpleAvgReturns) ;
N = row_count;
trimmedRowCount = 2*&trimmed.;
output;
end;
keep region trimmedMeanreturns N count trimmedRowCount ;
/***************************************************************************/
run;
Output:
%let trimmed = 1;
region DataStep ProcUnivariate
Africa 1183.962963 1183.963
Asia 662 662
Canada 3089.1428571 3089.143
Central America/Caribbean 3561.1333333 3561.133
Eastern Europe 2665.137931 2665.138
Middle East 6794.3181818 6794.318
Pacific 1538.5813953 1538.581
South America 1824.1153846 1824.115
United States 4462.4210526 4462.421
Western Europe 2538.05 2538.05
%let trimmed = 14;
region DataStep ProcUnivariate
Africa 897.92857143 897.9
Asia 778.21428571 .
Canada 1098.1111111 1098.1
Central America/Caribbean 2289.25 2289.3
Eastern Europe 2559.6666667 2559.7
Middle East 8620 .
Pacific 895.88235294 895.9
South America 1538.5769231 1538.6
United States 4010.4166667 4010.4
Western Europe 1968.5882353 1968.6
Output from the datastep for trimmed=1:
Count: No. of rows in the by-group
N: Ignore this column - same as Count
trimmedRowCount: no. of extreme rows excluded. If trimmedRowCount = Count then trimmedMeanreturns is the SimpleAverage
Region count trimmedMeanreturns N trimmedRowCount
Africa 56 1183.963 56 2
Asia 14 662 14 2
Canada 37 3089.143 37 2
Central America/Caribbean 32 3561.133 32 2
Eastern Europe 31 2665.138 31 2
Middle East 24 6794.318 24 2
Pacific 45 1538.581 45 2
South America 54 1824.115 54 2
United States 40 4462.421 40 2
Western Europe 62 2538.05 62 2