I have a dataset where observations is student and then I have a variable for their test score. I need to standardize these scores like this :
newscore = (oldscore - mean of all scores) / std of all scores
So that I am thinking is using a Data Step where I create a new dataset with the 'newscore' added to each student. But I don't know how to calculate the mean and std of the entire dataset IN in the Data Step. I know I can just calculate it using proc means, and then manually type it it. But I need to do I a lot of times and maybe drop variables and other stuff. So I would like to be able to just calculate it in the same step.
Data example:
__VAR testscore newscore
Student1 5 x
Student2 8 x
Student3 5 x
Code I tried:
data new;
set old;
newscore=(oldscore-(mean of testscore))/(std of testscore)
run;
(Can't post any of the real data, can't remove it from the server)
How do I do this?
Method1: Efficient way of solving this problem is by using proc stdize . It will do the trick and you dont need to calculate mean and standard deviation for this.
data have;
input var $ testscore;
cards;
student1 5
student2 8
student3 5
;
run;
data have;
set have;
newscore = testscore;
run;
proc stdize data=have out=want;
var newscore;
run;
Method2: As you suggested taking out means and standard deviation from proc means, storing their value in a macro and using them in our calculation.
proc means data=have;
var testscore;
output out=have1 mean = m stddev=s;
run;
data _null_;
set have1;
call symputx("mean",m);
call symputx("std",s);
run;
data want;
set have;
newscore=(testscore-&mean.)/&std.;
run;
My output:
var testscore newscore
student1 5 -0.577350269
student2 8 1.1547005384
student3 5 -0.577350269
Let me know in case of any queries.
You should not try to do this in the data step. Do it with proc means. You don't need to type anything in, just grab the value in a dataset.
You don't provide enough to give complete code in the answer, but the basic idea.
proc means data=sashelp.class;
var height weight;
output out=class_stats mean= std= /autoname;
run;
data class;
if _n_=1 then set class_Stats; *copy in the values from class_Stats;
set sashelp.class;
height_norm = (height-height_mean)/(height_stddev);
weight_norm = (weight-weight_mean)/(weight_stddev);
run;
Alternately, just use PROC STDIZE which will do this for you.
proc stdize data=sashelp.class out=class_Std;
var height weight;
run;
If you want to achieve this via proc sql:
proc sql;
create table want as
select *, mean(oldscore) as mean ,std(oldscore) as sd
from have;
quit;
For other statistical functions in proc sql, see here: https://support.sas.com/kb/25/279.html
Related
I often work with a large number of variables that have zero or empty values only, but I could not find a SAS command to drop these unwanted variables. I know we can use SAS/IML, but I encountered such cases many times and would like to have a macro that may help me without having to type the variable names to avoid errors. Here is my code for removing variables with zero values only. It works to produce a cleaned output data set y from a raw data set x without using the names of the variables. I hope others could have a better solution or help me to make mine better.
%Macro dropZeroV(x, y) ;
proc means data = &x. ;
var _numeric_;
output out = sumTab ; run;
proc transpose data = sumTab(drop = _TYPE_) out= sumt; var _Numeric_; id _STAT_; run;
%let Vlst =;
proc sql noprint;
select _NAME_ into : dropLst separated by ' '
from sumT
where Max=0 and Min =0;
data &y.;
set &x.; drop &dropLst.;
run;
proc print data = &y.; run;
%Mend dropZeroV;
Use STACKODS and ODS SUMMARY to get the table in the format needed in one step rather than multiple steps. This limits it to the sum, since if the sum = 0, all values are 0. You may also want to look at rounding to avoid any issues with numeric precision.
PROC MEANS + PROC TRANSPOSE go to :
ods select none;
proc means data= &x. stackods sum;
var _numeric_;
ods output summary = sumT;
run;
I'm doing a simple count of occurrences of a by-variable within a class variable, but cannot find a way to rename the total count across class variables. At the moment, the output dataset includes counts for all cluster2 within each group as well as the total count across all groups (i.e. the class variable used). However, the counts within classes are named, while the total is shown by an empty string.
Code:
proc means data=seeds noprint;
class group;
by cluster2;
id label2;
output out=seeds_counts (drop= _type_ _freq_) n(id)=count;
run;
Example of output file:
cluster2 group label2 count
7 area 1 20
7 sa area 1 15
7 sb area 1 5
15 area 15 42
15 sa area 15 18
....
Naturally, renaming the emtpy string to "Total" could be accomplished in a separate datastep, but I would like to do it directly in the Proc Means-step. It should be simple and trivial, but I haven't found a way so far. Afterwards, I want to transpose the dataset, which means that the emtpy string has to be changed, or it will be dropped in the proc transpose.
I don't know of a way to do it directly, but you can sort-of-cheat: you can tell SAS to show "Total" instead of missing.
proc format;
value $MissTotalF
' ' = 'Total'
other = [$CHAR12.];
quit;
proc means data=sashelp.class noprint;
class sex;
id age;
output out=sex_counts (drop= _type_ _freq_) n(age)=count;
format sex $MissTotalF.;
run;
For example. I'd also recommend using PROC TABULATE instead of PROC MEANS if you're just going for counts, though in this case it doesn't really make much difference.
The problem here is that if the variable in the class statement is numeric, then the resultant column will be numeric, therefore you can't add the word Total (unless you use a format, similar to the answer from #Joe). This will be why the value is missing, as the class variable can be either numeric or character.
Here's an example of a numeric class variable.
proc sort data=sashelp.class out=class;
by sex;
run;
proc means data=class noprint;
class age;
by sex;
output out=class_counts (drop= _:) n=count;
run;
Using proc tabulate can display the result pretty much how you want it, however the output dataset will have the same missing values, so won't really help. Here's a couple of examples.
proc tabulate data=class out=class_tabulate1 (drop=_:);
class sex age;
table sex*(age all='Total'),n='';
run;
proc tabulate data=class out=class_tabulate2 (drop=_:);
class sex age;
table sex,age*n='' all='Total';
run;
I think the best option to achieve your final goal is to add the nway option to proc means, which will remove the subtotals, then transpose the data and finally write a data step that creates the Total column by summing each row. It's 3 steps, but doesn't involve much coding.
Here is one method you could use by taking advantage of the _TYPE_ variable so that you can process the totals and details separately. You will still have trouble with PROC TRANSPOSE if there is a class with missing values (separate from the overall summary record).
proc means data=sashelp.class noprint;
class sex;
id age;
output out=sex_counts (drop= _freq_ ) n(age)=count;
run;
proc transpose data=sex_counts out=transpose prefix=count_ ;
where _type_=1 ;
id sex ;
var count;
run;
data transpose ;
merge transpose sex_counts(where=(_type_=0) keep=_type_ count);
rename count=count_Total;
drop _type_;
run;
I used the following code to calculate the difference between max and min value in a column, but it doesn't like a smart way. So could anyone give me some suggestion?
p.s. I need to put the difference back to the dataset as a new variable, because I want to delete datasets based on this difference.
proc univariate noprint date=test;
var time_l_;
output out=result max=max min=min;
run;
data test;
set result test;
run;
data test;
set test;
gap=max-min;
run;
You're pretty close, actually, to what I'd consider a good result. This isn't the absolute fastest way to do it, but it's probably the best when you don't need amazing performance because it's a lot less complicated than the faster methods.
Create the max/min dataset, then use if _n_ = 1 then set result; which will bring it in once. The variables are automatically RETAINed, because they are brought in on the SET statement. Then calculate the gap in the same data step.
proc univariate noprint data=sashelp.class;
var age;
output out=result max=max min=min;
run;
data test;
if _n_=1 then set result;
set sashelp.class;
gap = max-min;
run;
The SQL solution is straightforward but leaves a message in your log regarding remerging.
Proc SQL;
Create table want as
Select *, max(age) as max_age, min(age) as min_age, calculated max_age - calculated min_age as age_diff
From have;
Quit;
A simpler SQL solution using the range function:
proc sql;
create table want as
select *,range(age) as age_range
from sashelp.class;
quit;
What i want to do: I need to create a new variables for each value labels of a variable and do some recoding. I have all the value labels output from a SPSS file (see sample).
Sample:
proc format; library = library ;
value SEXF
1 = 'Homme'
2 = 'Femme' ;
value FUMERT1F
0 = 'Non'
1 = 'Oui , occasionnellement'
2 = 'Oui , régulièrement'
3 = 'Non mais j''ai déjà fumé' ;
value ... (many more with different amount of levels)
The new variable name would be the actual one without F and with underscore+level (example: FUMERT1F level 0 would become FUMERT1_0).
After that i need to recode the variables on this pattern:
data ds; set ds;
FUMERT1_0=0;
if FUMERT1=0 then FUMERT1_0=1;
FUMERT1_1=0;
if FUMERT1=1 then FUMERT1_1=1;
FUMERT1_2=0;
if FUMERT1=2 then FUMERT1_2=1;
FUMERT1_3=0;
if FUMERT1=3 then FUMERT1_3=1;
run;
Any help will be appreciated :)
EDIT: Both answers from Joe and the one of data_null_ are working but stackoverflow won't let me pin more than one right answer.
Update to add an _ underscore to the end of each name. It looks like there is not option for PROC TRANSREG to put an underscore between the variable name and the value of the class variable so we can just do a temporary rename. Create rename name=newname pairs to rename class variable to end in underscore and to rename them back. CAT functions and SQL into macro variables.
data have;
call streaminit(1234);
do caseID = 1 to 1e4;
fumert1 = rand('table',.2,.2,.2) - 1;
sex = first(substrn('MF',rand('table',.5),1));
output;
end;
stop;
run;
%let class=sex fumert1;
proc transpose data=have(obs=0) out=vnames;
var &class;
run;
proc print;
run;
proc sql noprint;
select catx('=',_name_,cats(_name_,'_')), catx('=',cats(_name_,'_'),_name_), cats(_name_,'_')
into :rename1 separated by ' ', :rename2 separated by ' ', :class2 separated by ' '
from vnames;
quit;
%put NOTE: &=rename1;
%put NOTE: &=rename2;
%put NOTE: &=class2;
proc transreg data=have(rename=(&rename1));
model class(&class2 / zero=none);
id caseid;
output out=design(drop=_: inter: rename=(&rename2)) design;
run;
%put NOTE: _TRGIND(&_trgindn)=&_trgind;
First try:
Looking at the code you supplied and the output from Joe's I don't really understand the need for the formats. It looks to me like you just want to create dummies for a list of class variables. That can be done with TRANSREG.
data have;
call streaminit(1234);
do caseID = 1 to 1e4;
fumert1 = rand('table',.2,.2,.2) - 1;
sex = first(substrn('MF',rand('table',.5),1));
output;
end;
stop;
run;
proc transreg data=have;
model class(sex fumert1 / zero=none);
id caseid;
output out=design(drop=_: inter:) design;
run;
proc contents;
run;
proc print data=design(obs=40);
run;
One good alternative to your code is to use proc transpose. It won't get you 0's in the non-1 cells, but those are easy enough to get. It does have the disadvantage that it makes it harder to get your variables in a particular order.
Basically, transpose once to vertical, then transpose back using the old variable name concatenated to the variable value as the new variable name. Hat tip to Data null for showing this feature in a recent SAS-L post. If your version of SAS doesn't support concatenation in PROC TRANSPOSE, do it in the data step beforehand.
I show using PROC EXPAND to then set the missings to 0, but you can do this in a data step as well if you don't have ETS or if PROC EXPAND is too slow. There are other ways to do this - including setting up the dataset with 0s pre-proc-transpose - and if you have a complicated scenario where that would be needed, this might make a good separate question.
data have;
do caseID = 1 to 1e4;
fumert1 = rand('Binomial',.3,3);
sex = rand('Binomial',.5,1)+1;
output;
end;
run;
proc transpose data=have out=want_pre;
by caseID;
var fumert1 sex;
copy fumert1 sex;
run;
data want_pre_t;
set want_pre;
x=1; *dummy variable;
run;
proc transpose data=want_pre_t out=want delim=_;
by caseID;
var x;
id _name_ col1;
copy fumert1 sex;
run;
proc expand data=want out=want_e method=none;
convert _numeric_ /transformin=(setmiss 0);
run;
For this method, you need to use two concepts: the cntlout dataset from proc format, and code generation. This method will likely be faster than the other option I presented (as it passes through the data only once), but it does rely on the variable name <-> format relationship being straightforward. If it's not, a slightly more complex variation will be required; you should post to that effect, and this can be modified.
First, the cntlout option in proc format makes a dataset of the contents of the format catalog. This is not the only way to do this, but it's a very easy one. Specify the appropriate libname as you would when you create a format, but instead of making one, it will dump the dataset out, and you can use it for other purposes.
Second, we create a macro that performs your action one time (creating a variable with the name_value name and then assigning it to the appropriate value) and then use proc sql to make a bunch of calls to that macro, once for each row in your cntlout dataset. Note - you may need a where clause here, or some other modifications, if your format library includes formats for variables that aren't in your dataset - or if it doesn't have the nice neat relationship your example does. Then we just make those calls in a data step.
*Set up formats and dataset;
proc format;
value SEXF
1 = 'Homme'
2 = 'Femme' ;
value FUMERT1F
0 = 'Non'
1 = 'Oui , occasionnellement'
2 = 'Oui , régulièrement'
3 = 'Non mais j''ai déjà fumé' ;
quit;
data have;
do caseID = 1 to 1e4;
fumert1 = rand('Binomial',.3,3);
sex = rand('Binomial',.5,1)+1;
output;
end;
run;
*Dump formats into table;
proc format cntlout=formats;
quit;
*Macro that does the above assignment once;
%macro spread_var(var=, val=);
&var._&val.= (&var.=&val.); *result of boolean expression is 1 or 0 (T=1 F=0);
%mend spread_var;
*make the list. May want NOPRINT option here as it will make a lot of calls in your output window otherwise, but I like to see them as output.;
proc sql;
select cats('%spread_var(var=',substr(fmtname,1,length(Fmtname)-1),',val=',start,')')
into :spreadlist separated by ' '
from formats;
quit;
*Actually use the macro call list generated above;
data want;
set have;
&spreadlist.;
run;
I have a bunch of sas datasets of various lengths and I need to trim the nth highest and lowest values by a variable value.
To do this for when I needed to trim the highest and lowest I did this
DATA VDBP273_first_night_Systolic;
SET VDBP273_first_night end=eof;
IF _N_ =1 then delete;
if eof then delete;
run;
And it worked fine.
Now I need to do something more like this
PROC SORT DATA=foo OUT=foo_sorted;
BY bar;
run;
DATA foo_out;
SET foo_sorted end=eof;
IF _N_ <= 5 then delete;
if eof *OR THE 4 right before it* then delete;
run;
I'm sure this is easy but it's stumping me. How can I say the last 5 of this sorted data set delete those?
Since you are presorting your data and then trying to eliminate top n and bottom n record, You can easily solve your problem using OBS= and FIRSTOBS= dataset option.
proc sql noprint;
select count(*) -4 into:counter from sashelp.class ;
quit;
proc sort data=sashelp.class out=have;by height;run;
proc print data=have;run;
data want;
set have(firstobs=6 obs=&counter);
run;
proc print data=want;run;
You can use the nobs= dataset option to store the total number of observations, which then means you can do something similar to your code to exclude the top/bottom n records.
I'd recommend putting the number of records to be excluded in a macro variable, it makes it easier to read and change than hard coding it.
%let excl = 6;
data want;
set sashelp.class nobs=numobs;
if &excl.< _n_ <=(numobs-&excl.);
run;
or simply do the same step done before, adding descending to the proc sort variable
proc sort data=have out=want; by var1 descending; run;