Drop variables with all-zero values from a SAS data set

Drop variables with all-zero values from a SAS data set - list

I often work with a large number of variables that have zero or empty values only, but I could not find a SAS command to drop these unwanted variables. I know we can use SAS/IML, but I encountered such cases many times and would like to have a macro that may help me without having to type the variable names to avoid errors. Here is my code for removing variables with zero values only. It works to produce a cleaned output data set y from a raw data set x without using the names of the variables. I hope others could have a better solution or help me to make mine better.
%Macro dropZeroV(x, y) ;
proc means data = &x. ;
var _numeric_;
output out = sumTab ; run;
proc transpose data = sumTab(drop = _TYPE_) out= sumt; var _Numeric_; id _STAT_; run;
%let Vlst =;
proc sql noprint;
select _NAME_ into : dropLst separated by ' '
from sumT
where Max=0 and Min =0;
data &y.;
set &x.; drop &dropLst.;
run;
proc print data = &y.; run;
%Mend dropZeroV;

Use STACKODS and ODS SUMMARY to get the table in the format needed in one step rather than multiple steps. This limits it to the sum, since if the sum = 0, all values are 0. You may also want to look at rounding to avoid any issues with numeric precision.
PROC MEANS + PROC TRANSPOSE go to :
ods select none;
proc means data= &x. stackods sum;
var _numeric_;
ods output summary = sumT;
run;

Related

proc report print null dataset

I have a null dataset such as
data a;
if 0;
run;
Now I wish to use proc report to print this dataset. Of course, there will be nothing in the report, but I want one sentence in the report said "It is a null dataset". Any ideas?
Thanks.

You can test to see if there are any observations in the dataset first. If there are observations, then use the dataset, otherwise use a dummy dataset that looks like this and print it:
data use_this_if_no_obs;
msg = 'It is a null dataset';
run;
There are plenty of ways to test datasets to see if they contain any observations or not. My personal favorite is the %nobs macro found here: https://stackoverflow.com/a/5665758/214994 (other than my answer, there are several alternate approaches to pick from, or do a google search).
Using this %nobs macro we can then determine the dataset to use in a single line of code:
%let ds = %sysfunc(ifc(%nobs(iDs=sashelp.class) eq 0, use_this_if_no_obs, sashelp.class));
proc print data=&ds;
run;
Here's some code showing the alternate outcome:
data for_testing_only;
if 0;
run;
%let ds = %sysfunc(ifc(%nobs(iDs=for_testing_only) eq 0, use_this_if_no_obs, sashelp.class));
proc print data=&ds;
run;
I've used proc print to simplify the example, but you can adapt it to use proc report as necessary.

For the no data report you don't need to know how many observations are in the data just that there are none. This example shows how I would approach the problem.
Create example data with zero obs.
data class;
stop;
set sashelp.class;
run;
Check for no obs and add one obs with missing on all vars. Note that no observation are every read from class in this step.
data class;
if eof then output;
stop;
modify class end=eof;
run;
make the report
proc report data=class missing;
column _all_;
define _all_ / display;
define name / order;
compute before name;
retain_name=name;
endcomp;
compute after;
if not missing(retain_name) then l=0;
else l=40;
msg = 'No data for this report';
line msg $varying. l;
endcomp;
run;

how to calculate percentile in SAS

I want to calculate the 95th percentile of a distribution. I think I cannot use proc means because I need the value, while the output of proc means is a table. I have to use the percentile to filter the dataset and create another dataset with only the observations greater than the percentile.
Clearly I don't want to use the numeric value..because I want to use it in a macro.

Don't put summary statistics into macro variables. You risk loss of precision.
This is based on your cryptic description of the problem.
proc means...
output out=pct95 pct95=
run;
data subset;
if _n_ eq 1 then set pct95;
set data;
if value < pct95;
run;

You can suppress proc means from outputting your results in a new tab using the noprint option. Try this:
proc means data = your_data noprint;
var variable_name;
output out = your_data2 p95= / autoname;
run;

Create new variables from format values

What i want to do: I need to create a new variables for each value labels of a variable and do some recoding. I have all the value labels output from a SPSS file (see sample).
Sample:
proc format; library = library ;
value SEXF
1 = 'Homme'
2 = 'Femme' ;
value FUMERT1F
0 = 'Non'
1 = 'Oui , occasionnellement'
2 = 'Oui , régulièrement'
3 = 'Non mais j''ai déjà fumé' ;
value ... (many more with different amount of levels)
The new variable name would be the actual one without F and with underscore+level (example: FUMERT1F level 0 would become FUMERT1_0).
After that i need to recode the variables on this pattern:
data ds; set ds;
FUMERT1_0=0;
if FUMERT1=0 then FUMERT1_0=1;
FUMERT1_1=0;
if FUMERT1=1 then FUMERT1_1=1;
FUMERT1_2=0;
if FUMERT1=2 then FUMERT1_2=1;
FUMERT1_3=0;
if FUMERT1=3 then FUMERT1_3=1;
run;
Any help will be appreciated :)
EDIT: Both answers from Joe and the one of data_null_ are working but stackoverflow won't let me pin more than one right answer.

Update to add an _ underscore to the end of each name. It looks like there is not option for PROC TRANSREG to put an underscore between the variable name and the value of the class variable so we can just do a temporary rename. Create rename name=newname pairs to rename class variable to end in underscore and to rename them back. CAT functions and SQL into macro variables.
data have;
call streaminit(1234);
do caseID = 1 to 1e4;
fumert1 = rand('table',.2,.2,.2) - 1;
sex = first(substrn('MF',rand('table',.5),1));
output;
end;
stop;
run;
%let class=sex fumert1;
proc transpose data=have(obs=0) out=vnames;
var &class;
run;
proc print;
run;
proc sql noprint;
select catx('=',_name_,cats(_name_,'_')), catx('=',cats(_name_,'_'),_name_), cats(_name_,'_')
into :rename1 separated by ' ', :rename2 separated by ' ', :class2 separated by ' '
from vnames;
quit;
%put NOTE: &=rename1;
%put NOTE: &=rename2;
%put NOTE: &=class2;
proc transreg data=have(rename=(&rename1));
model class(&class2 / zero=none);
id caseid;
output out=design(drop=_: inter: rename=(&rename2)) design;
run;
%put NOTE: _TRGIND(&_trgindn)=&_trgind;
First try:
Looking at the code you supplied and the output from Joe's I don't really understand the need for the formats. It looks to me like you just want to create dummies for a list of class variables. That can be done with TRANSREG.
data have;
call streaminit(1234);
do caseID = 1 to 1e4;
fumert1 = rand('table',.2,.2,.2) - 1;
sex = first(substrn('MF',rand('table',.5),1));
output;
end;
stop;
run;
proc transreg data=have;
model class(sex fumert1 / zero=none);
id caseid;
output out=design(drop=_: inter:) design;
run;
proc contents;
run;
proc print data=design(obs=40);
run;

One good alternative to your code is to use proc transpose. It won't get you 0's in the non-1 cells, but those are easy enough to get. It does have the disadvantage that it makes it harder to get your variables in a particular order.
Basically, transpose once to vertical, then transpose back using the old variable name concatenated to the variable value as the new variable name. Hat tip to Data null for showing this feature in a recent SAS-L post. If your version of SAS doesn't support concatenation in PROC TRANSPOSE, do it in the data step beforehand.
I show using PROC EXPAND to then set the missings to 0, but you can do this in a data step as well if you don't have ETS or if PROC EXPAND is too slow. There are other ways to do this - including setting up the dataset with 0s pre-proc-transpose - and if you have a complicated scenario where that would be needed, this might make a good separate question.
data have;
do caseID = 1 to 1e4;
fumert1 = rand('Binomial',.3,3);
sex = rand('Binomial',.5,1)+1;
output;
end;
run;
proc transpose data=have out=want_pre;
by caseID;
var fumert1 sex;
copy fumert1 sex;
run;
data want_pre_t;
set want_pre;
x=1; *dummy variable;
run;
proc transpose data=want_pre_t out=want delim=_;
by caseID;
var x;
id _name_ col1;
copy fumert1 sex;
run;
proc expand data=want out=want_e method=none;
convert _numeric_ /transformin=(setmiss 0);
run;

For this method, you need to use two concepts: the cntlout dataset from proc format, and code generation. This method will likely be faster than the other option I presented (as it passes through the data only once), but it does rely on the variable name <-> format relationship being straightforward. If it's not, a slightly more complex variation will be required; you should post to that effect, and this can be modified.
First, the cntlout option in proc format makes a dataset of the contents of the format catalog. This is not the only way to do this, but it's a very easy one. Specify the appropriate libname as you would when you create a format, but instead of making one, it will dump the dataset out, and you can use it for other purposes.
Second, we create a macro that performs your action one time (creating a variable with the name_value name and then assigning it to the appropriate value) and then use proc sql to make a bunch of calls to that macro, once for each row in your cntlout dataset. Note - you may need a where clause here, or some other modifications, if your format library includes formats for variables that aren't in your dataset - or if it doesn't have the nice neat relationship your example does. Then we just make those calls in a data step.
*Set up formats and dataset;
proc format;
value SEXF
1 = 'Homme'
2 = 'Femme' ;
value FUMERT1F
0 = 'Non'
1 = 'Oui , occasionnellement'
2 = 'Oui , régulièrement'
3 = 'Non mais j''ai déjà fumé' ;
quit;
data have;
do caseID = 1 to 1e4;
fumert1 = rand('Binomial',.3,3);
sex = rand('Binomial',.5,1)+1;
output;
end;
run;
*Dump formats into table;
proc format cntlout=formats;
quit;
*Macro that does the above assignment once;
%macro spread_var(var=, val=);
&var._&val.= (&var.=&val.); *result of boolean expression is 1 or 0 (T=1 F=0);
%mend spread_var;
*make the list. May want NOPRINT option here as it will make a lot of calls in your output window otherwise, but I like to see them as output.;
proc sql;
select cats('%spread_var(var=',substr(fmtname,1,length(Fmtname)-1),',val=',start,')')
into :spreadlist separated by ' '
from formats;
quit;
*Actually use the macro call list generated above;
data want;
set have;
&spreadlist.;
run;

control the number of decimal places in SAS proc means

I am trying to report my proc means output with 10 decimal places by specifying maxdec=10. However, SAS does not report more than 7 decimal places.
Here is the warning I get:
WARNING: NDec value is inappropriate, BEST format will be used.
I appreciate any suggestion.

If you look at the documentation, it states that MEANS will print out 0-8 decimal places based on the value of MAXDEC. If you want more, you will need to save the results and print them yourself.
Try this:
data test;
format x 12.11;
do i=1 to 1000;
x = rannor(0);
output;
end;
drop i;
run;
proc means data=test noprint;
var x;
output out=means_out mean=mean std=std;
run;
proc print data=means_out noobs;
var mean std;
format mean std 12.11;
run;

As already mentioned, maxdec= works for limiting the number of decimal places below 8. Proc means isn't going to let you do too much to change the format of the summary statistics. I'd suggest using proc tabulate:
If your proc means looks like:
proc means data=yourdata;
var yourvariable;
run;
Than use something like:
proc tabulate data=yourdata;
var yourvariable;
table yourvariable*
(n
mean*format=15.10
stddev*format=15.10
min*format=15.10
max*format=15.10);
run;

Determining the frequency of ONLY certain values in all variables of a data set

I'd like to get a frequency table that lists all variables, but only tells me the number of times "-2", "-1" and "M" appear in each variable.
Currently, when I run the following code:
proc freq data=mydata;
tables _ALL_
/list missing;
I get one table for each variable and all of its values (sometimes 100s). Can I just get tables with the three values I want, and everything else suppressed?

You can do this a number of ways.
First off, you probably want to do this to a dataset first to allow you to filter that dataset. I would use PROC TABULATE, but you can use PROC FREQ if you like it better.
*make up some data;
data mydata;
call streaminit(132);
array x[100];
do _i = 1 to 50;
do _t = 1 to dim(x);
x[_t]= floor(rand('Uniform')*9-5);
end;
output;
end;
keep x:;
run;
ods _all_ close; *close the 'visible' output types;
ods output onewayfreqs=outdata; *output the onewayfreqs (one way frequency tables) to a dataset;
proc freq data=mydata;
tables _all_/missing;
run;
ods output close; *close the dataset;
ods preferences; *open back up your default outputs;
Then filter it, and once you've done that print it however you want. Note in the PROC FREQ output, you get a column for each different variable - not super helpful. The F_ variables are the formatted values, which can then be combined using coalesce. I assume here they're all numeric variables - define f_val as character and use coalescec if there are any character variables or variables with character-ish formats applied to them.
data has_values;
set outdata;
f_val = coalesce(of f_:);
keep table f_val frequency percent;
if f_val in (0,-1,-2);
run;
The last line keeps only the 0,-1,-2.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Drop variables with all-zero values from a SAS data set - list

Related

proc report print null dataset

how to calculate percentile in SAS

Create new variables from format values

control the number of decimal places in SAS proc means

Determining the frequency of ONLY certain values in all variables of a data set

Categories

Resources