Null Columns only based on a variable value - SAS Dataset

Null Columns only based on a variable value - SAS Dataset - sas

I have a very big SAS Dataset with over 280 variables in it and I need retrieve all the complete NULL columns based on a Variable value. For example I have a Variable called Reported(with only values Yes & No) in this dataset and I want to find out based on value No, all the complete Null Columns in this dataset.
Is there any quick way to find this out with out writing all the columns names for complete NULL values?
So for example if I have 4 Variables in the table,
So based on the above table I would like to see the output like this where Var4='No' and only return the columns with all the missing values
This would help me to identify variables which are not being populated at all where the Var4 value is 'No'

Note the WHERE statement in PROC FREQ.
proc format;
value $_xmiss_(default=1 min=1 max=1) ' ' =' ' other='1';
value _xmiss_(default=1 min=1 max=1) ._-.Z=' ' other='1';
quit;
%let data=sashelp.heart;
proc freq data=&data nlevels;
where status eq: 'A';
ods select nlevels;
ods output nlevels=nlevels;
format _character_ $_xmiss_. _numeric_ _xmiss_.;
run;
data nlevels;
length TABLEVAR $32 TABLEVARLABEL $128 NLEVELS NMISSLEVELS NNONMISSLEVELS 8;
retain NLEVELS NMISSLEVELS NNONMISSLEVELS 0;
set nlevels;
run;

There are 2 parts to the question I think. First is to subset records where Reported = "N". Then among those records, report columns that have all missing values. If this is correct then you could do something as follows (I am assuming that the columns with missing values are all numeric. If not, this approach will need a slight modification):
/* Thanks to REEZA for pointing out this way of getting the freqs. This eliminates some constraints and is more efficient */
proc freq data=have nlevels ;
where var1 = "N" ;
ods output nlevels = freqs;
table _all_;
run;
proc sql noprint;
select TableVar into :cols separated by " " from freqs where NNonMissLevels = 0 ;
quit;
%put &cols;
data want;
set have (keep = &cols var1);
where var1 = "N" ;
run;

Related

Highlight the corresponding line number with another datasets

I have two datasets, one extract the extreme values from proc univariate. I would like to create a new variable and label them as 1 if the n in the original dataset equals the extracted line number in the univariate dataset. But I don't know how to program it not manually enter the line number.

There're a few ways to do this, but one easy way is to just add the rownum to the original dataset and merge on it.
Here's an example.
ods output extremeobs=extreme_test;
proc univariate data=sashelp.heart;
run;
ods output close;
data extreme_diastolic extreme_systolic; *just creating the extreme datasets;
set extreme_test;
if varname='Diastolic' then output extreme_diastolic;
else if varname='Systolic' then output extreme_systolic;
run;
data for_merge; *adding rownum on to the original dataset;
set sashelp.heart;
rownum = _n_;
run;
*now, sort the extreme datasets by the `highobs` and `lowobs` values respectively and save those as `rownum`, so they can be merged;
proc sort data=extreme_diastolic out=high_diastolic(keep=highobs rename=highobs=rownum);
by highobs;
run;
proc sort data=extreme_systolic out=high_systolic(keep=highobs rename=highobs=rownum);
by highobs;
run;
proc sort data=extreme_diastolic out=low_diastolic(keep=lowobs rename=lowobs=rownum);
by lowobs;
run;
proc sort data=extreme_systolic out=low_systolic(keep=lowobs rename=lowobs=rownum);
by lowobs;
run;
*now, merge those on using `in=` to identify which are matches.;
data heart_extremes;
merge for_merge high_diastolic(in=_highd) high_systolic(in=_highs) low_diastolic(in=_lowd) low_systolic(in=_lows);
by rownum;
if _highd then high_diastolic = 1;
if _highs then high_systolic = 1;
if _lowd then low_diastolic = 1;
if _lows then low_systolic = 1;
run;

How to write a concise list of variables in table of a freq when the variables are differentiated only by a suffix?

I have a dataset with some variables named sx for x = 1 to n.
Is it possible to write a freq which gives the same result as:
proc freq data=prova;
table s1 * s2 * s3 * ... * sn /list missing;
run;
but without listing all the names of the variables?
I would like an output like this:
S1 S2 S3 S4 Frequency
A 10
A E 100
A E J F 300
B 10
B E 100
B E J F 300
but with an istruction like this (which, of course, is invented):
proc freq data=prova;
table s1:sn /list missing;
run;

Why not just use PROC SUMMARY instead?
Here is an example using two variables from SASHELP.CARS.
So this is PROC FREQ code.
proc freq data=sashelp.cars;
where make in: ('A','B');
tables make*type / list;
run;
Here is way to get counts using PROC SUMMARY
proc summary missing nway data=sashelp.cars ;
where make in: ('A','B');
class make type ;
output out=want;
run;
proc print data=want ;
run;
If you need to calculate the percentages you can instead use the WAYS statement to get both the overall and the individual cell counts. And then add a data step to calculate the percentages.
proc summary missing data=sashelp.cars ;
where make in: ('A','B');
class make type ;
ways 0 2 ;
output out=want;
run;
data want ;
set want ;
retain total;
if _type_=0 then total=_freq_;
percent=100*_freq_/total;
run;
So if you have 10 variables you would use
ways 0 10 ;
class s1-s10 ;

If you just want to build up the string "S1*S2*..." then you could use a DO loop or a macro %DO loop and put the result into a macro variable.
data _null_;
length namelist $200;
do i=1 to 10;
namelist=catx('*',namelist,cats('S',i));
end;
call symputx('namelist',namelist);
run;
But here is an easy way to make such a macro variable from ANY variable list not just those with numeric suffixes.
First get the variables names into a dataset. PROC TRANSPOSE is a good way if you use the OBS=0 dataset option so that you only get the _NAME_ column.
proc transpose data=have(obs=0) ;
var s1-s10 ;
run;
Then use PROC SQL to stuff the names into a macro variable.
proc sql noprint;
select _name_
into :namelist separated by '*'
from &syslast
;
quit;
Then you can use the macro variable in your TABLES statement.
proc freq data=have ;
tables &namelist / list missing ;
run;

Car':
In short, no. There is no shortcut syntax for specifying a variable list that crosses dimension.
In long, yes -- if you create a surrogate variable that is an equivalent crossing.
Discussion
Sample data generator:
%macro have(top=5);
%local index;
data have;
%do index = 1 %to &top;
do s&index = 1 to 2+ceil(3*ranuni(123));
%end;
array V s:;
do _n_ = 1 to 5*ranuni(123);
x = ceil(100*ranuni(123));
if ranuni(123) < 0.1 then do;
ix = ceil(&top*ranuni(123));
h = V(ix);
V(ix) = .;
output;
V(ix) = h;
end;
else
output;
end;
%do index = 1 %to &top;
end;
%end;
run;
%mend;
%have;
As you probably noticed table s: created one freq per s* variable.
For example:
title "One table per variable";
proc freq data=have;
tables s: / list missing ;
run;
There is no shortcut syntax for specifying a variable list that crosses dimension.
NOTE: If you specify out=, the column names in the output data set will be the last variable in the level. So for above, the out= table will have a column "s5", but contain counts corresponding to combinations for each s1 through s5.
At each dimensional level you can use a variable list, as in level1 * (sublev:) * leaf. The same caveat for out= data applies.
Now, reconsider the original request discretely (no-shortcut) crossing all the s* variables:
title "1 table - 5 columns of crossings";
proc freq data=have;
tables s1*s2*s3*s4*s5 / list missing out=outEach;
run;
And, compare to what happens when a data step view uses a variable list to compute a surrogate value corresponding to the discrete combinations reported above.
data haveV / view=haveV;
set have;
crossing = catx(' * ', of s:); * concatenation of all the s variables;
keep crossing;
run;
title "1 table - 1 column of concatenated crossings";
proc freq data=haveV;
tables crossing / list missing out=outCat;
run;
Reality check with COMPARE, I don't trust eyeballs. If zero rows with differences (per noequal) then the out= data sets have identical counts.
proc compare noprint base=outEach compare=outCat out=diffs outnoequal;
var count;
run;
----- Log -----
NOTE: There were 31 observations read from the data set WORK.OUTEACH.
NOTE: There were 31 observations read from the data set WORK.OUTCAT.
NOTE: The data set WORK.DIFFS has 0 observations and 3 variables.
NOTE: PROCEDURE COMPARE used (Total process time)

Using proc freq with repeated ID variables

I would like to use proq freq to count the number of food types that someone consumed on a specific day(fint variable). My data is in long format with repeated idno for the different food types and different number of interview dates. However SAS hangs and does not run the code. I have more than 300,000 datalines.Is there another way to do this?
proc freq;
tables idno*fint*foodtype / out=countft;
run;

I am a little unsure of your data structure, but proc means can also count.
Assuming that you have multiple dates for each person, and multiple food types for each date, you can use:
data dataset;
set dataset;
count=1;
run;
proc means data=dataset sum;
class idno fint foodtype;
var count;
output out=countft sum=counftpday;
run;
/* Usually you only want the lines with the largest _type_, so keep going here */
proc sql noprint;
select max(_type_) into :want from countft;
quit; /*This grabs the max _type_ from output file */
data countft;
set countft;
where _type_=&want.;
run;

Try a proc sql:
proc sql;
create table want as
select distinct idno, fint, foodtype, count(*) as count
from have
order by 1, 2, 3;
quit;
Worse case scenario, sort and count in a data step.
proc sort data=have;
by idno fint foodtype;
run;
data count;
set have;
by idno fint foodtype;
if first.foodtype then count=1;
else count+1;
if last.foodtype then output;
run;

SAS Second smallest value

The following code, built using the Summary Statistics task from SAS Enterprise Guide, finds the min of each column of a table.
How can I find the second smallest value?
I tried replacing MIN with SMALLEST(2) but doesn't work.
Thank you.
TITLE;
TITLE1 "Summary Statistics";
TITLE2 "Results";
FOOTNOTE;
FOOTNOTE1 "Generated by the SAS System (&_SASSERVERNAME, &SYSSCPL) on
%TRIM (%QSYSFUNC(DATE(), NLDATE20.)) at
%TRIM(%SYSFUNC(TIME(), TIMEAMPM12.))";
PROC MEANS DATA=WORK.SORTTempTableSorted
NOPRINT
CHARTYPE
MIN NONOBS ;
VAR A B C;
OUTPUT OUT=WORK.MEANSummaryStats(LABEL="Summary Statistics for
WORK.QUERY_FOR_TRNSTRANSPOSEDPD__0001")
MIN()=
/ AUTONAME AUTOLABEL INHERIT
;
RUN;

Using the ExtremeValue table from PROC UNIVARIATE.
ods select none;
ods output ExtremeValues=ExtremeValues(where=(loworder=2) drop=high:);
proc univariate data=sashelp.class NEXTRVAL=2;
run;
ods select all;
proc print;
run;

I don't think there's any way to accomplish this within proc means. There are ways using a variety of other procs. The Univariate procedure highlights one method using the Extreme Observations.
https://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_univariate_sect058.htm
title 'Extreme Blood Pressure Observations';
ods select ExtremeObs;
proc univariate data=BPressure;
var Systolic Diastolic;
id PatientID;
run;
proc print data=ExtremeObs;
run;

I will assume you are interested in all numeric columns. If the IFLIST macro variable is longer than 64k bytes in length due to the number of numeric variables and the length of their names, this code will fail. It should work for all reasonably narrow data sets.
UNTESTED CODE
We get a list of the variables in your data set.
PROC CONTENTS DATA=SORTTempTableSorted OUT=md NOPRINT ;
RUN ;
We use that list to create statements and expressions.
IFLIST is a block of statements to store the minimum value of fieldname in fieldname_1 and the second lowest in fieldname_2. If the comparison is LT, then we keep distinct values, not necessarily the order statistics. If the comparision is LE and there are multiple observations with the minimum value, fieldname_1 and fieldname_2 will be equal to each other. I will assume you want distinct values.
MAXLIST is an expression that will resolve to the largest numeric value in the data set :)
MINLIST and MINLIST2 are created for use in RETAIN and KEEP statements.
PROC SQL STIMER NOPRINT EXEC ;
SELECT 'IF ' || name || ' LT ' || name '_1 THEN DO;' ||
name || '_2=' || name || '_1;' ||
name || '_1=' || name || ';END;ELSE IF ' ||
name || ' LT ' || name || '_2 THEN ' ||
name || '_2=' || name,
'MAX(' || name || ')',
name || '_1',
name || '_2'
INTO :iflist SEPARATED BY '; ',
:maxlist SEPARATED BY '<>'
:minlist SEPARATED BY ' ',
:min2list SEPARATED BY ' '
FROM md
WHERE type EQ 1
;
Now we get the largest numeric value from the data set:
SELECT &maxlist
INTO :maxval
FROM SORTTempTableSorted
;
QUIT ;
Now we do the work. The END option sets "eof" to 1 on the last observation, which is the only time we want to write a record to the output data set.
DATA min2 ;
SET SORTTempTableSorted END=eof;
RETAIN &minlist &min2list &maxval;
KEEP &minlist &min2list ;
&iflist ;
IF eof THEN
OUTPUT ;
RUN ;

A data step solution that should work for any number of columns without ever running into macro limitations:
proc sql noprint;
select count(*) into :NUM_COUNT from dictionary.columns
where LIBNAME='SASHELP' and MEMNAME = 'CLASS' and TYPE = 'num';
quit;
data class_min2;
do until(eof);
set sashelp.class end = eof;
array min2[&NUM_COUNT,2] _temporary_;
array nums[*] _numeric_;
do _n_ = 1 to &NUM_COUNT;
min2[_n_,1] = min(min2[_n_,1],nums[_n_]);
if min2[_n_,1] < nums[_n_] then min2[_n_,2] = min(nums[_n_],min2[_n_,2]);
end;
end;
do _iorc_ = 1 to 2;
do _n_ = 1 to &NUM_COUNT;
nums[_n_] = min2[_n_,_iorc_];
end;
output;
end;
keep _NUMERIC_;
run;
This outputs the two lowest distinct values of each numeric variable, without transposing the data in the same way that proc univariate does. You can easily allow for duplicate minimum values with a bit of tweaking.

Sort the values in ascending order. Delete the first value. This would be the minimum value. Now the value left at the first position is your second minimum.

Saving results from SAS proc freq with multiple tables

I'm a beginner in SAS and I have the following problem.
I need to calculate counts and percents of several variables (A B C) from one dataset and save the results to another dataset.
my code is:
proc freq data=mydata;
tables A B C / out=data_out ; run;
the result of the procedure for each variable appears in the SAS output window, but data_out contains the results only for the last variable. How to save them all in data_out?
Any help is appreciated.

ODS OUTPUT is your answer. You can't output directly using the OUT=, but you can output them like so:
ods output OneWayFreqs=freqs;
proc freq data=sashelp.class;
tables age height weight;
run;
ods output close;
OneWayFreqs is the one-way tables, (n>1)-way tables are CrossTabFreqs:
ods output CrossTabFreqs=freqs;
ods trace on;
proc freq data=sashelp.class;
tables age*height*weight;
run;
ods output close;
You can find out the correct name by running ods trace on; and then running your initial proc whatever (to the screen); it will tell you the names of the output in the log. (ods trace off; when you get tired of seeing it.)

Lots of good basic sas stuff to learn here
1) Run three proc freq statements (one for each variable a b c) with a different output dataset name so the datasets are not over written.
2) use a rename option on the out = statement to change the count and percent variables for when you combine the datasets
3) sort by category and merge all datasets together
(I'm assuming there are values that appear in in multiple variables, if not you could just stack the data sets)
data mydata;
input a $ b $ c$;
datalines;
r r g
g r b
b b r
r r r
g g b
b r r
;
run;
proc freq noprint data = mydata;
tables a / out = data_a
(rename = (a = category count = count_a percent = percent_a));
run;
proc freq noprint data = mydata;
tables b / out = data_b
(rename = (b = category count = count_b percent = percent_b));
run;
proc freq noprint data = mydata;
tables c / out = data_c
(rename = (c = category count = count_c percent = percent_c));
run;
proc sort data = data_a; by category; run;
proc sort data = data_b; by category; run;
proc sort data = data_c; by category; run;
data data_out;
merge data_a data_b data_c;
by category;
run;

As ever, there are lots of different ways of doing this sort of thing in SAS. Here are a couple of other options:
1. Use proc summary rather than proc freq:
proc summary data = sashelp.class;
class age height weight;
ways 1;
output out = freqs;
run;
2. Use multiple table statements in a single proc freq
This is more efficient than running 3 separate proc freq statements, as SAS only has to read the input dataset once rather than 3 times:
proc freq data = sashelp.class noprint;
table age /out = freq_age;
table height /out = freq_height;
table weight /out = freq_weight;
run;
data freqs;
informat age height weight count percent;
set freq_age freq_height freq_weight;
run;

This is a question I've dealt with many times and I WISH SAS had a better way of doing this.
My solution has been a macro that is generalized, provide your input data, your list of variables and the name of your output dataset. I take into consideration the format/type/label of the variable which you would have to do
Hope it helps:
https://gist.github.com/statgeek/c099e294e2a8c8b5580a
/*
Description: Creates a One-Way Freq table of variables including percent/count
Parameters:
dsetin - inputdataset
varlist - list of variables to be analyzed separated by spaces
dsetout - name of dataset to be created
Author: F.Khurshed
Date: November 2011
*/
%macro one_way_summary(dsetin, varlist, dsetout);
proc datasets nodetails nolist;
delete &dsetout;
quit;
*loop through variable list;
%let i=1;
%do %while (%scan(&varlist, &i, " ") ^=%str());
%let var=%scan(&varlist, &i, " ");
%put &i &var;
*Cross tab;
proc freq data=&dsetin noprint;
table &var/ out=temp1;
run;
*Get variable label as name;
data _null_;
set &dsetin (obs=1);
call symput('var_name', vlabel(&var.));
run;
%put &var_name;
*Add in Variable name and store the levels as a text field;
data temp2;
keep variable value count percent;
Variable = "&var_name";
set temp1;
value=input(&var, $50.);
percent=percent/100; * I like to store these as decimals instead of numbers;
format percent percent8.1;
drop &var.;
run;
%put &var_name;
*Append datasets;
proc append data=temp2 base=&dsetout force;
run;
/*drop temp tables so theres no accidents*/
proc datasets nodetails nolist;
delete temp1 temp2;
quit;
*Increment counter;
%let i=%eval(&i+1);
%end;
%mend;
%one_way_summary(sashelp.class, sex age, summary1);
proc report data=summary1 nowd;
column variable value count percent;
define variable/ order 'Variable';
define value / format=$8. 'Value';
define count/'N';
define percent/'Percentage %';
run;
EDIT (2022):
Better way of doing this is to use the ODS Tables:
/*This code is an example of how to generate a table with
Variable Name, Variable Value, Frequency, Percent, Cumulative Freq and Cum Pct
No macro's are required
Use Proc Freq to generate the list, list variables in a table statement if only specific variables are desired
Use ODS Table to capture the output and then format the output into a printable table.
*/
*Run frequency for tables;
ods table onewayfreqs=temp;
proc freq data=sashelp.class;
table sex age;
run;
*Format output;
data want;
length variable $32. variable_value $50.;
set temp;
Variable=scan(table, 2);
Variable_Value=strip(trim(vvaluex(variable)));
keep variable variable_value frequency percent cum:;
label variable='Variable'
variable_value='Variable Value';
run;
*Display;
proc print data=want(obs=20) label;
run;

The option STACKODS(OUTPUT) added to PROC MEANS in 9.3 makes this a much simpler task.
proc means data=have n nmiss stackods;
ods output summary=want;
run;
| Variable | N | NMiss |
| ------ | ----- | ----- |
| a | 4 | 3 |
| b | 7 | 0 |
| c | 6 | 1 |

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Null Columns only based on a variable value - SAS Dataset - sas

Related

Highlight the corresponding line number with another datasets

How to write a concise list of variables in table of a freq when the variables are differentiated only by a suffix?

Using proc freq with repeated ID variables

SAS Second smallest value

Saving results from SAS proc freq with multiple tables

Categories

Resources