I am working on a SAS Dataset which has missing values.
I can identify whether a particular variable has missing values using IS NULL/IS MISSING operator.
Is there any alternative way, through which I can identify which variables have missing values in one shot.
Thanks in Advance
The syntax IS NULL or IS MISSING is limited to use in SQL code (also in WHERE statements or WHERE= dataset options since those essentially use the same parser.)
To test if a value is missing you can also use the MISSING() function. Or compare it to a missing value. So for character variables test if it is equal to all blanks: c=' '. For numeric you can test x=., but you also need to look out for special missing values. So you might test if x <= .z.
To get a quick summary of number of distinct missing values for each variable you could use the NLEVEL option on PROC FREQ. Note it might not work for a large dataset with too many distinct values as the procedure will run out of memory.
use array and vname to find variable with missing values. If you want rows with missing values use cmiss function.
data have;
infile datalines missover;
input id num char $ var $;
datalines;
1 . A C
2 3 D
5 6 B D
;
/* gives variables with missing values*/
data want1(keep=miss);
set have;
array chars(*) _character_;
array nums(*) _numeric_;
do i=1 to dim(chars);
if chars(i)=' ' then
miss=vname(chars(i));
if nums(i)=. then
miss=vname(nums(i));
end;
if miss=' ' then
delete;
run;
/* gives rows with missing value*/
data want(drop=rows);
set have;
rows=cmiss(of id -- var);
if rows=1;
run;
You can use proc freq table statement with missing option. It includes missing category if missing values exist. Useful for categorical data.
data example;
input A Freq;
datalines;
1 2
2 2
. 2
;
*list variables in tables statement;
proc freq data=example;
tables A / missing;
run;
You can also use Proc Univariate it creates MissingValues table in ODS by default if any missing values exist. Useful for numeric data.
Two options (in addition to Peter Slezák's) I can suggest are :
- Use proc means with nmiss
proc means data = ___ n nmiss;
var _numeric_;
run;
In SAS Enterprise Guide, there is a characterize data task - this helps profile character variables too. (Under the hood, it is a combination of various procs, but is an easy to use option).
Hope this helps,
regards,
Sundaresh
Related
This is the code that I have with me:
%let data=sashelp.cars;
proc transpose data=&data(obs=0) out=names;
var _all_;
run;
proc sql;
select cats('_',_name_,'=missing(',_name_,');') into: stmts separated by ' ' from names;
run;
data missing;
set &data;
_BIGN = 1;
/*%m_expand_varlist(data=&data,expr=cats('_',_name_,'=missing(',_name_,');'));*/
&stmts;
keep _:;
run;
proc summary data=missing;
var _numeric_;
output out=smry sum=;
run;
proc transpose data=smry(drop=_type_ _freq_) out=smry_;
run;
The goal of this code is to output the number of missing values for both character and numeric variables in the data set. The code accomplishes that objective but I have difficulty in understanding the purpose of certain lines in the code.
May I know what the following part of the code does?
select cats('_',_name_,'=missing(',_name_,');') into: stmts separated by ' ' from names;
I understand that the into part just stores the value into the macro variable stmts but what does "separated by ' ' from names" in the above line mean?
data missing;
set &data;
_BIGN = 1;
&stmts;
keep _:;
run;
And in the above portion of the code, what is the purpose of "keep :"? What does the ":" do in that? And is the "_BIGN = 1" necessary?
And also in the final output table called smry_, I get underscores before the names of the variables. But I don't need these underscores. What can I do to remove them? When I removed the underscore after the "keep :", the underscores in the smry table went away but I was left with only 10 rows instead of 15. Help would be appreciated. Thank you.
Before answering your questions, let me disclaim that this is not the way to go if you simply want a count of missing values for each numeric variable in your data.
However, this seems to be more of a practice assignment than an actual problem.
The Separated By Clause simply inserts the specific string between the values in the macro variable if the data source has >1 items. In this case the names data set has 15 items, so all 15 values are listed with a few spaces between them.
The colon operator in the keep statement tells the data step to keep only variables prefixed with an underscore.
_BIGN is not strictly necessary. However, it seems that the author of the code wants a simple count of observations in the final data set. That is all it does.
The underscores are applied to each variable name in the creation of the macro variable in. It is probably done to avoid conflicts between variable names (though this is technically still possible). Obviously, you can simply remove the underscore in the final data set.
As you know the "into" clause stores the values in the macro variable. The "separated by" leads to a list of values stored in the variable with spaces as delimiter here. If you don't use this you have only the value of the first row in your macro variable.
The ":" is a wildcard that means you keep all the variables starting with an underscore:
keep _:;
I managed to remove the underscores in the final data set using the following code:
data _smry_(drop = _name_);
set smry_;
name=compress(_name_, , 'kas');
run;
You could just rename the underscore variables back to original.
%let data=sashelp.heart;
proc transpose data=&data(obs=0) out=names;
var _all_;
run;
proc sql;
select cats('Label _',_name_,'=',quote(_label_),';') into: lb_stmts separated by ' ' from names;
select cats('_',_name_,'=missing(',_name_,');') into: stmts separated by ' ' from names;
select cats('_',_name_,'=',_name_) into: rn_stmts separated by ' ' from names;
run;
options symbolgen=1;
data missingV / view=missingV;
set &data;
&stmts;
&lb_stmts;
rename &rn_stmts;
keep _:;
run;
options symbolgen=0;
ods output summary=summary;
proc means data=missingV Sum Mean N STACKODSOUTPUT;
var _numeric_;
run;
ods output close;
proc print width=minimum label;
label sum='#Missing' mean='%Missing';
format sum 8. mean percentn8.1;
run;
I have a set of input macro variables in SAS. They are dynamic and generated based on the user selection in a sas stored process.
For example:There are 10 input values 1 to 10.
The name of the macro variable is VAR_. If a user selects 2,5,7 then 4 macro variables are created.
&VAR_0=3;
&VAR_=2;
&VAR_1=5;
&VAR_2=7;
The first one with suffix 0 provides the count. The next 3 provides the values.
Note:If a user select only one value then only one macro variable is created. For example If a user selects 9 then &var_=9; will be created. There will not be any count macro variable.
I am trying to create a sas table using these variables.
It should be like this
OBS VAR
-----------
1 2
2 5
3 7
-----------
This is what I tried. Not sure if this is the right way to do approach it.
It doesn't give me a final solution but I can atleast get the name of the macro variables in a table. How can I get their values ?
data tbl1;
do I=1 to &var_0;
VAR=CAT('&VAR_',I-1);
OUTPUT;
END;
RUN;
PROC SQL;
CREATE TABLE TBL2 AS
SELECT I,
CASE WHEN VAR= '&VAR_0' THEN '&VAR_' ELSE VAR END AS VAR
from TBL1;
QUIT;
Thank You for your help.
Jay
SAS helpfully stores them in a table for you already, you just need to parse out the ones you want. The table is called SASHELP.VMACRO or DICTIONARY.MACROS
Here's an example:
%let var=1;
%let var2=3;
%let var4=5;
proc sql;
create table want as
select * from sashelp.vmacro
where name like 'VAR%';
quit;
proc print data=want;
run;
I think the real issue is the inconsistent behavior of the stored process. It only creates the 0 and 1 variable when there are multiple selections. I think that your example is a little off. If the value of VAR_0 is three then their should be a VAR_3 macro variable. Also the value of VAR_ and VAR_1 should be set to the same thing.
To fix this in the past I have done something like this. First let's assign the parameter name a macro variable so that the code is reusable for other programs.
%let name=VAR_;
Then first make sure the minimal macro variables exist.
%global &name &name.0 &name.1 ;
Then make sure that you have a count by setting the 0 variable to 1 when it is empty.
%let &name.0 = %scan(&&&name.0 1,1);
Then make sure that you have a 1 variable. Since it should have the same value as the macro variable without a suffix just re-assign it.
%let &name.1 = &&&name ;
Now your data step is easier.
data want ;
length var $32 value $200 ;
do i=1 to &&&name.0 ;
var=cats(symget('name'),i);
value=symget(var);
output;
end;
run;
I don't understand your numbering scheme and recommend changing it, if you can; the &var_ variable is very confusing.
Anyway, the easiest way to do this is SYMGET. That returns a value from the macro symbol table which you can specify at runtime.
%let VAR_0=3;
%let VAR_=2;
%let VAR_1=5;
%let VAR_2=7;
data want;
do obs = 1 to &var_0.;
var = input(symget(cats('VAR_',ifc(obs=1,'',put(obs-1,2.)))),2.);
output;
end;
run;
I have a table with postings by category (a number) that I transposed. I got a table with each column name as _number for example _16, _881, _853 etc. (they aren't in order).
I need to do the sum of all of them in a proc sql, but I don't want to create the variable in a data step, and I don't want to write all of the columns names either . I tried this but doesn't work:
proc sql;
select sum(_815-_16) as nnl
from craw.xxxx;
quit;
I tried going to the first number to the last and also from the number corresponding to the first place to the one corresponding to the last place. Gives me a number that it's not correct.
Any ideas?
Thanks!
You can't use variable lists in SQL, so _: and var1-var6 and var1--var8 don't work.
The easiest way to do this is a data step view.
proc sort data=sashelp.class out=class;
by sex;
run;
*Make transposed dataset with similar looking names;
proc transpose data=class out=transposed;
by sex;
id height;
var height;
run;
*Make view;
data transpose_forsql/view=transpose_forsql;
set transposed;
sumvar = sum(of _:); *I confirmed this does not include _N_ for some reason - not sure why!;
run;
proc sql;
select sum(sumvar) from transpose_Forsql;
quit;
I have no documentation to support this but from my experience, I believe SAS will assume that any sum() statement in SQL is the sql-aggregate statement, unless it has reason to believe otherwise.
The only way I can see for SAS to differentiate between the two is by the way arguments are passed into it. In the below example you can see that the internal sum() function has 3 arguments being passed in so SAS will treat this as the SAS sum() function (as the sql-aggregate statement only allows for a single argument). The result of the SAS function is then passed in as the single parameter to the sql-aggregate sum function:
proc sql noprint;
create table test as
select sex,
sum(sum(height,weight,0)) as sum_height_and_weight
from sashelp.class
group by 1
;
quit;
Result:
proc print data=test;
run;
sum_height_
Obs Sex and_weight
1 F 1356.3
2 M 1728.6
Also note a trick I've used in the code by passing in 0 to the SAS function - this is an easy way to add an additional parameter without changing the intended result. Depending on your data, you may want to swap out the 0 for a null value (ie. .).
EDIT: To address the issue of unknown column names, you can create a macro variable that contains the list of column names you want to sum together:
proc sql noprint;
select name into :varlist separated by ','
from sashelp.vcolumn
where libname='SASHELP'
and memname='CLASS'
and upcase(name) like '%T' /* MATCHES HEIGHT AND WEIGHT */
;
quit;
%put &varlist;
Result:
Height,Weight
Note that you would need to change the above wildcard to match your scenario - ie. matching fields that begin with an underscore, instead of fields that end with the letter T. So your final SQL statement will look something like this:
proc sql noprint;
create table test as
select sex,
sum(sum(&varlist,0)) as sum_of_fields_ending_with_t
from sashelp.class
group by 1
;
quit;
This provides an alternate approach to Joe's answer - though I believe using the view as he suggests is a cleaner way to go.
I have a SAS dataset which has 20 character variables, all of which are names (e.g. Adam, Bob, Cathy etc..)
I would like a dynamic code to create variables called Adam_ref, Bob_ref etc.. which will work even if there a different dataset with different names (i.e. don't want to manually define each variable).
So far my approach has been to use proc contents to get all variable names and then use a macro to create macro variables Adam_ref, Bob_ref etc..
How do I create actual variables within the dataset from here? Do I need a different approach?
proc contents data=work.names
out=contents noprint;
run;
proc sort data = contents; by varnum; run;
data contents1;
set contents;
Name_Ref = compress(Name||"_Ref");
call symput (NAME, NAME_Ref);
%put _user_;
run;
If you want to create an empty dataset that has variables named like some values you have in a macro variables you could do something like this.
Save the values into macro variables that are named by some pattern, like v1, v2 ...
proc sql;
select compress(Name||"_Ref") into :v1-:v20 from contents;
quit;
If you don't know how many values there are, you have to count them first, I assumed there are only 20 of them.
Then, if all your variables are character variables of length 100, you create a dataset like this:
%macro create_dataset;
data want;
length %do i=1 %to 20; &&v&i $100 %end;
;
stop;
run;
%mend;
%create_dataset; run;
This is how you can do it if you have the values in macro variable, there is probably a better way to do it in general.
If you don't want to create an empty dataset but only change the variable names, you can do it like this:
proc sql;
select name into :v1-:v20 from contents;
quit;
%macro rename_dataset;
data new_names;
set have(rename=(%do i=1 %to 20; &&v&i = &&v&i.._ref %end;));
run;
%mend;
%rename_dataset; run;
You can use PROC TRANSPOSE with an ID statement.
This step creates an example dataset:
data names;
harry="sally";
dick="gordon";
joe="schmoe";
run;
This step is essentially a copy of your step above that produces a dataset of column names. I will reuse the dataset namerefs throughout.
proc contents data=names out=namerefs noprint;
run;
This step adds the "_Refs" to the names defined before and drops everything else. The variable "name" comes from the column attributes of the dataset output by PROC CONTENTS.
data namerefs;
set namerefs (keep=name);
name=compress(name||"_Ref");
run;
This step produces an empty dataset with the desired columns. The variable "name" is again obtained by looking at column attributes. You might get a harmless warning in the GUI if you try to view the dataset, but you can otherwise use it as you wish and you can confirm that it has the desired output.
proc transpose out=namerefs(drop=_name_) data=namerefs;
id name;
run;
Here is another approach which requires less coding. It does not require running proc contents, does not require knowing the number of variables, nor creating a macro function. It also can be extended to do some additional things.
Step 1 is to use built-in dictionary views to get the desired variable names. The appropriate view for this is dictionary.columns, which has alias of sashelp.vcolumn. The dictionary libref can be used only in proc sql, while th sashelp alias can be used anywhere. I tend to use sashelp alias since I work in windows with DMS and can always interactively view the sashelp library.
proc sql;
select compress(Name||"_Ref") into :name_list
separated by ' '
from sashelp.vcolumn
where libname = 'WORK'
and memname = 'NAMES';
quit;
This produces a space delimited macro vaiable with the desired names.
Step 2 To build the empty data set then this code will work:
Data New ;
length &name_list ;
run ;
You can avoid assuming lengths or create populated dataset with new variable names by using a slightly more complicated select statement.
For example
select compress(Name)||"_Ref $")||compress(put(length,best.))
into :name_list
separated by ' '
will generate a macro variable which retains the previous length for each variable. This will work with no changes to step 2 above.
To create populated data set for use with rename dataset option, replace the select statement as follows:
select compress(Name)||"= "||compress(_Ref")
into :name_list
separated by ' '
Then replace the Step 2 code with the following:
Data New ;
set names (rename = ( &name_list)) ;
run ;
I am trying to create categorical variables in sas. I have written the following macro, but I get an error: "Invalid symbolic variable name xxx" when I try to run. I am not sure this is even the correct way to accomplish my goal.
Here is my code:
%macro addvars;
proc sql noprint;
select distinct coverageid
into :coverageid1 - :coverageid9999999
from save.test;
%do i=1 %to &sqlobs;
%let n=coverageid&i;
%let v=%superq(&n);
%let f=coverageid_&v;
%put &f;
data save.test;
set save.test;
%if coverageid eq %superq(&v)
%then &f=1;
%else &f=0;
run;
%end;
%mend addvars;
%addvars;
You're combining macro code with data step code in a way that isn't correct. %if = macro language, meaning you are actually evaluating whether the text "coverageid" is equal to the text that %superq(&v) evaluates to, not whether the contents of the coverageid variable equal the value in &v. You could just convert %if to if, but even if you got that to work properly it would be hideously inefficient (you're rewriting the dataset N times, so if you have 1500 values for coverageID you rewrite the entire 500MB dataset or whatnot 1500 times, instead of just once).
If what you want to do is take the variable 'coverageid' and convert it to a set of variables that consist of all possible values of coverageid, 1/0 binary, for each, there are a nubmer of ways to do it. I'm fairly sure the ETS module has a procedure that just does this, but I don't recall it off the top of my head - if you were to post this to the SAS mailing list, one of the guys there would undoubtedly have it quickly.
The simple way for me, is to do this with entirely datastep code. First determine how many potential values there are for COVERAGEID, then assign each to a direct value, then assign the value to the correct variable.
If the COVERAGEID values are consecutive (ie, 1 to some number, no skips, or you don't mind skipping) then this is easy - set up an array and iterate over it. I will assume they are NOT consecutive.
*First, get the distinct values of coverageID. There are a dozen ways to do this, this works as well as any;
proc freq data=save.test;
tables coverageid/out=coverage_values(keep=coverageid);
run;
*Then save them into a format. This converts each value to a consecutive number (so the lowest value becomes 1, the next lowest 2, etc.) This is not only useful for this step, but it can be useful in the future in converting back.;
data coverage_values_fmt;
set coverage_values;
start=coverageid;
label=_n_;
fmtname='COVERAGEF';
type='i';
call symputx('CoverageCount',_n_);
run;
*Import the created format;
proc format cntlin=coverage_values_fmt;
quit;
*Now use the created format. If you had already-consecutive values, you could skip to this step and skip the input statement - just use the value itself;
data save.test_fin;
set save.test;
array coverageids coverageid1-coverageid&coveragecount.;
do _t = 1 to &coveragecount.;
if input(coverageid,COVERAGEF.) = _t then coverageids[_t]=1;
else coverageids[_t]=0;
end;
drop _t;
run;
Here's another way that doesn't use formats, and may be easier to follow.
First, just make some test data:
data test;
input coverageid ##;
cards;
3 27 99 105
;
run;
Next, create a data set with no observations but one variable for each level of coverageid. Note that this approach allows arbitrary values here.
proc transpose data=test out=wide(drop=_name_);
id coverageid;
run;
Finally, create a new data set that combines the initial data set and the wide one. Then, for each level of x, look at each categorical variable and decide whether to turn it "on".
data want;
set test wide;
array vars{*} _:;
do i=1 to dim(vars);
vars{i} = (coverageid = substr(vname(vars{i}),2,1));
end;
drop i;
run;
The line
vars{i} = (coverageid = substr(vname(vars{i}),2));
may require more explanation. vname returns the name of the variable, and since we didn't specify a prefix in proc transpose, all variables are named something like _1, _2, etc. So we take the substring of the variable name that starts in the second position, and compare it to coverageid; if they're the same, we set the variable to 1; otherwise it evaluates to 0.