Avoiding invalid data statements on conversion to numeric - sas

I have a large table with several variables which will be input to a statistical analysis. As a statistician, I prefer all factors to be numeric, so that they work predictably in regression models, with formats to show labels for numeric values, e.g. race, sex.
when I set the original data, I rename all the character variables that I want to recode to add a suffix of c (for character values). I then use an input statement for each of them, with the corresponding best1. or best8. or appropriate format.
The problem is that the log becomes cluttered when missing variables are coded as actually missing values, eg .. I could add a line of if not missing(varc) then var = input(varc, best8.); for each variable, but this seems inefficient and hard to read.
Is there a better way to handle this?

If you're okay with eliminating the missing values message entirely (including things that 'could' be an issue) you can prepend ?? to the informat to tell it to not give you that warning.
var=input(varc,??8.);

Create a format like this and use it instead of best8..
proc format library=work;
invalue myform
'' = .
other=best8.
;
run;
options fmtsearch=(work);

Can you try to use
PROC STDIZE ..
REPONLY MISSING=0;
RUN;

Related

what's the difference on dealing char between where statement and if statement in SAS 9.4M6

Recently, I happen to find an interesting thing:
data test;
set sashelp.class;
if _N_ in (2 3 5 7) then name = '';
run;
data test;
set ;
where name;
run;
The result is: data Test has 15 observations.
I use SAS 9.4M6 for over a year and don't remember that where statment can deal with char type variable just like boolean type variable. I strightly change where to if and it leads the results: name is not a numeric variable and data Test has 0 observation.
So here is my two questions:
1. Is where name; another way of where name is not missing? if not, what happens when submitwhere name?
2. When did this kind of code(where <variable>;) start be allowed in SAS?
Thanks for any tips.
Yes it is a test for non empty values. I have used it for years, mainly when interactively browsing a dataset. I suspect it has been there since SAS introduced the WHERE statement, or at least since they re-factored it to use the same syntax as SQL WHERE clauses.
WHERE statements use PROC SQL style syntax (use of LIKE no use of variable lists) but IF statements use normal SAS syntax.
So if you used
if name ;
in your data step then you would see notes about SAS trying to convert the character variable NAME into a number that it can evaluate as FALSE (zero of missing) or TRUE (any other value). Since the names in SASHELP.CLASS cannot be converted to numbers then all of them will be treated as FALSE.
Interesting find! I also have not noticed this. I tried this on both 9.4M4 and 9.4M5 and received the same results. This may be something added even earlier than that. I see no documentation on it as well. From testing, here is what I am able to see:
where statements can automatically convert characters such as as if name to boolean statements
if statements do not automatically convert characters to boolean and require the user to explicitly state the character you are trying to exclude, e.g.
Code:
data want;
set test ;
if(NOT missing(name));
run;
I cannot tell when this was added, but it is earlier than versions I have access to. Your test code would be a great quiz question!

Dealing with missing values in SAS?

I am handling a SAS dataset with little observations and I need to put on relation a variable with its lag.
By doing this, I lose a record resulting in a missing values.
Do some of you know how SAS Base handles such items in the procedures as PROC REG or PROC CORR?
Thanks you all in advance.
It depends a bit on what exactly you're doing. Based on your references to PROC REG and CORR - it excludes the missing values - listwise. This means if any value is missing in a row that is being used or referenced in the PROC it will be excluded.
http://documentation.sas.com/?docsetId=lrcon&docsetTarget=p175x77t7k6kggn1io94yedqagl3.htm&docsetVersion=9.4&locale=en
PROC CORR has several options that allow you to specify how it handles the missing values. The NOMISS option tells SAS to use only complete cases.
https://blogs.sas.com/content/iml/2012/01/13/missing-values-pairwise-corr.html

SAS proc Freq & gchart display additional value's frequency/ bars

This might be a weird question. I have a data set contains data like agree, neutral, disagree...for many questions. There is not so many observations so for some question, one or more options has frequency of 0, say neutral. When I run proc freq, since neutral shows up in that variable, the table does not contain a row for neutral. I end up with tables with different number of rows. I would like to know if there is a option to show these 0 frequency rows. I will also need to run proc gchart for the same data set, and I will run into the same problem for having different number of bars. Please help me on this. Thank you!
This depends on how exactly you are running your PROC FREQ. It has the sparse option, which tells it to create a value for every logical cell on the table when creating an output dataset; normally, while you would have a cell with a missing value (or zero) in a crosstab, if that is output to a dataset (which is vertical, ie each combination of x and y axis value are placed in one row) those rows are left off. Sparse makes sure that doesn't happen; and in a larger (n-dimensional) crosstab, it creates rows for every possible combination of every variable, even ones that don't occur in the data.
However, if you're just doing
proc freq data=mydata;
tables myvar;
run;
That won't help you, as SAS doesn't really have anything to go on to figure out what should be there.
For that, you have to use a class variable procedure. Proc Tabulate is one of such procedures, and is similar to Proc Freq in its syntax (sort of). You need to either use CLASSDATA on the proc statement, or PRINTMISS on the table statement. In the former case, you do not need to use a format, I don't believe. In the latter case (PRINTMISS), you need to create a format for your variable (if you don't already have one) that contains all levels of the data that you want to display (even if it's just an identity format, e.g. formatting character strings to identical character strings), and specify PRELOADFMT on the proc statement. See this man page for more details.

Cycling through all variables

I started to learn SAS here fairly recently and am getting the basics down pretty well, but have a question regarding something that is a little outside of my current realm of knowledge. Does anyone happen to know of a way to cycle through all variables in a SAS dataset? I know how to run a do loop/array on variables in a range (x1-x99), but ideally would like to look at every variable without having to rename any variables. Basically, I'm looking to run through a dataset and change variable values when the current value = 'True'/'False'. My guess is that I'll need to use proc contents in someway here, but not really sure how to go about using it correctly. Any tips/insight would be greatly appreciated. Thanks!
You can create an array of non-similarly-named variables. You're on the right track with PROC CONTENTS, although you also can use dictionary.columns or sashelp.vcolumn, which contain basically the same information.
proc sql;
select name into :collist separated by ' '
from dictionary.columns
where memname='DATASETNAME' and libname='LIBNAME' and <other criteria>;
quit;
The variables have to be all of the same type (char/numeric) so you may want to include a criterion of variable type in your query, plus any other limiting factor you may need.
That will create a list, &collist., in a macro variable you can use in your array
array vars &collist.;
and now you can loop over the array.
You may also be able to cheat things, if all of your variables are the same type, and you know the order is fixed . The double dash list (x1--x99) is 'in variable order, all variables from x1 to x99' and doesn't require numeric suffixes or anything like that.
Finally, you also might be able to write a format in PROC FORMAT to accomplish what you need, depending on what you are intending to do (mapping TRUE to 1 and FALSE to 0 or something like that).
Adding to Joe's answer: you can overcome the requirement that all variables should be of the same type. For that you can use macro loop instead of array. Firstly you need to define the macro:
%macro loop;
%do i=1 %to %sysfunc(countw(&collist));
....
<here goes your code for changing values, where instead of a variable name
you use macro function %scan(&collist,&i)>
....
%end;
%mend loop;
and now you can paste %loop into the DATA step where you're going to process all variables.

Creating a template dataset from multiple datasets with different columns

Currently, I have several sets of business unit data that I'd like to put into a standard template format. Some business unit data contains columns that others don't. I would like to check if certain columns exist and then to create them if they don't. I understand that techniques to achieve similar functionality have been discussed earlier, here and here. However, I was wondering if a better method exists.
My current code is:
data Source_Data4;
set Interm.Source_Data3;
if 0 then do;
a="";
b="";
end;
run;
Using the RETAIN statement should be the fastest and easiest way to do this. If the field you are checking for is numeric then put a . instead of "".
data Source_Data4;
set Interm.Source_Data3;
retain a b "";
run;
If you have multiple datasets with different columns that you want to use a template for, an excellent way to do this is something like this:
data want;
if 0 then set template;
set have2;
run;
This is far easier to code than a bunch of retain/length statements. It accomplishes the identical results as the retain solution (it defines the PDV), with one exception; it will define lengths and formats of variables based on template (while retain does not affect length or format). This may be desirable or may not be, depending on your use case. It is very helpful when combining multiple datasets, as it provides a single point at which length/format differences can be tested for; once this step occurs, you can be confident that your various datasets are all identical in variable length/format.
Creating this dataset can be done a number of ways. One simple way is:
data template;
if 0 then set have;
if 0 then set have2;
stop;
run;
That will create a blank dataset with have1 order followed by any new variables from have2. If that's not desired, you may want to add a RETAIN statement prior to the if 0's that draws from a data dictionary.