I've got some weird results I don't quite understand. I create a data set in a data step, using several data sets in the set statement. There is a variable that is present in some of the datasets, but not in all of them. If this variable is missing in my new dataset, I want to give it some value. This creates a dangerously non-intuitive result and no warnings or errors.
In the example below, y is not present in test1. When creating test3, the behavior is as I would expect: z is assigned the x value from the same row for all the observations coming from test1. But test4 is not what i expect: the first value of x is repeated for all the rows from test1. Why is this?
data test1;
x=1;
output;
x=2;
output;
x=3;
output;
run;
data test2;
x=1;
y=2;
run;
data test3;
set test1 test2;
if missing(y) then z=x;
run;
data test4;
set test1 test2;
if missing(y) then y=x;
run;
The answer is in the When Variable Values Are Automatically Set to Missing by SAS section of Missing Variable Values Doc :
When variables are read with a SET, MERGE, or UPDATE statement, SAS
sets the values to missing only before the first iteration of the DATA
step. (If you use a BY statement, the variable values are also set to
missing when the BY group changes.) The variables retain their values
until new values become available (for example, through an assignment
statement or through the next execution of the SET, MERGE, or UPDATE
statement). Variables created with options in the SET, MERGE, and
UPDATE statements also retain their values from one iteration to the
next.
Meaning that in the test4 data step, the if missing(y) is true only in the first iteration of your data step. Then, you set y = 1, which is retained in the PDV.
That is not an issue in test3, because you do not overwrite y.
Variables that are created new by the data step, like the Z in your step that creates TEST3, are set to missing at the start of each iteration of the data step.
But variables that are coming from source datasets are "retained" (that is not set to missing automatically). So in the data step that creates TEST4 once a value is assigned to Y it is retained. Of course when the SET statement reads an observation from TEST2 the value of Y that had been retained from the previous iteration is overwritten.
Add some PUT statements so you can watch the values of X Y (and Z) as they change. First data step:
1234 data test3;
1235 put 'Before SET: ' (_n_ x y z) (=);
1236 set test1 test2;
1237 put ' After SET: ' (_n_ x y z) (=);
1238 if missing(y) then z=x;
1239 put ' After IF : ' (_n_ x y z) (=);
1240 run;
Before SET: _N_=1 x=. y=. z=.
After SET: _N_=1 x=1 y=. z=.
After IF : _N_=1 x=1 y=. z=1
Before SET: _N_=2 x=1 y=. z=.
After SET: _N_=2 x=2 y=. z=.
After IF : _N_=2 x=2 y=. z=2
Before SET: _N_=3 x=2 y=. z=.
After SET: _N_=3 x=3 y=. z=.
After IF : _N_=3 x=3 y=. z=3
Before SET: _N_=4 x=3 y=. z=.
After SET: _N_=4 x=1 y=2 z=.
After IF : _N_=4 x=1 y=2 z=.
Before SET: _N_=5 x=1 y=2 z=.
NOTE: There were 3 observations read from the data set WORK.TEST1.
NOTE: There were 1 observations read from the data set WORK.TEST2.
NOTE: The data set WORK.TEST3 has 4 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds
Second data step:
1241
1242 data test4;
1243 put 'Before SET : ' (_n_ x y) (=);
1244 set test1 test2;
1245 put ' After SET : ' (_n_ x y) (=);
1246 if missing(y) then y=x;
1247 put ' After IF : ' (_n_ x y) (=);
1248 run;
Before SET : _N_=1 x=. y=.
After SET : _N_=1 x=1 y=.
After IF : _N_=1 x=1 y=1
Before SET : _N_=2 x=1 y=1
After SET : _N_=2 x=2 y=1
After IF : _N_=2 x=2 y=1
Before SET : _N_=3 x=2 y=1
After SET : _N_=3 x=3 y=1
After IF : _N_=3 x=3 y=1
Before SET : _N_=4 x=3 y=1
After SET : _N_=4 x=1 y=2
After IF : _N_=4 x=1 y=2
Before SET : _N_=5 x=1 y=2
NOTE: There were 3 observations read from the data set WORK.TEST1.
NOTE: There were 1 observations read from the data set WORK.TEST2.
NOTE: The data set WORK.TEST4 has 4 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
Related
I stumbled upon there is a regexp option in proc format, so I give a try on this and get fuzzled finally.
proc format;
invalue test
'/n\(.*\)/i'(regexp) = 1
;
run;
data _null_;
x = 'n(ADT,TRTDT)';
y = input(x,test.);
z = prxmatch('/n\(.*\)/i',x)^=0;
put y = z = ;
run;
I had thought that the regexp option is equal to prxmatch() in data step, but the truth is I am wrong.
NOTE: Invalid argument to function INPUT at row 466 column 9.
y=. z=1
x=n(ADT,TRTDT) y=. z=1 _ERROR_=1 _N_=1
I have searched on help documentation and get nothing really help.
How does the option regexp in proc format works? Feel free to share your opinoin, thanks.
You defined an informat with a default width of 10 and tried to read a string of length 11.
data _null_;
x = 'n(ADT,TRTDT)';
y1 = input(x,??test.);
y2 = input(x,??test20.);
z = prxmatch('/n\(.*\)/i',x)^=0;
put (_all_) (=);
run;
Results:
x=n(ADT,TRTDT) y1=. y2=1 z=1
You can add the DEFAULT= option to the INVALUE statement to change the default width.
proc format;
invalue test (default=40)
'/n\(.*\)/i'(regexp) = 1
;
run;
I have a question regarding recursive/cumulative addition of a particular column for example: Click on example
How do I write this in SAS code which generates cumalitive addition with respect to column. Please help me with this.
Thank you in Advance
Example
use sum statement
data have;
input val;
datalines;
1
2
3
;
data want;
set have;
newval+val;
run;
Using Retain functionality.
You can reuse the code below as basis for any iterative/cumulative calculations.
data have;
input A;
datalines;
1
2
3
;
run;
data want;
set have;
Retain B;
/* If condition to initialize B only once, _N_ is the current row number */
if _N_= 1 then B=0;
B=B+A;
/* put statement will print the table in the log */
put _all_;
run;
Output:
A=1 B=1 _N_=1
A=2 B=3 _N_=2
A=3 B=6 _N_=3
My categorical variable has four levels - east, west, north, south. I want these levels to be 1, 2, 3, 4 (numeric form). How do I do that in SAS? Thank you!
There are reasons to prefer an INFORMAT over a FORMAT for creation of a numeric variable.
proc format cntlout= cntl;
value $numvar
east = 1
west = 2
north = 3
south = 4
other=.
;
invalue numvar(upcase)
EAST = 1
WEST = 2
NORTH = 3
SOUTH = 4
other=.
;
run;
data _null_;
do x='norTH' , 'South' , 'East' , 'west' , 'outer';
length b 8;
b = put(x,$numvar.);
c = input(x,numvar.);
put _all_;
end;
run;
Notice the different results and there is no conversion NOTE:
43 data _null_;
44 do x='norTH' , 'South' , 'East' , 'west' , 'outer';
45 length b 8;
46 b = put(x,$numvar.);
47 c = input(x,numvar.);
48 put _all_;
49 end;
50 run;
NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).
46:11
x=norTH b=. c=3 _ERROR_=0 _N_=1
x=South b=. c=4 _ERROR_=0 _N_=1
x=East b=. c=1 _ERROR_=0 _N_=1
x=west b=2 c=2 _ERROR_=0 _N_=1
x=outer b=. c=. _ERROR_=0 _N_=1
NOTE: DATA statement used (Total process time):
the simplest and the most appropriate way is to create format:
proc format;
value $numvar
east = 1
west = 2
north = 3
south = 4
;
run;
In data step you just create new numeric variable:
/* data step code */
new_var = put(your_categorical_variable, $numvar.);
/* data step code */
The advantage of this approach is that you can easily change coding if necessary - make changes only in proc format and not in all data steps where you convert variables. It's impossible when you use hardcoding
if var='east' then new_var=1 ...
Just curious is this code:
data Bla.SomeGreatNewDataset;
set WORK.InputTempDataset;
by SomeColumnName;
if first.SomeColumnName then output;
else delete;
run;
the same as:
data Bla.SomeGreatNewDataset;
set WORK.InputTempDataset;
by SomeColumnName;
if not missing(first.SomeColumnName) then output;
else delete;
run;
in other words does:
if first.SomeColumnName
just check if SomeColumnName does not contain a missing value?
Short answer, no.
BY Group processing with first.var and last.var operates on the distinct values of the variable. A missing value is a valid missing value.
first.var and last.var are Boolean values, either 1 or 0. You code outputs just the first record for each unique value of SomeColumnName.
Note, the data needs to either be sorted by SomeColumnName or have an index on that column.
Here is an example:
data have;
input x;
datalines;
1
2
2
.
3
3
3
;
run;
proc sort data=have;
by x;
run;
data want;
set have;
by x;
if first.x;
run;
proc print data=want;
run;
Produces:
Obs x
1 .
2 1
3 2
4 3
I want to delete ALL blank observations from a data set.
I only know how to get rid of blanks from one variable:
data a;
set data(where=(var1 ne .)) ;
run;
Here I set a new data set without the blanks from var1.
But how to do it, when I want to get rid of ALL the blanks in the whole data set?
Thanks in advance for your answers.
If you are attempting to get rid of rows where ALL variables are missing, it's quite easy:
/* Create an example with some or all columns missing */
data have;
set sashelp.class;
if _N_ in (2,5,8,13) then do;
call missing(of _numeric_);
end;
if _N_ in (5,6,8,12) then do;
call missing(of _character_);
end;
run;
/* This is the answer */
data want;
set have;
if compress(cats(of _all_),'.')=' ' then delete;
run;
Instead of the compress you could also use OPTIONS MISSING=' '; beforehand.
If you want to remove ALL Rows with ANY missing values, then you can use NMISS/CMISS functions.
data want;
set have;
if nmiss(of _numeric_) > 0 then delete;
run;
or
data want;
set have;
if nmiss(of _numeric_) + cmiss(of _character_) > 0 then delete;
run;
for all char+numeric variables.
You can do something like this:
data myData;
set myData;
array a(*) _numeric_;
do i=1 to dim(a);
if a(i) = . then delete;
end;
drop i;
This will scan trough all the numeric variables and will delete the observation where it finds a missing value
Here you go. This will work irrespective of the variable being character or numeric.
data withBlanks;
input a$ x y z;
datalines;
a 1 2 3
b 1 . 3
c . . 3
. . .
d . 2 3
e 1 . 3
f 1 2 3
;
run;
%macro removeRowsWithMissingVals(inDsn, outDsn, Exclusion);
/*Inputs:
inDsn: Input dataset with some or all columns missing for some or all rows
outDsn: Output dataset with some or all columns NOT missing for some or all rows
Exclusion: Should be one of {AND, OR}. AND will only exclude rows if any columns have missing values, OR will exclude only rows where all columns have missing values
*/
/*get a list of variables in the input dataset along with their types (i.e., whether they are numericor character type)*/
PROC CONTENTS DATA = &inDsn OUT = CONTENTS(keep = name type varnum);
RUN;
/*put each variable with its own comparison string in a seperate macro variable*/
data _null_;
set CONTENTS nobs = num_of_vars end = lastObs;
/*use NE. for numeric cols (type=1) and NE '' for char types*/
if type = 1 then call symputx(compress("var"!!varnum), compbl(name!!" NE . "));
else call symputx(compress("var"!!varnum), compbl(name!!" NE '' "));
/*make a note of no. of variables to check in the dataset*/
if lastObs then call symputx("no_of_obs", _n_);
run;
DATA &outDsn;
set &inDsn;
where
%do i =1 %to &no_of_obs.;
&&var&i.
%if &i < &no_of_obs. %then &Exclusion;
%end;
;
run;
%mend removeRowsWithMissingVals;
%removeRowsWithMissingVals(withBlanks, withOutBlanksAND, AND);
%removeRowsWithMissingVals(withBlanks, withOutBlanksOR, OR);
Outout of withOutBlanksAND:
a x y z
a 1 2 3
f 1 2 3
Output of withOutBlanksOR:
a x y z
a 1 2 3
b 1 . 3
c . . 3
e 1 . 3
f 1 2 3
Really weird nobody provided this elegant answer:
if missing(cats(of _all_)) then delete;
Edit: indeed, I didn't realized the cats(of _all_) returns a dot '.' for missing numeric value.
As a fix, I suggest this, which seems to be more reliable:
*-- Building a sample dataset with test cases --*;
data test;
attrib a format=8.;
attrib b format=$8.;
a=.; b='a'; output;
a=1; b=''; output;
a=.; b=''; output; * should be deleted;
a=.a; b=''; output; * should be deleted;
a=.a; b='.'; output;
a=1; b='b'; output;
run;
*-- Apply the logic to delete blank records --*;
data test2;
set test;
*-- Build arrays of numeric and characters --*;
*-- Note: array can only contains variables of the same type, thus we must create 2 different arrays --*;
array nvars(*) _numeric_;
array cvars(*) _character_;
*-- Delete blank records --*;
*-- Blank record: # of missing num variables + # of missing char variables = # of numeric variables + # of char variables --*;
if nmiss(of _numeric_) + cmiss(of _character_) = dim(nvars) + dim(cvars) then delete;
run;
The main issue being if there is no numeric at all (or not char at all), the creation of an empty array will generate a WARNING and the call to nmiss/cmiss an ERROR.
So, I think so far there is not other option than building a SAS statement outside the data step to identify empty records:
*-- Building a sample dataset with test cases --*;
data test;
attrib a format=8.;
attrib b format=$8.;
a=.; b='a'; output;
a=1; b=''; output;
a=.; b=''; output; * should be deleted;
a=.a; b=''; output; * should be deleted;
a=.a; b='.'; output;
a=1; b='b'; output;
run;
*-- Create a SAS statement which test any missing variable, regardless of its type --*;
proc sql noprint;
select distinct 'missing(' || strip(name) || ')'
into :miss_stmt separated by ' and '
from dictionary.columns
where libname = 'WORK'
and memname = 'TEST'
;
quit;
/*
miss_stmt looks like missing(a) and missing(b)
*/
*-- Delete blank records --*;
data test2;
set test;
if &miss_stmt. then delete;
run;