SAS: Check variables if they are not empty dynamically - sas

I need help checking if several variables are not empty. Normally, a "where VarName is not missing" would suffice, however, the number of variables that are generated will vary.
I have the following macro I found which correctly determines the number of variables when called in a data step:
%macro var_count(var_count_name);
array vars_char _character_;
array vars_num _numeric_;
&var_count_name = dim(vars_char) + dim(vars_num);
%mend;
My datasets is creating a variable number of COLs (i.e. COL1, COL2, COL3, etc.) depending on the dataset I use. I would like to use this in a data step where it returns observations where each of my generated COL1, COL2, COL3, etc. variables are looked at. I envision something like below:
Data Want;
set Have;
where cats(COL, %var_count(total_vars)) is not missing;
run;
But this does not work. I would very much like to avoid having to write "where COL1 is not missing or COL2 is not missing or ..." everytime the program is run. Any and all help will be appreciated.
EDIT: I fear I may have been too unspecific in my needs above. I will try to be more clear below. Not sure if I should make a new post, but here goes.
Dataset that I have (CVal = Character value)
| ID | COL1 | COL2 | COL3 | COL4 | COL5 | COL6 | COL7 |
| 1 | | | | | | | CVal |
| 2 | CVal | CVal | | | | | |
| 3 | | | | | | | |
| 4 | | CVal | | | | | |
I would like to keep ID1, 2 and 4, due to there being information in either COL1 through COL7 in each of these.
Essentially I would like a piece of code that can do the following:
Data Want;
Set data have;
if missing(COL1) and missing(COL2) and missing(COL3) and missing(COL4)
and missing(COL5) and missing(COL6) and missing(COL7) then delete;
run;
My problem is then, the number of COLs will vary depending on the input dataset. It may sometimes just be COL1-COL5, sometimes COL1-COL20. How can this be made "automatic", so it will automatically register the number of COL-columns and then automatically check those columns, if they are all empty and then delete the observation.

In your case to test if any of the COL: variables is non-empty you can just test if the concatenation of them is non-empty.
data want;
set have;
if not missing(cats(of COL:));
run;
You need to use subsetting IF because you cannot use variable lists in a WHERE statement.
Example:
35 data test;
36 set sashelp.class;
37 where nmiss(of height weight) > 0 ;
------
22
76
ERROR: Syntax error while parsing WHERE clause.
ERROR 22-322: Syntax error, expecting one of the following: !, !!, &, (, ), *, **, +, ',', -, /, <, <=, <>, =, >, >=, ?, AND, BETWEEN, CONTAINS, EQ,
GE, GT, IN, IS, LE, LIKE, LT, NE, NOT, NOTIN, OR, ^, ^=, |, ||, ~, ~=.
ERROR 76-322: Syntax error, statement will be ignored.
38 run;
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.TEST may be incomplete. When this step was stopped there were 0 observations and 5 variables.
WARNING: Data set WORK.TEST was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.01 seconds
Note that if any of the COL... variables is numeric you would need to modify the test a little. When the MISSING option is set to ' ' then the line above will work. But if it is set to the normal '.' then the numeric missing values will appear as periods. If you don't mind also treating any character values that just have periods as missing also then you could just use compress() to remove the periods.
if not missing(compress(cats(of COL:),'.'));

You can use N to count the number of non-missing numerics and CATS to check for some character values being not-missing.
Example:
Presume numeric and character variables are segregated with prior variable based array statements such as
array chars col1-col7;
array nums x1-x10;
The subsetting if would be
if N(of vars_num(*)) or not missing (CATS(of vars_char(*)));
or test using COALESCE and COALESCEC
if not missing(coalesce(of nums(*))) or
not missing(coalesceC(of chars(*)));
If you don't know the variable names ahead of time, you will need to examine the data set ahead of the subsetting DATA step and codegen the array statements into macro variables.

Related

How to code a macro that will iterate through a list of variables and return those that are missing

I'm still pretty new to SAS coding and I'm not great at loops. I want to code a macro that will iterate through a vector of variables and return a table of the 'study_id' of those that are missing this variable. Ideally, the macro would then append each list into one final table.
I know that I need a loop that iterates from 1 to the length of my vector of variables. I've also tested the sql step on a single variable and it works. Here's what I have, along with a truncated data set to reproduce the problem:
data test;
input Study_ID married_partner $ PT_Working $;
cards;
1 Yes Yes
2 No No
3 Yes .
5 Yes No
6 Yes No
8 Yes Yes
9 . No
10 Yes No
11 Yes No
12 Yes No
13 . No
14 Yes No
15 No No
17 Yes .
19 Yes No
20 Yes No
21 Yes No
;
run;
%let var=married_partner PT_Working;
%macro missing(data=, list=, var=);
do i = 1 to dim(&var);
proc sql;
create table missing_&var as
select &list
from &data
where missing(&var);
quit;
end;
%mend;
%missing(data=PT_BASELINE_ALLPT, list=Study_ID, var=&var)
I'm getting the following error:
61 missing_married_partner PT_Working
__________
78
202
NOTE: Line generated by the macro variable "VAR".
61 married_partner PT_Working
__________
22
ERROR 78-322: Expecting a '.'.
ERROR 202-322: The option or parameter is not recognized and will be ignored.
ERROR 22-322: Syntax error, expecting one of the following: !, !!, &, *, **, +, ',', -, '.', /, <, <=, <>, =, >, >=, ?, AND,
CONTAINS, EQ, EQT, GE, GET, GT, GTT, LE, LET, LIKE, LT, LTT, NE, NET, OR, ^=, |, ||, ~=.
Where am I going wrong and what further code should I add to combine all of these into one table?
Thanks for any help
You don't really need macro code for this problem. Remember that the purpose of macro code is to generate SAS code so first figure out what SAS code you want to run before trying to use macro logic to generate it.
To process a series of variables you can usually use an array. Although they do need to be of the same type (numeric or character).
If you just want to find observations with missing values on any of the variables you don't even need an array. The CMISS() function will work for both numeric and character variables. So this step will find all of the observations with any missing values of the two variables listed.
data want ;
set have;
if cmiss(of married_partner PT_Working);
run;
If you want it more flexible you could use a macro variable for the variable list.
data want ;
set have;
if cmiss(of &varlist);
run;
If would be harder to do in PROC SQL since that does not support the use of variable lists, including the OF keyword. Instead you would need to put commas between the variable names.
create table want as select * from have where cmiss(married_partner, PT_Working);

SAS drop records in by group with only one observation

I have a dataset and I am reading it with a by group statement:
data TEMPDATA;
SET RAWDATA; by SYMBOL DATE;
run;
proc expand data=TEMPDATA out=GAPDATA to=day method=step;
by symbol date;
id time;
run;
However, I realized that the proc expand procedure would return an error if there is a record in the by group that has only one observation.
For example:
| Symbol | Date | Time | BB | BO | MIDPRICE |
|--------|----------|------|----|----|----------|
| AAPL | 20130102 | 2 | 2 | 3 | 2.5 |
If there is only one record of AAPL, SAS will refuse to execute the command.
Therefore, I was wondering if there is a way to drop all the records, with the same symbol, that has only one record in the by group (symbol, date)?
Since you are using a data step already just add logic to delete the singletons. Any record that is both the first and last in its group indicates there is only one record in that group.
data TEMPDATA;
SET RAWDATA;
by SYMBOL DATE;
if first.date and last.date then delete;
run;
One nice feature in SAS PROC SQL is that you can group by and add summary measures while retaining all detail. This makes such removal easy (and can be useful in many other contexts as well). I.e.
PROC SQL;
CREATE TABLE tempdata2 AS
SELECT *
FROM tempdata
GROUP BY symbol, date
HAVING count(*) > 1
;
QUIT;

Sort observations in a custom order

I have a dataset that results from the joins between a few results from a proc univariate.
After some more joins, I have a final dataset with a variable called "Measure", which has the name of certain measures, like 'mean' and 'standard deviation', for example, and other variables each with values for these measures, representing a month in a certain year.
I'd like to sort these measures in a particular order and, for now, I'm doing a proc transpose, doing a retain to stabilish the order I want, and doing another transpose. The problem is that this a really naive solution and I feel it just takes longer than it should take.
Is there a simpler/more effective way to do this sort?
An example of what I want to do, with random values:
What I have:
Measures | 2013/01 | 2013/02 | 2013/03
Mean | 10 | 9 | 11
Std Devi.| 1 | 1 | 1
Median | 3 | 5 | 4
What I want:
Measures | 2013/01 | 2013/02 | 2013/03
Std Devi.| 1 | 1 | 1
Median | 3 | 5 | 4
Mean | 10 | 9 | 11
I hope I was clear enough.
Thanks in advance
Couple of straightforward solutions. First, you could simply add a variable that you sort by and then drop. Don't need to transpose, just do it in the data step or PROC SQL after the join. if measures='Mean' then sortorder=3; else if measures='MEdian' then sortorder=2;... then sort by sortorder and then drop it in the PROC SORT step.
Second, if you're using entirely numeric values, you can use PROC MEANS to do the sorting for you, with a custom format that defines the order (using NOTSORTED and order=data on the class statement) and idgroup functionality in PROC MEANS to do the sorting and output the right values. This is overkill in most cases, but if the dataset is huge it might be appropriate.
Third, if you're doing the joins in SQL, you can order by the variable that you input into a order you want - I can explain that in more detail if you find that the most useful.

Make a SAS data column into a Macro variable?

How can I convert the output of a SAS data column into a macro variable?
For example:
Var1 | Var2
-----------
A | 1
B | 2
C | 3
D | 4
E | 5
What if I want a macro variable containing all of the values in Var1 to use in a PROC REG or other procedure? How can I extract that column into a variable which can be used in other PROCS?
In other words, I would want to generate the equivalent statement:
%LET Var1 =
A
B
C
D
E
;
But I will have different results coming from a previous procedure so I can't just do a '%LET'. I have been exploring SYMPUT and SYMGET, but they seem to apply only to single observations.
Thank you.
proc sql;
select var1
into :varlist separated by ' '
from have;
quit;
creates &varlist. macro variable, separated by the separation character. If you don't specify a separation character it creates a variable with the last row's value only.
There are a lot of other ways, but this is the simplest. CALL SYMPUTX for example will do the same thing, except it's complicated to get it to pull all rows into one.
You can use it in a proc directly, no need for a macro variable. I used numeric values for your var1 for simplicity, but you get the idea.
data test;
input var1 var2 ##;
datalines;
1 100 2 200 3 300 4 400 5 500
run;
proc reg data=TEST;
MODEL VAR1 = VAR2;
RUN;

Inputting missing value in primary dataset based on values in secondary dataset and a matching condition

my understanding of SAS is very elementary. I am trying to do something like this and i need help.
I have a primary dataset A with 20,000 observations where Col1 stores the CITY and Col2 stores the MILES. Col2 contains a lot of missing data. Which is as shown below.
+----------------+---------------+
| Col1 | Col2 |
+----------------+---------------+
| Gary,IN | 242.34 |
+----------------+---------------+
| Lafayette,OH | . |
+----------------+---------------+
| Ames, IA | 123.19 |
+----------------+---------------+
| San Jose,CA | 212.55 |
+----------------+---------------+
| Schuaumburg,IL | . |
+----------------+---------------+
| Santa Cruz,CA | 454.44 |
+----------------+---------------+
I have another secondary dataset B this has around 5000 observations and very similar to dataset A where Col1 stores the CITY and Col2 stores the MILES. However in this dataset B, Col2 DOES NOT CONTAIN MISSING DATA.
+----------------+---------------+
| Col1 | Col2 |
+----------------+---------------+
| Lafayette,OH | 321.45 |
+----------------+---------------+
| San Jose,CA | 212.55 |
+----------------+---------------+
| Schuaumburg,IL | 176.34 |
+----------------+---------------+
| Santa Cruz,CA | 454.44 |
+----------------+---------------+
My goal is to fill the missing miles in Dataset A based on the miles in Dataset B by matching the city names in col1.
In this example, I am trying to fill in 321.45 in Dataset A from Dataset B and similarly 176.34 by matching Col1 (city names) between the two datasets.
I am need help doing this in SAS
You just have to merge the two datasets. Note that values of Col1 needs to match exactly in the two datasets.
Also, I am assuming that Col1 is unique in dataset B. Otherwise you need to somehow tell more exactly what value you want to use or remove the duplicates (for example by adding nodupkey in proc sort statement).
Here is an example how to merge in SAS:
proc sort data=A;
by Col1;
proc sort data=B;
by Col1;
data AB;
merge A(in=a) B(keep=Col1 Col2 rename=(Col2 = Col2_new));
by Col1;
if a;
if missing(Col2) then Col2 = Col2_new;
drop Col2_new;
run;
This includes all observations and columns from dataset A. If Col2 is missing in A then we use the value from B.
Pekka's solution is perfectly working, I add an alternative solution for the sake of completeness.
Sometimes in SAS a PROC SQL lets you skip some passages compared to a DATA step (with the relative gain in storage resources and computational time), and a MERGE is a typical example.
Here you can avoid sorting both input datasets and handling the renaming of variables (here the matching key has the same name col1 but in general this is not the case).
proc sql;
create table want as
select A.col1,
coalesce(A.col2,B.col2) as col2
from A left join B
on A.col1=B.col1
order by A.col1;
quit;
The coalesce() function returns the first non missing element encountered in the arguments list.