SAS do I need to drop an array after creating it? - sas

I was using the following code to create an array from the dataset:
DATA REPLACED;
SET TPS_DROPPED;
array arr_jin(*) _numeric_;
do i=1 to dim(arr_jin);
if arr_jin(i) = . then arr_jin(i) = 0;
end;
drop i arr_jin;
RUN;
However, i got the following error log:
ERROR 241-185: The array arr_jin is not allowed in a DROP/KEEP/RENAME context.
WARNING: The variable arr_jin in the DROP, KEEP, or RENAME list has never been referenced.
Is it true that generally arrays don't need to be dropped after creation?

An ARRAY in a SAS data step is not a variable so there is nothing to DROP.
If you did want to drop the actual variables that you are using the array to reference then you would need to list those variables names on the DROP statement.

Related

How do I sum similarly named variables in SAS?

I want to remove every observation where the variable starting with R1 has a missing value. In order to do this, I first try to sum every variable with that prefix:
data test
input R1_1 R1_2 R1_3;
datalines;
. . .
;
run;
data test2;
set test;
diagnosis=sum (of R1:);
run;
This syntax should work according to this article. However something seems to be wrong. In the above example, I get an error complaining about the function call not having enough arguments. In other cases, the code seems to run smoothly but my diagnosis variable isn't created.
Can I fix this and in that case how?
Your code does not work because you did not have a semicolon ending the DATA statement so the TEST dataset you created does not have any variables. Instead you also created datasets named INPUT R1_1 R1_2 and R1_3 that also did not have any variables.
To your actual question you can use NMISS() to count the number of missing numeric values.
nmiss = nmiss(of R1_:) ;
So you can eliminate observations with ANY missing values by using something like:
data want;
set have;
where nmiss(of R1_1-R1_3);
run;
If the goal is to remove observations where ALL of the values are missing you need to know how many variables you are testing. If you don't know that number in advance then you could use an ARRAY to count them. But then you would need to use a subsetting IF instead of WHERE.
data want;
set have;
array x r1_: ;
if nmiss(of r1_:) < dim(x);
run;
If you have a mix of numeric and character variables you can use CMISS() instead.

Difficulty understanding the "_n_" variable in SAS, and how it applies to a concatenate function

I am very new to SAS, and for whatever reason am finding a lot of difficulty deciphering what this code block (below) does. I've googled and search stackoverflow to no avail. I'd appreciate any input, thanks!
set dataset;
id=cat("L",_n_);
run;
Probably there must be a data statement as well.
data newdataset;
set dataset;
id = cat("L", _n_);
run;
This above code creates a new dataset named newdataset from the existing dataset named dataset.
Also creating a new column called id, and id is creating by concatenating a constant character value "L" with the automatic variable _n_ using the CAT function. The automatic variable _n_ represents the number of times the DATA step has iterated.

SAS: How to create a variable like 1,2,3,4,...,N

I need to introduce a time trend to a regression model for a course but have no idea how to create a variable that's just (1,2,3,4,...,108). In R or Python I would just create an empty vector of 0's and then loop through to fill them with the loop index but I have no clue how to do it in SAS.
Thank you in advance
data want;
set have;
time_trend+1;
run;
SAS is an inherently looping language. The code above does four things:
Read a row
Add 1 to a variable called time_trend
Output the row to a dataset named want
Read the next row and execute the statements again
SAS automatically initialized the variable time_trend for us at compilation, so we do not need to declare a length or type. SAS assumes it is a numeric variable by default.
The statement time_trend+1 is a special shortcut of the below logic:
data want;
set have;
retain time_trend 0;
time_trend = time_trend + 1;
run;

Is there a reason why array cannot be referenced in a keep= data set options?

The following is the code.
data WORK.TOTALSALES(keep=MonthSales{12});
set WORK.MONTHLYSALES(keep=Year Product Sales);
array MonthSales{12};
do i=1 to 12;
MonthSales{i}=Sales;
end;
drop i;
run;
Many thanks for your time and attention.
The keep= dataset option does not support arrays, but it does support sequentially numbered variables.
data WORK.TOTALSALES(keep=MonthSales1-MonthSales12);
set WORK.MONTHLYSALES(keep=Year Product Sales);
array MonthSales{12};
do i=1 to 12;
MonthSales{i}=Sales;
end;
drop i;
run;
A SAS Array in a Data Step is just a logical grouping of variables. That grouping is only available to the processing inside that data step. Data set options like drop= and keep= are handled by the SAS IO system, which is independent of the Data Step.
A SAS array, as documented in SAS(R) 9.4 Language Reference: Concepts, Fifth Edition has the following restriction:
An array definition is in effect only for the duration of the DATA
step. If you want to use the same array in several DATA steps, you
must redefine the array in each step.
So even if you were to use the drop= (instead of keep=) data set option on WORK.TOTALSALES for some other variables in the data step, the array would still not be part of the output data set as it only exists within the data step that it is defined in with the array statement.

Naming variable using _n_, a column for each iteration of a datastep

I need to declare a variable for each iteration of a datastep (for each n), but when I run the code, SAS will output only the last one variable declared, the greatest n.
It seems stupid declaring a variable for each row, but I need to achieve this result, I'm working on a dataset created by a proc freq, and I need a column for each group (each row of the dataset).
The result will be in a macro, so it has to be completely flexible.
proc freq data=&data noprint ;
table &group / out=frgroup;
run;
data group1;
set group (keep=&group count ) end=eof;
call symput('gr', _n_);
*REQUESTED code will go here;
run;
I tried these:
var&gr.=.;
call missing(var&gr.);
and a lot of other statement, but none worked.
Always the same result, the ds includes only var&gr where &gr is the maximum n.
It seems that the PDV is overwriting the new variable each iteration, but the name is different.
Please, include the result in a single datastep, or, at least, let the code take less time as possible.
Any idea on how can I achieve the requested result?
Thanks.
Macro variables don't work like you think they do. Any macro variable reference is resolved at compile time, so your call symput is changing the value of the macro variable after all the references have been resolved. The reason you are getting results where the &gr is the maximum n is because that is what &gr was as a result of the last time you ran the code.
If you know you can determine the maximum _n_, you can put the max value into a macro variable and declare an array like so:
Find max _n_ and assign value to maxn:
data _null_;
set have end=eof;
if eof then call symput('maxn',_n_);
run;
Create variables:
data want;
set have;
array var (&maxn);
run;
If you don't like proc transpose (if you need 3 columns you can always use it once for every column and then put together the outputs) what you ask can be done with arrays.
First thing you need to determine the number of groups (i.e. rows) in the input dataset and then define an array with dimension equal to that number.
Then the i-th element of your array can be recalled using _n_ as index.
In the following code &gr. contains the number of groups:
data group1;
set group;
array arr_counts(&gr.) var1-var&gr.;
arr_counts(_n_)= count;
run;
In SAS there're several methods to determine the number of obs in a dataset, my favorite is the following: (doesn't work with views)
data _null_;
if 0 then set group nobs=n;
call symputx('gr',n);
run;