I have a lot of data sets that I need to have the same structure - same variables, same order. I have a data set serving as a template ("all" in the code below). Other data are given this form by listing both the template data set (with obs=0) and a particular data set ("some" in the code below) in the same set statement. This works just fine.
I then want to loop through the variables. If one of them is missing (as it will be, if it's not present in the particular data set), it should be set to the value of the previous variable. var2 should get the value from var1 etc.
This should be done within each row. This works fine if done in a separate data step, but doesn't work if done in the same data step described above.
If done in the same data step, the values inserted for missing values will always be from row 1. Why is this? Can I achieve the wanted result without using another data step?
/* All the variables a complete data set should contain.*/
data all;
format var1-var5 $20.;
run;
/* Actual data have some of these variables, but not all. var1 is never missing, all other variables might be*/
data some;
var1="Obs 1, Value 1";
var4="Obs 1, Value 4";
output;
var1="Obs 2, Value 1";
var4="Obs 2, Value 4";
output;
run;
/* Not working - The values inserted when the conditional is true are all from row 1*/
data dont_want;
set all(obs=0) some;
array chars{*} _character_;
do i=1 to dim(chars);
if missing(chars{i}) then chars{i}=chars{i-1};
end;
drop i;
run;
/* Working*/
data temp;
set all(obs=0) some;
run;
data want;
set temp;
array chars{*} _character_;
do i=1 to dim(chars);
if missing(chars{i}) then chars{i}=chars{i-1};
end;
drop i;
run;
The values for the "extra" variables are being RETAINed since they are sourced from the ALL dataset. Any variable that is sourced from an input dataset is NOT reset to missing at the start of the data step iteration. Since those variables are not on the SOME dataset they do not change when an observation is read from it.
Just add code to reset them to missing. If you want to do it without knowing the names of the variables you might consider re-ordering the code.
You could define and clear the array after the compiler has "seen" the ALL dataset but before the run-time has read the SOME dataset.
data dont_want;
if 0 then set all;
array chars{*} _character_;
call missing(of chars{*});
set some;
do i=2 to dim(chars);
if missing(chars{i}) then chars{i}=chars{i-1};
end;
drop i;
run;
Or add an explicit OUTPUT statement and reset them after that.
data dont_want;
set all(obs=0) some;
array chars{*} _character_;
do i=2 to dim(chars);
if missing(chars{i}) then chars{i}=chars{i-1};
end;
drop i;
output;
call missing(of _all_);
run;
Couple things to do if you want to use implicit OUTPUT
Prep the PDV prior to the data reading set using a non-reading set
Set up array based on prepped PDV
Clear the array
Read the data with set
Impute your data
output
Example:
data dont_want;
if 0 then set all some; * non reading set preps the PDV;
array chars{*} _character_;
call missing(of chars(*)); * clears all auto-retained data set variables;
set all(obs=0) some; * data reading set;
* shift right an array requires left to right processing;
do i=dim(chars) to 2 by -1;
if missing(chars{i}) then chars{i}=chars{i-1};
end;
*** OR COPY right into empty slots, repeating prior copy if needed;
do i=2 to dim(chars);
if missing(chars{i}) then chars{i}=chars{i-1};
end;
drop i;
* implicit output;
run;
Related
So I have a vector of search terms, and my main data set. My goal is to create an indicator for each observation in my main data set where variable1 includes at least one of the search terms. Both the search terms and variable1 are character variables.
Currently, I am trying to use a macro to iterate through the search terms, and for each search term, indicate if it is in the variable1. I do not care which search term triggered the match, I just care that there was a match (hence I only need 1 indicator variable at the end).
I am a novice when it comes to using SAS macros and loops, but have tried searching and piecing together code from some online sites, unfortunately, when I run it, it does nothing, not even give me an error.
I have put the code I am trying to run below.
*for example, I am just testing on one of the SASHELP data sets;
*I take the first five team names to create a search list;
data terms; set sashelp.baseball (obs=5);
search_term = substr(team,1,3);
keep search_term;;
run;
*I will be searching through the baseball data set;
data test; set sashelp.baseball;
run;
%macro search;
%local i name_list next_name;
proc SQL;
select distinct search_term into : name_list separated by ' ' from work.terms;
quit;
%let i=1;
%do %while (%scan(&name_list, &i) ne );
%let next_name = %scan(&name_list, &i);
*I think one of my issues is here. I try to loop through the list, and use the find command to find the next_name and if it is in the variable, then I should get a non-zero value returned;
data test; set test;
indicator = index(team,&next_name);
run;
%let i = %eval(&i + 1);
%end;
%mend;
Thanks
Here's the temporary array solution which is fully data driven.
Store the number of terms in a macro variable to assign the length of arrays
Load terms to search into a temporary array
Loop through for each word and search the terms
Exit loop if you find the term to help speed up the process
/*1*/
proc sql noprint;
select count(*) into :num_search_terms from terms;
quit;
%put &num_search_terms.;
data flagged;
*declare array;
array _search(&num_search_terms.) $ _temporary_;
/*2*/
*load array into memory;
if _n_ = 1 then do j=1 to &num_search_terms.;
set terms;
_search(j) = search_term;
end;
set test;
*set flag to 0 for initial start;
flag = 0;
/*3*/
*loop through and craete flag;
do i=1 to &num_search_terms. while(flag=0); /*4*/
if find(team, _search(i), 'it')>0 then flag=1;
end;
drop i j search_term ;
run;
Not sure I totally understand what you are trying to do but if you want to add a new binary variable that indicates if any of the substrings are found just use code like:
data want;
set have;
indicator = index(term,'string1') or index(term,'string2')
... or index(term,'string27') ;
run;
Not sure what a "vector" would be but if you had the list of terms in a dataset you could easily generate that code from the data. And then use %include to add it to your program.
filename code temp;
data _null_;
set term_list end=eof;
file code ;
if _n_ =1 then put 'indicator=' # ;
else put ' or ' #;
put 'index(term,' string :$quote. ')' #;
if eof then put ';' ;
run;
data want;
set have;
%include code / source2;
run;
If you did want to think about creating a macro to generate code like that then the parameters to the macro might be the two input dataset names, the two input variable names and the output variable name.
In a SAS Data Step i have a character variable called "varName". This variable stores the name of another variable. In below's example, it stores the name of the numeric variable "changeMe":
data TMP;
length
varName $32
changeMe 8
;
varName = ‘changeMe’;
/*??? How to change the content of variable that varName holds ???*/
run;
Now the question is: how do i change the content of the variable that varName holds?
The use case would be that varName acts as a dynamic pointer to different variables that i want to manipulate in a big SAS Data Set.
DATA Step does not directly provide for named indirect assignment.
In some cases, the indirect assignment requirement might indicate you want to perform a Proc TRANSPOSE data transformation. If the variable names and values are provided in a transaction data set, and the data has BY group variables, your better solution might be to TRANSPOSE the transaction data and merge that transform to the master data using an UPDATE or MODIFY statement.
Regardless, you can array variables of a given type and iterate the array looking for the target requiring assignment.
Example:
data want;
set sashelp.class;
varname = 'name';
varvalue = 'Scooter';
array chars _character_;
do _n_ = 1 to dim(chars);
if upcase (vname(chars(_n_))) = upcase(varname) then do;
chars(_n_) = varvalue;
end;
end;
run;
Output
call execute() is a highly feasible solution.
data TMP;
length
varName $32
changeMe 8
;
varName = 'changeMe';
run;
data _null_;
set TMP end=eof;
if _n_ = 1 then call execute('data %trim(&syslast.); modify %trim(&syslast.);');
call execute(cats(varName)||' = rand("uniform",1,0);');
if eof then call execute('run;');
run;
Log:
NOTE: CALL EXECUTE generated line.
1 + data WORK.TMP; modify WORK.TMP;
2 + changeMe = rand("uniform",1,0);
3 + run;
NOTE: There were 1 observations read from the data set WORK.TMP.
NOTE: The data set WORK.TMP has been updated. There were 1 observations rewritten, 0 observations
added and 0 observations deleted.
I'm facing the problem that I want to put data into a character variable.
So I have a long tranposed dataset where I have three variables: date( by which i transposed before hand) var (has three different outputs of my previous variables) and col1 (which includes the values of my previous variables).
Now i want to create a forth variable which has as well three different outputs. My problem is that I can create the variable put with my code it does always create missing value.
data pair2;
set data1;
if var="BNNESR" or var="BNNESR_r" or var="BNNESR_t" then output;
length all $ 20;
all=" ";
if var="BNNESR" then all="pdev";
if var="BNNESR_t" then all="trigger";
if var="BNNESR_r" then all="rdev";
drop var;
run;
Afterwards I want to tranpose it back by the "all" variable. I know i could just rename the old vars before I transpose it and then just keep them.
But the complete calculation will go on and actually will be turned into a macro where it is not that easy if would do it like that way.
Your program will just subset the input data and add a new variable that is empty because you are writing the data out before you assign any value to the new variable.
Use a subsetting IF (or WHERE) statement instead of using an explicit OUTPUT statement. Once your data step has an explicit OUTPUT statement then SAS no longer automatically writes the observation at the end of the data step iteration.
data pair2;
set data1;
if var="BNNESR" or var="BNNESR_r" or var="BNNESR_t" ;
length all $20;
if var="BNNESR" then all="pdev";
else if var="BNNESR_t" then all="trigger";
else if var="BNNESR_r" then all="rdev";
drop var;
run;
Since the list in the IF statement matches the values in the recode step then perhaps you want to just use a DELETE statement instead?
data pair2;
set data1;
length all $20;
if var="BNNESR" then all="pdev";
else if var="BNNESR_t" then all="trigger";
else if var="BNNESR_r" then all="rdev";
else delete;
drop var;
run;
Is it possible to repeat a data step a number of times (like you might in a %do-%while loop) where the number of repetitions depends on the result of the data step?
I have a data set with numeric variables A. I calculate a new variable result = min(1, A). I would like the average value of result to equal a target and I can get there by scaling variable A by a constant k. That is solve for k where target = average(min(1,A*k)) - where k and target are constants and A is a list.
Here is what I have so far:
filename f0 'C:\Data\numbers.csv';
filename f1 'C:\Data\target.csv';
data myDataSet;
infile f0 dsd dlm=',' missover firstobs=2;
input A;
init_A = A; /* store the initial value of A */
run;
/* read in the target value (1 observation) */
data targets;
infile f1 dsd dlm=',' missover firstobs=2;
input target;
K = 1; * initialise the constant K;
run;
%macro iteration; /* I need to repeat this macro a number of times */
data myDataSet;
retain key 1;
set myDataSet;
set targets point=key;
A = INIT_A * K; /* update the value of A /*
result = min(1, A);
run;
/* calculate average result */
proc sql;
create table estimate as
select avg(result) as estimate0
from myDataSet;
quit;
/* compare estimate0 to target and update K */
data targets;
set targets;
set estimate;
K = K * (target / estimate0);
run;
%mend iteration;
I can get the desired answer by running %iteration a few times, but Ideally I would like to run the iteration until (target - estimate0 < 0.01). Is such a thing possible?
Thanks!
I had a similar problem to this just the other day. The below approach is what I used, you will need to change the loop structure from a for loop to a do while loop (or whatever suits your purposes):
First perform an initial scan of the table to figure out your loop termination conditions and get the number of rows in the table:
data read_once;
set sashelp.class end=eof;
if eof then do;
call symput('number_of_obs', cats(_n_) );
call symput('number_of_times_to_loop', cats(3) );
end;
run;
Make sure results are as expected:
%put &=number_of_obs;
%put &=number_of_times_to_loop;
Loop over the source table again multiple times:
data final;
do loop=1 to &number_of_times_to_loop;
do row=1 to &number_of_obs;
set sashelp.class point=row;
output;
end;
end;
stop; * REQUIRED BECAUSE WE ARE USING POINT=;
run;
Two part answer.
First, it's certainly possible to do what you say. There are some examples of code that works like this available online, if you want a working, useful-code example of iterative macros; for example, David Izrael's seminal Rakinge macro, which performs a rimweighting procedure by iterating over a relatively simple process (proc freqs, basically). This is pretty similar to what you're doing. In the process it looks in the datastep at the various termination criteria, and outputs a macro variable that is the total number of criteria met (as each stratification variable separately needs to meet the termination criterion). It then checks %if that criterion is met, and terminates if so.
The core of this is two things. First, you should have a fixed maximum number of iterations, unless you like infinite loops. That number should be larger than the largest reasonable number you should ever need, often by around a factor of two. Second, you need convergence criteria such that you can terminate the loop if they're met.
For example:
data have;
x=5;
run;
%macro reduce(data=, var=, amount=, target=, iter=20);
data want;
set have;
run;
%let calc=.;
%let _i=0;
%do %until (&calc.=&target. or &_i.=&iter.);
%let _i = %eval(&_i.+1);
data want;
set want;
&var. = &var. - &amount.;
call symputx('calc',&var.);
run;
%end;
%if &calc.=&target. %then %do;
%put &var. reduced to &target. in &_i. iterations.;
%end;
%else %do;
%put &var. not reduced to &target. in &iter. iterations. Try a larger number.;
%end;
%mend reduce;
%reduce(data=have,var=x,amount=1,target=0);
That is a very simple example, but it has all of the same elements. I prefer to use do-until and increment on my own but you can do the opposite also (as %rakinge does). Sadly the macro language doesn't allow for do-by-until like the data step language does. Oh well.
Secondly, you can often do things like this inside a single data step. Even in older versions (9.2 etc.), you can do all of what you ask above in a single data step, though it might look a little clunky. In 9.3+, and particularly 9.4, there are ways to run that proc sql inside the data step and get the result back without waiting for another data step, using RUN_MACRO or DOSUBL and/or the FCMP language. Even something simple, like this:
data have;
initial_a=0.3;
a=0.3;
target=0.5;
output;
initial_a=0.6;
a=0.6;
output;
initial_a=0.8;
a=0.8;
output;
run;
data want;
k=1;
do iter=1 to 20 until (abs(target-estimate0) < 0.001);
do _n_ = 1 to nobs;
if _n_=1 then result_tot=0;
set have nobs=nobs point=_n_;
a=initial_a*k;
result=min(1,a);
result_tot+result;
end;
estimate0 = result_tot/nobs;
k = k * (target/estimate0);
end;
output;
stop;
run;
That does it all in one data step. I'm cheating a bit because I'm writing my own data step iterator, but that's fairly common in this sort of thing, and it is very fast. Macros iterating multiple data steps and proc sql steps will be much slower typically as there is some overhead from each one.
I have looked around quite a bit for something of this nature, and the majority of sources all give examples of counting the amount of observations etc.
But what I am actually after is a simple piece of code that will check to see if there are any observations in the dataset, if that condition is met then the program needs to continue as normal, but if the condition is not met then I would like a new record to be created with a variable stating that the dataset is empty.
I have seen macros and SQL code that can accomplish this, but what I would like to know is is it possible to do the same in SAS code? I know the code I have below does not work, but any insight would be appreciated.
Data TEST;
length VAR1 $200.;
set sashelp.class nobs=n;
call symputx('nrows',n);
obs= &nrows;
if obs = . then VAR1= "Dataset is empty"; output;
Run;
You could do it by always appending a 1-row data set with the empty dataset message, and then delete the message if it doesn't apply.
data empty_marker;
length VAR1 $200;
VAR1='Dataset is empty';
run;
Data TEST;
length VAR1 $200.;
set
sashelp.class nobs=n
empty_marker (in=marker)
;
if (marker) and _n_ > 1 then delete;
Run;
Easiest way I can think of is to use the nobs statement to check the number of records. The trick is you don't want to actually read from an empty data set. That will terminate the DATA Step and the nobs value will not be set. So you use an always false if statement to check the number of observations.
data test1;
format x best. msg $32.;
stop;
run;
data test1;
if _n_ = 0 then
set test1 nobs=nobs;
if ^nobs then do;
msg = "NO RECORDS";
output;
stop;
end;
set test1;
/*Normal code here*/
output;
run;
So this populates the nobs value with 0. The if clause sees the 0 and allows you to set the message and output that value. Use the stop to then terminate the DATA Step. Outside of that check, do your normal data step code. You need the ending output statement because of the first. Once the compiler sees an output it will not do it automatically for you.
Here it works for a data set with values.
data test2;
format x best. msg $32.;
do x=1 to 5;
msg="Yup";
output;
end;
run;
data test2;
if _n_ = 0 then
set test2 nobs=nobs;
if ^nobs then do;
msg = "NO RECORDS";
output;
stop;
end;
set test2;
y=x+1;
output;
run;