I'm trying to apply the same logical on all the variables and create a new variable based on the logical:
DATA want;
SET have;
IF "range" = 25 THEN "new range" = 1
ELSE "new range" = 0;
RUN;
If it's easier I can also just change the variables themselves as opposed to creating new variables from the logical statement.
As an example, I want any value within the variables of 25 to be 1, and everything else to be 0:
HAVE:
var_100 var_101 var_102
30 25 20
45 100 25
25 25 10
WANT:
var_100 var_101 var_102
0 1 0
0 0 1
1 1 0
So I have about 100 variables with all the same prefix and increasing suffices. Instead of writing 100 logicals. I am trying to write one that will apply to every variable in that range of var_1 to var_100.
Lots of ways you can do this, mostly based on what exactly you're doing to what.
Arrays are the simplest:
data want;
set have;
array vars[25] var1-var25;
array newvars[25] n_var1 - n_var25;
do _i = 1 to dim(vars);
if vars[_i] = 25 then newvars[_i] = 1;
else newvars[_i] = 0;
end;
run;
Of course you need some reasonable way to specify those variable lists (var1-var25 and n_var1 - n_var25); if they're not just sequential, you'll either have to write them all out, or use the macro language to do that.
Another way is to write a macro to do what you want.
%macro recode(invar=, outvar=, inval=, outval=, otherval=);
if &invar. = &inval. then &outvar. = &outval.;
else &outvar. = &otherval.;
%mend recode;
data want;
set have;
%recode(invar=var1, outvar=n_Var1, inval=25, outval=1, otherval=0);
.. 25 of these ..
run;
You can then generate these macro calls with code; search on "sas data driven programming" either here or on a search engine for examples.
The latter is better if that 25 -> 1 changes by the variable. The former is better if it doesn't and the variables are easily "listable" (like var1-var25). If they're not listable, but the 25->1 is fixed, either one works about the same in my opinion.
And of course instead of using newvars you can just recode var[_i] = 1 or whatever if that's easier.
As an aside, there are also simpler ways of coding variables to 1/0 flags if that's what you're doing using procs. I think PROC SCORE is one common way, but probably worth a separate question if you want to go this route.
Related
I have the following macro that I have created and a dataset
%censusdata (districtname=,districtnum=);
data districtcodes;
input distnumber distname$;
cards;
1 Kaap/Cape
2 Simonstad
3 Bellville
4 Goodwood
5 Kuilsrivier
6 Wynberg
run;
Essentially I want to create a do loop which takes in distname from the districtcodes dataset and inputs it to distrctname in the %censusdata macro, and distnumber and inputs it to districtnum field in the macro.
How should I go about this?
Assuming you already have a macro developed, you can call it using CALL EXECUTE and the parameter values from the data set.
data districtcodes;
input distnumber distname$;
str=catt('%censusdata(districtname=', distname, ' , districtnum=', distnumber,
');');
call execute (str);
cards;
1 Kaap/Cape
2 Simonstad
3 Bellville
4 Goodwood
5 Kuilsrivier
6 Wynberg
;
;
;;
run;
As you discovered, DATA step datalines, (aka cards), are not compatible with the macro system.
You may want to rethink why the data has to be inside the macro. There are some use cases, but fewer than might seem at first.
Regardless of the reasoning here are a couple of ways (there are more)
place the data inside a DATA step string and parse it out using scan
place the data inside a macro variable and parse it out using %scan
create the data set prior to calling the macro and pass the data set name as well
Here is one using DATA step string parsing
%macro censusdata (districtname=,districtnum=);
data districtcodes (keep=dist:);
*input distnumber distname$;
*cards;
data = "
1 Kaap/Cape
2 Simonstad
3 Bellville
4 Goodwood
5 Kuilsrivier
6 Wynberg
";
put data=;
put data= $HEX60.;
do _n_ = 1 by 1 while (length(scan(data,_n_,' ')));
distnumber = input ( scan (data, _n_, ' '), best8. );
_n_ + 1;
districtname = scan (data, _n_, ' ');
output;
if _n_ > 10 then stop;
end;
stop;
run;
%mend;
%censusdata();
The macro example you show seems a little peculiar, as you are passing in parameters, ostensibly to help operate on some data, which is a static entity with respect to the macro.
As more reasonable approach might be to
eliminate the macro altogether
pass the name of the data set, and names of variables to be used for some code generation.
Such a macro would only make sense if you were performing identical types of processing on a wide range of data sets that meet an operant model (i.e. data set has at least two columns, one column for code numbers and the second for some associated text).
As you might see, coding macros with highly specific names and arguments (such as censusdata, districtname, districtcode) can be a wrapper with little re-use value.
data districtcodes;
input distnumber distname$;
cards;
1 Kaap/Cape
2 Simonstad
3 Bellville
4 Goodwood
5 Kuilsrivier
6 Wynberg
run;
Example invocation
%censusdata (data=districtcodes, codevar=distnumber, namevar=distname);
%mend;
I am trying to find a quick way to replace missing values with the average of the two nearest non-missing values. Example:
Id Amount
1 10
2 .
3 20
4 30
5 .
6 .
7 40
Desired output
Id Amount
1 10
2 **15**
3 20
4 30
5 **35**
6 **35**
7 40
Any suggestions? I tried using the retain function, but I can only figure out how to retain last non-missing value.
I thinks what you are looking for might be more like interpolation. While this is not mean of two closest values, it might be useful.
There is a nifty little tool for interpolating in datasets called proc expand. (It should do extrapolation as well, but I haven't tried that yet.) It's very handy when making series of of dates and cumulative calculations.
data have;
input Id Amount;
datalines;
1 10
2 .
3 20
4 30
5 .
6 .
7 40
;
run;
proc expand data=have out=Expanded;
convert amount=amount_expanded / method=join;
id id; /*second is column name */
run;
For more on the proc expand see documentation: https://support.sas.com/documentation/onlinedoc/ets/132/expand.pdf
This works:
data have;
input id amount;
cards;
1 10
2 .
3 20
4 30
5 .
6 .
7 40
;
run;
proc sort data=have out=reversed;
by descending id;
run;
data retain_non_missing;
set reversed;
retain next_non_missing;
if amount ne . then next_non_missing = amount;
run;
proc sort data=retain_non_missing out=ordered;
by id;
run;
data final;
set ordered;
retain last_non_missing;
if amount ne . then last_non_missing = amount;
if amount = . then amount = (last_non_missing + next_non_missing) / 2;
run;
but as ever, will need extra error checking etc for production use.
The key idea is to sort the data into reverse order, allowing it to use RETAIN to carry the next_non_missing value back up the data set. When sorted back into the correct order, you then have enough information to interpolate the missing values.
There may well be a PROC to do this in a more controlled way (I don't know anything about PROC STANDARDIZE, mentioned in Reeza's comment) but this works as a data step solution.
Here's an alternative requiring no sorting. It does require IDs to be sequential, though that can be worked around if they're not.
What it does is uses two set statements, one that gets the main (and previous) amounts, and one that sets until the next amount is found. Here I use the sequence of id variables to guarantee it will be the right record, but you could write this differently if needed (keeping track of what loop you're on) if the id variables aren't sequential or in an order of any sort.
I use the first.amount check to make sure we don't try to execute the second set statement more than we should (which would terminate early).
You need to do two things differently if you want first/last rows treated differently. Here I assume prev_amount is 0 if it's the first row, and I assume last_amount is missing, meaning the last row just gets the last prev_amount repeated, while the first row is averaged between 0 and the next_amount. You can treat either one differently if you choose, I don't know your data.
data have;
input Id Amount;
datalines;
1 10
2 .
3 20
4 30
5 .
6 .
7 40
;;;;
run;
data want;
set have;
by amount notsorted; *so we can tell if we have consecutive missings;
retain prev_amount; *next_amount is auto-retained;
if not missing(amount ) then prev_amount=amount;
else if _n_=1 then prev_amount=0; *or whatever you want to treat the first row as;
else if first.amount then do;
do until ((next_id > id and not missing(next_amount)) or (eof));
set have(rename=(id=next_id amount=next_amount)) end=eof;
end;
amount = mean(prev_amount,next_amount);
end;
else amount = mean(prev_amount,next_amount);
run;
I'd like the sum of the rows of a data set. In particular, I would like to sum from the second element to the last element (skipping the first entry).
How can I achieve this?
It sounds like you want to add up everything except the first column. You also don't know how many variables you have and it many change over time.
There may be a smarter way to do this, but here are 3 options.
If your ID value is stored as text while everything else is a number, then it is trivial to say:
data sum;
set test;
sum = sum(of _numeric_);
run;
which will simply add up all numeric variables. However it sounds like you have integer IDs, so perhaps one of these options would work. First, some sample data:
data test;
input id var1 var2 var3;
cards;
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
;
run;
Option 1 - Simply add up all of the numeric variables, and then subtract your ID value, this leaves you with the sum of everything except the ID:
data test2;
set test;
sum=sum(of _numeric_)-id;
run;
Option 2 - You can tell SAS to operate over a range of variables in the order they are listed in the dataset. You could just do sum = sum(var1--var3);, however you might not know what the first and last variables are. There's also a possibility that your ID variable is in the middle somewhere.
A solution to this would be to make sure your ID variable is first, and then create dummy variables before and after the range of variables you want to sum:
data test3;
format id START_SUM;
set test;
END_SUM = .;
sum = sum(of START_SUM--END_SUM);
drop START_SUM END_SUM;
run;
This creates ID and START_SUM before setting your data, and then creates the empty END_SUM at the end of your data. It then sums everything from START_SUM to END_SUM, and because sum(of ...) skips over missing values, you only get the sum of the variables you actually care about. Then you drop your dummy variables as they are no longer necessary.
Option 1 is obviously simpler, but Option 2 has some potential benefits in that it works with both numeric and non-numeric IDs, and has no chance of being subject to any sorts of weird rounding issues when you add and subtract the ID (although that won't happen if everything is an integer).
Confused about setting and recalling macro variables within a data step. I have an array of "Tumor" variables, and only one of them contains the information I need. There are a series of flags (BIN) to help me know which part of the Tumor array to reference. How can I do something like the following:
data tumors;
input ID $ BIN1 BIN2 Tumor1 Tumor2;
datalines;
1001 0 0 12 00
1002 1 0 01 01
1003 0 1 00 12
;
data newdata;
set tumors;
if BIN1 = 1 then do; %let value = 1; end;
if BIN2 = 1 then do; %let value = 2; end;
if Tumor&value in ('00','01','02') then Stage=0;
run;
The code outputs all blanks for "Stage" because I am doing this improperly, but I am not sure where the mistake is (there should be many listed as Stage 0). Any advice? I would want it to output the following:
data tumors_new;
input ID $ BIN1 BIN2 Tumor1 Tumor2 Stage;
datalines;
1001 0 0 12 00 ""
1002 1 0 01 02 01
1003 0 1 00 12 12
;
In your case, you do not need to use macro variables. You can do all of this with data step logic:
data newdata;
set olddata;
array Tumor[2];
if(BIN1 = 1) then value = 1;
if(BIN2 = 2) then value = 2;
if(value IN(1, 2) ) then do; *Prevent errors from occuring if value is missing;
if(Tumor[value]) in ('00','01','02') then Stage=0;
end;
run;
Assuming your variable names are Tumor1, Tumor2, we initialize an array named Tumor containing 2 values, which will automatically be named Tumor1 and Tumor2.
Explanation
The macro facility is a separate programming language from the SAS Data Step. There are only a few Data Step functions that connect the two languages. The reason why this is not working is because SAS always compiles macro language elements first before compiling any other code. When programming, always assume that your macro code will be interpreted first. I simply remember it in this compiling order:
Macro Code
SAS Code
In the program above, SAS does the following order of operations:
Assign 1 to the macro variable value
Assign 2 to the macro variable value
Resolve the macro variable value to 2
Compile the data step, then execute it
To bridge the gap between the Data Step and Macro Language, you need to use one of two functions:
call symput('macro variable name', variable or constant)
call symputx('macro variable name', scope <'G' or 'L'> )
symput (standing for Symbol Put) will read in the value of a data step variable into a macro variable, but only for that record. This is the tricky part. Because the Data Step naturally loops, it will constantly overwrite the value of your macro variable until the end of file marker. For this reason, it is very common to find call symput routines embedded in conditionals.
With symput, you are unable to use that macro variable directly within the Data Step. It only is usable after the data step completes. For example, you cannot use this logic:
data foo;
set bar;
call symput('macvar', var);
if(&macvar = 1) then put 'Woo!';
run;
This will produce an error
ERROR 22-322: Syntax error, expecting one of the following: a name, a quoted string, a numeric constant, a datetime constant, a missing value, INPUT, PUT.
WARNING: Apparent symbolic reference MACVAR not resolved.
This is because macvar is not given a value until the end of the data step, causing an error in the if statement (and, because of the error, symput never runs, so macvar is never created). To SAS, the if statement above looks like:
if( = 1);
You can confirm this by typing in that code and finding that Error 22-322 occurs again.
Suppose the dataset has 3 columns
Obs Theo Cal
1 20 20
2 21 23
3 21 .
4 22 .
5 21 .
6 23 .
Theo is the theoretical value while Cal is the estimated value.
I need to calculate the missing Cal.
For each Obs, its Cal is a linear combination of previous two Cal values.
Cal(3) = Cal(2) * &coef1 + Cal(1) * &coef2.
Cal(4) = Cal(3) * &coef1 + Cal(2) * &coef2.
But Cal = lag1(Cal) * &coef1 + lag2(Cal) * &coef2 didn't work as I expected.
The problem with using lag is when you use lag1(Cal) you're not getting the last value of Cal that was written to the output dataset, you're getting the last value that was passed to the lag1 function.
It would probably be easier to use a retain as follows:
data want(drop=Cal_l:);
set have;
retain Cal_l1 Cal_l2;
if missing(Cal) then Cal = Cal_l1 * &coef1 + Cal_l2 * &coef2;
Cal_l2 = Cal_l1;
Cal_l1 = Cal;
run;
I would guess you wrote a datastep like so.
data want;
set have;
if missing(cal) then
cal = lag1(cal)*&coef1 + lag2(cal)*&coef2;
run;
LAG isn't grabbing a previous value, but is rather creating a queue that is N long and gives you the end piece of. If you have it behind an IF statement, then you will never put the useful values of CAL into that queue - you'll only be tossing missings into it. See it like so:
data have;
do x=1 to 10;
output;
end;
run;
data want;
set have;
real_lagx = lag(x);
if mod(x,2)=0 then do;
not_lagx = lag(x);
put real_lagx= not_lagx=;
end;
run;
The Real lags are the immediate last value, while the NOT lags are the last even value, because they're inside the IF.
You have two major options here. Use RETAIN to keep track of the last two observations, or use LAG like I did above before the IF statement and then use the lagged values inside the IF statement. There's nothing inherently better or worse with either method; LAG works for what it does as long as you understand it well. RETAIN is often considered 'safer' because it's harder to screw up; it's also easier to watch what you're doing.
data want;
set have;
retain cal1 cal2;
if missing(cal) then cal=cal1*&coef1+cal2*&coef2;
output;
cal2=cal1;
cal1=cal;
run;
or
data want;
set have;
cal1=lag1(cal);
cal2=lag2(cal);
if missing(cal) then cal=cal1*&coef1+cal2*&coef2;
run;
The latter method will only work if cal is infrequently missing - specifically, if it's never missing more than once from any three observations. In the initial example, the first cal (row 3) will be populated, but from there on out it will always be missing. This may or may not be desired; if it's not, use retain.
There might be a way to accomplish it in a DATA step but as for me, when I want SAS to process iteratively, I use PROC IML and a do loop. I named your table SO and succesfully ran the following :
PROC IML;
use SO; /* create a matrix from your table to be used in proc iml */
read all var _all_ into table;
close SO;
Cal=table[,3];
do i=3 to nrow(cal); /* process iteratively the calculations */
if cal[i]=. then do;cal[i]=&coef1.*cal[i-1]+&coef2.*cal[i-2];
end;else do;end;
end;
table[,3]=cal;
Varnames={"Obs" "Theo" "Cal"};
create SO_ok from table [colname=varnames]; /* outputs a new table */
append from table;
close SO_ok;
QUIT;
I'm not saying you couldn't use lag() and a DATA step to achieve what you want to do. But I find that PROC IML is useful and more intuitive when it comes to iterative process.