Creating new variables using sas table according specific condition

Creating new variables using sas table according specific condition - sas

I have a SAS table which has a numeric variable age. I need to construct new variables depending on the value of age. New variables should have this logic:
if the 0<=age<=25 then age0=1 else age0=0
if the 26<=age<=40 then age25=1 else age25=0 //here age25 is different to age0!!
So I wrote this code using macro to avoid repetition:
%macro intervalle_age(var,var1,var2);
if (&var=>&var1) and (&var<=&var2);
then return 1;
else return 0;
%mend;
Then I call the macro to get values of each new variables:
age0=%intervalle_age(age,0,25);
age25=%intervalle_age(age,26,40);
age25=%intervalle_age(age,41,65);
age25=%intervalle_age(age,65,771);
But this doesn't work!
How can I resolve it, please?
Thank you in advance!

I agree with Nikolay that you should step back and avoid macro altogether. The sample code you posted appears to be incorrect, you have four conditionals for different age ranges being assigned to only two variables.
In SAS a logical evaluation resolves to 1 for true and 0 for false. Additionally numeric variables can be used in logical expressions with non-zero, non-missing values meaning true and false otherwise.
So a sequence of code for assigning age range flag variables would be:
age0 = 0 < age <= 25 ;
age25 = 25 < age <= 40 ;
age40 = 40 < age <= 65 ;
age65 = 65 < age <= 71 ;
age71 = 71 < age ;
Masking simple and readable SAS statements behind a wall of macro code can lead to maintenance issues and degrade future understanding. However if your use case was to construct many sets of these types of code blocks, a macro that is based the breakpoints could lead to better legibility and understanding.
data have; age = 22; bmi = 20; run;
options mprint;
* easier to understand and not prone to copy paste issues or typos;
data want;
set have;
%make_flag_variables (var=age, breakpoints=0 25 40 65 71)
%make_flag_variables (var=bmi, breakpoints=0 18.5 25 30)
run;
Depends on this macro
%macro make_flag_variables (var=, breakpoints=);
%local I BREAKPOINT SUFFIX_LOW RANGE_LOW SUFFIX_HIGH RANGE_HIGH;
%let I = 1;
%do %while (%length(%scan(&breakpoints,&I,%str( ))));
%let BREAKPOINT = %scan(&breakpoints,&I,%str( ));
%let SUFFIX_LOW = &SUFFIX_HIGH;
%let SUFFIX_HIGH = %sysfunc(TRANSLATE(&BREAKPOINT,_,.));
%let RANGE_LOW = &RANGE_HIGH;
%let RANGE_HIGH = &BREAKPOINT;
%if &I > 1 %then %do;
&VAR.&SUFFIX_LOW = &RANGE_LOW < &VAR <= &RANGE_HIGH; /* data step source code emitted here */
%end;
%let I = %eval ( &I + 1 );
%end;
%mend;
Log snippet shows the code generation performed by the macro
92 data want;
93 set have;
94
95 %make_flag_variables (var=age, breakpoints=0 25 40 65 71)
MPRINT(MAKE_FLAG_VARIABLES): age0 = 0 < age <= 25;
MPRINT(MAKE_FLAG_VARIABLES): age25 = 25 < age <= 40;
MPRINT(MAKE_FLAG_VARIABLES): age40 = 40 < age <= 65;
MPRINT(MAKE_FLAG_VARIABLES): age65 = 65 < age <= 71;
96 %make_flag_variables (var=bmi, breakpoints=0 18.5 25 30)
MPRINT(MAKE_FLAG_VARIABLES): bmi0 = 0 < bmi <= 18.5;
MPRINT(MAKE_FLAG_VARIABLES): bmi18_5 = 18.5 < bmi <= 25;
MPRINT(MAKE_FLAG_VARIABLES): bmi25 = 25 < bmi <= 30;
97 run;

return doesn't have any special meaning in SAS macros. The macros are said to "generate" code, i.e. the macro invocation is replaced by the text, that's left after processing the things that the macro processor "understands" (basically, involving tokens (words) starting with & or %).
In your case the macro processor just expands the macro variables (the rest is just text, which the macro processor leaves untouched), resulting in:
age0=if (age=>0) and (age<=25);
then return 1;
else return 0;
age25=/*and so on*/
It's important to understand how the macro processor and regular execution interact (basically, all the macro expansions must be finished before the given DATA or PROC step starts executing).
To make this work you either need to generate the complete if statement, including the assignment to the output var:
%macro calc_age_interval(outvar, inputvar, lbound, ubound);
if (&inputvar=>&lbound) and (&inputvar<=&ubound) then do;
&outvar = 1;
end; else do;
&outvar = 0;
end;
%mend calc_age_interval;
%calc_age_interval(outvar=age0, inputvar=age, lbound=0, ubound=25);
Or make it generate an expression, which will evaluate to either 0 or 1 at execution time (either by assigning the result directly to a variable (the result of boolean expression is either 1 or 0 anyway), or using IFN() to be more explicit):
%macro calc_age_interval(inputvar, lbound, ubound);
ifn((&inputvar=>&lbound) and (&inputvar<=&ubound), 1, 0)
%mend;
age0 = %calc_age_interval(age, 0, 25); /* expands to age0=ifn(..., 1, 0); */
Taking a step back, I wouldn't bother with macros in this case at all. You can use the in (M:N) range notation or reset all output variables to 0, then do an if-elseif:
if age < 0 then age_missing_or_negative = 1;
else if age <= 25 then age0 = 1;
else if age <= 40 then age25 = 1;
...

Related

mean of 10 variables with different starting point (SAS)

I have 18 numerical variables pm25_total2000 to pm25_total2018
Each person have a starting year between 2013 and 2018, we can call that variable "reqyear".
Now I want to calculate mean for each persons 10 years before the starting year.
For example if a person have starting year 2015 I want mean(of pm25_total2006-pm25_total2015)
Or if a person have starting year 2013 I want mean(of pm25_total2004-pm25_total2013)
How to do this?
data _null_;
set scapkon;
reqyear=substr(iCDate,1,4)*1;
call symput('reqy',reqyear);
run;
data scatm;
set scapkon;
/* Medelvärde av 10 år innan rekryteringsår */
pm25means=mean(of pm25_total%eval(&reqy.-9)-pm25_total%eval(&reqy.));
run;
%eval(&reqy.-9) will be constant value (the same value for all as for the first person) , in my case 2007
That doesn't work.

You can compute the mean with a traditional loop.
data want;
set have;
array x x2000-x2018;
call missing(sum, mean, n);
do _n_ = 1 to 10;
v = x ( start - 1999 -_n_ );
if not missing(v) then do;
sum + v;
n + 1;
end;
end;
if n then mean = sum / n;
run;
If you want to flex your SAS skill, you can use POKE and PEEK concepts to copy a fixed length slice (i.e. a fixed number of array elements) of an array to another array and compute the mean of the slice.
Example:
You will need to add sentinel elements and range checks on start to prevent errors when start-10 < 2000.
data have;
length id start x2000-x2018 8;
do id = 1 to 15;
start = 2013 + mod(id,6);
array x x2000-x2018;
do over x;
x = _n_;
_n_+1;
end;
output;
end;
format x: 5.;
run;
data want;
length id start mean10yrPriorStart 8;
set have;
array x x2000-x2018;
array slice(10) _temporary_;
call pokelong (
peekclong ( addrlong ( x(start-1999-10) ) , 10*8 ) ,
addrlong ( slice (1))
);
mean10yrPriorStart = mean(of slice(*));
run;

use an array and loop
index the array with years
accumulate the sum of the values
accumulate the count to account for any missing values
divide to obtain the mean value
data want;
set have;
array _pm(2000:2018) pm25_total2000 - pm25_total2018;
do year=reqyear to (reqyear-9) by -1;
*add totals;
total = sum(total, _pm(year));
*add counts;
nyears = sum(nyears,not missing(_pm(year)));
end;
*accounts for possible missing years;
mean = total/nyears;
run;
Note this loop goes in reverse (start year to 9 years previous) because it's slightly easier to understand this way IMO.
If you have no missing values you can remove the nyears step, but not a bad thing to include anyways.

NOTE: My first answer did not address the OP's question, so this a redux.
For this solution, I used Richard's code for generating test data. However, I added a line to randomly add missing values.
x = _n_;
if ranuni(1) < .1 then x = .;
_n_+1;
This alternative does not perform any checks for missing values. The sum() and n() functions inherently handle missing values appropriately. The loop over the dynamic slice of the data array only transfers the value to a temporary array. The final sum and count is performed on the temp array outside of the loop.
data want;
set have;
array x(2000:2018) x:;
array t(10) _temporary_;
j = 1;
do i = start-9 to start;
t(j) = x(i);
j + 1;
end;
sum = sum(of t(*));
cnt = n(of t(*));
mean = sum / cnt;
drop x: i j;
run;
Result:
id start sum cnt mean
1 2014 72 7 10.285714286
2 2015 305 10 30.5
3 2016 458 9 50.888888889
4 2017 631 9 70.111111111

Random number generation without repetition in SAS

I am trying to obtain random numer generation without repetition. My idea is to make do while loop which will go 5 times. Inside i will get the random numer, store it in the table and check at every iteration if the picked numer is in the table or not and then decide if this random pick is an repetition or not.
Here is my code where I try to perform my idea but something is wrong and i do not know where i made a mistake.
data WithoutRepetition;
counter = 0;
array temp (5) _temporary_;
do while(1);
rand=round(4*ranuni(0) +1,1);
if counter = 0 then
do;
temp(1) = rand;
counter=counter+1;
output;
continue;
end;
do a=1 to counter by 1;
if temp(a) = rand then continue ;
end;
temp(counter) = rand;
output;
counter=counter+1;
if counter = 5 then do;
leave;
end;
end;
run;

Sounds like you want a random permutation.
165 data _null_;
166 seed=12345;
167 array r[5] (1:5);
168 put r[*];
169 call ranperm(seed,of r[*]);
170 put r[*];
171 run;
1 2 3 4 5
5 1 4 3 2
This is a simplified version of what you are trying to do.
data WithoutRepetition;
i=0;
array temp[5];
do r=1 by 1 until(i eq dim(temp));
rand=round(4*ranuni(0)+1,1);
if rand not in temp then do; i+1; temp[i]=rand; end;
end;
drop i rand;
run;

You were reasonably close to a working, if convoluted solution. For educational purposes, although data _null_'s answer is much cleaner, here's why your code wasn't working:
Your leave statement is inside a do-end block which is inside another do-loop. Leave statements only break out of the innermost do-loop, so yours has no effect.
The same is true of your continue statements, the first of which is completely unnecessary.
Because you are updating your array with newly-found unique values before you increment counter, previously populated values are being overwritten. This often results in duplicates of the overwritten values appearing in your output.
I would place continue and leave in the same category as goto - avoid using them if at all possible, as they tend to make code difficult to debug. It's clearer to set the exit conditions for all your loops at the point of entry.
Just for fun, though, here's a fixed version of your original code:
data WithoutRepetition;
counter = 0;
array temp (5) _temporary_;
do while(1);
rand=round(4*ranuni(0) +1,1);
if counter = 0 then do;
temp(1) = rand;
counter +1;
output;
end;
dupe = 0;
do a=1 to counter;
if temp(a) = rand then dupe=1;
end;
if dupe then continue;
counter +1;
temp(counter) = rand;
output;
if counter = 5 then leave;
end;
run;
And here is an equivalent version with all the leave and continue statements replaced with more readable alternatives:
data WithoutRepetition;
counter = 0;
array temp (5) _temporary_;
do while(counter < 5);
rand=round(4*ranuni(0) +1,1);
if counter = 0 then do;
temp(1) = rand;
counter +1;
output;
end;
else do;
dupe = 0;
do a=1 to counter while(dupe = 0);
if temp(a) = rand then dupe=1;
end;
if dupe = 0 then do;
counter +1;
temp(counter) = rand;
output;
end;
end;
end;
run;

SAS maximize a function of variables

Given a set of variable v(1) - v(k), a function f is defined as f(v1,v2,...vk).
The target is to have a set of v(i) that maximize f given v(1)+v(2)+....+v(k)=n. All elements are restricted to non-negative integers.
Note: I don't have SAS/IML or SAS/OR.
If k is known, say 2, then I can do sth like this.
data out;
set in;
maxf = 0;
n1 = 0;
n2 = 0;
do i = 0 to n;
do j = 0 to n;
if i + j ne n then continue;
_max = f(i,j);
if _max > maxf then do;
maxf = max(maxf,_max);
n1 = i;
n2 = j;
end;
end;
end;
drop i j;
run;
However, this solution has several issues.
Using loops seems to be very inefficient.
It doesn't know how may nested loops needed when k is unknown.
It's exactly the "Allocate n balls into k bins" problem where k is determined by # of columns in data in with specific prefix and n is determined by macro variable.
Function f is known, e.g f(i,j) = 2*i+3*j;
Is this possible to be done in data step?

As said in the comments, general non-linear integer programs are hard to solve. The method below will solve for continuous parameters. You will have to take the output and find the nearest integer values that maximize your function. However, the loop will now be much smaller and quicker to run.
First let's make a function. This function has an extra parameter and is linear in that parameter. Wrap your function inside something like this.
proc fcmp outlib=work.fns.fns;
function f(x1,x2,a);
out = -10*(x1-5)*(x1-5) + -2*(x2-2)*(x2-2) + 2*(x1-5) + 3*(x2-2);
return(out+a);
endsub;
run;quit;
options cmplib=work.fns;
We need to add the a parameter so that we can have a value that SAS can pass besides the actual parameters. SAS will think it's solving the likelihood of A, based on x1 and x2.
Generate a Data Set with an A value.
data temp;
a = 1;
run;
Now use PROC NLMIXED to maximize the likelihood of A.
ods output ParameterEstimates=Parameters;
ods select ParameterEstimates;
proc nlmixed data=temp;
parms x1=1 x2=1;
bounds x1>0, x2>0;
y = f(x1,x2,a);
model a ~ general(y);
run;
ods select default;
I get output of x1=5.1 and x2=2.75. You can then search "around" that to see where the maximum comes out.
Here's my attempt at a Data Step to search around the value:
%macro call_fn(fn,n,parr);
%local i;
&fn(&parr[1]
%do i=2 %to &n;
, &parr[&i]
%end;
,0)
%mend;
%let n=2;
%let c=%sysevalf(2**&n);
data max;
set Parameters end=last;
array parms[&n] _temporary_;
array start[&n] _temporary_;
array pmax[&n];
max = -9.99e256;
parms[_n_] = estimate;
if last then do;
do i=1 to &n;
start[i] = floor(parms[i]);
end;
do i=1 to &c;
x = put(i,$binary2.);
do j=1 to &n;
parms[j] = input(substr(x,j,1),best.) + start[j];
end;
/*You need a macro to write this dynamically*/
val = %call_fn(f,&n,parms);
*put i= max= val=;
if val > max then do;
do j=1 to &n;
pmax[j] = parms[j];
end;
max = val;
end;
end;
output;
end;

SAS Proc GCHART equivalent of Proc GPLOT UNIFORM

Is there a way to make multiple bar charts with uniform axis with proc gchart?
In proc gplot, I can use the uniform option like this:
proc gplot data=test uniform;
by state;
plot var*date;
run;
This will give me a set of plots for the by variable that all use the same axis range.
This option doesn't exist for proc gchart--is there any other way to do this? I can't just define a fixed range since my data will vary.

Thanks for the input everyone.
Since it looks like there isn't a good solution within the proc itself, I went with a macro approach to manually setting the axis.
This paper provided the foundation for what I did:
http://analytics.ncsu.edu/sesug/2012/BB-09.pdf
Since I couldn't find the text of the program anywhere except for in that non-searchable PDF, I've typed it in here. My version adds one additional parameter that optionally pads the low value in order to leave space for data labels below the low point (useful if you are making a column chart with labels above the positive values and below the negative values)
%macro set_axis_minmaxincrement(ds=,
axisvar=,
axis_length = 51,
sa_min = 999999,
sa_max = -999999,
returned_min = axis_min,
returned_max = axis_max,
returned_increment = axis_increment,
force_zero = 0,
pad_bottom = 0
) ;
%global &returned_min &returned_max &returned_increment;
/* Find the high and low values. Note: a data step was used versus a proc */
/* to allow the application of the option parameters, if specified. */
proc sort data=&ds out=sortlb(keep=&axisvar);
by &axisvar;
where &axisvar ne .;
run;
data axisdata(keep=low high);
retain low 0;
set sortlb end=eof;
by &axisvar;
if _n_=1 then low = &axisvar;
if eof then do;
high = &axisvar;
if &sa_min ^= 999999 and &sa_min < low then low = &sa_min;
if &sa_max ^= -999999 and &sa_max > high then high = &sa_max;
%if &force_zero = 1 %then %do;
if low > 0 then low = 0;
else if high < 0 then high = 0;
%end;
%if &pad_bottom = 1 %then %do;
if low < 0 then low = low-((high-low)*.06);
%end;
output;
end;
run;
data axisdata;
set axisdata;
/* insure that high is greater than low */
if high <= low then do;
if abs(low) <= 1 then high = low + 1;
else high = low+10;
end;
/* Calculate the conversion unit to transform the standard range to */
/* include the actual range. This value is used to convert the standard */
/* to the actual increment for the actual range. */
axisrange = high - low;
/* ranges of less than 1 */
if axisrange <= 6 then do;
check = 6;
conversion_unit = .01;
do until (axisrange > check);
check = check/10;
if axisrange <= check then conversion_unit = conversion_unit / 10;
end;
end;
/* Ranges of 1 or greater */
else do;
check = 60;
conversion_unit = 1.0;
do while (axisrange > check);
check = check*10;
conversion_unit = conversion_unit * 10;
end;
end;
/* standardize the range to lie between 6 to 60 */
unit_range = axisrange/conversion_unit;
/* Set the increment based on the unitized range */
/* 'Long' axis, 8 - 12 increments */
%if &axis_length >50 %then %do;
if unit_range < 12 then axisinc = 1 * conversion_unit;
else if unit_range < 24 then axisinc = 2 * conversion_unit;
else if unit_range < 30 then axisinc = 2.5 * conversion_unit;
else axisinc = 5 * conversion_unit;
%end;
/* Otherwise, 'short' axis, 4-6 increments */
%else %do;
if unit_range < 12 then axisinc = 2 * conversion_unit;
else if unit_range < 18 then axisinc = 3 * conversion_unit;
else if unit_range < 24 then axisinc = 4 * conversion_unit;
else if unit_range < 30 then axisinc = 5 * conversion_unit;
else axisinc = 10 * conversion_unit;
%end;
/*Round the min's value to match the increment; if the number is */
/* rounded up so that it becomes larger than the lowest data value, */
/* decrease the min by one increment. */
axislow = round(low,axisinc);
if axislow > low then axislow = axislow - axisinc;
/* Round the max; if the number is rounded down, */
/* increase the max by one increment. */
axishigh = round(high, axisinc);
if axishigh < high then axishigh = axishigh + axisinc;
/* put the values into the global macro variables */
call symput("&returned_min",compress(put(axislow, best.)));
call symput("&returned_max",compress(put(axishigh, best.)));
call symput("&returned_increment",compress(put(axisinc, best.)));
run;
%mend set_axis_minmaxincrement;

SAS creating a dynamic interval

This is somewhat complex (well to me at least).
Here is what I have to do:
Say that I have the following dataset:
date price volume
02-Sep 40 100
03-Sep 45 200
04-Sep 46 150
05-Sep 43 300
Say that I have a breakpoint where I wish to create an interval in my dataset. For instance, let my breakpoint = 200 volume transaction.
What I want is to create an ID column and record an ID variable =1,2,3,... for every breakpoint = 200. When you sum all the volume per ID, the value must be constant across all ID variables.
So using my example above, my final dataset should look like the following:
date price volume id
02-Sep 40 100 1
03-Sep 45 100 1
03-Sep 45 100 2
04-Sep 46 100 2
04-Sep 46 50 3
05-Sep 43 150 3
05-Sep 43 150 4
(last row can miss some value but that is fine. I will kick out the last id)
As you can see, I had to "decompose" some rows (like the second row for instance, I break the 200 into two 100 volume) in order to have constant value of the sum, 200, of volume across all ID.

Looks like you're doing volume bucketing for a flow toxicity VPIN calculation. I think this works:
%let bucketsize = 200;
data buckets(drop=bucket volume rename=(vol=volume));
set tmp;
retain bucket &bucketsize id 1;
do until(volume=0);
vol=min(volume,bucket);
output;
volume=volume-vol;
bucket=bucket-vol;
if bucket=0 then do;
bucket=&bucketsize;
id=id+1;
end;
end;
run;
I tested this with your dataset and it looks right, but I would check carefully several cases to confirm that it works right.

If you have a variable which indicates 'Buy' or 'Sell', then you can try this. Let's say this variable is called type and takes the values 'B' or 'S'. One advantage of using this method would be that it is easier to process 'by-groups' if any.
%let bucketsize = 200;
data tmp2;
set tmp;
retain volsumb idb volusums ids;
/* Initialize. */
volusumb = 0; idb = 1; volsums = 0; ids = 1;
/* Store the current total for each type. */
if type = 'B' then volsumb = volsumb + volume;
else if type = 'S' then volsums = volsums + volume;
/* If the total has reached 200, then reset and increment id. */
/* You have not given the algorithm if the volume exceeds 200, for example the first two values are 150 and 75. */
if volsumb = &bucketsize then do; idb = idb + 1; volsumb = 0; end;
if volsums = &bucketsize then do; ids = ids + 1; volsums = 0; end;
drop volsumb volsums;
run;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Creating new variables using sas table according specific condition - sas

Related

mean of 10 variables with different starting point (SAS)

Random number generation without repetition in SAS

SAS maximize a function of variables

SAS Proc GCHART equivalent of Proc GPLOT UNIFORM

SAS creating a dynamic interval

Categories

Resources