What does a dot do when it appears after a variable on the row of FORMAT in PROC FREQ in sas? - sas

I wonder what does the IQ. on the row of FORMAT (second to the last row) does in this piece of codes? Does it mean a termination of variable name? Or does it mean a transformation to integer? Or something else? Thank you!
The dataset is called IQ, and it contains only one numerical variable called IQ. These numbers appear to be integers.
PROC FORMAT;
VALUE IQ
75 - <85 = '75 <= IQ score < 85'
85 - <95 = '85 <= IQ score < 95'
95 - <105 = '95 <= IQ score < 105'
105 - <115 = '105 <= IQ score < 115'
115 - <125 = '115 <= IQ score < 125'
125 - <135 = '125 <= IQ score < 135'
135 - <145 = '135 <= IQ score < 145'
145 - <155 = '145 <= IQ score < 155'
;
RUN;
/* *************************** Frequency table of the grouped IQ score **************************** */
PROC FREQ DATA = IQ;
TABLES IQ;
FORMAT IQ IQ.;
RUN;

The syntax for a FORMAT specification is the format name followed by an optional width followed by a period followed by an option decimal width.
A variable name cannot contain a period, so the PERIOD is what allows SAS to recognize that as a FORMAT specification instead of another variable name.
The FORMAT statement is used to attach a FORMAT specification to one or more VARIABLES. In your statement:
FORMAT IQ IQ. ;
The list of variables is just the single variable IQ and the format to be applied is IQ. , which will use the default width defined with the IQ format since there is no width listed.

Look in your code to see if there is any reference to a PROC FORMAT. Although SAS provides a large number of ready made FORMATS and INFORMATS, it is often necessary to create formats
designed for a specific purpose.
They can be created through PROC FORMAT and can be used to:
convert numeric variables into character values
convert character strings into numbers
convert character strings into other character strings
In your case, the user-defined format IQ. will convert the numeric variable iq to character values.
data have;
input iq iq_fmt;
format iq_fmt iq.;
cards;
80 80
100 100
110 110
130 130
150 150
;
iq iq_fmt
80 75 <= IQ score < 85
100 95 <= IQ score < 105
110 105 <= IQ score < 115
130 125 <= IQ score < 135
150 145 <= IQ score < 155
Simple formats are created using the VALUE statement of the PROC FORMAT. It includes the name of the format to be created and the paired
mapping of values (on the left of the = sign) and what those values will be mapped to (on the right of the = sign). As an example, if the value is between 75 and 85 (not incl.) the resulting value will be 75 <= IQ score < 85
In the above example, iq and iq_fmt are identical columns. I apply the IQ. format to the iq_fmt column. You can see that, for the first observation, the value of iq is 80, thus mapped to 75 <= IQ score < 85 in the iq_fmt column.
More details available in Building and Using User Defined Formats

Related

Creating new variables using sas table according specific condition

I have a SAS table which has a numeric variable age. I need to construct new variables depending on the value of age. New variables should have this logic:
if the 0<=age<=25 then age0=1 else age0=0
if the 26<=age<=40 then age25=1 else age25=0 //here age25 is different to age0!!
So I wrote this code using macro to avoid repetition:
%macro intervalle_age(var,var1,var2);
if (&var=>&var1) and (&var<=&var2);
then return 1;
else return 0;
%mend;
Then I call the macro to get values of each new variables:
age0=%intervalle_age(age,0,25);
age25=%intervalle_age(age,26,40);
age25=%intervalle_age(age,41,65);
age25=%intervalle_age(age,65,771);
But this doesn't work!
How can I resolve it, please?
Thank you in advance!
I agree with Nikolay that you should step back and avoid macro altogether. The sample code you posted appears to be incorrect, you have four conditionals for different age ranges being assigned to only two variables.
In SAS a logical evaluation resolves to 1 for true and 0 for false. Additionally numeric variables can be used in logical expressions with non-zero, non-missing values meaning true and false otherwise.
So a sequence of code for assigning age range flag variables would be:
age0 = 0 < age <= 25 ;
age25 = 25 < age <= 40 ;
age40 = 40 < age <= 65 ;
age65 = 65 < age <= 71 ;
age71 = 71 < age ;
Masking simple and readable SAS statements behind a wall of macro code can lead to maintenance issues and degrade future understanding. However if your use case was to construct many sets of these types of code blocks, a macro that is based the breakpoints could lead to better legibility and understanding.
data have; age = 22; bmi = 20; run;
options mprint;
* easier to understand and not prone to copy paste issues or typos;
data want;
set have;
%make_flag_variables (var=age, breakpoints=0 25 40 65 71)
%make_flag_variables (var=bmi, breakpoints=0 18.5 25 30)
run;
Depends on this macro
%macro make_flag_variables (var=, breakpoints=);
%local I BREAKPOINT SUFFIX_LOW RANGE_LOW SUFFIX_HIGH RANGE_HIGH;
%let I = 1;
%do %while (%length(%scan(&breakpoints,&I,%str( ))));
%let BREAKPOINT = %scan(&breakpoints,&I,%str( ));
%let SUFFIX_LOW = &SUFFIX_HIGH;
%let SUFFIX_HIGH = %sysfunc(TRANSLATE(&BREAKPOINT,_,.));
%let RANGE_LOW = &RANGE_HIGH;
%let RANGE_HIGH = &BREAKPOINT;
%if &I > 1 %then %do;
&VAR.&SUFFIX_LOW = &RANGE_LOW < &VAR <= &RANGE_HIGH; /* data step source code emitted here */
%end;
%let I = %eval ( &I + 1 );
%end;
%mend;
Log snippet shows the code generation performed by the macro
92 data want;
93 set have;
94
95 %make_flag_variables (var=age, breakpoints=0 25 40 65 71)
MPRINT(MAKE_FLAG_VARIABLES): age0 = 0 < age <= 25;
MPRINT(MAKE_FLAG_VARIABLES): age25 = 25 < age <= 40;
MPRINT(MAKE_FLAG_VARIABLES): age40 = 40 < age <= 65;
MPRINT(MAKE_FLAG_VARIABLES): age65 = 65 < age <= 71;
96 %make_flag_variables (var=bmi, breakpoints=0 18.5 25 30)
MPRINT(MAKE_FLAG_VARIABLES): bmi0 = 0 < bmi <= 18.5;
MPRINT(MAKE_FLAG_VARIABLES): bmi18_5 = 18.5 < bmi <= 25;
MPRINT(MAKE_FLAG_VARIABLES): bmi25 = 25 < bmi <= 30;
97 run;
return doesn't have any special meaning in SAS macros. The macros are said to "generate" code, i.e. the macro invocation is replaced by the text, that's left after processing the things that the macro processor "understands" (basically, involving tokens (words) starting with & or %).
In your case the macro processor just expands the macro variables (the rest is just text, which the macro processor leaves untouched), resulting in:
age0=if (age=>0) and (age<=25);
then return 1;
else return 0;
age25=/*and so on*/
It's important to understand how the macro processor and regular execution interact (basically, all the macro expansions must be finished before the given DATA or PROC step starts executing).
To make this work you either need to generate the complete if statement, including the assignment to the output var:
%macro calc_age_interval(outvar, inputvar, lbound, ubound);
if (&inputvar=>&lbound) and (&inputvar<=&ubound) then do;
&outvar = 1;
end; else do;
&outvar = 0;
end;
%mend calc_age_interval;
%calc_age_interval(outvar=age0, inputvar=age, lbound=0, ubound=25);
Or make it generate an expression, which will evaluate to either 0 or 1 at execution time (either by assigning the result directly to a variable (the result of boolean expression is either 1 or 0 anyway), or using IFN() to be more explicit):
%macro calc_age_interval(inputvar, lbound, ubound);
ifn((&inputvar=>&lbound) and (&inputvar<=&ubound), 1, 0)
%mend;
age0 = %calc_age_interval(age, 0, 25); /* expands to age0=ifn(..., 1, 0); */
Taking a step back, I wouldn't bother with macros in this case at all. You can use the in (M:N) range notation or reset all output variables to 0, then do an if-elseif:
if age < 0 then age_missing_or_negative = 1;
else if age <= 25 then age0 = 1;
else if age <= 40 then age25 = 1;
...

Dealing with extreme outliers in sas with If-Then-Else statements

I have some extreme outliers throwing my regression model off, and I removed them using If-Then-Else statements. However, SAS eliminated those data points completely and found new outliers in the ones remaining. Is there a way to remove the outliers from analysis without it throwing more into the mix?
I calculated Q3 + 1.5 * IQR and used that value as so:
Data lungcancer; input trt surv age sex ##;
/* create a new variable diff */
diff = surv - 365;
/* create a new categorical variable resp */
If diff > 0 then resp= 1;
If diff <= 0 then resp= 0;
/* create a new categorical variable sev */
if 2276 > surv >= 1621 then sev=0;
Else If 456 <= surv <= 1620 then sev=1;
Else if 181 <= surv <= 455 then sev=2;
Else if 1 <= surv <= 180 then sev=3;
Else if surv > 2276 then delete; /* Remove outliers */
So, you removed some data points that were on the edge of your data, and then got a new set of data, and recalculated IQR, and ... are surprised that there are new "outliers"?
This isn't SAS doing anything particular, it's doing what it's asked, identifying things in 1.5*IQR. Outlier removal is always up to you (when you're doing things this way, anyway, and not using one of the more advanced procs I suppose): you decide what's an outlier and remove it or not, depending on your data. So - do you think these new data points are outliers? Remove or not depending on that.

proc optmodel syntax for multiple constraints

I have been struggling with this since yesterday and have gone over a ton of material and a number of answers on stackoverflow already.
And I've also created a base code which I've pasted below.
background
we have units data for drugs, which can be combined into a combination, called a regimen. For example, drug1+drug2 would be regimen 1, drug1+drug2+drug3 would be regimen 2 and drug1+drug2+drug4 would be regimen 3.
Our ultimate objective is to find out the number of patients on a regimen. This can be accomplished only by finding the %contribution (called patient share) of each regimens to the market (we can't calculate it directly from units due to the multiple uses across regimens).
basically
units = patientshare * dosing * compliance * duration of therapy * total patients
where we know units, dosing and duration of therapy
and total patients, compliance and patient share will be bounded variables.
My problem is that the variables and constraints are at different levels.
Units is at drug level (and month);
dosing is at drug level;
compliance is at drug level;
duration of therapy is at regimen level;
patient share is at regimen level
This is my code and I would appreciate if someone could tell me where I'm going wrong (which is in the arrays I suspect).
PROC OPTMODEL;
SET <STRING> DRUG;
SET <STRING> REGIMEN;
SET <STRING> MONTH;
NUMBER DOSING{DRUG};
READ DATA DRUG_DATA INTO DRUG=[DRUG] DOSING;
/*PRINT DOSING;*/
NUMBER COMPLIANCE{DRUG};
READ DATA DRUG_DATA INTO DRUG=[DRUG] COMPLIANCE;
/*PRINT COMPLIANCE;*/
NUMBER DOT{drug, regimen};
READ DATA REGIMEN INTO drug=[drug]
{R in regimen}< DOT [drug, R]=col(R)>;
PRINT DOT;
NUMBER UNITS{MONTH, DRUG};
READ DATA DATASET INTO MONTH=[MONTH]
{D IN DRUG}< UNITS[MONTH, D]=COL(D)>;
/*PRINT UNITS;*/
NUMBER RATIO{MONTH};
READ DATA RATIO_1 INTO MONTH=[MONTH] RATIO;
/*PRINT RATIO;*/
/*DEFINE THE PARAMETERS*/
var ps {MONTH,DRUG} init 0.1 >=0 <=1,
annualpatients init 7000 <=7700 >=6300,
compliance init 0.1 >=0.3 <=0.8,
DOSING[RIB] INIT 5 >=6 <=4;
/*SET THE OBJECTIVE*/
min sse = sum{M IN MONTH}( (units[M,D in drug]-(ps[M,R IN REGIMEN]*annualpatients*ratio[M]*dosing[D]*compliance[D]*dot[R]*7 ))**2 );
/*SET THE CONSTRAINTS*/
constraint MONTHLY_patient_share {M IN MONTH}: sum{r is regimen}(ps[R IN REGIMEN])=1;
constraint total_patients sum{M in months, r in regimen} : ps[m,r in regimen]*annualpatients*ratio[m]=annual_patients;
expand;
solve with nlpc;
quit;
And here's the log:
2824 PROC OPTMODEL;
2825
2826 /*DEFINE THE DATA LEVELS (SETS) OF DRUGS, MONTH AND REGIMEN*/
2827
2828 SET <STRING> DRUG;
2829 SET <STRING> REGIMEN;
2830 SET <STRING> MONTH;
2831
2832 NUMBER DOSING{DRUG};
2833 READ DATA DRUG_DATA INTO DRUG=[DRUG] DOSING;
NOTE: There were 4 observations read from the data set WORK.DRUG_DATA.
2834 /*PRINT DOSING;*/
2835
2836 /*NUMBER COMPLIANCE{DRUG};*/
2837 /*READ DATA DRUG_DATA INTO DRUG=[DRUG] COMPLIANCE;*/
2838 /*PRINT COMPLIANCE;*/
2839
2840 NUMBER DOT{drug, regimen};
2841 READ DATA REGIMEN INTO drug=[drug]
2842 {R in regimen}< DOT [drug, R]=col(R)>;
ERROR: The symbol 'REGIMEN' has no value at line 2842 column 7.
2843 PRINT DOT;
ERROR: The symbol 'REGIMEN' has no value at line 2840 column 18.
2844
2845 NUMBER UNITS{MONTH, DRUG};
2846 READ DATA DATASET INTO MONTH=[MONTH]
2847 {D IN DRUG}< UNITS[MONTH, D]=COL(D)>;
NOTE: There were 12 observations read from the data set WORK.DATASET.
2848 /*PRINT UNITS;*/
2849
2850 NUMBER RATIO{MONTH};
2851 READ DATA RATIO_1 INTO MONTH=[MONTH] RATIO;
NOTE: There were 12 observations read from the data set WORK.RATIO_1.
2852 /*PRINT RATIO;*/
2853
2854 /*DEFINE THE PARAMETERS*/
2855
2856 var ps {MONTH,DRUG} init 0.1 >=0 <=1,
2857 annualpatients init 7000 <=7700 >=6300,
2858 compliance init 0.1 >=0.3 <=0.8,
2859 DOSING[RIB] INIT 5 >=6 <=4;
-
22
200
------
528
ERROR 22-322: Syntax error, expecting one of the following: ;, ',', <=, >=, BINARY, INIT,
INTEGER, {.
ERROR 200-322: The symbol is not recognized and will be ignored.
ERROR 528-782: The name 'DOSING' is already declared.
2860
2861 /*SET THE OBJECTIVE*/
2862 min sse = sum{M IN MONTH}( (units[M,D in drug]-(ps[M,R IN
- - -- -
-
-
-
537 651 631 651
537
537
648
ERROR 537-782: The symbol 'D' is unknown.
ERROR 651-782: Subscript 2 must be a string, found a number.
ERROR 648-782: The subscript count does not match array 'DOT', 1 NE 2.
--
-
631
647
ERROR 631-782: The operand types for 'IN' are mismatched, found a number and a set<string>.
ERROR 647-782: The name 'compliance' must be an array.
2862! min sse = sum{M IN MONTH}( (units[M,D in drug]-(ps[M,R IN
-
-
-
537
651
537
2862! REGIMEN]*annualpatients*ratio[M]*dosing[D]*compliance[D]*dot[R]*7 ))**2 );
ERROR 537-782: The symbol 'R' is unknown.
ERROR 651-782: Subscript 1 must be a string, found a number.
2863
2864
2865 /*SET THE CONSTRAINTS*/
2866 constraint MONTHLY_patient_share {M IN MONTH}: sum{r in regimen}(ps[R IN REGIMEN])=1;
-
648
ERROR 648-782: The subscript count does not match array 'ps', 1 NE 2.
2867 constraint total_patients sum{M in months, r in regimen} : ps[m,r in
---
22
76
2867! regimen]*annualpatients*ratio[m]=annual_patients;
ERROR 22-322: Syntax error, expecting one of the following: !!, (, *, **, +, -, .., /, :, <=, <>,
=, ><, >=, BY, CROSS, DIFF, ELSE, INTER, SYMDIFF, TO, UNION, [, ^, {, ||.
ERROR 76-322: Syntax error, statement will be ignored.
2868
2869 expand;
NOTE: Previous errors might cause the problem to be resolved incorrectly.
ERROR: The constraint 'MONTHLY_patient_share' has an incomplete declaration.
NOTE: The problem has 50 variables (0 free, 0 fixed).
NOTE: The problem has 0 linear constraints (0 LE, 0 EQ, 0 GE, 0 range).
NOTE: The problem has 0 nonlinear constraints (0 LE, 0 EQ, 0 GE, 0 range).
NOTE: Unable to create problem instance due to previous errors.
2870 solve with nlpc;
ERROR: No objective has been specified at line 2870 column 6.
2871 quit;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE OPTMODEL used (Total process time):
real time 0.07 seconds
cpu time 0.07 seconds
The first error (on log line 2842) happens because OPTMODEL doesn't have the data for set REGIMEN when you refer to it.
I can't see your datasets, but it appears that you have a DRUG-by-REGIMEN matrix.
So you need to tell OPTMODEL the list of the REGIMEN before using it to read the columns. Here are two efficient ways to read the column names into OPTMODEL:
/* a simple example matrix with numbers and their squares */
data m_by_n (drop=i);
do i = 1 to 3;
n = i;
n_square = i * i;
output;
end;
run;
/* get the variable names in the `name` column, plus other information */
proc contents data=m_by_n out=contents_of_m_by_n; quit;
proc optmodel;
set ROWS;
set<str> COLS;
num val{ROWS,COLS};
/* use the output from PROC CONTENTS */
read data contents_of_m_by_n into COLS=[name];
read data m_by_n into ROWS=[_N_]
{ j in COLS }< val[_N_,j] = col( j ) >;
put val[*]=;
/* Or, all within OPTMODEL */
num dsid init open('m_by_n');
set COLS2 = setof{i in 1 .. attrn(dsid,'nvars')} varname(dsid,i);
read data m_by_n into ROWS=[_N_]
{ j in COLS2 }< val[_N_,j] = col( j ) >;
put val[*]=;
quit;

Proc Report Compute

I am learning proc report and wanted to make a simple report with a computed column.
Here is my code :
proc report data = schools nowd;
columns school class maths science total;
define school / group;
define class / display;
define maths / analysis;
define science / analysis;
define total / computed;
compute total;
total = maths + science;
endcomp;
run;
Here is the output which i am getting :
Schools Class Maths Science total
Airport i 50 41 0
Airport ii 92 53 0
Airport iii 62 60 0
Airport iv 66 61 0
Amrut i 84 58 0
Amrut ii 42 83 0
Amrut iii 53 64 0
Amrut iv 89 100 0
Asia i 42 74 0
Asia ii 48 91 0
Asia iii 75 76 0
Asia iv 46 84 0
Can anyone please explain me why i am getting the value of total as 0. I believe it is possible to create a new column in PROC REPORT. What is it that i am doing wrong.
Thanks and Regards
Amit
Compound variable names are needed when an analysis variable has been used to calculate a statistic. You can reference maths and science as maths.sum and science.sum, respectively. If you had left those variables as display variables, you could also refer to them without compound names. The direct reference can be used c3 and c4, however, if you changed the order of those variables on the COLUMNS statement, it would alter your computation (just something to consider).
proc report data = schools nowd;
columns school class maths science total;
define school / group;
define class / display;
define maths / analysis;
define science / analysis;
define total / computed;
compute total;
total = maths.sum + science.sum;
endcomp;
run;
PROC REPORT compute order can be confusing. Basically, you have a log message saying 'Maths' and 'Science' are missing, because they haven't been associated with those columns as of the point in the report where the COMPUTE happens. You can use _C#_ where # is the column number to more easily refer to columns.
Also, as pointed out in comments, when accessing an analysis variable you need to refer to it by the type of analysis, so weight.sum instead of weight.
proc report data = sashelp.class nowd;
columns name sex height weight bmi;
define name / group;
define sex / display;
define height / analysis;
define weight / analysis;
define bmi / computed;
compute bmi;
bmi=_c4_/(_c3_**2);
endcomp;
run;
proc report data = school out= xyz nowd;
columns schools class maths science total;
define schools / group;
define class / display;
define maths / order;
define science / order;
define total / computed;
compute total ;
if maths or science ne . then
total = maths + science ;
endcomp;
run;

SAS creating a dynamic interval

This is somewhat complex (well to me at least).
Here is what I have to do:
Say that I have the following dataset:
date price volume
02-Sep 40 100
03-Sep 45 200
04-Sep 46 150
05-Sep 43 300
Say that I have a breakpoint where I wish to create an interval in my dataset. For instance, let my breakpoint = 200 volume transaction.
What I want is to create an ID column and record an ID variable =1,2,3,... for every breakpoint = 200. When you sum all the volume per ID, the value must be constant across all ID variables.
So using my example above, my final dataset should look like the following:
date price volume id
02-Sep 40 100 1
03-Sep 45 100 1
03-Sep 45 100 2
04-Sep 46 100 2
04-Sep 46 50 3
05-Sep 43 150 3
05-Sep 43 150 4
(last row can miss some value but that is fine. I will kick out the last id)
As you can see, I had to "decompose" some rows (like the second row for instance, I break the 200 into two 100 volume) in order to have constant value of the sum, 200, of volume across all ID.
Looks like you're doing volume bucketing for a flow toxicity VPIN calculation. I think this works:
%let bucketsize = 200;
data buckets(drop=bucket volume rename=(vol=volume));
set tmp;
retain bucket &bucketsize id 1;
do until(volume=0);
vol=min(volume,bucket);
output;
volume=volume-vol;
bucket=bucket-vol;
if bucket=0 then do;
bucket=&bucketsize;
id=id+1;
end;
end;
run;
I tested this with your dataset and it looks right, but I would check carefully several cases to confirm that it works right.
If you have a variable which indicates 'Buy' or 'Sell', then you can try this. Let's say this variable is called type and takes the values 'B' or 'S'. One advantage of using this method would be that it is easier to process 'by-groups' if any.
%let bucketsize = 200;
data tmp2;
set tmp;
retain volsumb idb volusums ids;
/* Initialize. */
volusumb = 0; idb = 1; volsums = 0; ids = 1;
/* Store the current total for each type. */
if type = 'B' then volsumb = volsumb + volume;
else if type = 'S' then volsums = volsums + volume;
/* If the total has reached 200, then reset and increment id. */
/* You have not given the algorithm if the volume exceeds 200, for example the first two values are 150 and 75. */
if volsumb = &bucketsize then do; idb = idb + 1; volsumb = 0; end;
if volsums = &bucketsize then do; ids = ids + 1; volsums = 0; end;
drop volsumb volsums;
run;