I recently moved to SAS 9.4, and found a weird bug with IFN and IFC fucntion. Please see the code below illustarting IFN fucntion:
data a;
input b 8.;
datalines;
0.1
150
110
9.1
1
0
;
run;
proc sql;
create table test as
select b, (IFN(b = . | b > 100 | b < 1 ,1,0)) as f_no_overlap length = 8,
(IFN(b in (.,0) | b > 100 | b < 1 ,1,0)) as m_overlap length = 8,
(IFN(b in (.,0) | b < 1 ,1,0)) as u_no_100,
(IFN(b in (0,110) | b < 1 | b > 100 ,1,0)) as b_no_null,
(IFN(b = 0 | b < 1 | b > 100 | b= . ,1,0)) as h_last_miss,
(IFN(b = . | b = 0 | b < 1 | b > 100,1,0)) as k_first_null,
(IFN(b = 0 | b = . | b < 1 | b > 100,1,0)) as y_second_null,
(IFN(b = 0 | b < 1 | b = . | b > 100,1,0)) as l_third_null,
(IFN(b < 1 | b = 0 | b = . | b > 100,1,0)) as o_first_one from a;
quit;
I get the following results :
Where as the same code on SAS 9.3 gives the following results (which is correct) :
Why does it behave so weird with the same conditions on 9.4, mainly it seems to be the NULL in condition that seems to be causing an issue.
Has anyone encountered the same problem? Do we have a solution for it?
The discussion of unexpected evaluation results is for code run in SAS 9.4 TS1M4 on Windows 10.0.17763 Build 17763.
3 proc options;
4 run;
SAS (r) Proprietary Software Release 9.4 TS1M4
SQL does not have a missing value (., .<letter>) concept in the same way DATA Step and other Procs do. SQL has NULL and SAS coerces missing values into NULL, so there is a fuzzy edge and you found a problem there !
The failure to properly evaluate an expression appears to be how 9.4 SQL implementation is processing literal missing (.) values in this specific case. The failure is not in the IFN, but rather the evaluation passed to IFN !
Examining only the logic expression the problems does not seem to be related to IN. Similar unexpected evaluation results occur when the IN is split into a series of ORs. The specific causation appears to be where the missing literal (.) appears in the expression -- which in turn becomes 9.4 SQL implementation innards (parsing, etc.)
Definitely seems to be a bug when more than two sub-expressions and one of them uses a missing (.). The proper remedy, and one that becomes more suitable for remote or pass-through processing, would be to avoid using missing literals (.) in your SQL and use ANSI null tests operators IS NULL and IS NOT NULL
data have;
b = 9.1;
run;
proc sql;
create table want as
select
b
, b in (.,0) | b > 100 as part1 /* correct result */
, b in (.,0) | b < 1 as part2 /* correct result */
, b > 100 | b < 1 as part3 /* correct result */
, b in (.,0) | b > 100 | b < 1 as parts_null_first /* INCORRECT result */
, b > 100 | b < 1 | b in (.,0) as parts_null_last /* INCORRECT result */
, b=. | b=0 | b > 100 | b < 1 as parts_no_in_null_first /* INCORRECT result */
, b=0 | b > 100 | b < 1 | b= . as parts_no_in_null_last /* correct - weird? */
, b is null | b=0 | b > 100 | b < 1 as parts_is_null /* correct result */
, calculated part1 | calculated part2 | calculated part3 as calc_parts_in_1_expr /* correct result */
from have
;
quit;
I didn't test if the same issue occurs when the problematic expression is in a WHERE caluse. The expression is not a problem as an assignment in DATA step:
data want2;
set have;
parts_null_first = b in (.,0) | b > 100 | b < 1 ; /* correct result */
parts_null_last = b > 100 | b < 1 | b in (.,0); /* correct result */
run;
If the expression evaluation 'error' occurs in where expressions, then the where evaluation engine is more likely the root cause -- I believe the same engine is used for Proc/Data WHERE statements, Dataset WHERE= option and SQL evaluations.
There might be a SAS Note or Hotfix for the situation but I didn't go looking.
Another discussion of testing for missing values can be found in SAS_Tipster's "SAS Tip: Use IS MISSING and IS NULL with Numeric or Character Variables" on communities.sas.com. The important take away is the use of operators in criteria testing for null values.
The IS MISSING and IS NULL operators, which are used with a WHERE statement, can handle character or numeric variables. They also work with the NOT operator:
Documentation summarizes IS MISSING predicate as:
Tests for a SAS missing value in a SAS native data store.
SAS came back wi this response :
We have now released Problem Note XXXXXX: Problem with the IFN function in SAS 9.4TS1M4 in regard to this issue. It seems only circumvention at this stage is upgrade to a newer version of SAS such as SAS 9.4M5 (TS1M5) or later. so, no hotfix or solution available at this stage for this problem in the current version.
Related
Before we get to my question please note that I purposefully did not include example data in this post, as my problem occurs on my full dataset and subsets of it. I have two dataset with client data in the following format.
Have_1
+------------+------------+------+
| dt | dt_next | id |
+------------+------------+------+
| 30.09.2010 | 31.10.2010 | 0001 |
+------------+------------+------+
| 31.10.2010 | 30.11.2010 | 0001 |
+------------+------------+------+
| 30.11.2010 | 31.12.2010 | 0001 |
+------------+------------+------+
| 31.12.2010 | 31.01.2011 | 0001 |
+------------+------------+------+
Have_2
+------+-------+------------+------------+
| id | event | start_date | end_date |
+------+-------+------------+------------+
| 0001 | 1 | 31.10.2010 | 30.11.2010 |
+------+-------+------------+------------+
| 0001 | 2 | 31.10.2010 | 31.12.2010 |
+------+-------+------------+------------+
I am trying to use the IFN function to put 1-0 flags in my dataset by using the following logic:
Proc SQL;
Create table want as
Select a.*
,ifn(a.id in (select id from have_2 where a.dt <= end_date and start_date <= a.dt_next), 1, 0) as flg_1
,ifn(a.id in (select id from have_2 where a.dt <= end_date and start_date <= a.dt), 1, 0) as flg_2
From have_1 as a;
Quit;
The code works fine if I take only one client, however, if I take the full dataset (or even a small subset of it such as only 10 clients) then the code gets stuck in the sense that the process begins without error but simply never finishes. I tried setting indexes to both my input datasets, without success.
Are there any peculiarities to the IFN function, which can make it behave that way?
So why not just join and take the max of all events if any event's dates fall into those periods? That should eliminate the need to do two subqueries for every observation in HAVE1.
proc sql;
create table want2 as
select a.id
, a.dt
, a.dt_next
, max(a.dt <= b.end_date and b.start_date <= a.dt_next) as flg1
, max(a.dt <= b.end_date and b.start_date <= a.dt) as flg2
from have1 a
left join have2 b
on a.id = b.id
group by 1,2,3
;
quit;
Note the issue is with the subqueries, not the IFN() function call. Also there is no need for IFN() function here. SAS evaluates boolean expressions to 1 or 0. So the expression a=b returns the same result as IFN(a=b,1,0) returns.
Looking at a continuous variable under four different categorical coded groups.
Attempting to run proc power with a onewayanova test but I can't seem to make it account for multiple standard deviations.
Looking to try and see if this is possible.
Title "Find Power for ANOVA"
proc power;
onewayanova test = overall
groupmeans = 1814120 | 1344300 | 953580 | 1352900
stddev = 1879922.09 | 969317.15 | 441433.68 | 970670.65
npergroup = 3 | 4 |5 | 4
power = .;
run;
This gives me:
180
ERROR 180-322: Statement is not valid or it is used out of proper order.
stddev and npergroup use number lists, whereas groupmeans use grouped-number-lists. The syntax between the two are different.
proc power;
onewayanova
test = overall
groupmeans = 1814120 | 1344300 | 953580 | 1352900
stddev = 1879922.09 969317.15 441433.68 970670.65
npergroup = 3 4 5 4
power = .
;
run;
(I find it hard to give a good descriptive title, so I'll just ask by means of an example.)
I have a data set like this:
|ID | A1 A2 A3 | B1 B2 B3 | C1 C2 C3 |
+---+----------+----------+----------+
| 1 | a aa aaa| b bb bbb| c cc ccc|
| 2 | (... some values, etc ...)
What I want to do is, given an "ID", make a table output with the values A1,A2,etc for that ID, something like this:
| | A's | B's | C's |
+---+-----+-----+-----+
| 1 | a | b | c |
| 2 | aa | bb | cc |
| 3 | aaa | bbb | ccc |
So, to recap: I want to pick a row, and output a table with certain variables displayed in columns. I've tried to wrap my mind around how proc tabulate works, but haven't managed to wrangle it into giving me what I want; it may be I'm barking up the wrong tree. Is there a way to do this?
I don't need this to return a data table, just some screen output.
You can reshape the data by creating a transposing view that operates on the three arrays in parallel. Proc REPORT or PRINT can then be used to generate the presentation output.
Sample Data
data have;
do id = 1 to 10;
array a a1-a3;
array b b1-b3;
array c c1-c3;
do i = 1 to dim(a);
a(i) = 10 ** i + id;
b(i) = 2 * 10 ** i + id;
c(i) = 3 * 10 ** i + id;
end;
output;
keep id a: b: c:;
end;
run;
Transposing view
data have_v / view = have_v;
set have;
array as a1-a3;
array bs b1-b3;
array cs c1-c3;
do seq = 1 to dim(as);
a = as(seq);
b = bs(seq);
c = cs(seq);
output;
end;
keep id seq a b c;
run;
Output with where clause. BY statement used to show id value in output.
proc report data=have_v;
by id;
where id = 3;
column id seq a b c;
define id / display noprint;
run;
You could use VIEWTABLE and issue a WHERE command if you don't want to produce output.
If each row encompasses an arbitrary number of 'arrays' (say a to z) of arbitrary but equal length (say 1 to 15), you would want to write a macro that performs some meta-data examination of the data set in question. The examination would attempt to discover the array 'names' and number of elements in each. This say would need to discover and output 15 rows by 26 columns for a given id.
Sounds like something that on old style data _null_ report could produce.
data _null_;
set have ;
where id=1 ;
array a a1-a3 ;
array b b1-b3 ;
array c c1-c3 ;
file print;
put #10 'A' #20 'B' #30 'C'
/ #10 8*'-' #20 8*'-' #30 8*'-'
;
do i=1 to dim(a);
put i 8. #10 a(i) #20 b(i) #30 c(i) ;
end;
run;
Results
A B C
-------- -------- --------
1 a b c
2 aa bb cc
3 aaa bbb ccc
I have two datasets, one for male and one for female, which contain identical variables. I need to find the percent difference between the sexes on each variable by group.
The datasets look something like this, but with more variables and groups,
| Group | Sex | VarA | VarB |
|-------+-----+------+------|
| 1 | F | 8 | 5 |
| 2 | F | 6 | 3 |
| 3 | F | 7 | 0 |
|-------+-----+------+------|
| Group | Sex | VarA | VarB |
|-------+-----+------+------|
| 1 | M | 9 | 7 |
| 2 | M | 8 | 5 |
| 3 | M | 6 | 3 |
|-------+-----+------+------|
The result I need is this:
| Group | percent_diffA | percent_diffB |
|-------+---------------+---------------|
| 1 | -0.117647059 | -0.333333333 |
| 2 | -0.285714286 | -0.5 |
| 3 | 0.153846154 | -2 |
|-------+---------------+---------------|
I could solve this via a merge by renaming each variable.
data difference;
merge
females (rename = (VarA = VarA_F VarB = VarB_F)
males (rename = (VarA = VarA_M VarB = VarB_M)
;
by group;
percent_diffA = (VarA_F - VarA_M) / ( (VarA_F + VarA_M) / 2 );
percent_diffB = (VarB_F - VarB_M) / ( (VarB_F + VarB_M) / 2 );
drop sex;
run;
However, this approach requires me to rename everything manually. With several variables, the rename statement becomes cumbersome. Unfortunately, this calculation is being interjected into some old code, so renaming the original datasets is not practical.
I'm wondering if there is another way to solve this problem which is less cumbersome.
EDIT: I have updated the variable names because that appears to have caused people confusion. They were originally called Var1 and Var2. They are now VarA and VarB. The real variable names are descriptive, for instance body_weight_g or gonadal_somatic_index. The variables are not simply listed with sequential numbers.
For a data set that contains variables that are sequentially numbered there is variable list syntax for renaming the whole range of variables:
This example creates sample that has 100 variables.
data have1 have2;
do group = 1 to 100;
sex = 'M';
array var(100);
do _n_ = 1 to dim(var);
var(_n_) = ceil (25 * ranuni(123));
end;
if group ne 42 then output have1;
sex = 'F';
do _n_ = 1 to dim(var);
var(_n_) = ceil (25 * ranuni(123));
end;
if group ne 100-42 then output have2;
end;
run;
The rename option works on all 100 variables.
data want;
merge
have1(rename=var1-var100=mvar1-mvar100 in=_M)
have2(rename=var1-var100=fvar1-fvar100 in=_F)
;
by group;
if _M & _F & first.group & last.group then do;
array one mvar1-mvar100;
array two fvar1-fvar100;
array results result1-result100;
do i = 1 to dim(results);
diff = one(i) - two(i);
mean = mean (one(i), two(i));
results(i) = diff / mean * 100;
end;
end;
keep group result:;
run;
Shenglin's answer is a nice and concise use of SQL.
An alternative method is constructing a macro variable specifying the renames to be used in the rename DSO (data set option). This can be done with an SQL query to the dictionary table containing the column names.
* This macro creates the macro variable rename_suffix, to be used in a rename statement or data set option ;
* It will be of form: var1 = var1_suffix var2 = var2_suffix ... ;
* &inset is the input set. &suffix is the suffix to added to all variables except for the variables specified in &keys. ;
* &keys variables should be given each in quotation marks, and separated by spaces. ;
%macro rename_list(inset, suffix, keys) ;
%global rename_&inset ; * So that this macro variable is accessable outside the macro ;
proc sql ;
select strip(name) || ' = ' || strip(name) || "_&suffix"
into :rename_&inset separated by ' '
from sashelp.vcolumn /* dictionary.columns can be used in place of sashelp.vcolumn */
where libname = 'WORK' & memname = "%sysfunc(upcase(&inset))"
& upcase(strip(name)) not in (' ' %sysfunc(upcase(&keys))); * The ' ' is included, so there is no error if no keys are given ;
quit ;
%mend rename_list ;
%rename_list(females, F, 'GROUP' 'SEX')
%rename_list(males , M, 'GROUP' 'SEX')
%put &rename_females ; * Check that the macro variables are correct ;
%put &rename_males ;
%macro pct_diff(num) ;
percent_diff&num = (Var&num._F - Var&num._M) / ( (Var&num._F + Var&num._M) / 2 ) ;
%mend pct_diff ;
data difference ;
merge females(rename = (&rename_females), drop = sex)
males (rename = (&rename_males ), drop = sex) ;
by group ;
pct_diff(1) ;
pct_diff(2) ;
run ;
dm 'vt difference';
The percent_diff variable creation can also be shortened with a macro (as shown). If you had a large and/or variable number of variables to compare, then you could further shorten it by automatically detecting the number of comparisons, by running the same SQL query with the select into part modified to be
select count(name) into :varct trimmed
to count the number of variables, and then use a do loop in the data step:
do i = 1 to &varct ;
%pct_diff(i) ;
end ;
Use table alias in proc sql to avoid name change:
proc sql;
select a.group,(a.var1-b.var1)/((a.var1+b.var1)/2) as percent_diff1,
(a.var2-b.var2)/((a.var2+b.var2)/2) as percent_diff2
from female as a,male as b
where a.group=b.group;
quit;
I have a set of multiple choice responses from a survey with 45 questions, and I've placed the correct responses as my first observation in the dataset.
In my DATA step I would like to set values to 0 or 1depending on whether the variable in each observation matches the same variable in the first observation, I want to replace the response letter (A-D) with the 0 or 1 in the dataset, how do I go about doing that comparison?
I'm not doing any grouping, so I believe I can access the first row using First.x, but I'm not sure how to compare that across each variable(answer1-answer45).
| Id | answer1 | answer2 | ...through answer 45
|:-------------|---------:|
| KEY | A | B |
| 2 | A | C |
| 3 | C | D |
| 4 | A | B |
| 5 | D | C |
| 6 | B | B |
Should become:
| Id | answer1 | answer2 | ...through answer 45
|:-------------|---------:|
| KEY | A | B |
| 2 | 1 | 0 |
| 3 | 0 | 0 |
| 4 | 1 | 1 |
| 5 | 0 | 0 |
| 6 | 0 | 1 |
Current code for reading in the data:
DATA TEST(drop=name fill answer0);
INFILE SCORES DSD firstobs=2;
length id $4;
length answer1-answer150 $1;
INPUT name $ fill id $ (answer0-answer150) ($);
RUN;
Thanks in advance!
Here's how I might do it. Create a data set to PROC COMPARE the KEY to the observed. Then you have X for not matching key and missing for matched. You can then use PROC TRANSREG to score the 'X.' to 01. PROC TRANSREG also creates macro variables which contain the names of the new variables and the number.
From log NOTE: _TRGINDN=2 _TRGIND=answer1D answer2D
data questions;
input id:$3. (answer1-answer2)(:$1.);
cards;
KEY A B
2 A C
3 C D
4 A B
5 D C
6 B B
;;;;
run;
data key;
if _n_ eq 1 then set questions(obs=1);
set questions(keep=id firstobs=2);
run;
proc compare base=key compare=questions(firstobs=2) out=comp outdiff noprint;
id id;
run;
options validvarname=v7;
proc transreg design data=comp(drop=_type_ type=data);
id id;
model class(answer:) / noint;
output out=scored(drop=intercept _:);
run;
%put NOTE: &=_TRGINDN &=_TRGIND;
I don't have my SAS license here at home, so I can't actually test this code. I'll give it me best shot, though ...
First, I'd keep my correct answers in a separate table, and then merge it with the answers from the respondents. That also makes the solution scalable, should you have more multiple choice solutions and answers in the same table, since you'd be joining on the assignment ID as well.
Now, import all your correct answers to a table answers_correct with column names answer_correct1-answer_correct45.
Then, merge the two tables and determine the outcome for each question.
DATA outcome;
MERGE answers answers_correct;
* We will not be using any BY.;
* If you later add more questionnaires, merge BY the questionnaire ID;
ARRAY answer(*) answer1-answer45;
ARRAY answer_correct(*) answer_correct1-answer_correct45;
LENGTH result1-result45 $1;
ARRAY result(*) result1-result45;
DROP i;
FOR i = 1 TO DIM(answer);
IF answer(i) = answer_correct(i) THEN result(i) = '1';
ELSE result(i) = '0';
END;
RUN;