Proc sql subquery based on nonexisitng column returns not null - sas

Here is a sample code that was derived from actual application. There are two datasets - "aa" for a query and "bb" for subquery. Column "m" from datasets "aa" matches column "y" from datasets "bb". Also, there is "yy" column on "aa" table has a value of 30. Column "m" from datasets "aa" contains value "30" in one of its rows, and column "y" from datasets "bb" does not. First proc sql uses values from "y" column of "bb" table to subset table "aa" based on matching values in column "m". It is a correct query and produces results as expected. Second proc sql block has column "y" intentionally misspelled as "yy" in subquery in a row that stars with where statement. Otherwise the whole proc sql block is the same as the first one. Given that there is no column "yy" on dataset bb, I would expect an error message to appear and the whole query to fail. However, it does return one row without failing or error messages. Closer look would suggest that it actually uses "yy" column from table "aa" (see tree in the log output). I do not think this is a correct behavior. If you would have some comments or explanations, I would greatly appreciate it. Otherwise, I maybe should report it to SAS as a bug. Thank you!
Here is the code:
options
msglevel = I
;
data aa;
do i=1 to 20;
m=i*5;
yy=30;
output;
end;
run;
data bb;
do i=10 to 20;
y=i*5;
output;
end;
run;
option DEBUG=JUNK ;
/*Correct sql command*/
proc sql _method
_tree
;
create table cc as
select *
from aa
where m in (select y from bb)
;quit;
/*Incorrect sql command - column "yy" in not on "bb" table"*/
proc sql _method
_tree;
create table dd as
select *
from aa
where m in (select yy from bb)
;quit;
Here is log with sql tree:
119 options
120 msglevel = I
121 ;
122 data aa;
123 do i=1 to 20;
124 m=i*5;
125 yy=30;
126 output;
127 end;
128 run;
NOTE: The data set WORK.AA has 20 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
129
130 data bb;
131 do i=10 to 20;
132 y=i*5;
133 output;
134 end;
135 run;
NOTE: The data set WORK.BB has 11 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
136 option DEBUG=JUNK ;
137
138 /*Correct sql command*/
139 proc sql _method
140 _tree
141 ;
142 create table cc as
143 select *
144 from aa
145 where m in (select y from bb)
146 ;
NOTE: SQL execution methods chosen are:
sqxcrta
sqxfil
sqxsrc( WORK.AA )
NOTE: SQL subquery execution methods chosen are:
sqxsubq
sqxsrc( WORK.BB )
Tree as planned.
/-SYM-V-(aa.i:1 flag=0001)
/-OBJ----|
| |--SYM-V-(aa.m:2 flag=0001)
| \-SYM-V-(aa.yy:3 flag=0001)
/-FIL----|
| | /-SYM-V-(aa.i:1 flag=0001)
| | /-OBJ----|
| | | |--SYM-V-(aa.m:2 flag=0001)
| | | \-SYM-V-(aa.yy:3 flag=0001)
| |--SRC----|
| | \-TABL[WORK].aa opt=''
| | /-SYM-V-(aa.m:2)
| \-IN-----|
| | /-SYM-V-(bb.y:2 flag=0001)
| | /-OBJ----|
| | /-SRC----|
| | | \-TABL[WORK].bb opt=''
| \-SUBC---|
--SSEL---|
NOTE: Table WORK.CC created, with 11 rows and 3 columns.
146! quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds
147
148
149 /*Incorrect sql command - column "yy" in not on "bb" table"*/
150 proc sql _method
151 _tree;
152 create table dd as
153 select *
154 from aa
155 where m in (select yy from bb)
156 ;
NOTE: SQL execution methods chosen are:
sqxcrta
sqxfil
sqxsrc( WORK.AA )
NOTE: SQL subquery execution methods chosen are:
sqxsubq
sqxreps
sqxsrc( WORK.BB )
Tree as planned.
/-SYM-V-(aa.i:1 flag=0001)
/-OBJ----|
| |--SYM-V-(aa.m:2 flag=0001)
| \-SYM-V-(aa.yy:3 flag=0001)
/-FIL----|
| | /-SYM-V-(aa.i:1 flag=0001)
| | /-OBJ----|
| | | |--SYM-V-(aa.m:2 flag=0001)
| | | \-SYM-V-(aa.yy:3 flag=0001)
| |--SRC----|
| | \-TABL[WORK].aa opt=''
| | /-SYM-V-(aa.m:2)
| \-IN-----|
| | /-SYM-A-(#TEMA001:1 flag=0035)
| | /-OBJ----|
| | /-REPS---|
| | | |--empty-
| | | |--empty-
| | | | /-OBJ----|
| | | |--SRC----|
| | | | \-TABL[WORK].bb opt=''
| | | |--empty-
| | | |--empty-
| | | | /-SYM-A-(#TEMA001:1 flag=
0035)
| | | | /-ASGN---|
| | | | | \-SUBP(1)
| | | \-OBJE---|
| \-SUBC---|
| \-SYM-V-(aa.yy:3)
--SSEL---|
NOTE: Table WORK.DD created, with 1 rows and 3 columns.
156! quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
Here are datasets:
aa:
i m yy
1 5 30
2 10 30
3 15 30
4 20 30
5 25 30
6 30 30
7 35 30
8 40 30
9 45 30
10 50 30
11 55 30
12 60 30
13 65 30
14 70 30
15 75 30
16 80 30
17 85 30
18 90 30
19 95 30
20 100 30
bb:
i y
10 50
11 55
12 60
13 65
14 70
15 75
16 80
17 85
18 90
19 95
20 100

I agree, this looks pretty weird and may well be a bug. I was able to reproduce this from the code you provided in SAS 9.4 and in SAS 9.1.3, which would make it at least ~12 years old.
In particular, I'm interested in this bit of the output you got from the _method option when creating the DD table but not when creating the CC table:
NOTE: SQL subquery execution methods chosen are:
sqxsubq
sqxreps <--- What is this doing?
sqxsrc( WORK.BB )
Similarly, the corresponding section from the _tree output is highly obscure:
| | /-SYM-A-(#TEMA001:1 flag=0035)
| | /-OBJ----|
| | /-REPS---|
| | | |--empty-
| | | |--empty-
| | | | /-OBJ----|
| | | |--SRC----|
| | | | \-TABL[WORK].bb opt=''
| | | |--empty-
| | | |--empty-
| | | | /-SYM-A-(#TEMA001:1 flag= 0035)
| | | | /-ASGN---|
| | | | | \-SUBP(1)
| | | \-OBJE---|
| \-SUBC---|
| \-SYM-V-(aa.yy:3)
I have never seen sqxreps or reps in the respective bits of output before. Neither of them is listed in any of the papers I was able to find based on a brief bit of googling (in fact, this question is currently the only hit on Google for sas + sqxreps):
http://support.sas.com/resources/papers/proceedings10/139-2010.pdf
http://www2.sas.com/proceedings/sugi30/101-30.pdf
Quoting the first of these:
Codes Description
sqxcrta Create table as Select
Sqxslct Select
sqxjsl Step loop join (Cartesian)
sqxjm Merge join
sqxjndx Index join
sqxjhsh Hash join
sqxsort Sort
sqxsrc Source rows from table
sqxfil Filter rows
sqxsumg Summary stats with GROUP BY
sqxsumn Summary stats with no GROUP BY
Based on a bit of quick testing, this seems to happen regardless of the variable and tables names used, provided that the variable name from AA is repeated multiple times in the subquery referencing table BB. It also happens if you have a variable named e.g. YYY in AA but one named YY in BB, or more generally whenever you have a variable in BB whose name is initially the same as the name of the corresponding variable in AA but then continues for one or more characters.
From this, I'm guessing at some point in the SQL parser, someone used a like operator rather than checking for equality of variable names, and somehow as a result this syntax is triggering an undocumented or incomplete 'feature' in proc sql.
An example of the more general case:
options
msglevel = I
;
data aa;
do i=1 to 20;
m=i*5;
myvar_plus_suffix=30;
output;
end;
run;
data bb;
do i=10 to 20;
myvar=i*5;
output;
end;
run;
option DEBUG=JUNK ;
/*Incorrect sql command - column "yy" in not on "bb" table"*/
proc sql _method
_tree;
create table dd as
select *
from aa
where m in (select myvar_plus_suffix from bb)
;quit;

Here is a response from SAS support.
What you are seeing is related to column scoping in PROC SQL.
PROC SQL supports Corellated Subqueries. A Correlated Subquery references a column in the "outer" table which can then be compared to columns in the "inner" table. PROC SQL does not require that a fully qualified column name is used. As a result, if it sees a column in the subquery that does not exist in the inner table (the table referenced in the subquery), it looks for that column in the "outer" table and uses the value if it finds one.
If a fully qualified column name is used, the error you are expecting will occur such as the following:
proc sql;
create table dd as
select *
from aa as outer
where outer.m in (select inner.yyy from bb as inner);
quit;

Related

How to create 2 new columns with appropriate prefix based on values in columns with same prefix in SAS Enterprise Guide / PROC SQL?

I have table in SAS Enterprise Guide like below:
ID | COUNT_COL_A | COUNT_COL_B | SUM_COL_A | SUM_COL_B
-----|-------------|-------------|-----------|------------
111 | 10 | 10 | 320 | 120
222 | 15 | 80 | 500 | 500
333 | 1 | 5 | 110 | 350
444 | 20 | 5 | 670 | 0
Requirements:
I need to create new column "TOP_COUNT" where will be name of column (COUNT_COL_A or COUNT_COL_B) with the highest value per each ID,
if some ID has same values in both "COUNT_" columns take to "TOP_COUNT" column name which has higher value in its counterpart with prefix SUM_ (SUM_COL_A or SUM_COL_B)
I need to create new column "TOP_SUM" where will be name of column (SUM_COL_A or SUM_COL_B) with the highest value per each ID,
if some ID has same values in both "SUM_" columns take to "TOP_SUM" column name which has higher value in its counterpart with prefix COUNT_ (COUNT_COL_A or COUNT_COL_B)
It is not possible to have only 0 in columns with prefix _COUNT or only 0 in columns with prefix _SUM
There is not null in table
Desire output:
ID | COUNT_COL_A | COUNT_COL_B | SUM_COL_A | SUM_COL_B | TOP_COUNT | TOP_SUM
-----|-------------|-------------|-----------|------------|-------------|---------
111 | 10 | 10 | 320 | 120 | COUNT_COL_A | SUM_COL_A
222 | 15 | 80 | 500 | 500 | COUNT_COL_B | SUM_COL_B
333 | 1 | 5 | 110 | 350 | COUNT_COL_B | SUM_COL_B
444 | 20 | 5 | 670 | 0 | COUNT_COL_A | SUM_COL_A
How can i do that in SAS Enterprise Guide or in PROC SQL ?
Use an array with loops methodology:
Declare an array of the count variables
Set the maximum value to 0
Loop through the array
Check if each value is more than current
maximum
If yes, assign value to current maximum value and store name
If no, keep looping
Non looping, function methodology:
Use MAX to find the maximum value of the array
Use WHICHN() to find the location of the array
Use VNAME to get the variable name based on the location
*for count - you can extend for max;
data want;
set have;
array _count(*) count_col_:;
*looping methodology;
top_count_value=0;
do i=1 to _count;
if _count(i) > top_count_value then do;
top_count = vname(_count(i));
top_count_value = _count(i);
end;
end;
/*or function methodology*/
top_count_max = max(of _count(*));
index_top_count = whichn(top_count_max, of _count(*));
top_count_name_2 = vname(_count(index_top_count);
run;
Just do the same thing as your other question. But because you want to transpose two sets of variable it is probably going to be easier to a data step and arrays to do the first transform.
data tall;
set have;
array counts count_col_a count_col_b;
array sums sum_col_a sum_col_b;
do index=1 to dim(sums);
length type $5 name $32 ;
type='COUNT';
name=vname(counts[index]);
value1=counts[index];
value2=sums[index];
output;
type='SUM';
name=vname(sums[index]);
value1=sums[index];
value2=counts[index];
output;
end;
run;
Now sort and take the last per ID/TYPE combination to find the largest.
proc sort;
by id type value1 value2 name;
run;
data top;
set tall;
by id type value1 value2;
if last.type;
run;
And then transpose and re-merge.
proc transpose data=top out=want(drop=_name_) prefix=TOP_;
by id;
id type;
var name;
run;
data want;
merge have want;
by id;
run;
Result:
COUNT_ COUNT_ SUM_ SUM_
Obs ID COL_A COL_B COL_A COL_B TOP_COUNT TOP_SUM
1 111 10 10 320 120 COUNT_COL_A SUM_COL_A
2 222 15 80 500 500 COUNT_COL_B SUM_COL_B
3 333 1 5 110 350 COUNT_COL_B SUM_COL_B
4 444 20 5 670 0 COUNT_COL_A SUM_COL_A

How to aggretage col1 per ID and val1 per ID and values in col1 in SAS Enterprise Gude or PROC SQL?

I have table in SAS Enterprise Guide like below:
ID | COL1 | VAL1 |
----|------|------|
111 | A | 10 |
111 | A | 5 |
111 | B | 10 |
222 | B | 20 |
333 | C | 25 |
... | ... | ... |
And I need to aggregate above table to know:
sum of values from COL1 per ID
sum of values from VAL1 per COL1 per ID
So, as a result I need something like below:
ID | COL1_A | COL1_B | COL1_C | COL1_A_VAL1_SUM | COL1_B_VAL1_SUM | COL1_C_VAL1_SUM
----|--------|--------|---------|-----------------|-----------------|------------------
111 | 2 | 1 | 0 | 15 | 10 | 0
222 | 0 | 1 | 0 | 0 | 20 | 0
333 | 0 | 0 | 1 | 0 | 0 | 25
for example because:
COL1_A = 2 for ID 111, because ID=111 has 2 times "A" in COL1
COL1_A_VAL1_SUM = 15 for ID 111, because ID=111 has 10+5=15 in VAL1 for "A" in COL1
How can I do that in SAS Enterpriuse Guide or in PROC SQL ?
First, we'll create the counts that we need by group with SQL:
proc sql;
create table totals_by_group as
select id
, col1
, count(col1) as count_col1
, sum(val1) as sum_val1
from have
group by id, col1
;
quit;
This produces the following table:
id col1 count_col1 sum_val1
111 A 2 15
111 B 1 10
222 B 1 20
333 C 1 25
Now we need to transpose this into the way we want it. We'll do this with two transpose steps: one for count_col1, and one for sum_val1. proc transpose has a few handy options to make this easy, namely the id, prefix, and suffix options.
First, we'll consider our ID variable col1. This creates columns named A, B, and C. For example:
id A B C
111 2 1 .
222 . 1 .
333 . . 1
The prefix and suffix options let us add a prefix and suffix to these names.
proc transpose
data = totals_by_group
out = count_by_group(drop=_NAME_)
prefix = COL1_;
by id;
id col1;
var count_col1;
run;
proc transpose
data = totals_by_group
out = sum_by_group(drop=_NAME_)
prefix = COL1_
suffix = _VAL1_SUM;
by id;
id col1;
var sum_val1;
run;
This gives us two tables:
COUNT_BY_GROUP
id COL1_A COL1_B COL1_C
111 2 1 .
222 . 1 .
333 . . 1
SUM_BY_GROUP
id COL1_A_VAL1_SUM COL1_B_VAL1_SUM COL1_C_VAL1_SUM
111 15 10 .
222 . 20 .
333 . . 25
Now we just need to merge them together, then set all missing values to 0 by iterating over each numeric column and checking if it's missing.
data want;
merge count_by_group
sum_by_group
;
by id;
array numvars[*] _NUMERIC_;
do i = 1 to dim(numvars);
if(missing(numvars[i])) then numvars[i] = 0;
end;
drop i;
run;
Final table:
id COL1_A COL1_B COL1_C COL1_A_VAL1_SUM COL1_B_VAL1_SUM COL1_C_VAL1_SUM
111 2 1 0 15 10 0
222 0 1 0 0 20 0
333 0 0 1 0 0 25

insert into table select from table where columnname = xxx and columnname2 = yyy looping too many times

I'll try to be as complete but concise as necessary. If I leave something out, please let me know.
I have a collection of activities and each activity is comprised of steps necessary to complete that given activity. Each step has a few additional components that go along with the step. If you were to look at this as a tree, it'd look like this:
ACTIVITY
-- STEP
---- COMPONENT
Below is the dataset results of the component table.
I'm looking to write a mySQL insert/select statement that will allow me to copy the ID = 84. On the insert though, the ID value should inherit the new ID of the ACTIVITY (e.g. for example, let's go with 299) and the AID should inherit that of the STEP values (e.g. for this, let's go with 501,502,503,504,505,506).
I know what the mySQL statement would look like, that's not the problem. The problem I'm running into is how to write the loop so that I can pass in the new ID and the new AID values. The SID is the primary key (auto increment).
With the given dataset from above, I'd expect 6 new records to be inserted. Instead, I'm getting 9 so my loops are not looping correctly or I'm passing in the wrong data.
Here is the loop:
for (local.data.newAID in local.data.list_newAID){
// COPY SET
for (local.data.origAID in local.data.list_existingAID){
local.formDataStruct.origAID = local.data.origAID;
variables.workoutDAO.makeCopyCoreSet(
origID = local.dataStruct.ID,
newID = local.dataStruct.newID,
origAID = local.dataStruct.origAID,
newAID = local.dataStruct.newAID
);
}
}
Here is the makeCopyCoreSet function:
INSERT INTO SET(ID, LID, AID)
SELECT
:newID, LID, :newAID
FROM
Set
WHERE ID = :origID AND AID = :origAID;
What am I missing?
We want to copy one of our Activities, so we want to pass the ID we want to copy and the ID we want to be the new ID (unless we have another way to determine it.
variables.workoutDAO.NEW_makeCopyCoreSet(
origID = local.dataStruct.ID,
newID = local.dataStruct.newID
);
And then in our NEW_makeCopyCoreSet() function (a CF function), we have the query:
INSERT INTO component (ID, LID, AID)
SELECT DISTINCT :newID, LID, AID
FROM component
WHERE ID = :origID
To see it in action (from the SQL side):
https://dbfiddle.uk/?rdbms=mariadb_10.2&fiddle=ba4328dca3327814a7dc18fea284ead8
First we set up our base data.
/* SETUP 1 */
CREATE TABLE component ( ID int, LID int, AID int, SID int UNIQUE AUTO_INCREMENT)
/* SETUP 2 */
INSERT INTO component (ID, LID, AID)
SELECT 84,0,432 UNION ALL
SELECT 84,0,433 UNION ALL
SELECT 84,0,434 UNION ALL
SELECT 84,0,435 UNION ALL
SELECT 84,0,435 UNION ALL
SELECT 84,0,435 UNION ALL
SELECT 84,0,435 UNION ALL
SELECT 84,0,436 UNION ALL
SELECT 84,0,437
/* What's in the original? */
SELECT * FROM component
ID | LID | AID | SID
-: | --: | --: | --:
84 | 0 | 432 | 1
84 | 0 | 433 | 2
84 | 0 | 434 | 3
84 | 0 | 435 | 4
84 | 0 | 435 | 5
84 | 0 | 435 | 6
84 | 0 | 435 | 7
84 | 0 | 436 | 8
84 | 0 | 437 | 9
Then we copy an existing ID to a new ID.
/* Copy an ID. */
INSERT INTO component (ID, LID, AID)
SELECT DISTINCT 299, LID, AID
FROM component
WHERE ID = 84
/* What's in the table for the new ID? */
SELECT * FROM component WHERE ID = 299
ID | LID | AID | SID
--: | --: | --: | --:
299 | 0 | 432 | 16
299 | 0 | 433 | 17
299 | 0 | 434 | 18
299 | 0 | 435 | 19
299 | 0 | 436 | 20
299 | 0 | 437 | 21

Performing t-test using SAS when variables are in different columns

I have data which looks like following.
I was wondering how to run a t-test when variables that I want to compare are in different columns
+---------+------------+----------+-------------+-------------+----------------+
| Case_id | Control_id | case_age | control_age | case_result | control_result |
+---------+------------+----------+-------------+-------------+----------------+
| 1 | 50 | 24 | 24 | 23 | 12 |
| 1 | 52 | 24 | 24 | 23 | 10 |
| 2 | 65 | 27 | 27 | 24 | 15 |
| 2 | 70 | 27 | 27 | 24 | 14 |
+---------+------------+----------+-------------+-------------+----------------+
The SAS tutorials indicate the following syntax for running a t-test. But in my case I do not have a class variable to distinguish between cases and control. Is there a way to tell SAS to compare two variables case_result and control_result.
proc ttest data;
class Gender;
var Score;
run;
If you would like to compare two variables, it can be done this way:
proc compare base=libname.dataset allstats briefsummary;
var var1;
with var2;
title 'Comparison two variables';
run;
To run ttest on difference b/w two variables (paired comparison),
proc ttest data=libname.dataset;
paired var1*var2;
run;

Chi-Sq test result difference when done Manually and by SAS

I am trying to perform a chi-square test on my data using SAS University Edition.
Here is the strucure of my data
+----------+------------+------------------+-------------------+
| study_id | Control_id | study_mortality | control_mortality |
+----------+------------+------------------|-------------------+
| 1 | 50 | Alive | Alive |
| 1 | 52 | Alive | Alive |
| 2 | 65 | Dead | Dead |
| 2 | 70 | Dead | Alive |
+----------+------------+------------------+-------------------+
I am getting different results when I do the test with SAS Vs when I do it manually using an online calculator. I used the values from 'PROC FREQ' to calculate the Chi-Sq using online calculator. Here are the outputs of frequencies and the Chi-sq test. Can someone point where the issue is.
proc freq data = mydata;
tables study_mortality control_mortality;
where type=1;
run;
+-----------------+-------------------+
| study_mortality | Frequency |
+-----------------+-------------------
| Alive | 7614 |
| Dead | 324 |
+-----------------+-------------------+
+----------------- +-------------------+
| control_mortality| Frequency |
+----------------- +-------------------
| Alive | 6922 |
| Dead | 159 |
+----------------- +-------------------+
proc freq data = mydata;
tables study_mortality*control_mortality/ CHISQ;
where type=1;
run;
+-----------------+-------------------+---------+-------+
| | Control_mortality | | |
+-----------------+-------------------+---------+-------+
| Study_mortality | Alive | Dead | Total |
| Alive | 5515 | 134 | 5649 |
| Dead | 249 | 5 | 254 |
| Total | 5764 | 139 | 5903 |
+-----------------+-------------------+---------+-------+
Statistic DF Value Prob
Chi-Square 1 0.1722 0.6782
Likelihood Ratio Chi-Square 1 0.1818 0.6699
Continuity Adj. Chi-Square 1 0.0414 0.8388
Mantel-Haenszel Chi-Square 1 0.1722 0.6782
Phi Coefficient -0.0054
Contingency Coefficient 0.0054
Cramer's V -0.0054
You have missing data. Look at the N's on those tables.
Study Mortality is around 8000 and Control Mortality is around 7000 but when you cross them you only have 5903 records. This means that certain records are excluded. There should be a line in the output saying N missing somewhere. Not sure if SAS didn't put it there or you only pasted selected output. The P value matches exactly when I use an online calculator and also match your output.
data have;
infile cards;
input Study Control N;
cards;
1 1 5515
1 0 134
0 1 249
0 0 5
;
run;
proc freq data=have;
table study*control / chisq;
weight N;
run;