How to aggretage col1 per ID and val1 per ID and values in col1 in SAS Enterprise Gude or PROC SQL? - sas

I have table in SAS Enterprise Guide like below:
ID | COL1 | VAL1 |
----|------|------|
111 | A | 10 |
111 | A | 5 |
111 | B | 10 |
222 | B | 20 |
333 | C | 25 |
... | ... | ... |
And I need to aggregate above table to know:
sum of values from COL1 per ID
sum of values from VAL1 per COL1 per ID
So, as a result I need something like below:
ID | COL1_A | COL1_B | COL1_C | COL1_A_VAL1_SUM | COL1_B_VAL1_SUM | COL1_C_VAL1_SUM
----|--------|--------|---------|-----------------|-----------------|------------------
111 | 2 | 1 | 0 | 15 | 10 | 0
222 | 0 | 1 | 0 | 0 | 20 | 0
333 | 0 | 0 | 1 | 0 | 0 | 25
for example because:
COL1_A = 2 for ID 111, because ID=111 has 2 times "A" in COL1
COL1_A_VAL1_SUM = 15 for ID 111, because ID=111 has 10+5=15 in VAL1 for "A" in COL1
How can I do that in SAS Enterpriuse Guide or in PROC SQL ?

First, we'll create the counts that we need by group with SQL:
proc sql;
create table totals_by_group as
select id
, col1
, count(col1) as count_col1
, sum(val1) as sum_val1
from have
group by id, col1
;
quit;
This produces the following table:
id col1 count_col1 sum_val1
111 A 2 15
111 B 1 10
222 B 1 20
333 C 1 25
Now we need to transpose this into the way we want it. We'll do this with two transpose steps: one for count_col1, and one for sum_val1. proc transpose has a few handy options to make this easy, namely the id, prefix, and suffix options.
First, we'll consider our ID variable col1. This creates columns named A, B, and C. For example:
id A B C
111 2 1 .
222 . 1 .
333 . . 1
The prefix and suffix options let us add a prefix and suffix to these names.
proc transpose
data = totals_by_group
out = count_by_group(drop=_NAME_)
prefix = COL1_;
by id;
id col1;
var count_col1;
run;
proc transpose
data = totals_by_group
out = sum_by_group(drop=_NAME_)
prefix = COL1_
suffix = _VAL1_SUM;
by id;
id col1;
var sum_val1;
run;
This gives us two tables:
COUNT_BY_GROUP
id COL1_A COL1_B COL1_C
111 2 1 .
222 . 1 .
333 . . 1
SUM_BY_GROUP
id COL1_A_VAL1_SUM COL1_B_VAL1_SUM COL1_C_VAL1_SUM
111 15 10 .
222 . 20 .
333 . . 25
Now we just need to merge them together, then set all missing values to 0 by iterating over each numeric column and checking if it's missing.
data want;
merge count_by_group
sum_by_group
;
by id;
array numvars[*] _NUMERIC_;
do i = 1 to dim(numvars);
if(missing(numvars[i])) then numvars[i] = 0;
end;
drop i;
run;
Final table:
id COL1_A COL1_B COL1_C COL1_A_VAL1_SUM COL1_B_VAL1_SUM COL1_C_VAL1_SUM
111 2 1 0 15 10 0
222 0 1 0 0 20 0
333 0 0 1 0 0 25

Related

How to create 2 new columns with appropriate prefix based on values in columns with same prefix in SAS Enterprise Guide / PROC SQL?

I have table in SAS Enterprise Guide like below:
ID | COUNT_COL_A | COUNT_COL_B | SUM_COL_A | SUM_COL_B
-----|-------------|-------------|-----------|------------
111 | 10 | 10 | 320 | 120
222 | 15 | 80 | 500 | 500
333 | 1 | 5 | 110 | 350
444 | 20 | 5 | 670 | 0
Requirements:
I need to create new column "TOP_COUNT" where will be name of column (COUNT_COL_A or COUNT_COL_B) with the highest value per each ID,
if some ID has same values in both "COUNT_" columns take to "TOP_COUNT" column name which has higher value in its counterpart with prefix SUM_ (SUM_COL_A or SUM_COL_B)
I need to create new column "TOP_SUM" where will be name of column (SUM_COL_A or SUM_COL_B) with the highest value per each ID,
if some ID has same values in both "SUM_" columns take to "TOP_SUM" column name which has higher value in its counterpart with prefix COUNT_ (COUNT_COL_A or COUNT_COL_B)
It is not possible to have only 0 in columns with prefix _COUNT or only 0 in columns with prefix _SUM
There is not null in table
Desire output:
ID | COUNT_COL_A | COUNT_COL_B | SUM_COL_A | SUM_COL_B | TOP_COUNT | TOP_SUM
-----|-------------|-------------|-----------|------------|-------------|---------
111 | 10 | 10 | 320 | 120 | COUNT_COL_A | SUM_COL_A
222 | 15 | 80 | 500 | 500 | COUNT_COL_B | SUM_COL_B
333 | 1 | 5 | 110 | 350 | COUNT_COL_B | SUM_COL_B
444 | 20 | 5 | 670 | 0 | COUNT_COL_A | SUM_COL_A
How can i do that in SAS Enterprise Guide or in PROC SQL ?
Use an array with loops methodology:
Declare an array of the count variables
Set the maximum value to 0
Loop through the array
Check if each value is more than current
maximum
If yes, assign value to current maximum value and store name
If no, keep looping
Non looping, function methodology:
Use MAX to find the maximum value of the array
Use WHICHN() to find the location of the array
Use VNAME to get the variable name based on the location
*for count - you can extend for max;
data want;
set have;
array _count(*) count_col_:;
*looping methodology;
top_count_value=0;
do i=1 to _count;
if _count(i) > top_count_value then do;
top_count = vname(_count(i));
top_count_value = _count(i);
end;
end;
/*or function methodology*/
top_count_max = max(of _count(*));
index_top_count = whichn(top_count_max, of _count(*));
top_count_name_2 = vname(_count(index_top_count);
run;
Just do the same thing as your other question. But because you want to transpose two sets of variable it is probably going to be easier to a data step and arrays to do the first transform.
data tall;
set have;
array counts count_col_a count_col_b;
array sums sum_col_a sum_col_b;
do index=1 to dim(sums);
length type $5 name $32 ;
type='COUNT';
name=vname(counts[index]);
value1=counts[index];
value2=sums[index];
output;
type='SUM';
name=vname(sums[index]);
value1=sums[index];
value2=counts[index];
output;
end;
run;
Now sort and take the last per ID/TYPE combination to find the largest.
proc sort;
by id type value1 value2 name;
run;
data top;
set tall;
by id type value1 value2;
if last.type;
run;
And then transpose and re-merge.
proc transpose data=top out=want(drop=_name_) prefix=TOP_;
by id;
id type;
var name;
run;
data want;
merge have want;
by id;
run;
Result:
COUNT_ COUNT_ SUM_ SUM_
Obs ID COL_A COL_B COL_A COL_B TOP_COUNT TOP_SUM
1 111 10 10 320 120 COUNT_COL_A SUM_COL_A
2 222 15 80 500 500 COUNT_COL_B SUM_COL_B
3 333 1 5 110 350 COUNT_COL_B SUM_COL_B
4 444 20 5 670 0 COUNT_COL_A SUM_COL_A

In SAS, how do you stop flagging a group of rows if a specific condition is met?

I have a table in SAS dataset that looks like this:
proc sql;
create table my_table
(id char(1),
my_date num format=date9.,
my_col num);
insert into my_table
values('A','01JAN2010'd,.)
values('A','02JAN2010'd,0)
values('A','03DEC2009'd,1)
values('A','04NOV2009'd,1)
values('B','01JAN2010'd,.)
values('B','02NOV2009'd,2)
values('C','01JAN2010'd,.)
values('C','02OCT2009'd,3)
values('D','01JAN2010'd,.)
values('D','02NOV2009'd,2)
values('D','03OCT2009'd,1)
values('D','04AUG2009'd,2)
values('D','05MAY2009'd,3)
values('D','06APR2009'd,1);
quit;
I am trying to create a new column desired that, for each group of id column, flags the row with a value of 1 if the value in my_col is missing or less than 3.
The part I'm having trouble with is that when there is a my_col value that is greater than 2, I need the desired value for that row to be missing and also stop flagging any remaining rows in the id group with a value of 1.
The resulting dataset should look like this:
+----+-----------+--------+---------+
| id | my_date | my_col | desired |
+----+-----------+--------+---------+
| A | 01JAN2010 | . | 1 |
| A | 02JAN2010 | 0 | 1 |
| A | 03DEC2009 | 1 | 1 |
| A | 04NOV2009 | 1 | 1 |
| B | 01JAN2009 | . | 1 |
| B | 02NOV2009 | 2 | 1 |
| C | 01JAN2010 | . | 1 |
| C | 02OCT2009 | 3 | . |
| D | 01JAN2010 | . | 1 |
| D | 02NOV2009 | 2 | 1 |
| D | 03OCT2009 | 1 | 1 |
| D | 04AUG2009 | 2 | 1 |
| D | 05MAY2009 | 3 | . |
| D | 06APR2009 | 1 | . |
+----+-----------+--------+---------+
Looks like a simple application of a retained variable. Set the flag to 1 when you start a new group and then set it to missing when the value of MY_COL is larger than 2.
data want;
set my_table ;
by id;
if first.id then desired=1;
if my_col>2 then desired=.;
retain desired;
run;
Also it is not clear why you used such complicated code to create your example data. Why not a simple data step?
data my_table;
input id :$1. my_date :date. my_col;
format my_date date9.;
cards;
A 01JAN2010 .
A 02JAN2010 0
A 03DEC2009 1
A 04NOV2009 1
B 01JAN2010 .
B 02NOV2009 2
C 01JAN2010 .
C 02OCT2009 3
D 01JAN2010 .
D 02NOV2009 2
D 03OCT2009 1
D 04AUG2009 2
D 05MAY2009 3
D 06APR2009 1
;
I can't think of a simpler way to do it, but this works. You will need to have your data sorted by id.
data my_table2;
set my_table;
by id;
format gt2flag $1.;
retain gt2flag;
if first.id then gt2flag='';
if my_col gt 2 then gt2flag='Y';
if gt2flag = 'Y' then desired=.;
else desired=1;
drop gt2flag;
run;
id my_date my_col desired
A 01JAN2010 . 1
A 02JAN2010 0 1
A 03DEC2009 1 1
A 04NOV2009 1 1
B 01JAN2010 . 1
B 02NOV2009 2 1
C 01JAN2010 . 1
C 02OCT2009 3 .
D 01JAN2010 . 1
D 02NOV2009 2 1
D 03OCT2009 1 1
D 04AUG2009 2 1
D 05MAY2009 3 .
D 06APR2009 1 .

Grouping child items and displaying parent sum

I have the following table
+-------+--------+---------+
| group | item | value |
+-------+--------+---------+
| 1 | a | 10 |
| 1 | b | 20 |
| 2 | b | 30 |
| 2 | c | 40 |
+-------+--------+---------+
I would like to group the table by group, insert the grouped sum into value, and then ungroup:
+-------+--------+
| item | value |
+-------+--------+
| 1 | 30 |
| a | 10 |
| b | 20 |
| 2 | 70 |
| b | 30 |
| c | 40 |
+-------+--------+
The purpose of the result is to interpret the first column as items a and b belonging to group 1 with sum 30 and items b and c belonging to group 2 with sum 70.
Such a data transformation can be indicative of a reporting requirement more than a useful data structure for downstream processing. Proc REPORT can create output in the form desired.
data have;
infile datalines;
input group $ item $ value ##; datalines;
1 a 10 1 b 20 2 b 30 2 c 40
;
proc report data=have;
column group item value;
define group / order order=data noprint;
break before group / summarize;
compute item;
if missing(item) then item=group;
endcomp;
run;
I assume that both group and item are character variables
data have;
infile datalines firstobs=4 dlm='|';
input group $ item $ value;
datalines;
+-------+--------+---------+
| group | item | value |
+-------+--------+---------+
| 1 | a | 10 |
| 1 | b | 20 |
| 2 | b | 30 |
| 2 | c | 40 |
+-------+--------+---------+
;
data want (keep=group value);
do _N_=1 by 1 until (last.group);
set have;
by group;
v + value;
end;
value = v;output;v=0;
do _N_=1 to _N_;
set have;
group = item;
output;
end;
run;

Removing observations before 'beginning' and after 'ending' - SAS code

My table has some leading and trailing observations that I am trying to remove. I want to remove the rows that come before every 'begin' event and after every 'end' event for every single group. The table resembles the below:
| Time | Group | Event | Value |
| 1 | 1 | NA | 0 |
| 2 | 1 | NA | 0 |
| 3 | 1 | Begin | 1.1 |
| 4 | 1 | NA | 1.2 |
| 5 | 1 | NA | 1.3 |
| 6 | 1 | End | 1.4 |
| 7 | 1 | NA | 0 |
| 1 | 2 | NA | 0 |
| 2 | 2 | Begin | 1.1 |
| 3 | 2 | NA | 1.2 |
| 4 | 2 | End | 1.3 |
| 5 | 2 | NA | 1.4 |
On the presumption that the incoming data is already sorted and that there are zero or more serially bounded ranges of Begin to End within each group:
data want;
do until (last.group);
set have;
by group time;
if event = 'Begin' then _keeprow = 1;
if _keeprow then output;
if event = 'End' then _keeprow = 0;
end;
drop _keeprow;
end;
I have came out an easy way but will be limited by the actual data size.
data have;
input Time Group Event $ Value ;
datalines;
1 1 NA 0
2 1 NA 0
3 1 Begin 1.1
4 1 NA 1.2
5 1 NA 1.3
6 1 End 1.4
7 1 NA 0
1 2 NA 0
2 2 Begin 1.1
3 2 NA 1.2
4 2 End 1.3
5 2 NA 1.4
;
run;
proc sort data = have;
by group time;
run;
data have1;
set have;
count + 1;
by group;
if first.group then count = -100;
if event = 'Begin' then count = 0;
if event = 'End' then count = 100;
if count < 0 or count >100 then delete;
run;
The current code could be applied to the small size data if you have less than 100 observations between 'Begin' and 'End' and less than 100 observations before 'Begin'. You can adjust the initial count value according to the true data size.
one way to do is
data have;
input Time Group Event $ Value ;
datalines;
1 1 NA 0
2 1 NA 0
3 1 Begin 1.1
4 1 NA 1.2
5 1 NA 1.3
6 1 End 1.4
7 1 NA 0
1 2 NA 0
2 2 Begin 1.1
3 2 NA 1.2
4 2 End 1.3
5 2 NA 1.4
;
data have2(keep= Group min_var max_var);
set have;
by group;
retain min_var max_var;
if trim(Event)= "Begin" then min_var =_n_ ;
if trim(Event)= "End" then max_var =_n_;
if last.group;
run;
data want;
merge have have2;
by group;
if _n_ ge min_var and _n_ le max_var ;
drop min_var max_var;
run;

Proc sql subquery based on nonexisitng column returns not null

Here is a sample code that was derived from actual application. There are two datasets - "aa" for a query and "bb" for subquery. Column "m" from datasets "aa" matches column "y" from datasets "bb". Also, there is "yy" column on "aa" table has a value of 30. Column "m" from datasets "aa" contains value "30" in one of its rows, and column "y" from datasets "bb" does not. First proc sql uses values from "y" column of "bb" table to subset table "aa" based on matching values in column "m". It is a correct query and produces results as expected. Second proc sql block has column "y" intentionally misspelled as "yy" in subquery in a row that stars with where statement. Otherwise the whole proc sql block is the same as the first one. Given that there is no column "yy" on dataset bb, I would expect an error message to appear and the whole query to fail. However, it does return one row without failing or error messages. Closer look would suggest that it actually uses "yy" column from table "aa" (see tree in the log output). I do not think this is a correct behavior. If you would have some comments or explanations, I would greatly appreciate it. Otherwise, I maybe should report it to SAS as a bug. Thank you!
Here is the code:
options
msglevel = I
;
data aa;
do i=1 to 20;
m=i*5;
yy=30;
output;
end;
run;
data bb;
do i=10 to 20;
y=i*5;
output;
end;
run;
option DEBUG=JUNK ;
/*Correct sql command*/
proc sql _method
_tree
;
create table cc as
select *
from aa
where m in (select y from bb)
;quit;
/*Incorrect sql command - column "yy" in not on "bb" table"*/
proc sql _method
_tree;
create table dd as
select *
from aa
where m in (select yy from bb)
;quit;
Here is log with sql tree:
119 options
120 msglevel = I
121 ;
122 data aa;
123 do i=1 to 20;
124 m=i*5;
125 yy=30;
126 output;
127 end;
128 run;
NOTE: The data set WORK.AA has 20 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
129
130 data bb;
131 do i=10 to 20;
132 y=i*5;
133 output;
134 end;
135 run;
NOTE: The data set WORK.BB has 11 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
136 option DEBUG=JUNK ;
137
138 /*Correct sql command*/
139 proc sql _method
140 _tree
141 ;
142 create table cc as
143 select *
144 from aa
145 where m in (select y from bb)
146 ;
NOTE: SQL execution methods chosen are:
sqxcrta
sqxfil
sqxsrc( WORK.AA )
NOTE: SQL subquery execution methods chosen are:
sqxsubq
sqxsrc( WORK.BB )
Tree as planned.
/-SYM-V-(aa.i:1 flag=0001)
/-OBJ----|
| |--SYM-V-(aa.m:2 flag=0001)
| \-SYM-V-(aa.yy:3 flag=0001)
/-FIL----|
| | /-SYM-V-(aa.i:1 flag=0001)
| | /-OBJ----|
| | | |--SYM-V-(aa.m:2 flag=0001)
| | | \-SYM-V-(aa.yy:3 flag=0001)
| |--SRC----|
| | \-TABL[WORK].aa opt=''
| | /-SYM-V-(aa.m:2)
| \-IN-----|
| | /-SYM-V-(bb.y:2 flag=0001)
| | /-OBJ----|
| | /-SRC----|
| | | \-TABL[WORK].bb opt=''
| \-SUBC---|
--SSEL---|
NOTE: Table WORK.CC created, with 11 rows and 3 columns.
146! quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds
147
148
149 /*Incorrect sql command - column "yy" in not on "bb" table"*/
150 proc sql _method
151 _tree;
152 create table dd as
153 select *
154 from aa
155 where m in (select yy from bb)
156 ;
NOTE: SQL execution methods chosen are:
sqxcrta
sqxfil
sqxsrc( WORK.AA )
NOTE: SQL subquery execution methods chosen are:
sqxsubq
sqxreps
sqxsrc( WORK.BB )
Tree as planned.
/-SYM-V-(aa.i:1 flag=0001)
/-OBJ----|
| |--SYM-V-(aa.m:2 flag=0001)
| \-SYM-V-(aa.yy:3 flag=0001)
/-FIL----|
| | /-SYM-V-(aa.i:1 flag=0001)
| | /-OBJ----|
| | | |--SYM-V-(aa.m:2 flag=0001)
| | | \-SYM-V-(aa.yy:3 flag=0001)
| |--SRC----|
| | \-TABL[WORK].aa opt=''
| | /-SYM-V-(aa.m:2)
| \-IN-----|
| | /-SYM-A-(#TEMA001:1 flag=0035)
| | /-OBJ----|
| | /-REPS---|
| | | |--empty-
| | | |--empty-
| | | | /-OBJ----|
| | | |--SRC----|
| | | | \-TABL[WORK].bb opt=''
| | | |--empty-
| | | |--empty-
| | | | /-SYM-A-(#TEMA001:1 flag=
0035)
| | | | /-ASGN---|
| | | | | \-SUBP(1)
| | | \-OBJE---|
| \-SUBC---|
| \-SYM-V-(aa.yy:3)
--SSEL---|
NOTE: Table WORK.DD created, with 1 rows and 3 columns.
156! quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
Here are datasets:
aa:
i m yy
1 5 30
2 10 30
3 15 30
4 20 30
5 25 30
6 30 30
7 35 30
8 40 30
9 45 30
10 50 30
11 55 30
12 60 30
13 65 30
14 70 30
15 75 30
16 80 30
17 85 30
18 90 30
19 95 30
20 100 30
bb:
i y
10 50
11 55
12 60
13 65
14 70
15 75
16 80
17 85
18 90
19 95
20 100
I agree, this looks pretty weird and may well be a bug. I was able to reproduce this from the code you provided in SAS 9.4 and in SAS 9.1.3, which would make it at least ~12 years old.
In particular, I'm interested in this bit of the output you got from the _method option when creating the DD table but not when creating the CC table:
NOTE: SQL subquery execution methods chosen are:
sqxsubq
sqxreps <--- What is this doing?
sqxsrc( WORK.BB )
Similarly, the corresponding section from the _tree output is highly obscure:
| | /-SYM-A-(#TEMA001:1 flag=0035)
| | /-OBJ----|
| | /-REPS---|
| | | |--empty-
| | | |--empty-
| | | | /-OBJ----|
| | | |--SRC----|
| | | | \-TABL[WORK].bb opt=''
| | | |--empty-
| | | |--empty-
| | | | /-SYM-A-(#TEMA001:1 flag= 0035)
| | | | /-ASGN---|
| | | | | \-SUBP(1)
| | | \-OBJE---|
| \-SUBC---|
| \-SYM-V-(aa.yy:3)
I have never seen sqxreps or reps in the respective bits of output before. Neither of them is listed in any of the papers I was able to find based on a brief bit of googling (in fact, this question is currently the only hit on Google for sas + sqxreps):
http://support.sas.com/resources/papers/proceedings10/139-2010.pdf
http://www2.sas.com/proceedings/sugi30/101-30.pdf
Quoting the first of these:
Codes Description
sqxcrta Create table as Select
Sqxslct Select
sqxjsl Step loop join (Cartesian)
sqxjm Merge join
sqxjndx Index join
sqxjhsh Hash join
sqxsort Sort
sqxsrc Source rows from table
sqxfil Filter rows
sqxsumg Summary stats with GROUP BY
sqxsumn Summary stats with no GROUP BY
Based on a bit of quick testing, this seems to happen regardless of the variable and tables names used, provided that the variable name from AA is repeated multiple times in the subquery referencing table BB. It also happens if you have a variable named e.g. YYY in AA but one named YY in BB, or more generally whenever you have a variable in BB whose name is initially the same as the name of the corresponding variable in AA but then continues for one or more characters.
From this, I'm guessing at some point in the SQL parser, someone used a like operator rather than checking for equality of variable names, and somehow as a result this syntax is triggering an undocumented or incomplete 'feature' in proc sql.
An example of the more general case:
options
msglevel = I
;
data aa;
do i=1 to 20;
m=i*5;
myvar_plus_suffix=30;
output;
end;
run;
data bb;
do i=10 to 20;
myvar=i*5;
output;
end;
run;
option DEBUG=JUNK ;
/*Incorrect sql command - column "yy" in not on "bb" table"*/
proc sql _method
_tree;
create table dd as
select *
from aa
where m in (select myvar_plus_suffix from bb)
;quit;
Here is a response from SAS support.
What you are seeing is related to column scoping in PROC SQL.
PROC SQL supports Corellated Subqueries. A Correlated Subquery references a column in the "outer" table which can then be compared to columns in the "inner" table. PROC SQL does not require that a fully qualified column name is used. As a result, if it sees a column in the subquery that does not exist in the inner table (the table referenced in the subquery), it looks for that column in the "outer" table and uses the value if it finds one.
If a fully qualified column name is used, the error you are expecting will occur such as the following:
proc sql;
create table dd as
select *
from aa as outer
where outer.m in (select inner.yyy from bb as inner);
quit;