List Aggregation and Group Concatenation in SAS Proc SQL - sas

I have a dataset which has to be rolled up based on the granularity(FIELD1 & FIELD2). Two of the metrics fields(METRIC1 & METRIC2) have to be summed up. Until now it seems to be an easy GROUP BY task. But I have a string field(FLAG) which has to be rolled up too, by concatenating the distinct values.
Input Dataset:
Expected Result:
This operation can be performed in Oracle using the LISTAGG() function.
Kindly help me out in achieving the same in SAS Proc SQL.

I don't believe there's a direct way to do this in SAS. CATS (and similar concatenation functions) aren't aggregation functions. It was suggested to add these back a few years ago but nothing came of it that I'm aware of (see this thread.)
If I understand right, what you're doing is GROUP BY field1/field2, SUM metric1/metric2, and make a single FLAG field that concatenates all seen FLAG field values (but doesn't group by them).
The way I would handle this is to first do your aggregation (field1/field2), and then join that to a separate table that was just field1/field2/flag. You could make that most easily in the data step, something like:
data want;
set have;
by field1 field2;
length flag_out $100; *or longer if you need longer;
flag_out = catx(',',flag_out,flag);
if last.field2 then output;
rename flag_out=flag;
drop flag;
run;
This assumes it's sorted already by field1/field2, otherwise you need to do that first.

As stated, there is no LISTAGG() function, and there is also no built-in feature for creating a custom aggregate function. However, there are two possibilities that get the output.
Example one
Data step with DOW processing and hashing for tracking distinct flag values while concatenating within a group.
data want;
if 0 then set have; *prep pdv;
length flags $200;
declare hash _flags();
_flags.defineKey('flag');
_flags.defineDone();
do until (last.f2);
set have;
by f1 f2;
m1_sum = sum(m1_sum,m1);
m2_sum = sum(m2_sum,m2);
if _flags.find() ne 0 then do;
_flags.add();
flags = catx(',',flags,flag);
end;
end;
drop m1 m2 flag;
_flags.delete();
run;
Example two
Create a FCMP custom function used from within SQL. Because FCMP can not create an aggregate function, the result will be automatically remerged against original data which must then be filtered. The FCMP function also uses a hash for tracking distinct values of flag within a group.
proc fcmp outlib=sasuser.functionsx.package;
function listagg(f1 $, f2 $, n, item $) $;
length result $32000 index 8;
static flag;
static index;
declare hash items();
if flag = . then do;
flag = 1;
rc = items.defineKey('item');
rc = items.defineDone();
end;
static items;
index + 1;
rc = items.replace();
if index = n then do;
declare hiter hi('items');
result = '';
do while (hi.next() = 0);
result = catx(',',result,item);
end;
index = 0;
rc = items.clear();
return (result);
end;
else
return ("");
endsub;
run;
options cmplib=sasuser.functionsx;
proc sql;
create table wanted as
select * from
(
select /* subselect is a remerge due to 'listagg' mimic */
f1,
f2,
listagg(f1, f2, count(*), flag) as flags,
sum(m1) as m1,
sum(m2) as m2
from have
group by f1, f2
)
where flags is not null /* filter the subselect */
;
quit;
Ideally a hash of hashes would have been used, but FCMP only provides for hash instance creation in a declare statement, and dynamic hashes can not be instantiated with _new_. SAS Viya users would be able to use the new component object Dictionary in an FCMP function, and might be able to have a Dictionary of Dictionaries to track the distinct flag values in each group.

Thanks everyone for your valuable inputs. Apparently there is no straightforward solution to this scenario in SAS. With the bigger picture of the requirement in mind, I have decided to tackle the problem at the data layer itself or add another intermediate presentation layer.
I'm sure many have pointed out this need to SAS, I have raised this issue with SAS too. Hope they look into it and come up with a similar function as LISTAGG OR GROUP_CONCAT.

Answer from Joe is very good but is missing one critical part.
There should be a
retain flag_out;
line after the 'by' line.

Related

I want to add auto_increment column in a table in SAS

I want to add a auto_Increment column in a table in SAS.Following code add's a column but not increment the value.
Thanks In Advance.
proc sql;
alter table pmt.W_cur_qtr_recoveries
add ID integer;
quit;
Wow, going to try for my second "SAS doesn't do that" answer this morning. Risky stuff.
A SAS dataset cannot define an auto-increment column. Whether you are creating a new dataset or inserting records into an existing dataset, you are responsible for creating any increment counters (ie they are just normal numeric vars where you have set the values to what you want).
That said, there are DATA step statements such as the sum statement (e.g. MyCounter+1) that make it easier to implement counters. If you describe more details of your problem, people could provide some alternatives.
The correct answer at this time is to create the ID yourself, BUT the discussion wouldn't be complete without mentioning that there is an unsupported SQL function Monotonic that can do what you want. It's not reliable, yet it persists.
The code pattern for its usage is
select monotonic() as ID, ....
Use the _N_ automatic variable in a data step like:
DATA TEMPLIB.my_dataset (label="my dataset with auto increment variables");
SET TEMPREP.my_dataset;
sas_incr_num = _N_; * add an auto increment 'sas_incr_num' variable;
sas_incr_cat = cat("AB.",cats(repeat("0",5-ceil(log10(sas_incr_num+1))),sas_incr_num),".YZ"); * auto increment the sas_incr_num variable and add 5 leading zeros and concatenate strings on either end;
LABEL
sas_incr_num="auto number each row"
sas_incr_cat="auto number each row, leading zeros, and add strings along for fun"
...
There is no such thing as an auto increment column in a SAS dataset. You can use a data step to create a new dataset that has the new variable. You can use the same name to have it replace the old one when done.
data pmt.W_cur_qtr_recoveries;
set pmt.W_cur_qtr_recoveries;
ID+1;
run;
It really depends on what your intended outcome is. But I have thrown together an example of how you may want to tackle this. it is a little rough, but gives you something to work from.
/*JUST SETTING UP THE DAY ONE DATA WITH AN ID ATTACHED
YOU WOULD MAKE THE FIRST RUN EXECUTE DIFFERENTLY TO SUBSEQUENT RUNS BY USING THE EXISTS FUNCTION AND MACRO LANGUAGE,
BUT I WILL LET YOU INVESTIGATE THIS FURTHER AS IT MAY BE IRRELEVANT.*/
DATA DAY1;
SET SASHELP.CLASS;
ID+1;
RUN;
/*ON DAY 2 WE ARE APPENDING ADDITIONAL RECORDS TO THE EXISTING DATASET*/
DATA DAY2;
/*APPEND DATASETS*/
SET DAY1 SASHELP.CLASS;
/*HOLD VALUE IN PROGRAM DATA VECTOR (PDV) UNTIL EXPLICITLY CHANGED*/
RETAIN _ID;
/*ADD VARIABLE _ID AND POPULATE WITH ID. IN DOING THIS THE LAST INSTANCE OF THE ID WILL BE HELD IN THE PDV FOR THE
FIRST OF THE NEW RECORDS*/
IF ID ~= . THEN _ID = ID;
/*INCREMENT THE VALUE IN _ID BY 1 AND DO SO FOR EACH RECORD ADDED*/
ELSE DO;
_ID+1;
END;
/*DROP THE ORIGINAL ID;*/
DROP ID;
/*RENAME _ID TO ID*/
RENAME _ID = ID;
RUN;
where "W_prv_qtr_recoveries" is a table Name and "pmt" is a library name.
Thanks to user2337871.
DATA pmt.W_prv_qtr_recoveries;
SET pmt.W_prv_qtr_recoveries;
RETAIN _ID;
IF ID ~= . THEN _ID = ID;
ELSE DO;
_ID+1;
END;
DROP ID;
RENAME _ID = ID;
RUN;
Assuming that this autoincrement column will be used for every record that is inserted.
We can accomplish the same as follows:-
We will first check the latest key in the dataset
PROC SQL;
SELECT MAX(KEY) INTO :MK FROM MYDATA;
QUIT;
%put KeyOld=&MK;
Then we increment this key
Data _NULL_;
call symput('KeyNew',&MK+1);
run;
%put KeyNew=&KeyNew;
Here we hold the New record that we want to insert, and add the correspoding key
Data TEMP1;
set TEMP;
Key=&KeyNew;
run;
Finally we load the new record in our dataset
PROC APPEND BASE=MYDATA DATA=TEMP1 FORCE;
RUN;

SAS Updating records sequentially

I have hundreds of thousands of IDs in a large dataset.
Some records have the same ID but different data points. Some of these IDs need to be merged into a single ID. People registered for a system more than once should be just one person in the database.
I also have a separate file that tells me which IDs need to be merged, but it's not always a one-to-one relationship. For example, in many cases I have x->y and then y->z because they registered three times. I had a macro that essentially was the following set of if-then statements:
if ID='1111111' then do; ID='2222222'; end;
if ID='2222222' then do; ID='3333333'; end;
I believe SAS runs this one record at a time. My list of merged IDs is almost 15k long, so it takes forever to run and the list just gets longer. Is there a faster method of updating these IDs?
Thanks
EDIT: Here is an example of the situation, except the macro is over 15k lines long due to all the merges.
data one;
input ID $5. v1 $ v2 $;
cards;
11111 a b
11111 c d
22222 e f
33333 g h
44444 i j
55555 k l
66666 m n
66666 o p
;
run;
%macro ID_Change;
if ID='11111' then do; ID='77777'; end; *77777 is a brand new ID;
if ID='22222' then do; ID='88888'; end; *88888 is a new ID but is merged below;
if ID='88888' then do; ID='99999'; end; *99999 becomes the newer ID;
%mend;
data two; set one; %ID_Change; run;
A hash table will greatly speed up the process. Hash tables are one of the little-used, but highly effective, tools in SAS. They're a bit bizarre since the syntax is very different from standard SAS programming. For now, think of it as a way to merge data together in-memory (a big reason as to why it's so fast).
First, create a dataset that has the conversions that you need. We want to match up by ID, then convert it to New_ID. Consider ID as your key column, and New_ID as your data column.
dataset: translate
ID New_ID
111111 222222
222222 333333
In a hash table, you need to consider two things:
The Key column(s)
The Data column(s)
The Data column is what will be replacing observations matched by the Key column. In other words, New_ID will be populated every time there's a match for ID.
Next, you'll want to do your hash merge. This is performed in the data step.
data want;
set have;
/* Only declare the hash object on the first iteration.
Otherwise it will do this every record. */
if(_N_ = 1) then do;
declare hash id_h(dataset: 'translate'); *Declare a hash object called 'id_h';
id_h.defineKey('ID'); *Define key for matching;
id_h.defineData('New_ID'); *The new ID after matching;
id_h.defineDone(); *Done declaring this hash object;
call missing(New_ID); *Prevents a warning in the log;
end;
/* If a customer has changed multiple times, keep iterating until
there is no longer a match between tables */
do while(id_h.Find() = 0);
_loop_count+1; *Tells us how long we've been in the loop;
/* Just in case the while loop gets to 500 iterations, then
there's likely a problem and you don't want the data step to get stuck */
if(_loop_count > 500) then do;
put 'WARNING: ' ID ' iterated 500 times. The loop will stop. Check observation ' _N_;
leave;
end;
/* If the ID of the hash table matches the ID of the dataset, then
we'll set ID to be New_ID from the hash object;
ID = New_ID;
end;
_loop_count = 0;
drop _loop_count;
run;
This should run very quickly and provide the desired output, assuming that your lookup table is coded in the way that you need it to be.
Use PROC SQL or a MERGE step against your separate file (after you have created a separate dataset from it, using infile or proc import) to append this unique id to all records. If your separate file contains only the duplicates, you will need to create a dummy unique id for the non-duplicates.
Do PROC SORT with BY unique id and timestamp of signup.
Use a DATA step with the same BY variables. Depending on whether you want to keep the first or last signup, do if first.timestamp then output; (or last, etc.)
Or you could do it all in one PROC SQL using a left join to the separate file, a coalesce step to return a dummy unique id if it is not contained in the separate file, a group by unique id, and a having max(timestamp) (or min). You can also coalesce any other variables you might want to try to preserve across signups -- for example, if the first signup contained a phone number and successive signups were missing that data point.
Without a reproducible example it's hard to be more specific.

select only a few columns from a large table in SAS

I have to join 2 tables on a key (say XYZ). I have to update one single column in table A using a coalesce function. Coalesce(a.status_cd, b.status_cd).
TABLE A:
contains some 100 columns. KEY Columns ABC.
TABLE B:
Contains just 2 columns. KEY Column ABC and status_cd
TABLE A, which I use in this left join query is having more than 100 columns. Is there a way to use a.* followed by this coalesce function in my PROC SQL without creating a new column from the PROC SQL; CREATE TABLE AS ... step?
Thanks in advance.
You can take advantage of dataset options to make it so you can use wildcards in the select statement. Note that the order of the columns could change doing this.
proc sql ;
create table want as
select a.*
, coalesce(a.old_status,b.status_cd) as status_cd
from tableA(rename=(status_cd=old_status)) a
left join tableB b
on a.abc = b.abc
;
quit;
I eventually found a fairly simple way of doing this in proc sql after working through several more complex approaches:
proc sql noprint;
update master a
set status_cd= coalesce(status_cd,
(select status_cd
from transaction b
where a.key= b.key))
where exists (select 1
from transaction b
where a.ABC = b.ABC);
quit;
This will update just the one column you're interested in and will only update it for rows with key values that match in the transaction dataset.
Earlier attempts:
The most obvious bit of more general SQL syntax would seem to be the update...set...from...where pattern as used in the top few answers to this question. However, this syntax is not currently supported - the documentation for the SQL update statement only allows for a where clause, not a from clause.
If you are running a pass-through query to another database that does support this syntax, it might still be a viable option.
Alternatively, there is a way to do this within SAS via a data step, provided that the master dataset is indexed on your key variable:
/*Create indexed master dataset with some missing values*/
data master(index = (name));
set sashelp.class;
if _n_ <= 5 then call missing(weight);
run;
/*Create transaction dataset with some missing values*/
data transaction;
set sashelp.class(obs = 10 keep = name weight);
if _n_ > 5 then call missing(weight);
run;
data master;
set transaction;
t_weight = weight;
modify master key = name;
if _IORC_ = 0 then do;
weight = coalesce(weight, t_weight);
replace;
end;
/*Suppress log messages if there are key values in transaction but not master*/
else _ERROR_ = 0;
run;
A standard warning relating to the the modify statement: if this data step is interrupted then the master dataset may be irreparably damaged, so make sure you have a backup first.
In this case I've assumed that the key variable is unique - a slightly more complex data step is needed if it isn't.
Another way to work around the lack of a from clause in the proc sql update statement would be to set up a format merge, e.g.
data v_format_def /view = v_format_def;
set transaction(rename = (name = start weight = label));
retain fmtname 'key' type 'i';
end = start;
run;
proc format cntlin = v_format_def; run;
proc sql noprint;
update master
set weight = coalesce(weight,input(name,key.))
where master.name in (select name from transaction);
run;
In this scenario I've used type = 'i' in the format definition to create a numeric informat, which proc sql uses convert the character variable name to the numeric variable weight. Depending on whether your key and status_cd columns are character or numeric you may need to do this slightly differently.
This approach effectively loads the entire transaction dataset into memory when using the format, which might be a problem if you have a very large transaction dataset. The data step approach should hardly use any memory as it only has to load 1 row at a time.

SAS creating subtables ordered by year

I have a table in SAS which consists of data from stock exchange. One of its columns holds information about date. I would like to create subtables, each one hold data from only one specific year.
Assuming you want to do this (often, this is an inferior option as analyses run separately by year can be run from one dataset using by year;, but certainly this can sometimes be appropriate), the gold standard method for doing this is the hash table, as the hash table can produce unlimited tables based on the data. I will edit in a method for doing this with hash table if I have time while running things this afternoon; it's the 'hashing' method described on this page.
Hashing code, adapted from the sascommunity.org page above:
data have;
call streaminit(7);
do year=1998 to 2014;
do id= 1 to 10;
x=rand('Uniform');
output;
end;
end;
run;
data _null_ ;
dcl hash byyear () ;
byyear.definekey ('k') ; if `id` or similar is a safe unique ID you could use that here, otherwise `k` is your unique identifier - hash requires unique;
byyear.definedata ('year','id','x') ;
byyear.definedone () ;
do k = 1 by 1 until ( last.year ) ;
set have;
by year ;
byyear.add () ;
end ;
dsetname=cats('year',year);
byyear.output (dataset: dsetname) ;
run ;
There is a similar set of methods that revolve around using a macro to generate the code. This paper goes into detail about one method to do that; I won't explain it in detail as I consider it inferior to the hash method (even if it is lower CPU time, it is more complicated to write than either a pure macro method or a pure hash method) but in certain cases it could be better.
A simple example of the macro method using the conceptual have aframe defined:
proc sql;
select distinct(cats('year',year(date))) into :dsetlist
separated by ' '
from have;
select distinct(cats('%outputyear(year=',year(date),')')) into :outputlist
separated by ' '
from have;
quit;
%macro outpuyear(year=);
if year(date)=&year. then output year&dset.;
%mend outputyear;
data &dsetlist.;
set have;
&outputlist.;
run;
data year1 year2 year3 yearN;
set stockdata;
if year(date) = 2014 then
output year1;
else if year(date) = 2013 then
output year2;
else if year(date) = 2012 then
output year3;
else
output yearN;
run;
You could also use case statements I guess.

SAS: Compare values between two numbers to make separate buckets

I'm trying to compare multiple sets of data by putting them in separate groups between two numbers. Originally I had statements like,
if COLUMN1 gt 0 and COLUMN1 LE 1000 then PRICE_GROUP = 1000;
I had this going up by 1000 to 100,000. The only problem is that once I counted how many were in each price_group, some price_groups were missing (57,000 had no values so when I would count(Price_group) it would not appear for some groups). The solution I think is to make a table with the bounds for each, and then compare the actual value vs the upper and lower bound.
proc iml;
mat = j(100,2,0);
total = 100000;
mat[1,1] = 0;
mat[1,2] = mat[1,1] + (total/100);
do i = 2 to nrow(mat);
mat[i,1] = mat[i-1,1] + (total/100);
mat[i,2] = mat[i,1] + (total/100);
end;
create dataset from mat;
append from mat;
quit;
This creates the table which I can compare the values, but is there an easier way besides proc iml? I was next going to do a loop to compare each value with the two columns and create a new column on the table to have the count in each bucket. This still seems like an intensive process that is inefficient.
IML isn't a terrible solution, but there are a few others depending on what exactly you're doing.
The most common is proc format. Create a format that manages each bucket, like so:
proc format;
value buckets
0-1000 = 1000
1000<-2000 = 2000
...
other="NA";
quit;
Then you can either use the format (or informat) to create a new variable with the bucketed value, or even better, use the format on the fly (ie, in proc means or whatnot) which not only means you don't have to rewrite the dataset, but you can swap formats on and off depending on how many buckets you want (say, buckets100 format or buckets20 and whatnot).
Second, your specific question looks like it's solveable just using math:
data want;
set have;
bucket = &total/100*(1+floor(column1/(&total/100)));
run;
although obviously that doesn't work for every example.
Third, you could use a hash lookup table, if you are unable to use formats (such as there are two or more elements that determine the bucket). If that's useful I can expand on that, or just google about as those are very commonly used for lookups in SAS. That's the closest solution to the IML solution inside a regular datastep.
Create another table with groups:
data group_table;
do price_group=1000 to 100000 by 1000;
output;
end;
run;
Then left join the grouping/comparison table with this new table using price_group as key:
proc sql;
create table price_group_compare as
select L.price_group,R.group1_count,R.group2_count
from group_table as L, group_counts as R
on L.price_group = R.price_group;
quit;