Suppose I'm subsetting a table and summarizing it in proc sql. The code uses a where ... in clause and a subquery to do the subsetting. I know that some SQL engines would set some limit on the number of arguments to the where ... in clause. Does SAS has have limit on this? This question would apply to a program like this:
proc sql;
create table want as
select
ID,
sum(var1) as var1,
sum(var2) as var2,
sum(var3) as var3
from largetable
where ID in (select ID from longlist)
group by ID;
quit;
What if longlist returns 10,000 IDs? How about 10,000,000?
I'm not aware of any explicit limit on this. SAS's SQL parser seems to convert these often to JOINs, when they're not explicitly coded in the table; that means there are some limitations, but not particularly small ones.
I do believe there is a limit to the length of a SQL statement in total, so if you were trying to include an extremely long list in text you might run into problems, but in the example above I don't see a problem with 10,000,000 IDs. I just tested it with 250,000,000 IDs in the longlist table, and SAS had no problem with it:
data largetable;
do id=1 to 1e8;
if mod(id,7)=0 then output;
end;
run;
data ids;
do id = 1 to 1e9;
if mod(id,4)=0 then output;
end;
run;
proc sql _method;
create table want as
select
ID
from largetable
where ID in (select ID from IDs)
group by ID;
quit;
Interestingly, adding _method indicates it does not do this as a join, but as a subquery. I'm not sure why, at least in this case; everything I've been told says that it should convert this to a join implicitly.
As Joe has said, there should probably be no problems with any reasonable number of rows in the longlist table. However, although this may be readable, a join may perform better.
Do you have a strong preference for running the query as written rather than doing a left join, e.g.
proc sql;
create table want as
select
b.ID,
sum(b.var1) as var1,
sum(b.var2) as var2,
sum(b.var3) as var3
from longlist a left join largetable b
on a.ID = b.ID
group by b.ID;
quit;
Elaborating a bit on entering a long list as text - I'm not aware of any limit on the length of any one statement in SAS, but there are various limits on the length of individual lines of code, depending on your version and how you're submitting it. I suspect it's possible to split a long statement over several lines each approaching the maximum allowed length.
Related
I have a dataset with 20 columns all starting with the name morb_, which are all 1 or 2, coded as No and Yes. There is an additional column called Pat_TNO which is the patient reference number. Patients have more than one row.
I wish to create a new dataset which summarises whether each patient has had at least one of each type of event. So far the code I have written works perfectly, but is there a way to simplify it using an array?
proc sql;
select
Pat_TNO,
max(morb_1) as morb_1 format yn.,
max(morb_2) as morb_2 format yn. /* etc etc */
from morbidity
group by Pat_TNO;
quit;
COumn names aren't morb_1 and morb_2, rather morb_amputation, morb_mi, morb_tia, etc.
proc summary data=morbidity nway missing;
class pat_tno;
output out=max max(morb_:) = ;
run;
I have a question about the following 2 codes in SAS PROC SQL.
Code 1: (Standard Book version)
CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV AS a
WHERE
a.SITEID = '0001'
AND a.CLAIMID IN (SELECT CLAIMID FROM WORK.INPUT)
Code 2: (The much faster way in practice)
CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV AS a
WHERE
a.SITEID = '0001'
AND a.CLAIMID IN ('10001', '10002', '10003', ... '15000')
When I try to do it more elegantly by using subquery in #1, the run time blows up to 50 minutes +. But the same input returns within 3 minutes using Code 2. Why is that? Note, it's just as slow using INNER JOIN too (after reading this). The input is 5000+ CLAIMID, which I manually paste into the IN('...') block everyday.
PS: The CLAIMID are made up, in real life they are random.
The CLAIMID are indexed in DW.CLAIMS. I am using SAS PROC SQL to access an Oracle database. What is going on, and is there a better way? Thanks!
I don't know that I can tell you why SAS is so slow at the first select; something's not optimized in that scenario clearly.
If I had to guess, I'd guess that SAS is deciding in the first case that it can't use pass-through SQL and so it's downloading the whole big table and then running this SAS-side, while in the second case it's passing the query up to the SQL database and only transporting the resulting rows back.
But there are several ways to work around this, anyway. Here's one: use a macro variable to do precisely the pasting you're doing!
proc sql;
select quote(strip(claimid)) into :claimlist separated by ','
from work.input
;
CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV AS a
WHERE
a.SITEID = '0001'
AND a.CLAIMID IN (&claimlist.)
;
quit;
Tada, you don't have to touch this anymore, and it's identical to the copy/paste that you did.
A few extra notes given some comments:
If CLAIMID is ever less than 15, you may have space padding, so I added strip to remove those. It doesn't matter for string comparisons - except insomuch as you might run out of macro language, and I worry that some DBMS may actually care about the padding. You can leave out strip if the 15 is a constant length.
Macro variables run up to 64K in space. If you have 15 character variable plus " " two plus comma one, you have 18 characters; you have room for a bit over 3500 values. That's under 5000, unfortunately.
In this case, you can either split up the field into two macro variables (easy enough hopefully, use obs and firstobs) or you can do some other solution.
Transfer the work.input dataset into the DW libname, then do the join in SQL there.
Put the contents of the claimID into a file instead of into a macro variable, and then %include that file.
Use call execute to execute the whole proc SQL.
Here's one example of CALL EXECUTE.
data _null_;
set work.input end=eof;
if _n_=1 then do;
call execute('CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV AS a
WHERE
a.SITEID = "0001"
AND a.CLAIMID IN ('); *the part of the SQL query before the list of IDs;
end;
call execute(quote(claimID) || ' ');
if EOF then do;
call execute('); QUIT;'); *the part of the SQL query after the list of IDs;
end;
run;
This would be nearly identical to the %INCLUDE solution really, except there you put that stuff to a text file instead of CALL EXECUTEing it, and then you %INCLUDE that text file.
I think you're working both with local data and data on your server. When SAS is working with data from different sources (databases) it brings it all into SAS for processing which can be really, really slow.
Instead, you can make a macro variable and use that within your query. If it's 5000, it should fit into one macro variable, assuming the length is less than 13 chars each. A macro variable size limit is 64K characters, so it depends on the length of the variable. If not you could create a macro instead.
proc sql noprint;
select quote(claimID, "'") into : claim_list separated by ", " from input;
quit;
proc sql;
CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM DW.CLAIMS_BAV AS a
WHERE
a.SITEID = '0001'
AND a.CLAIMID IN (&claim_list.);
quit;
Please be sure to use
option sastrace=',,,ds' sastraceloc=saslog nostsuffix;
to receive information on how your code is translated by SAS/Aceess engine to DB statements.
In order to give SAS a hint to dynamicly build IN (1,2,3, ..) clause from your IN (SELECT .. query
add MULTI_DATASRC_OPT=IN_CLAUSE to your libname DW ... statement and
add dbmaster dataset option to the "master" table
like one of the following queries:
CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV (dbmaster=yes) AS a
WHERE
a.SITEID = '0001'
AND a.CLAIMID IN (SELECT CLAIMID FROM WORK.INPUT)
or
CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV (dbmaster=yes) AS a
inner join WORK.INPUT AS b
on a.CLAIMID = b.CLAIMID
WHERE
a.SITEID = '0001'
Using the In() without sub-querying is definitely faster, but other performance consideration to keep in mind is the network and compute server load/traffic at the time of running; assuming you are running on a client / server configuration.
If you plan to use the SQL select into macro variable solution; keep in mind the count of distinct values and the length of the string you are saving in the macro as there is a size limit.
You can also save the In() values in a table and just do a join.
PROC SQL;
/*CLAIM ID Table*/
CREATE TABLE WORK.OUTPUT1 AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV AS a
WHERE
a.SITEID = '0001';
/*ID Lookup Table*/
CREATE TABLE WORK.OUTPUT2 AS
SELECT
DISTINCT b.CLAIMID FROM WORK.INPUT AS b
;
/*Inner Join Table / AKA lookup join*/
CREATE TABLE WORK.Final AS
SELECT
a.SOURCE, a.CLAIMID, a.DXCODE
FROM WORK.OUTPUT1 AS a INNER JOIN WORK.OUTPUT2 AS b
ON a.CLAIMID = b.CLAIMID
;
QUIT;
I'm trying to get all of all of a series of variables while pulling off of the most recent possible update date (PD_LAST_UPDATE) some fields were updated yesterday, some fields might have been a year ago, so I can't just do PD_LAST_UPDATE = (variable encoded to a specific time) and if I do any set time I'll get way too much data.
Here's my code
(SELECT N1.PD_PROP_NUM, N1.PD_START_DATE, N1.PD_END_DATE, N1.PD_DOW_FREQ,
N1.PD_RATE_PGM, N1.PD_ROOM_POOL, N1.PD_QUOTE_SERIES,
N1.PD_RPGM_SEQ_NUM, N1.PD_LAST_UPDATE
FROM OMP.OMT_PR_SSTRAT_DTL N1
INNER JOIN OMP.OMT_PROP_SSTRAT AS N2 ON (N1.PD_PROP_NUM=N2.PS_PROP_NUM AND
N1.PD_START_DATE=N2.PS_START_DATE AND
N1.PD_DOW_FREQ=N2.PS_DOW_FREQ AND
N1.PD_ROOM_POOL=N2.PS_ROOM_POOL)
WHERE N2.PS_PROP_NUM in (11612) AND **n1.PD_LAST_UPDATE = (MAX)**
);
quit;
The portion of particular interest is bolded, and the prop num ahead of it will be done away with once I can figure out how to select the max value so I can pull down all prob nums. Thanks in advance.
You have two ways to filter on the max value of a variable.
One is to group by everything you want to calculate the maximum by, and then use having (which is the after-group-by version of where), like so:
proc sql;
select origin, make, model
from sashelp.cars
group by origin, make
having mpg_city = max(mpg_city);
quit;
This is allowed in SAS, but not in most other SQL flavors. It's a shortcut to the other method below, largely, and it only works in some particular data structures.
The more traditional approach, then, is to do a correlated subquery:
proc sql;
select origin, make, model
from sashelp.cars C
where mpg_city = (
select max(mpg_city)
from sashelp.cars R
where C.origin=R.origin
and C.make=R.make
group by make, origin
);
quit;
In this case, we're getting to the same place, and more or less getting there the same way - SAS does this on the back end anyway.
In the case of a join, you can either perform this subquery or similar on the dataset prior to the join (or in a subquery whose result is then joined), or you can do so on the result of the join, depending on which is more efficient and whether you need rows from both tables to determine the maximum value.
I have to join 2 tables on a key (say XYZ). I have to update one single column in table A using a coalesce function. Coalesce(a.status_cd, b.status_cd).
TABLE A:
contains some 100 columns. KEY Columns ABC.
TABLE B:
Contains just 2 columns. KEY Column ABC and status_cd
TABLE A, which I use in this left join query is having more than 100 columns. Is there a way to use a.* followed by this coalesce function in my PROC SQL without creating a new column from the PROC SQL; CREATE TABLE AS ... step?
Thanks in advance.
You can take advantage of dataset options to make it so you can use wildcards in the select statement. Note that the order of the columns could change doing this.
proc sql ;
create table want as
select a.*
, coalesce(a.old_status,b.status_cd) as status_cd
from tableA(rename=(status_cd=old_status)) a
left join tableB b
on a.abc = b.abc
;
quit;
I eventually found a fairly simple way of doing this in proc sql after working through several more complex approaches:
proc sql noprint;
update master a
set status_cd= coalesce(status_cd,
(select status_cd
from transaction b
where a.key= b.key))
where exists (select 1
from transaction b
where a.ABC = b.ABC);
quit;
This will update just the one column you're interested in and will only update it for rows with key values that match in the transaction dataset.
Earlier attempts:
The most obvious bit of more general SQL syntax would seem to be the update...set...from...where pattern as used in the top few answers to this question. However, this syntax is not currently supported - the documentation for the SQL update statement only allows for a where clause, not a from clause.
If you are running a pass-through query to another database that does support this syntax, it might still be a viable option.
Alternatively, there is a way to do this within SAS via a data step, provided that the master dataset is indexed on your key variable:
/*Create indexed master dataset with some missing values*/
data master(index = (name));
set sashelp.class;
if _n_ <= 5 then call missing(weight);
run;
/*Create transaction dataset with some missing values*/
data transaction;
set sashelp.class(obs = 10 keep = name weight);
if _n_ > 5 then call missing(weight);
run;
data master;
set transaction;
t_weight = weight;
modify master key = name;
if _IORC_ = 0 then do;
weight = coalesce(weight, t_weight);
replace;
end;
/*Suppress log messages if there are key values in transaction but not master*/
else _ERROR_ = 0;
run;
A standard warning relating to the the modify statement: if this data step is interrupted then the master dataset may be irreparably damaged, so make sure you have a backup first.
In this case I've assumed that the key variable is unique - a slightly more complex data step is needed if it isn't.
Another way to work around the lack of a from clause in the proc sql update statement would be to set up a format merge, e.g.
data v_format_def /view = v_format_def;
set transaction(rename = (name = start weight = label));
retain fmtname 'key' type 'i';
end = start;
run;
proc format cntlin = v_format_def; run;
proc sql noprint;
update master
set weight = coalesce(weight,input(name,key.))
where master.name in (select name from transaction);
run;
In this scenario I've used type = 'i' in the format definition to create a numeric informat, which proc sql uses convert the character variable name to the numeric variable weight. Depending on whether your key and status_cd columns are character or numeric you may need to do this slightly differently.
This approach effectively loads the entire transaction dataset into memory when using the format, which might be a problem if you have a very large transaction dataset. The data step approach should hardly use any memory as it only has to load 1 row at a time.
I have a table with postings by category (a number) that I transposed. I got a table with each column name as _number for example _16, _881, _853 etc. (they aren't in order).
I need to do the sum of all of them in a proc sql, but I don't want to create the variable in a data step, and I don't want to write all of the columns names either . I tried this but doesn't work:
proc sql;
select sum(_815-_16) as nnl
from craw.xxxx;
quit;
I tried going to the first number to the last and also from the number corresponding to the first place to the one corresponding to the last place. Gives me a number that it's not correct.
Any ideas?
Thanks!
You can't use variable lists in SQL, so _: and var1-var6 and var1--var8 don't work.
The easiest way to do this is a data step view.
proc sort data=sashelp.class out=class;
by sex;
run;
*Make transposed dataset with similar looking names;
proc transpose data=class out=transposed;
by sex;
id height;
var height;
run;
*Make view;
data transpose_forsql/view=transpose_forsql;
set transposed;
sumvar = sum(of _:); *I confirmed this does not include _N_ for some reason - not sure why!;
run;
proc sql;
select sum(sumvar) from transpose_Forsql;
quit;
I have no documentation to support this but from my experience, I believe SAS will assume that any sum() statement in SQL is the sql-aggregate statement, unless it has reason to believe otherwise.
The only way I can see for SAS to differentiate between the two is by the way arguments are passed into it. In the below example you can see that the internal sum() function has 3 arguments being passed in so SAS will treat this as the SAS sum() function (as the sql-aggregate statement only allows for a single argument). The result of the SAS function is then passed in as the single parameter to the sql-aggregate sum function:
proc sql noprint;
create table test as
select sex,
sum(sum(height,weight,0)) as sum_height_and_weight
from sashelp.class
group by 1
;
quit;
Result:
proc print data=test;
run;
sum_height_
Obs Sex and_weight
1 F 1356.3
2 M 1728.6
Also note a trick I've used in the code by passing in 0 to the SAS function - this is an easy way to add an additional parameter without changing the intended result. Depending on your data, you may want to swap out the 0 for a null value (ie. .).
EDIT: To address the issue of unknown column names, you can create a macro variable that contains the list of column names you want to sum together:
proc sql noprint;
select name into :varlist separated by ','
from sashelp.vcolumn
where libname='SASHELP'
and memname='CLASS'
and upcase(name) like '%T' /* MATCHES HEIGHT AND WEIGHT */
;
quit;
%put &varlist;
Result:
Height,Weight
Note that you would need to change the above wildcard to match your scenario - ie. matching fields that begin with an underscore, instead of fields that end with the letter T. So your final SQL statement will look something like this:
proc sql noprint;
create table test as
select sex,
sum(sum(&varlist,0)) as sum_of_fields_ending_with_t
from sashelp.class
group by 1
;
quit;
This provides an alternate approach to Joe's answer - though I believe using the view as he suggests is a cleaner way to go.