I'm trying to get all of all of a series of variables while pulling off of the most recent possible update date (PD_LAST_UPDATE) some fields were updated yesterday, some fields might have been a year ago, so I can't just do PD_LAST_UPDATE = (variable encoded to a specific time) and if I do any set time I'll get way too much data.
Here's my code
(SELECT N1.PD_PROP_NUM, N1.PD_START_DATE, N1.PD_END_DATE, N1.PD_DOW_FREQ,
N1.PD_RATE_PGM, N1.PD_ROOM_POOL, N1.PD_QUOTE_SERIES,
N1.PD_RPGM_SEQ_NUM, N1.PD_LAST_UPDATE
FROM OMP.OMT_PR_SSTRAT_DTL N1
INNER JOIN OMP.OMT_PROP_SSTRAT AS N2 ON (N1.PD_PROP_NUM=N2.PS_PROP_NUM AND
N1.PD_START_DATE=N2.PS_START_DATE AND
N1.PD_DOW_FREQ=N2.PS_DOW_FREQ AND
N1.PD_ROOM_POOL=N2.PS_ROOM_POOL)
WHERE N2.PS_PROP_NUM in (11612) AND **n1.PD_LAST_UPDATE = (MAX)**
);
quit;
The portion of particular interest is bolded, and the prop num ahead of it will be done away with once I can figure out how to select the max value so I can pull down all prob nums. Thanks in advance.
You have two ways to filter on the max value of a variable.
One is to group by everything you want to calculate the maximum by, and then use having (which is the after-group-by version of where), like so:
proc sql;
select origin, make, model
from sashelp.cars
group by origin, make
having mpg_city = max(mpg_city);
quit;
This is allowed in SAS, but not in most other SQL flavors. It's a shortcut to the other method below, largely, and it only works in some particular data structures.
The more traditional approach, then, is to do a correlated subquery:
proc sql;
select origin, make, model
from sashelp.cars C
where mpg_city = (
select max(mpg_city)
from sashelp.cars R
where C.origin=R.origin
and C.make=R.make
group by make, origin
);
quit;
In this case, we're getting to the same place, and more or less getting there the same way - SAS does this on the back end anyway.
In the case of a join, you can either perform this subquery or similar on the dataset prior to the join (or in a subquery whose result is then joined), or you can do so on the result of the join, depending on which is more efficient and whether you need rows from both tables to determine the maximum value.
Related
I'm very new to SAS and trying to learn it. I have a problem statement where I need to extract two files from a location and then perform joins. Below is a detailed explanation of what I'm trying to achieve in a single proc sql statement:
There are two tables, table a (columns - account#, sales, transaction, store#) and table b (columns - account#, account zipcode) and an excel file (columns - store# and store zipcode). I need to first join these two tables on column account#.
Next step is to join their resulting values with the excel file on column store# and also add a column called as 'distance', which calculates the distance between account zipcode and store zipcode with the help of zipcitydistance(account zipcode, store zipcode) function. Let the resulting table be called "F".
Next I want to use case statement to create a column of distance bucket based on the distance from above query, for e.g.,
select
case
when distance<=5 then "<=5"
when distance between 5 and 10 then "5-10"
when distance between 10 and 15 then "10-15"
else ">=15"
end as distance_bucket,
sum(transactions) as total_txn,
sum(sales) as total_sales,
from F
group by 1
So far, below is the code that I have written:
data table_a
set xyzstore.filea;
run;
data table_b
set xyzstore.fileb;
run;
proc import datafile="/location/file.xlsx"
out=filec dbms=xlsx replace;
run;
proc sql;
create table d as
select a.store_number, b.account_number, sum(a.sales) as sales, sum(a.transactions) as transactions, b.account_zipcode
from table_a left join table_b
on a.account_number=b.account_number
group by a.store_number, b.account_number, b.account_zipcode;
quit;
proc sql;
create table e as
select d.*, c.store_zipcode, zipcitydistance(table_d.account_zipcode, c.store_zipcode) as distance
from d inner join filec as c
on d.store_number=c.store_number;
quit;
proc sql;
create table final as
select
case
when distance<=5 then "<=5"
when distance between 5 and 10 then "5-10"
when distance between 10 and 15 then "10-15"
else ">=15"
end as distance_bucket,
sum(transactions) as total_txn,
sum(sales) as total_sales,
from e
group by 1;
quit;
How can I write the above lines of code in a single proc sql statement?
The way you are currently doing it is more readable and the preferred way to do it. Turning it into a single SQL statement will not yield any significant performance gains and will make it harder to troubleshoot in the future.
To do a little cleanup, you can remove the two data step set statements and join directly on those files themselves:
create table d as
...
from xyzstore.filea left join xystore.fileb
...
quit;
You could also use a format instead to clean up the CASE statement.
proc format;
value storedistance
low - 5 = '<=5'
5< - 10 = '5-10'
10< - 15 = '10-15'
15 - high = '>=15'
other = ' '
;
run;
...
proc sql;
create table final as
select put(distance, storedistance.) as distance_bucket
, sum(transactions) as total_txn
, sum(sales) as total_sales
from e
group by calculated distance_bucket
;
quit;
If you did want to turn your existing code into one big SQL statement, it would look like this:
proc sql;
create table final as
select CASE
when(distance <= 5) then '<=5'
when(distance between 5 and 10) then '5-10'
when(distance between 10 and 15) then '10-15'
else '>=15'
END as distance_bucket
, sum(transactions) as total_txn
, sum(sales) as total_sales
/* Join 'table d' with c */
from (select d.*
, c.store_zipocde
, zipcitydistance(d.account_zipcode, c.store_zipcode) as distance
/* Create 'table d' */
from (select a.store_number
, b.account_number
, sum(a.sales) as sales
, sum(a.transactions) as transactions
, b.account_zipcode
from xyzstore.filea as a
LEFT JOIN
xyzstore.fileb as b
ON a.account_number = b.account_number
group by a.store_number
, b.account_number
, b.account_zipcode
) as d
INNER JOIN
filec as c
)
group by calculated distance_bucket
;
quit;
While more compact, it is more difficult to troubleshoot. You lose those in-between steps that can identify if there's an issue with the data. Suppose the store distances look incorrect one day: you'd need to unpack all of those SQL statements, put them into individual PROC SQL blocks and run them. Every time you run into a problem you will need to do this. If you have them separated out, you'll use a negligible amount of temporary disk space and have a much easier time troubleshooting.
When dealing with raw data, especially data that updates regularly, assume something will go wrong one day and you'll need to review it in-depth. Sometimes the wrong file gets sent. Sometimes an upstream issue occurs that sends corrupted data. Any time that happens, you'll need to dig in and find out if it's a problem with your process or their process. Making easy-to-troubleshoot code will speed up the solution for everyone.
How do i start this??
I have two data sets.
For the output you will deliver:
It should be an excel or XML format
Each query logic/programmed check should be on each tab
Columns should be
Subject #,
Visit Date (You will need the Visit Date Listing also attached)
Visit Name (Visit date from the file_34422 must match Visit name in the Blood Pressure File)
Date of Assessment (From the BP Log), VSBPDT_RAW, VSTPT, BP results.
A column for SYBP1. SYBP2, SYBP3, DIABP1, DIABP2, DIABP3
Findings/query text.
Below are Specification for BP:
For same SUBJECT and same FOLDERNAME, where VSTPT is Blood Pressure 1.
if VSBPYN is No, then all must be null or =0 (VSBPDT_RAW, VSBPTM1, SYSBP1, DIABP1, VSBPND2, VSBPTM2, SYSBP2, DIABP2, VSBPND3, VSBPTM3, SYSBP3, DIABP3)
This is what i have started with and
proc sql;
select
f.subject,
f.SVSTDT_RAW, f.FolderName,
b.FolderName,
VSBPDT_RAW, VSTPT,
SYSBP1, SYSBP2, SYSBP3,
DIABP1, DIABP2, DIABP3
FROM first_data as f, bp_data as b
group by subject, foldername
where f.subject = b.subject
having VSTPT is Blood Pressure set 1,
VSBPYN is No;
quit;
I just need to be pointed towards the right direction. I know this can't be right.
I do not know the exact structure of your data, so the solution below may need to be modified by you to select the right columns.
From the descritpion, this looks like it might be a good situation for SQL and a data step. You have a lot of columns to merge with the bp table. It will be easy to do merge all of these columns with first_data in SQL.
When you have lots of by-row conditionals, a data step will be easier to work with and read than many CASE statements in SQL. We'll do a two-stage approach in which we use SQL and a data step.
Step 1: Merge the data
proc sql noprint;
create table stage as
select t1.*
, t2.VSBPYN
from bp_data as t1
INNER JOIN
first_data as t2
ON t1.subject = t2.subject
AND foldername = t2.foldername
where t1.VSTPT = 1
;
quit;
Step 2: Conditionally set values to missing
Next, we'll do a data step for our conditional logic. call missing() is a useful function that will let you set the value of many variables to missing all in a single statement.
data want;
set stage;
if(upcase(VSBPYN) = 'NO') then call missing(VSBPDT_RAW, VSBPTM1, SYSBP1, DIABP1,
VSBPND2, VSBPTM2, SYSBP2, DIABP2,
VSBPND3, VSBPTM3, SYSBP3, DIABP3
);
run;
Step 3: Output to Excel
Finally, we sent the output to Excel.
proc export
data=want
file='/my/location/want.xlsx'
dbms=xlsx
replace;
run;
I have a question about the following 2 codes in SAS PROC SQL.
Code 1: (Standard Book version)
CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV AS a
WHERE
a.SITEID = '0001'
AND a.CLAIMID IN (SELECT CLAIMID FROM WORK.INPUT)
Code 2: (The much faster way in practice)
CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV AS a
WHERE
a.SITEID = '0001'
AND a.CLAIMID IN ('10001', '10002', '10003', ... '15000')
When I try to do it more elegantly by using subquery in #1, the run time blows up to 50 minutes +. But the same input returns within 3 minutes using Code 2. Why is that? Note, it's just as slow using INNER JOIN too (after reading this). The input is 5000+ CLAIMID, which I manually paste into the IN('...') block everyday.
PS: The CLAIMID are made up, in real life they are random.
The CLAIMID are indexed in DW.CLAIMS. I am using SAS PROC SQL to access an Oracle database. What is going on, and is there a better way? Thanks!
I don't know that I can tell you why SAS is so slow at the first select; something's not optimized in that scenario clearly.
If I had to guess, I'd guess that SAS is deciding in the first case that it can't use pass-through SQL and so it's downloading the whole big table and then running this SAS-side, while in the second case it's passing the query up to the SQL database and only transporting the resulting rows back.
But there are several ways to work around this, anyway. Here's one: use a macro variable to do precisely the pasting you're doing!
proc sql;
select quote(strip(claimid)) into :claimlist separated by ','
from work.input
;
CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV AS a
WHERE
a.SITEID = '0001'
AND a.CLAIMID IN (&claimlist.)
;
quit;
Tada, you don't have to touch this anymore, and it's identical to the copy/paste that you did.
A few extra notes given some comments:
If CLAIMID is ever less than 15, you may have space padding, so I added strip to remove those. It doesn't matter for string comparisons - except insomuch as you might run out of macro language, and I worry that some DBMS may actually care about the padding. You can leave out strip if the 15 is a constant length.
Macro variables run up to 64K in space. If you have 15 character variable plus " " two plus comma one, you have 18 characters; you have room for a bit over 3500 values. That's under 5000, unfortunately.
In this case, you can either split up the field into two macro variables (easy enough hopefully, use obs and firstobs) or you can do some other solution.
Transfer the work.input dataset into the DW libname, then do the join in SQL there.
Put the contents of the claimID into a file instead of into a macro variable, and then %include that file.
Use call execute to execute the whole proc SQL.
Here's one example of CALL EXECUTE.
data _null_;
set work.input end=eof;
if _n_=1 then do;
call execute('CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV AS a
WHERE
a.SITEID = "0001"
AND a.CLAIMID IN ('); *the part of the SQL query before the list of IDs;
end;
call execute(quote(claimID) || ' ');
if EOF then do;
call execute('); QUIT;'); *the part of the SQL query after the list of IDs;
end;
run;
This would be nearly identical to the %INCLUDE solution really, except there you put that stuff to a text file instead of CALL EXECUTEing it, and then you %INCLUDE that text file.
I think you're working both with local data and data on your server. When SAS is working with data from different sources (databases) it brings it all into SAS for processing which can be really, really slow.
Instead, you can make a macro variable and use that within your query. If it's 5000, it should fit into one macro variable, assuming the length is less than 13 chars each. A macro variable size limit is 64K characters, so it depends on the length of the variable. If not you could create a macro instead.
proc sql noprint;
select quote(claimID, "'") into : claim_list separated by ", " from input;
quit;
proc sql;
CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM DW.CLAIMS_BAV AS a
WHERE
a.SITEID = '0001'
AND a.CLAIMID IN (&claim_list.);
quit;
Please be sure to use
option sastrace=',,,ds' sastraceloc=saslog nostsuffix;
to receive information on how your code is translated by SAS/Aceess engine to DB statements.
In order to give SAS a hint to dynamicly build IN (1,2,3, ..) clause from your IN (SELECT .. query
add MULTI_DATASRC_OPT=IN_CLAUSE to your libname DW ... statement and
add dbmaster dataset option to the "master" table
like one of the following queries:
CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV (dbmaster=yes) AS a
WHERE
a.SITEID = '0001'
AND a.CLAIMID IN (SELECT CLAIMID FROM WORK.INPUT)
or
CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV (dbmaster=yes) AS a
inner join WORK.INPUT AS b
on a.CLAIMID = b.CLAIMID
WHERE
a.SITEID = '0001'
Using the In() without sub-querying is definitely faster, but other performance consideration to keep in mind is the network and compute server load/traffic at the time of running; assuming you are running on a client / server configuration.
If you plan to use the SQL select into macro variable solution; keep in mind the count of distinct values and the length of the string you are saving in the macro as there is a size limit.
You can also save the In() values in a table and just do a join.
PROC SQL;
/*CLAIM ID Table*/
CREATE TABLE WORK.OUTPUT1 AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV AS a
WHERE
a.SITEID = '0001';
/*ID Lookup Table*/
CREATE TABLE WORK.OUTPUT2 AS
SELECT
DISTINCT b.CLAIMID FROM WORK.INPUT AS b
;
/*Inner Join Table / AKA lookup join*/
CREATE TABLE WORK.Final AS
SELECT
a.SOURCE, a.CLAIMID, a.DXCODE
FROM WORK.OUTPUT1 AS a INNER JOIN WORK.OUTPUT2 AS b
ON a.CLAIMID = b.CLAIMID
;
QUIT;
Thanks for the feedback guys, but I have to rewrite the question to make it more clear.
Say, we have a table:
Table
What I am trying to get from this table is a list of numbers which have matching FP_NDT dates to my condition, for example, I want to get a list of numbers, which only have FP_NDT not null for 2014 and 2015 and missing values for 2011, 2012 and 2013 (irrelevant of the months). So with this condition I should get only Number 4. Is it possible to do it from this table ?
PS: If I write a simple sql select statement and put a condition like
where year(FP_NDT) in (2014,2015)
it would also give me numbers 2 and 3...
Why not first summarize the data?
proc sql;
create table XX as
select number
, max(year(fp_ndt)=2011) as yr2011
, max(year(fp_ndt)=2012) as yr2012
, max(year(fp_ndt)=2013) as yr2013
, max(year(fp_ndt)=2014) as yr2014
, max(year(fp_ndt)=2015) as yr2015
from table1
group by number
;
Now it is easy to make your tests.
select * from XX
where yr2014+yr2015=2 and yr2011+yr2012+yr2013=0
;
You could use the first query as a sub-query instead of creating a physical table.
So you want names associate with both 1 and 2 and 3, but in different rows.
You can group rows by names and count the associated numbers as this:
PROC SQL;
CREATE TABLE xxx AS SELECT
name,
SUM(number=1) AS count1,
SUM(number=2) AS count2,
SUM(number=3) AS count3
FROM test GROUP BY name;
QUIT;
Then you can filter the results based on count1-count3, i.e. (count1>0 AND count2>0 AND count3>0).
Try this:
proc sql;
select *
from work.test
group by name having nmiss(number)=0;
quit;
I have found one work around which is to actually create separate data sets for each year and then inner join them with a where condition for missing and not null for needed years. However, it becomes a bit cumbersome when it comes to 60 months, for instance...
Suppose I'm subsetting a table and summarizing it in proc sql. The code uses a where ... in clause and a subquery to do the subsetting. I know that some SQL engines would set some limit on the number of arguments to the where ... in clause. Does SAS has have limit on this? This question would apply to a program like this:
proc sql;
create table want as
select
ID,
sum(var1) as var1,
sum(var2) as var2,
sum(var3) as var3
from largetable
where ID in (select ID from longlist)
group by ID;
quit;
What if longlist returns 10,000 IDs? How about 10,000,000?
I'm not aware of any explicit limit on this. SAS's SQL parser seems to convert these often to JOINs, when they're not explicitly coded in the table; that means there are some limitations, but not particularly small ones.
I do believe there is a limit to the length of a SQL statement in total, so if you were trying to include an extremely long list in text you might run into problems, but in the example above I don't see a problem with 10,000,000 IDs. I just tested it with 250,000,000 IDs in the longlist table, and SAS had no problem with it:
data largetable;
do id=1 to 1e8;
if mod(id,7)=0 then output;
end;
run;
data ids;
do id = 1 to 1e9;
if mod(id,4)=0 then output;
end;
run;
proc sql _method;
create table want as
select
ID
from largetable
where ID in (select ID from IDs)
group by ID;
quit;
Interestingly, adding _method indicates it does not do this as a join, but as a subquery. I'm not sure why, at least in this case; everything I've been told says that it should convert this to a join implicitly.
As Joe has said, there should probably be no problems with any reasonable number of rows in the longlist table. However, although this may be readable, a join may perform better.
Do you have a strong preference for running the query as written rather than doing a left join, e.g.
proc sql;
create table want as
select
b.ID,
sum(b.var1) as var1,
sum(b.var2) as var2,
sum(b.var3) as var3
from longlist a left join largetable b
on a.ID = b.ID
group by b.ID;
quit;
Elaborating a bit on entering a long list as text - I'm not aware of any limit on the length of any one statement in SAS, but there are various limits on the length of individual lines of code, depending on your version and how you're submitting it. I suspect it's possible to split a long statement over several lines each approaching the maximum allowed length.