Based on the data could anyone assist me - making a program that contains
test_date closest to delivery_date.
delivery_date 11/16/2011
test_date 21/nov/2011 10/nov/2011 5/oct/2010
Thanks in advance
Can't really answer properly without more information, but if you have two datasets and one row per date, and actual date variables, then the solution is something like this:
create table finaldsn as
select a.*
, b.*
, a.delivery_date-b.test_date as days
, abs(calculated days) as absdays
, min(calculated absdays)as close
from dsnA as a
full join
dsnB as b
on a.subject=b.id
where a.delivery_date ne . and b.ltest_datebdt ne .
and b.id in (select distinct subject
from dsnA)
group by a.subject, a.delivery_date
having calculated absdays=calculated close
;
quit;
This is from the following paper: http://www.lexjansen.com/pharmasug/2003/coderscorner/cc001.pdf which also presents a few other solutions.
You can use the SAS function INTCK to calculate the diference between a sample date and your target date, then you can just keep the date which is closer to 0.
here is a link so you can quickly pick up on this functions:
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000212868.htm
also, feel free to check out the other functions present on the left side of the link, as they can come in handy in the long run.
Related
I'm very new to SAS and trying to learn it. I have a problem statement where I need to extract two files from a location and then perform joins. Below is a detailed explanation of what I'm trying to achieve in a single proc sql statement:
There are two tables, table a (columns - account#, sales, transaction, store#) and table b (columns - account#, account zipcode) and an excel file (columns - store# and store zipcode). I need to first join these two tables on column account#.
Next step is to join their resulting values with the excel file on column store# and also add a column called as 'distance', which calculates the distance between account zipcode and store zipcode with the help of zipcitydistance(account zipcode, store zipcode) function. Let the resulting table be called "F".
Next I want to use case statement to create a column of distance bucket based on the distance from above query, for e.g.,
select
case
when distance<=5 then "<=5"
when distance between 5 and 10 then "5-10"
when distance between 10 and 15 then "10-15"
else ">=15"
end as distance_bucket,
sum(transactions) as total_txn,
sum(sales) as total_sales,
from F
group by 1
So far, below is the code that I have written:
data table_a
set xyzstore.filea;
run;
data table_b
set xyzstore.fileb;
run;
proc import datafile="/location/file.xlsx"
out=filec dbms=xlsx replace;
run;
proc sql;
create table d as
select a.store_number, b.account_number, sum(a.sales) as sales, sum(a.transactions) as transactions, b.account_zipcode
from table_a left join table_b
on a.account_number=b.account_number
group by a.store_number, b.account_number, b.account_zipcode;
quit;
proc sql;
create table e as
select d.*, c.store_zipcode, zipcitydistance(table_d.account_zipcode, c.store_zipcode) as distance
from d inner join filec as c
on d.store_number=c.store_number;
quit;
proc sql;
create table final as
select
case
when distance<=5 then "<=5"
when distance between 5 and 10 then "5-10"
when distance between 10 and 15 then "10-15"
else ">=15"
end as distance_bucket,
sum(transactions) as total_txn,
sum(sales) as total_sales,
from e
group by 1;
quit;
How can I write the above lines of code in a single proc sql statement?
The way you are currently doing it is more readable and the preferred way to do it. Turning it into a single SQL statement will not yield any significant performance gains and will make it harder to troubleshoot in the future.
To do a little cleanup, you can remove the two data step set statements and join directly on those files themselves:
create table d as
...
from xyzstore.filea left join xystore.fileb
...
quit;
You could also use a format instead to clean up the CASE statement.
proc format;
value storedistance
low - 5 = '<=5'
5< - 10 = '5-10'
10< - 15 = '10-15'
15 - high = '>=15'
other = ' '
;
run;
...
proc sql;
create table final as
select put(distance, storedistance.) as distance_bucket
, sum(transactions) as total_txn
, sum(sales) as total_sales
from e
group by calculated distance_bucket
;
quit;
If you did want to turn your existing code into one big SQL statement, it would look like this:
proc sql;
create table final as
select CASE
when(distance <= 5) then '<=5'
when(distance between 5 and 10) then '5-10'
when(distance between 10 and 15) then '10-15'
else '>=15'
END as distance_bucket
, sum(transactions) as total_txn
, sum(sales) as total_sales
/* Join 'table d' with c */
from (select d.*
, c.store_zipocde
, zipcitydistance(d.account_zipcode, c.store_zipcode) as distance
/* Create 'table d' */
from (select a.store_number
, b.account_number
, sum(a.sales) as sales
, sum(a.transactions) as transactions
, b.account_zipcode
from xyzstore.filea as a
LEFT JOIN
xyzstore.fileb as b
ON a.account_number = b.account_number
group by a.store_number
, b.account_number
, b.account_zipcode
) as d
INNER JOIN
filec as c
)
group by calculated distance_bucket
;
quit;
While more compact, it is more difficult to troubleshoot. You lose those in-between steps that can identify if there's an issue with the data. Suppose the store distances look incorrect one day: you'd need to unpack all of those SQL statements, put them into individual PROC SQL blocks and run them. Every time you run into a problem you will need to do this. If you have them separated out, you'll use a negligible amount of temporary disk space and have a much easier time troubleshooting.
When dealing with raw data, especially data that updates regularly, assume something will go wrong one day and you'll need to review it in-depth. Sometimes the wrong file gets sent. Sometimes an upstream issue occurs that sends corrupted data. Any time that happens, you'll need to dig in and find out if it's a problem with your process or their process. Making easy-to-troubleshoot code will speed up the solution for everyone.
How do i start this??
I have two data sets.
For the output you will deliver:
It should be an excel or XML format
Each query logic/programmed check should be on each tab
Columns should be
Subject #,
Visit Date (You will need the Visit Date Listing also attached)
Visit Name (Visit date from the file_34422 must match Visit name in the Blood Pressure File)
Date of Assessment (From the BP Log), VSBPDT_RAW, VSTPT, BP results.
A column for SYBP1. SYBP2, SYBP3, DIABP1, DIABP2, DIABP3
Findings/query text.
Below are Specification for BP:
For same SUBJECT and same FOLDERNAME, where VSTPT is Blood Pressure 1.
if VSBPYN is No, then all must be null or =0 (VSBPDT_RAW, VSBPTM1, SYSBP1, DIABP1, VSBPND2, VSBPTM2, SYSBP2, DIABP2, VSBPND3, VSBPTM3, SYSBP3, DIABP3)
This is what i have started with and
proc sql;
select
f.subject,
f.SVSTDT_RAW, f.FolderName,
b.FolderName,
VSBPDT_RAW, VSTPT,
SYSBP1, SYSBP2, SYSBP3,
DIABP1, DIABP2, DIABP3
FROM first_data as f, bp_data as b
group by subject, foldername
where f.subject = b.subject
having VSTPT is Blood Pressure set 1,
VSBPYN is No;
quit;
I just need to be pointed towards the right direction. I know this can't be right.
I do not know the exact structure of your data, so the solution below may need to be modified by you to select the right columns.
From the descritpion, this looks like it might be a good situation for SQL and a data step. You have a lot of columns to merge with the bp table. It will be easy to do merge all of these columns with first_data in SQL.
When you have lots of by-row conditionals, a data step will be easier to work with and read than many CASE statements in SQL. We'll do a two-stage approach in which we use SQL and a data step.
Step 1: Merge the data
proc sql noprint;
create table stage as
select t1.*
, t2.VSBPYN
from bp_data as t1
INNER JOIN
first_data as t2
ON t1.subject = t2.subject
AND foldername = t2.foldername
where t1.VSTPT = 1
;
quit;
Step 2: Conditionally set values to missing
Next, we'll do a data step for our conditional logic. call missing() is a useful function that will let you set the value of many variables to missing all in a single statement.
data want;
set stage;
if(upcase(VSBPYN) = 'NO') then call missing(VSBPDT_RAW, VSBPTM1, SYSBP1, DIABP1,
VSBPND2, VSBPTM2, SYSBP2, DIABP2,
VSBPND3, VSBPTM3, SYSBP3, DIABP3
);
run;
Step 3: Output to Excel
Finally, we sent the output to Excel.
proc export
data=want
file='/my/location/want.xlsx'
dbms=xlsx
replace;
run;
I'm trying to get all of all of a series of variables while pulling off of the most recent possible update date (PD_LAST_UPDATE) some fields were updated yesterday, some fields might have been a year ago, so I can't just do PD_LAST_UPDATE = (variable encoded to a specific time) and if I do any set time I'll get way too much data.
Here's my code
(SELECT N1.PD_PROP_NUM, N1.PD_START_DATE, N1.PD_END_DATE, N1.PD_DOW_FREQ,
N1.PD_RATE_PGM, N1.PD_ROOM_POOL, N1.PD_QUOTE_SERIES,
N1.PD_RPGM_SEQ_NUM, N1.PD_LAST_UPDATE
FROM OMP.OMT_PR_SSTRAT_DTL N1
INNER JOIN OMP.OMT_PROP_SSTRAT AS N2 ON (N1.PD_PROP_NUM=N2.PS_PROP_NUM AND
N1.PD_START_DATE=N2.PS_START_DATE AND
N1.PD_DOW_FREQ=N2.PS_DOW_FREQ AND
N1.PD_ROOM_POOL=N2.PS_ROOM_POOL)
WHERE N2.PS_PROP_NUM in (11612) AND **n1.PD_LAST_UPDATE = (MAX)**
);
quit;
The portion of particular interest is bolded, and the prop num ahead of it will be done away with once I can figure out how to select the max value so I can pull down all prob nums. Thanks in advance.
You have two ways to filter on the max value of a variable.
One is to group by everything you want to calculate the maximum by, and then use having (which is the after-group-by version of where), like so:
proc sql;
select origin, make, model
from sashelp.cars
group by origin, make
having mpg_city = max(mpg_city);
quit;
This is allowed in SAS, but not in most other SQL flavors. It's a shortcut to the other method below, largely, and it only works in some particular data structures.
The more traditional approach, then, is to do a correlated subquery:
proc sql;
select origin, make, model
from sashelp.cars C
where mpg_city = (
select max(mpg_city)
from sashelp.cars R
where C.origin=R.origin
and C.make=R.make
group by make, origin
);
quit;
In this case, we're getting to the same place, and more or less getting there the same way - SAS does this on the back end anyway.
In the case of a join, you can either perform this subquery or similar on the dataset prior to the join (or in a subquery whose result is then joined), or you can do so on the result of the join, depending on which is more efficient and whether you need rows from both tables to determine the maximum value.
Thanks for the feedback guys, but I have to rewrite the question to make it more clear.
Say, we have a table:
Table
What I am trying to get from this table is a list of numbers which have matching FP_NDT dates to my condition, for example, I want to get a list of numbers, which only have FP_NDT not null for 2014 and 2015 and missing values for 2011, 2012 and 2013 (irrelevant of the months). So with this condition I should get only Number 4. Is it possible to do it from this table ?
PS: If I write a simple sql select statement and put a condition like
where year(FP_NDT) in (2014,2015)
it would also give me numbers 2 and 3...
Why not first summarize the data?
proc sql;
create table XX as
select number
, max(year(fp_ndt)=2011) as yr2011
, max(year(fp_ndt)=2012) as yr2012
, max(year(fp_ndt)=2013) as yr2013
, max(year(fp_ndt)=2014) as yr2014
, max(year(fp_ndt)=2015) as yr2015
from table1
group by number
;
Now it is easy to make your tests.
select * from XX
where yr2014+yr2015=2 and yr2011+yr2012+yr2013=0
;
You could use the first query as a sub-query instead of creating a physical table.
So you want names associate with both 1 and 2 and 3, but in different rows.
You can group rows by names and count the associated numbers as this:
PROC SQL;
CREATE TABLE xxx AS SELECT
name,
SUM(number=1) AS count1,
SUM(number=2) AS count2,
SUM(number=3) AS count3
FROM test GROUP BY name;
QUIT;
Then you can filter the results based on count1-count3, i.e. (count1>0 AND count2>0 AND count3>0).
Try this:
proc sql;
select *
from work.test
group by name having nmiss(number)=0;
quit;
I have found one work around which is to actually create separate data sets for each year and then inner join them with a where condition for missing and not null for needed years. However, it becomes a bit cumbersome when it comes to 60 months, for instance...
I have a WORK dataset with more than 30 columns but only 2 columns out of them are date fields. (Start date and End date). I want the date format in the permanent dataset to be in date. and not in yymmdd10. which is the current format in work dataset. When I used the below code, the two date fields are taking first two positions. I dont want to reorder the positions and at the same time dont want to mention the format with all 30+ columns. Could someone please help me if there is any way for this?
data DLR.DEALER;
set work.dealer_invoices; * this dataset contains more than 30 columns;
format start_dt end_dt date.;
run;
I could not find any solution for this on our site. Any help is highly appreciated than just asking me to mention all the columns in the format statement :) Thanks in advance.
Certainly the format statement shouldn't have any impact on ordering given its location.
A workaround would be to use PROC DATASETS to change the format instead of in the data step.
You also could "mention all columns" fairly easily.
proc sql;
select name into :namelist separated by ' '
from dictionary.columns
where libname='WORK' and memname='DEALER_INVOICES'
order by varnum;
quit;
then
data DLR.DEALER;
retain &namelist;
set work.dealer_invoices;
format...;
run;