Thanks for the feedback guys, but I have to rewrite the question to make it more clear.
Say, we have a table:
Table
What I am trying to get from this table is a list of numbers which have matching FP_NDT dates to my condition, for example, I want to get a list of numbers, which only have FP_NDT not null for 2014 and 2015 and missing values for 2011, 2012 and 2013 (irrelevant of the months). So with this condition I should get only Number 4. Is it possible to do it from this table ?
PS: If I write a simple sql select statement and put a condition like
where year(FP_NDT) in (2014,2015)
it would also give me numbers 2 and 3...
Why not first summarize the data?
proc sql;
create table XX as
select number
, max(year(fp_ndt)=2011) as yr2011
, max(year(fp_ndt)=2012) as yr2012
, max(year(fp_ndt)=2013) as yr2013
, max(year(fp_ndt)=2014) as yr2014
, max(year(fp_ndt)=2015) as yr2015
from table1
group by number
;
Now it is easy to make your tests.
select * from XX
where yr2014+yr2015=2 and yr2011+yr2012+yr2013=0
;
You could use the first query as a sub-query instead of creating a physical table.
So you want names associate with both 1 and 2 and 3, but in different rows.
You can group rows by names and count the associated numbers as this:
PROC SQL;
CREATE TABLE xxx AS SELECT
name,
SUM(number=1) AS count1,
SUM(number=2) AS count2,
SUM(number=3) AS count3
FROM test GROUP BY name;
QUIT;
Then you can filter the results based on count1-count3, i.e. (count1>0 AND count2>0 AND count3>0).
Try this:
proc sql;
select *
from work.test
group by name having nmiss(number)=0;
quit;
I have found one work around which is to actually create separate data sets for each year and then inner join them with a where condition for missing and not null for needed years. However, it becomes a bit cumbersome when it comes to 60 months, for instance...
Related
I'm very new to SAS and trying to learn it. I have a problem statement where I need to extract two files from a location and then perform joins. Below is a detailed explanation of what I'm trying to achieve in a single proc sql statement:
There are two tables, table a (columns - account#, sales, transaction, store#) and table b (columns - account#, account zipcode) and an excel file (columns - store# and store zipcode). I need to first join these two tables on column account#.
Next step is to join their resulting values with the excel file on column store# and also add a column called as 'distance', which calculates the distance between account zipcode and store zipcode with the help of zipcitydistance(account zipcode, store zipcode) function. Let the resulting table be called "F".
Next I want to use case statement to create a column of distance bucket based on the distance from above query, for e.g.,
select
case
when distance<=5 then "<=5"
when distance between 5 and 10 then "5-10"
when distance between 10 and 15 then "10-15"
else ">=15"
end as distance_bucket,
sum(transactions) as total_txn,
sum(sales) as total_sales,
from F
group by 1
So far, below is the code that I have written:
data table_a
set xyzstore.filea;
run;
data table_b
set xyzstore.fileb;
run;
proc import datafile="/location/file.xlsx"
out=filec dbms=xlsx replace;
run;
proc sql;
create table d as
select a.store_number, b.account_number, sum(a.sales) as sales, sum(a.transactions) as transactions, b.account_zipcode
from table_a left join table_b
on a.account_number=b.account_number
group by a.store_number, b.account_number, b.account_zipcode;
quit;
proc sql;
create table e as
select d.*, c.store_zipcode, zipcitydistance(table_d.account_zipcode, c.store_zipcode) as distance
from d inner join filec as c
on d.store_number=c.store_number;
quit;
proc sql;
create table final as
select
case
when distance<=5 then "<=5"
when distance between 5 and 10 then "5-10"
when distance between 10 and 15 then "10-15"
else ">=15"
end as distance_bucket,
sum(transactions) as total_txn,
sum(sales) as total_sales,
from e
group by 1;
quit;
How can I write the above lines of code in a single proc sql statement?
The way you are currently doing it is more readable and the preferred way to do it. Turning it into a single SQL statement will not yield any significant performance gains and will make it harder to troubleshoot in the future.
To do a little cleanup, you can remove the two data step set statements and join directly on those files themselves:
create table d as
...
from xyzstore.filea left join xystore.fileb
...
quit;
You could also use a format instead to clean up the CASE statement.
proc format;
value storedistance
low - 5 = '<=5'
5< - 10 = '5-10'
10< - 15 = '10-15'
15 - high = '>=15'
other = ' '
;
run;
...
proc sql;
create table final as
select put(distance, storedistance.) as distance_bucket
, sum(transactions) as total_txn
, sum(sales) as total_sales
from e
group by calculated distance_bucket
;
quit;
If you did want to turn your existing code into one big SQL statement, it would look like this:
proc sql;
create table final as
select CASE
when(distance <= 5) then '<=5'
when(distance between 5 and 10) then '5-10'
when(distance between 10 and 15) then '10-15'
else '>=15'
END as distance_bucket
, sum(transactions) as total_txn
, sum(sales) as total_sales
/* Join 'table d' with c */
from (select d.*
, c.store_zipocde
, zipcitydistance(d.account_zipcode, c.store_zipcode) as distance
/* Create 'table d' */
from (select a.store_number
, b.account_number
, sum(a.sales) as sales
, sum(a.transactions) as transactions
, b.account_zipcode
from xyzstore.filea as a
LEFT JOIN
xyzstore.fileb as b
ON a.account_number = b.account_number
group by a.store_number
, b.account_number
, b.account_zipcode
) as d
INNER JOIN
filec as c
)
group by calculated distance_bucket
;
quit;
While more compact, it is more difficult to troubleshoot. You lose those in-between steps that can identify if there's an issue with the data. Suppose the store distances look incorrect one day: you'd need to unpack all of those SQL statements, put them into individual PROC SQL blocks and run them. Every time you run into a problem you will need to do this. If you have them separated out, you'll use a negligible amount of temporary disk space and have a much easier time troubleshooting.
When dealing with raw data, especially data that updates regularly, assume something will go wrong one day and you'll need to review it in-depth. Sometimes the wrong file gets sent. Sometimes an upstream issue occurs that sends corrupted data. Any time that happens, you'll need to dig in and find out if it's a problem with your process or their process. Making easy-to-troubleshoot code will speed up the solution for everyone.
I'm trying to get all of all of a series of variables while pulling off of the most recent possible update date (PD_LAST_UPDATE) some fields were updated yesterday, some fields might have been a year ago, so I can't just do PD_LAST_UPDATE = (variable encoded to a specific time) and if I do any set time I'll get way too much data.
Here's my code
(SELECT N1.PD_PROP_NUM, N1.PD_START_DATE, N1.PD_END_DATE, N1.PD_DOW_FREQ,
N1.PD_RATE_PGM, N1.PD_ROOM_POOL, N1.PD_QUOTE_SERIES,
N1.PD_RPGM_SEQ_NUM, N1.PD_LAST_UPDATE
FROM OMP.OMT_PR_SSTRAT_DTL N1
INNER JOIN OMP.OMT_PROP_SSTRAT AS N2 ON (N1.PD_PROP_NUM=N2.PS_PROP_NUM AND
N1.PD_START_DATE=N2.PS_START_DATE AND
N1.PD_DOW_FREQ=N2.PS_DOW_FREQ AND
N1.PD_ROOM_POOL=N2.PS_ROOM_POOL)
WHERE N2.PS_PROP_NUM in (11612) AND **n1.PD_LAST_UPDATE = (MAX)**
);
quit;
The portion of particular interest is bolded, and the prop num ahead of it will be done away with once I can figure out how to select the max value so I can pull down all prob nums. Thanks in advance.
You have two ways to filter on the max value of a variable.
One is to group by everything you want to calculate the maximum by, and then use having (which is the after-group-by version of where), like so:
proc sql;
select origin, make, model
from sashelp.cars
group by origin, make
having mpg_city = max(mpg_city);
quit;
This is allowed in SAS, but not in most other SQL flavors. It's a shortcut to the other method below, largely, and it only works in some particular data structures.
The more traditional approach, then, is to do a correlated subquery:
proc sql;
select origin, make, model
from sashelp.cars C
where mpg_city = (
select max(mpg_city)
from sashelp.cars R
where C.origin=R.origin
and C.make=R.make
group by make, origin
);
quit;
In this case, we're getting to the same place, and more or less getting there the same way - SAS does this on the back end anyway.
In the case of a join, you can either perform this subquery or similar on the dataset prior to the join (or in a subquery whose result is then joined), or you can do so on the result of the join, depending on which is more efficient and whether you need rows from both tables to determine the maximum value.
Suppose I'm subsetting a table and summarizing it in proc sql. The code uses a where ... in clause and a subquery to do the subsetting. I know that some SQL engines would set some limit on the number of arguments to the where ... in clause. Does SAS has have limit on this? This question would apply to a program like this:
proc sql;
create table want as
select
ID,
sum(var1) as var1,
sum(var2) as var2,
sum(var3) as var3
from largetable
where ID in (select ID from longlist)
group by ID;
quit;
What if longlist returns 10,000 IDs? How about 10,000,000?
I'm not aware of any explicit limit on this. SAS's SQL parser seems to convert these often to JOINs, when they're not explicitly coded in the table; that means there are some limitations, but not particularly small ones.
I do believe there is a limit to the length of a SQL statement in total, so if you were trying to include an extremely long list in text you might run into problems, but in the example above I don't see a problem with 10,000,000 IDs. I just tested it with 250,000,000 IDs in the longlist table, and SAS had no problem with it:
data largetable;
do id=1 to 1e8;
if mod(id,7)=0 then output;
end;
run;
data ids;
do id = 1 to 1e9;
if mod(id,4)=0 then output;
end;
run;
proc sql _method;
create table want as
select
ID
from largetable
where ID in (select ID from IDs)
group by ID;
quit;
Interestingly, adding _method indicates it does not do this as a join, but as a subquery. I'm not sure why, at least in this case; everything I've been told says that it should convert this to a join implicitly.
As Joe has said, there should probably be no problems with any reasonable number of rows in the longlist table. However, although this may be readable, a join may perform better.
Do you have a strong preference for running the query as written rather than doing a left join, e.g.
proc sql;
create table want as
select
b.ID,
sum(b.var1) as var1,
sum(b.var2) as var2,
sum(b.var3) as var3
from longlist a left join largetable b
on a.ID = b.ID
group by b.ID;
quit;
Elaborating a bit on entering a long list as text - I'm not aware of any limit on the length of any one statement in SAS, but there are various limits on the length of individual lines of code, depending on your version and how you're submitting it. I suspect it's possible to split a long statement over several lines each approaching the maximum allowed length.
So, I have a significant problem with proc compare. I have two datasets with the two columns. One column lists table names and the other one - names of variables which correspond to table names from the first column. I want compare values of one of them based on the values of first column. I somewhat made it work but the thing is that these datasets have different sizes due to additional values in one of them. Which means that some new variable was added in the middle of a dataset (new variable was added to a table). Unfortunately, proc compare compares values from two datasets horizontally and checks them against each other for values, so in my case it looks like this:
ds 1 | ds 2
cost | box_nr
other | cost_total
As you can see, a new value box_nr was added to the second dataset that appears above the value that I want it to compare variable cost to (cost_total). So I would like to know if it's possible to compare values (check for differences in character sequence) that have at least minimal similarity - for example 3 letters (cos) or if it's possible to just put values like box_nr at the end suggesting that they don't appear in a certain dataset.
My code:
PROC Compare base=USERSPDS.MIzew compare=USERSPDS.MIwew
out=USERSPDS.result outbase outcomp outdif noprint;
id 'TABLE HD'n;
where ;
run;
proc print data=USERSPDS.result noobs;
by 'TABLE HD'n;
id 'TTABLE HD'n;
title 'COMPARISON:';
run;
Untested, but this should get you some of the way.
proc sql;
create table compare as
select
coalesce(a.cola, b.cola) as cola,
a.colb as acolb,
b.colb as bcolb
from dataa as a
full outer join datab as b
on
a.cola = b.cola and
compged(a.colb, b.colb) <= 100;
quit;
Have a look at the compged documentation for further information.
Sounds like you could make a new variable in both datasets, VAR3chars=substr(var,1,3) and then add that variable to your ID statement. I think that should work unless there are duplicate values.
So if one dataset had var="cost" and the other had var="cost_total", they would match on the id so they would be compared and found to be different.
If one dataset had var="box_nr" and the other did not have any values starting with "box", they would not match on the id so compare would find that a record exists for that id in one dataset but not the other.
Based on the data could anyone assist me - making a program that contains
test_date closest to delivery_date.
delivery_date 11/16/2011
test_date 21/nov/2011 10/nov/2011 5/oct/2010
Thanks in advance
Can't really answer properly without more information, but if you have two datasets and one row per date, and actual date variables, then the solution is something like this:
create table finaldsn as
select a.*
, b.*
, a.delivery_date-b.test_date as days
, abs(calculated days) as absdays
, min(calculated absdays)as close
from dsnA as a
full join
dsnB as b
on a.subject=b.id
where a.delivery_date ne . and b.ltest_datebdt ne .
and b.id in (select distinct subject
from dsnA)
group by a.subject, a.delivery_date
having calculated absdays=calculated close
;
quit;
This is from the following paper: http://www.lexjansen.com/pharmasug/2003/coderscorner/cc001.pdf which also presents a few other solutions.
You can use the SAS function INTCK to calculate the diference between a sample date and your target date, then you can just keep the date which is closer to 0.
here is a link so you can quickly pick up on this functions:
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000212868.htm
also, feel free to check out the other functions present on the left side of the link, as they can come in handy in the long run.