I have 3 Data sets A,B,C which contains the following variables
A:
period region city
B:
period city Sales
C:
period region Sales
My goal is to do a left join on A using B and C to get the Sales information based on the geographic location. I tried to in the sequence of steps:
/* Left joining B to A based on period and region */
proc sql;
Create table merge1 as
select l.* , r.* from
A as l left join B as r
on l.period = r.period and l.city=r.city;
quit;
/* Renaming Sales variable*/
data merged2;
set merge1;
rename Sales= s1;
run;
/*Doing another left join again, this time using C*/
proc sql;
create table merge3 as
select l.*,r.* from
A as l left join C as r
on l.period= r.period and l.region=r.region;
quit;
/*Replacing some of the values*/
data merge4;
set merge3;
Sales1= IFN(s1=., Sales, s1);
drop s1 Sales;
run;
My question would be if there are much better/ efficient ways to go about this? Especially on the multiple left joins since the process will get really tedious as the number of datasets and varaibles to be matched increases, thanks!
You could do it in a single SQL procedure. Since you have multiple tables, you will have to join them one by one.
proc sql;
Create table merge1 as select
A.* ,
B.sales as s1,
C.sales as s2,
coalesce(B.sales, C.sales) as Sales /*takes first non missing value*/
from A
left join B on (A.period = B.period and A.city = B.city)
left join C on (A.period = C.period and A.region = C.region);
quit;
Related
I'm very new to SAS and trying to learn it. I have a problem statement where I need to extract two files from a location and then perform joins. Below is a detailed explanation of what I'm trying to achieve in a single proc sql statement:
There are two tables, table a (columns - account#, sales, transaction, store#) and table b (columns - account#, account zipcode) and an excel file (columns - store# and store zipcode). I need to first join these two tables on column account#.
Next step is to join their resulting values with the excel file on column store# and also add a column called as 'distance', which calculates the distance between account zipcode and store zipcode with the help of zipcitydistance(account zipcode, store zipcode) function. Let the resulting table be called "F".
Next I want to use case statement to create a column of distance bucket based on the distance from above query, for e.g.,
select
case
when distance<=5 then "<=5"
when distance between 5 and 10 then "5-10"
when distance between 10 and 15 then "10-15"
else ">=15"
end as distance_bucket,
sum(transactions) as total_txn,
sum(sales) as total_sales,
from F
group by 1
So far, below is the code that I have written:
data table_a
set xyzstore.filea;
run;
data table_b
set xyzstore.fileb;
run;
proc import datafile="/location/file.xlsx"
out=filec dbms=xlsx replace;
run;
proc sql;
create table d as
select a.store_number, b.account_number, sum(a.sales) as sales, sum(a.transactions) as transactions, b.account_zipcode
from table_a left join table_b
on a.account_number=b.account_number
group by a.store_number, b.account_number, b.account_zipcode;
quit;
proc sql;
create table e as
select d.*, c.store_zipcode, zipcitydistance(table_d.account_zipcode, c.store_zipcode) as distance
from d inner join filec as c
on d.store_number=c.store_number;
quit;
proc sql;
create table final as
select
case
when distance<=5 then "<=5"
when distance between 5 and 10 then "5-10"
when distance between 10 and 15 then "10-15"
else ">=15"
end as distance_bucket,
sum(transactions) as total_txn,
sum(sales) as total_sales,
from e
group by 1;
quit;
How can I write the above lines of code in a single proc sql statement?
The way you are currently doing it is more readable and the preferred way to do it. Turning it into a single SQL statement will not yield any significant performance gains and will make it harder to troubleshoot in the future.
To do a little cleanup, you can remove the two data step set statements and join directly on those files themselves:
create table d as
...
from xyzstore.filea left join xystore.fileb
...
quit;
You could also use a format instead to clean up the CASE statement.
proc format;
value storedistance
low - 5 = '<=5'
5< - 10 = '5-10'
10< - 15 = '10-15'
15 - high = '>=15'
other = ' '
;
run;
...
proc sql;
create table final as
select put(distance, storedistance.) as distance_bucket
, sum(transactions) as total_txn
, sum(sales) as total_sales
from e
group by calculated distance_bucket
;
quit;
If you did want to turn your existing code into one big SQL statement, it would look like this:
proc sql;
create table final as
select CASE
when(distance <= 5) then '<=5'
when(distance between 5 and 10) then '5-10'
when(distance between 10 and 15) then '10-15'
else '>=15'
END as distance_bucket
, sum(transactions) as total_txn
, sum(sales) as total_sales
/* Join 'table d' with c */
from (select d.*
, c.store_zipocde
, zipcitydistance(d.account_zipcode, c.store_zipcode) as distance
/* Create 'table d' */
from (select a.store_number
, b.account_number
, sum(a.sales) as sales
, sum(a.transactions) as transactions
, b.account_zipcode
from xyzstore.filea as a
LEFT JOIN
xyzstore.fileb as b
ON a.account_number = b.account_number
group by a.store_number
, b.account_number
, b.account_zipcode
) as d
INNER JOIN
filec as c
)
group by calculated distance_bucket
;
quit;
While more compact, it is more difficult to troubleshoot. You lose those in-between steps that can identify if there's an issue with the data. Suppose the store distances look incorrect one day: you'd need to unpack all of those SQL statements, put them into individual PROC SQL blocks and run them. Every time you run into a problem you will need to do this. If you have them separated out, you'll use a negligible amount of temporary disk space and have a much easier time troubleshooting.
When dealing with raw data, especially data that updates regularly, assume something will go wrong one day and you'll need to review it in-depth. Sometimes the wrong file gets sent. Sometimes an upstream issue occurs that sends corrupted data. Any time that happens, you'll need to dig in and find out if it's a problem with your process or their process. Making easy-to-troubleshoot code will speed up the solution for everyone.
While using left join in SAS, the right side table have duplicate IDs with different donations. Therefore, it returns several rows.
While i only want one row with the highest donated amount.
The code is as follows:
Create table x
As select T1.*,
T2. Donations
From xxx t1
Left join yy t2 on (t1.id = t2.id);
Quit;
Thanks for any help
IN SAS follow https://stackoverflow.com/a/61486331/8227346
and in mysql
you can use partioning with ROW_NUMBER
CREATE TABLE x As select T1.*, T2.Donations
From xxx t1
LEFT JOIN
(
SELECT * FROM
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY donated_amount DESC) rank
FROM
yy
)
WHERE
rank = 1
)
t2
ON (t1.id = t2.id);
More info can be found https://www.c-sharpcorner.com/blogs/rownumber-function-with-partition-by-clause-in-sql-server1
You can either work with a subselect which selects only the highest donation for a given ID or you could do some pre work with SAS (which i prefer):
*Order ascending by ID and DONATIONS;
proc sort data=work.t2;
by ID DONATIONS;
run;
*only retain the dataset with the highest DONATION per ID;
data work.HIGHEST_DONATIONS;
set work.t2;
by ID;
if last.ID then output;
run;
I don't have SAS available right now but it should work.
Don't hesitate asking further questions. :)
I am trying to join two datasets. The first dataset1 has two columns item and price. The second dataset2 has three columns - item, customerid, and qty. I need to only include the unique rows from dataset1 that are not in dataset2. While trying to implement this code, I get the error:
Error: Unresolved reference to table/correlation name i.
I am unsure how to fix this error, thanks.
PROC SQL;
create table a as
select *
from dataset1 as i
except corr
select *
from dataset2 as p
where i.item = p.item;
describe table a;
QUIT;
EXCEPT is used to select records in the first set that do not exist in the second set. So if what you want is, to quote you, select records from dataset1 that do not appear in dataset2, you don't need the where clause:
PROC SQL;
create table a as
select *
from dataset1 as i
except corr
select *
from dataset2 as p
;
QUIT;
If however, like that where clause would suggest, you actually want to select records from dataset1 where the value of item is not found in dataset2, you could do this
proc sql;
select *
from dataset1 i
where not exists (select *
from dataset2 p
where i.item=p.item
)
;
quit;
EDIT: following your latest comment, and if you reaaaally need your query to feature an except, this should get you your result
proc sql;
create table a as
select t1.*
from dataset1 t1
inner join (select *
from dataset1 as i
except corr
select *
from dataset2 as p
) t2
on t1.item=t2.item
;
quit;
Even though this will do the same as the query above (with not exists) or, now that I think of it (stupid me), as this:
proc sql;
create table a as
select *
from dataset1
where item not in (select distinct item from dataset2)
;
quit;
I have to join 2 tables on a key (say XYZ). I have to update one single column in table A using a coalesce function. Coalesce(a.status_cd, b.status_cd).
TABLE A:
contains some 100 columns. KEY Columns ABC.
TABLE B:
Contains just 2 columns. KEY Column ABC and status_cd
TABLE A, which I use in this left join query is having more than 100 columns. Is there a way to use a.* followed by this coalesce function in my PROC SQL without creating a new column from the PROC SQL; CREATE TABLE AS ... step?
Thanks in advance.
You can take advantage of dataset options to make it so you can use wildcards in the select statement. Note that the order of the columns could change doing this.
proc sql ;
create table want as
select a.*
, coalesce(a.old_status,b.status_cd) as status_cd
from tableA(rename=(status_cd=old_status)) a
left join tableB b
on a.abc = b.abc
;
quit;
I eventually found a fairly simple way of doing this in proc sql after working through several more complex approaches:
proc sql noprint;
update master a
set status_cd= coalesce(status_cd,
(select status_cd
from transaction b
where a.key= b.key))
where exists (select 1
from transaction b
where a.ABC = b.ABC);
quit;
This will update just the one column you're interested in and will only update it for rows with key values that match in the transaction dataset.
Earlier attempts:
The most obvious bit of more general SQL syntax would seem to be the update...set...from...where pattern as used in the top few answers to this question. However, this syntax is not currently supported - the documentation for the SQL update statement only allows for a where clause, not a from clause.
If you are running a pass-through query to another database that does support this syntax, it might still be a viable option.
Alternatively, there is a way to do this within SAS via a data step, provided that the master dataset is indexed on your key variable:
/*Create indexed master dataset with some missing values*/
data master(index = (name));
set sashelp.class;
if _n_ <= 5 then call missing(weight);
run;
/*Create transaction dataset with some missing values*/
data transaction;
set sashelp.class(obs = 10 keep = name weight);
if _n_ > 5 then call missing(weight);
run;
data master;
set transaction;
t_weight = weight;
modify master key = name;
if _IORC_ = 0 then do;
weight = coalesce(weight, t_weight);
replace;
end;
/*Suppress log messages if there are key values in transaction but not master*/
else _ERROR_ = 0;
run;
A standard warning relating to the the modify statement: if this data step is interrupted then the master dataset may be irreparably damaged, so make sure you have a backup first.
In this case I've assumed that the key variable is unique - a slightly more complex data step is needed if it isn't.
Another way to work around the lack of a from clause in the proc sql update statement would be to set up a format merge, e.g.
data v_format_def /view = v_format_def;
set transaction(rename = (name = start weight = label));
retain fmtname 'key' type 'i';
end = start;
run;
proc format cntlin = v_format_def; run;
proc sql noprint;
update master
set weight = coalesce(weight,input(name,key.))
where master.name in (select name from transaction);
run;
In this scenario I've used type = 'i' in the format definition to create a numeric informat, which proc sql uses convert the character variable name to the numeric variable weight. Depending on whether your key and status_cd columns are character or numeric you may need to do this slightly differently.
This approach effectively loads the entire transaction dataset into memory when using the format, which might be a problem if you have a very large transaction dataset. The data step approach should hardly use any memory as it only has to load 1 row at a time.
If I have Table A with two columns: ID and Mean, and Table B with a long list of columns including Mean, how can I replace the values of the Mean column in Table B with the IDs that exist in Table A?
I've tried PROC SQL UPDATE and both DATASET MERGE and DATASET UPDATE but they keep adding rows when the number of columns is not equal in both tables.
data want;
merge have1(in=H1) have2(in=H2);
by mergevar;
if H1;
run;
That will guarantee that H2 does not add any rows, unless there are duplicate values for one of the by values. Other conditions can be used as well; if h2; would do about the same thing for the right-hand dataset, and if h1 and h2; would only keep records that come from both tables.
PROC SQL join should also work fairly easily.
proc sql;
create table want as
select A.id, coalesce(B.mean, A.mean)
from A left join B
on A.id=B.id;
quit;