If have the data step:
data myRegions;
set myRegions;
ext_price = price * qty;
mix = weighted_calc * ext_price;
run;
I wan to do this on SQL as I want to use some groupings and subqueries
but do I hace t make price * qty operation everytime I want to use that value?!
You can use calculated , from the docs:
CALCULATED enables you to use the results of an expression in the same
SELECT clause or in the WHERE clause. It is valid only when used to
refer to columns that are calculated in the immediate query
expression.
Here is an example:
proc sql;
create table myRegions as
Select a.*,
(price * qty) as ext_price ,
(weighted_calc * calculated ext_price ) as mix
from myRegions;
quit;
Related
I'm very new to SAS and trying to learn it. I have a problem statement where I need to extract two files from a location and then perform joins. Below is a detailed explanation of what I'm trying to achieve in a single proc sql statement:
There are two tables, table a (columns - account#, sales, transaction, store#) and table b (columns - account#, account zipcode) and an excel file (columns - store# and store zipcode). I need to first join these two tables on column account#.
Next step is to join their resulting values with the excel file on column store# and also add a column called as 'distance', which calculates the distance between account zipcode and store zipcode with the help of zipcitydistance(account zipcode, store zipcode) function. Let the resulting table be called "F".
Next I want to use case statement to create a column of distance bucket based on the distance from above query, for e.g.,
select
case
when distance<=5 then "<=5"
when distance between 5 and 10 then "5-10"
when distance between 10 and 15 then "10-15"
else ">=15"
end as distance_bucket,
sum(transactions) as total_txn,
sum(sales) as total_sales,
from F
group by 1
So far, below is the code that I have written:
data table_a
set xyzstore.filea;
run;
data table_b
set xyzstore.fileb;
run;
proc import datafile="/location/file.xlsx"
out=filec dbms=xlsx replace;
run;
proc sql;
create table d as
select a.store_number, b.account_number, sum(a.sales) as sales, sum(a.transactions) as transactions, b.account_zipcode
from table_a left join table_b
on a.account_number=b.account_number
group by a.store_number, b.account_number, b.account_zipcode;
quit;
proc sql;
create table e as
select d.*, c.store_zipcode, zipcitydistance(table_d.account_zipcode, c.store_zipcode) as distance
from d inner join filec as c
on d.store_number=c.store_number;
quit;
proc sql;
create table final as
select
case
when distance<=5 then "<=5"
when distance between 5 and 10 then "5-10"
when distance between 10 and 15 then "10-15"
else ">=15"
end as distance_bucket,
sum(transactions) as total_txn,
sum(sales) as total_sales,
from e
group by 1;
quit;
How can I write the above lines of code in a single proc sql statement?
The way you are currently doing it is more readable and the preferred way to do it. Turning it into a single SQL statement will not yield any significant performance gains and will make it harder to troubleshoot in the future.
To do a little cleanup, you can remove the two data step set statements and join directly on those files themselves:
create table d as
...
from xyzstore.filea left join xystore.fileb
...
quit;
You could also use a format instead to clean up the CASE statement.
proc format;
value storedistance
low - 5 = '<=5'
5< - 10 = '5-10'
10< - 15 = '10-15'
15 - high = '>=15'
other = ' '
;
run;
...
proc sql;
create table final as
select put(distance, storedistance.) as distance_bucket
, sum(transactions) as total_txn
, sum(sales) as total_sales
from e
group by calculated distance_bucket
;
quit;
If you did want to turn your existing code into one big SQL statement, it would look like this:
proc sql;
create table final as
select CASE
when(distance <= 5) then '<=5'
when(distance between 5 and 10) then '5-10'
when(distance between 10 and 15) then '10-15'
else '>=15'
END as distance_bucket
, sum(transactions) as total_txn
, sum(sales) as total_sales
/* Join 'table d' with c */
from (select d.*
, c.store_zipocde
, zipcitydistance(d.account_zipcode, c.store_zipcode) as distance
/* Create 'table d' */
from (select a.store_number
, b.account_number
, sum(a.sales) as sales
, sum(a.transactions) as transactions
, b.account_zipcode
from xyzstore.filea as a
LEFT JOIN
xyzstore.fileb as b
ON a.account_number = b.account_number
group by a.store_number
, b.account_number
, b.account_zipcode
) as d
INNER JOIN
filec as c
)
group by calculated distance_bucket
;
quit;
While more compact, it is more difficult to troubleshoot. You lose those in-between steps that can identify if there's an issue with the data. Suppose the store distances look incorrect one day: you'd need to unpack all of those SQL statements, put them into individual PROC SQL blocks and run them. Every time you run into a problem you will need to do this. If you have them separated out, you'll use a negligible amount of temporary disk space and have a much easier time troubleshooting.
When dealing with raw data, especially data that updates regularly, assume something will go wrong one day and you'll need to review it in-depth. Sometimes the wrong file gets sent. Sometimes an upstream issue occurs that sends corrupted data. Any time that happens, you'll need to dig in and find out if it's a problem with your process or their process. Making easy-to-troubleshoot code will speed up the solution for everyone.
I have two datasets. The first, big_dataset, has around 3000 columns, most of which are never used. The second, column_list, contains a single column called column_name with around 100 values. Each value is the name of a column I want to keep.
I want to filter big_dataset so that only columns in column_list are kept, and the rest are discarded.
If I were using Pandas dataframes in Python, this would be a trivial task:
cols = column_list['column_name'].tolist()
smaller_dataset = big_dataset[cols]
However, I can't figure out the SAS equivalent. Proc Transpose doesn't let me turn the rows into headers. I can't figure out a statement in the data step that would let this work, and as far as I'm aware this isn't something that Proc SQL could handle. I've read through the docs on Proc Datasets and that doesn't seem to have what I need either.
To obtain a list of columns from column_list to use against big_dataset, you can query the column_list table and put the result into a macro variable. This can be achieved with PROC SQL and the SEPARATED BY clause:
proc sql noprint;
select column_name
into :cols separated by ','
from column_list;
create table SMALLER_DATASET AS
select &cols.
from WORK.BIG_DATASET;
quit;
Alternatively you may use SEPARATED BY ' ' and then use the resulting list in a KEEP statement or dataset option:
proc sql noprint;
select column_name
into :cols separated by ' '
from column_list;
quit;
data small_dataset;
set big_dataset (keep=&cols.);
/* or keep=&cols.; */
run;
I am trying to join two datasets. The first dataset1 has two columns item and price. The second dataset2 has three columns - item, customerid, and qty. I need to only include the unique rows from dataset1 that are not in dataset2. While trying to implement this code, I get the error:
Error: Unresolved reference to table/correlation name i.
I am unsure how to fix this error, thanks.
PROC SQL;
create table a as
select *
from dataset1 as i
except corr
select *
from dataset2 as p
where i.item = p.item;
describe table a;
QUIT;
EXCEPT is used to select records in the first set that do not exist in the second set. So if what you want is, to quote you, select records from dataset1 that do not appear in dataset2, you don't need the where clause:
PROC SQL;
create table a as
select *
from dataset1 as i
except corr
select *
from dataset2 as p
;
QUIT;
If however, like that where clause would suggest, you actually want to select records from dataset1 where the value of item is not found in dataset2, you could do this
proc sql;
select *
from dataset1 i
where not exists (select *
from dataset2 p
where i.item=p.item
)
;
quit;
EDIT: following your latest comment, and if you reaaaally need your query to feature an except, this should get you your result
proc sql;
create table a as
select t1.*
from dataset1 t1
inner join (select *
from dataset1 as i
except corr
select *
from dataset2 as p
) t2
on t1.item=t2.item
;
quit;
Even though this will do the same as the query above (with not exists) or, now that I think of it (stupid me), as this:
proc sql;
create table a as
select *
from dataset1
where item not in (select distinct item from dataset2)
;
quit;
I have to join 2 tables on a key (say XYZ). I have to update one single column in table A using a coalesce function. Coalesce(a.status_cd, b.status_cd).
TABLE A:
contains some 100 columns. KEY Columns ABC.
TABLE B:
Contains just 2 columns. KEY Column ABC and status_cd
TABLE A, which I use in this left join query is having more than 100 columns. Is there a way to use a.* followed by this coalesce function in my PROC SQL without creating a new column from the PROC SQL; CREATE TABLE AS ... step?
Thanks in advance.
You can take advantage of dataset options to make it so you can use wildcards in the select statement. Note that the order of the columns could change doing this.
proc sql ;
create table want as
select a.*
, coalesce(a.old_status,b.status_cd) as status_cd
from tableA(rename=(status_cd=old_status)) a
left join tableB b
on a.abc = b.abc
;
quit;
I eventually found a fairly simple way of doing this in proc sql after working through several more complex approaches:
proc sql noprint;
update master a
set status_cd= coalesce(status_cd,
(select status_cd
from transaction b
where a.key= b.key))
where exists (select 1
from transaction b
where a.ABC = b.ABC);
quit;
This will update just the one column you're interested in and will only update it for rows with key values that match in the transaction dataset.
Earlier attempts:
The most obvious bit of more general SQL syntax would seem to be the update...set...from...where pattern as used in the top few answers to this question. However, this syntax is not currently supported - the documentation for the SQL update statement only allows for a where clause, not a from clause.
If you are running a pass-through query to another database that does support this syntax, it might still be a viable option.
Alternatively, there is a way to do this within SAS via a data step, provided that the master dataset is indexed on your key variable:
/*Create indexed master dataset with some missing values*/
data master(index = (name));
set sashelp.class;
if _n_ <= 5 then call missing(weight);
run;
/*Create transaction dataset with some missing values*/
data transaction;
set sashelp.class(obs = 10 keep = name weight);
if _n_ > 5 then call missing(weight);
run;
data master;
set transaction;
t_weight = weight;
modify master key = name;
if _IORC_ = 0 then do;
weight = coalesce(weight, t_weight);
replace;
end;
/*Suppress log messages if there are key values in transaction but not master*/
else _ERROR_ = 0;
run;
A standard warning relating to the the modify statement: if this data step is interrupted then the master dataset may be irreparably damaged, so make sure you have a backup first.
In this case I've assumed that the key variable is unique - a slightly more complex data step is needed if it isn't.
Another way to work around the lack of a from clause in the proc sql update statement would be to set up a format merge, e.g.
data v_format_def /view = v_format_def;
set transaction(rename = (name = start weight = label));
retain fmtname 'key' type 'i';
end = start;
run;
proc format cntlin = v_format_def; run;
proc sql noprint;
update master
set weight = coalesce(weight,input(name,key.))
where master.name in (select name from transaction);
run;
In this scenario I've used type = 'i' in the format definition to create a numeric informat, which proc sql uses convert the character variable name to the numeric variable weight. Depending on whether your key and status_cd columns are character or numeric you may need to do this slightly differently.
This approach effectively loads the entire transaction dataset into memory when using the format, which might be a problem if you have a very large transaction dataset. The data step approach should hardly use any memory as it only has to load 1 row at a time.
I have a SQL that would create for each customer a short excerpt of his history. Suppose the columns I am interested in are TIMESTAMP and PURCHASE VALUE. I'd like to calculate a linear regression for each customer and put this value into a table.
proc sql;
create table CUSTOMERHISTORY as
select
TIME_STAMP
,PURCHASE_VALUE
,CUSTOMER_ID
from <my data source>
;quit;
The table is quite large; it would be best, if the table wouldn't have to loaded into RAM prior to computation.
I tried
proc reg
data = CUSTOMERHISTORY;
model PURCHASE_VALUE=TIME_STAMP;
outest = OUTTABLE;
by CUSTOMER_ID;
but it never wrote anything to the OUTTABLE. (I found parameter outest in http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_reg_sect007.htm )
According to the documentation you link to, outtest is a parameter that you should give as a option to proc reg. So to get that specific output, your code should look as:
proc reg
data = CUSTOMERHISTORY
outest = OUTTABLE;
model PURCHASE_VALUE=TIME_STAMP;
by CUSTOMER_ID;
run;
Note that there is no semicolon between data = ... and outtest = ....