I have say two tables in teradata one of them-Reports is like this
Year Report_ID BAD_PART_NUMBERS
2015 P12568 6989820
2015 P12568 1769819
2015 P12568 1988700
2015 P12697 879010
2015 P12697 287932
2015 P12697 17902
and the other table-Orders
order_no Customer_id Purchase dt PART_NUM PART_DESC
265187 B1792 3/4/2016 02-6989820 gfsahj
1669 B1792 7/8/2017 01-32769237 susisd
1692191 B1794 5/7/2015 03-6989820 gfsahj
16891 B1794 3/24/2016 78-1769819 ysatua
62919 B1794 2/7/2017 15-3287629 at8a9s7d
One of my objective is to find the part number that was most frequently purchased after purchasing a bad part, for every Report_ID
For one report_ID I wrote the code like this:
%let REPORT_ID=('P12568');
Proc SQL;
connect to teradata as tera1 (server='XXX' user=&userid pwd=&pwd Database
="XXXXX" );
create table BAD_PART as
select * from connection to tera1
(
select REPORT_ID,BAD_PART_NUMBERS from REPORTS where REPORT_ID=&REPORT_ID
*other where conditions
group by 1,2
)
;
disconnect from tera1;
quit;
/*creating a PART_NUM macro*/
PROC SQL NOPRINT;
SELECT quote(cats('%',BAD_PART_NUMBERS),"'")
INTO :PART_NUM separated by ", "
FROM BAD_PART ;
QUIT;
%put macro variable PART_NUM:&PART_NUM;
/*FINDING SECONDARY PART INFORMATION*/
proc sql;
connect to teradata as tera1 (server='XXXX' user=&userid pwd=&pwd Database
=" XXXX" );
create table SEC_PART as
select * from connection to tera1
(
SELECT &REPORT_ID as REPORT_ID, PART_NUM, PART_DESC,COUNT (DISTINCT ORDER)
as frequency
from (
select Customer_id,Min(Purchase_dt) as FIRST_BAD_PART_PURCHASE
from ORDERS
where (PART_NUM like any(&PART_NUM)) A
left join (
select Customer_id, Purchase_dt, PART_NUM, PART_DESC,ORDER
from ORDERS group by 1,2,3,4,5 ) B
on A. Customer_id =B. Customer_id
AND FIRST_BAD_PART_PURCHASE< Purchase_dt
group by 1,2,3 order by frequency desc
having frequency>0
)
;
disconnect from tera1;
quit;
/*---various PROC SQL and Data steps*/
Ultimately, I have a dataset which has
Report_ID MONTHS VALUE
P12568 0 21
P12568 1 34
P12568 2 40.38
P12568 3 67.05
P12568 4 100.08
where months here is continous which is MONTHS of exposure. For every report_id the final table needs to be appended.Suppose I am interested in seeing for all report_id for a year eg;
select REPORT_ID from reports where year='2015'.
Right now my code is doing for one Report_ID but if I am interested to find for more than one at once.
Try performing the entire query in Teradata. Instead constructing the any list, join to the bad_part query and use concatenation to construct the like pattern.
Deep in the query try having a
JOIN ( select part_num bad_part_num from <bad part_num query> ) bad_list ON
PART_NUM like '%' || bad_list.bad_part_num
instead of the
where (PART_NUM like any(&PART_NUM)) A
wherein the any list is, a list of % prefaced bad part numbers, constructed via SAS Proc SQL (into :) macro.
Related
I'm very new to SAS and trying to learn it. I have a problem statement where I need to extract two files from a location and then perform joins. Below is a detailed explanation of what I'm trying to achieve in a single proc sql statement:
There are two tables, table a (columns - account#, sales, transaction, store#) and table b (columns - account#, account zipcode) and an excel file (columns - store# and store zipcode). I need to first join these two tables on column account#.
Next step is to join their resulting values with the excel file on column store# and also add a column called as 'distance', which calculates the distance between account zipcode and store zipcode with the help of zipcitydistance(account zipcode, store zipcode) function. Let the resulting table be called "F".
Next I want to use case statement to create a column of distance bucket based on the distance from above query, for e.g.,
select
case
when distance<=5 then "<=5"
when distance between 5 and 10 then "5-10"
when distance between 10 and 15 then "10-15"
else ">=15"
end as distance_bucket,
sum(transactions) as total_txn,
sum(sales) as total_sales,
from F
group by 1
So far, below is the code that I have written:
data table_a
set xyzstore.filea;
run;
data table_b
set xyzstore.fileb;
run;
proc import datafile="/location/file.xlsx"
out=filec dbms=xlsx replace;
run;
proc sql;
create table d as
select a.store_number, b.account_number, sum(a.sales) as sales, sum(a.transactions) as transactions, b.account_zipcode
from table_a left join table_b
on a.account_number=b.account_number
group by a.store_number, b.account_number, b.account_zipcode;
quit;
proc sql;
create table e as
select d.*, c.store_zipcode, zipcitydistance(table_d.account_zipcode, c.store_zipcode) as distance
from d inner join filec as c
on d.store_number=c.store_number;
quit;
proc sql;
create table final as
select
case
when distance<=5 then "<=5"
when distance between 5 and 10 then "5-10"
when distance between 10 and 15 then "10-15"
else ">=15"
end as distance_bucket,
sum(transactions) as total_txn,
sum(sales) as total_sales,
from e
group by 1;
quit;
How can I write the above lines of code in a single proc sql statement?
The way you are currently doing it is more readable and the preferred way to do it. Turning it into a single SQL statement will not yield any significant performance gains and will make it harder to troubleshoot in the future.
To do a little cleanup, you can remove the two data step set statements and join directly on those files themselves:
create table d as
...
from xyzstore.filea left join xystore.fileb
...
quit;
You could also use a format instead to clean up the CASE statement.
proc format;
value storedistance
low - 5 = '<=5'
5< - 10 = '5-10'
10< - 15 = '10-15'
15 - high = '>=15'
other = ' '
;
run;
...
proc sql;
create table final as
select put(distance, storedistance.) as distance_bucket
, sum(transactions) as total_txn
, sum(sales) as total_sales
from e
group by calculated distance_bucket
;
quit;
If you did want to turn your existing code into one big SQL statement, it would look like this:
proc sql;
create table final as
select CASE
when(distance <= 5) then '<=5'
when(distance between 5 and 10) then '5-10'
when(distance between 10 and 15) then '10-15'
else '>=15'
END as distance_bucket
, sum(transactions) as total_txn
, sum(sales) as total_sales
/* Join 'table d' with c */
from (select d.*
, c.store_zipocde
, zipcitydistance(d.account_zipcode, c.store_zipcode) as distance
/* Create 'table d' */
from (select a.store_number
, b.account_number
, sum(a.sales) as sales
, sum(a.transactions) as transactions
, b.account_zipcode
from xyzstore.filea as a
LEFT JOIN
xyzstore.fileb as b
ON a.account_number = b.account_number
group by a.store_number
, b.account_number
, b.account_zipcode
) as d
INNER JOIN
filec as c
)
group by calculated distance_bucket
;
quit;
While more compact, it is more difficult to troubleshoot. You lose those in-between steps that can identify if there's an issue with the data. Suppose the store distances look incorrect one day: you'd need to unpack all of those SQL statements, put them into individual PROC SQL blocks and run them. Every time you run into a problem you will need to do this. If you have them separated out, you'll use a negligible amount of temporary disk space and have a much easier time troubleshooting.
When dealing with raw data, especially data that updates regularly, assume something will go wrong one day and you'll need to review it in-depth. Sometimes the wrong file gets sent. Sometimes an upstream issue occurs that sends corrupted data. Any time that happens, you'll need to dig in and find out if it's a problem with your process or their process. Making easy-to-troubleshoot code will speed up the solution for everyone.
I have a state table sitting in Teradata with 11 million rows and a unique row for every ID. I run logic in SAS that if a column (class) is updated, it updates the Teradata with the new record. Table structure in Teradata and the table generated in SAS is:
id
class
updated_at
1
X
date1
2
Y
date2
If the class is updated in the SAS created table for an id, the class and updated_at columns are updated in Teradata (more columns can be updated as well). Moreover, if a new record (id) is added, it is inserted into Teradata.
I want to achieve this in SAS, without having to push the SAS table into Teradata, and use merge into. Every table created in SAS will be 11 million+ rows.
To update a record manually, I can just use this:
proc sql;
update TD.TABLE_IN_TERADATA
set class = 'Z'
where updated_at = date3;
quit;
As far as I understand you have a teradata master table with all your data. Then you have new SAS tables with data to update your master data.
To generate some sample data (only SAS tables, I don' have teradata at hand...):
data test_data;
input id 2. class $2. updated_at date9.;
format updated_at date9.;
datalines;
1 X 01jan2020
2 Y 12feb2020
3 Z 01jan2020
4 X 16mar2020
5 Y 23jun2020
6 Z 23jun2020
7 X 31dec2020
;
run;
data sas_data;
input id 2. class $2. ;
format updated_at date9.;
updated_at=today();
datalines;
1 Z
3 Z
5 Z
7 Z
8 Y
9 Z
;
run;
So, we have changes in id=1, 5 and 7, whereas 3 is unchanged and 8 and 9 are new.
In pure SAS code you can use a data step with update to update and insert in one step, see here:
/* Any data row without change has to be eliminated, */
/* here id=3, otherwise updated_at will be updated there */
proc sql;
create table changed_data as
select s.*
from sas_data s
left join test_data t
on s.id eq t.id
where s.class ne t.class;
quit;
/* in sas update and insert via data-update-step */
data test_data1;
update test_data changed_data;
by id;
run;
As documented, the first sql step is only needed if you don't want to have updated_at to be updated in id=3 because there is no change. But maybe you want to have this updated as well, then you can remove this step.
By the way, precondition here is that the table is sorted by id or there is an index on id in the table.
But it might be that the SAS data step will not work with the teradata table. Then you could use the following steps in "pure" SQL (starting with the first step above to generate the table changed_data) plus an append step:
/* Alternative steps in pure SQL */
/* Step1: SQL-update, no insert */
proc sql;
update test_data t
set class=(select class from changed_data s where t.id=s.id),
updated_at=(select updated_at from changed_data s where t.id=s.id)
where id in (select id from changed_data)
;
quit;
/* Preparation for step2: extract completely new data */
proc sql;
create table new_data as
select s.*
from sas_data s
where id not in (select id from test_data)
;
quit;
/* Step2: insert new data via proc-append */
proc append base=test_data
data=new_data;
quit;
Generally, your performance might be poor with big data sets. Then consider to use a passthrough to the database and use the teradata "upsert", but then you will have to move your sas data into teradata.
Below given dataset I am trying to find New Users Vs Repeated Users.
DATE ID Unique_Event
20200901 a12345 1
20200902 a12345 1
20200903 b12345 1
20200903 a12345 1
20200904 c12345 1
In the above dataset, since a12345 appeared on multiple dates, should be counted as a "repeated" user whereas b12345 only appeared once, so he is a "new" user. Please note, this is only sample data as the actual data is quite large. I tried the below code, but I am not getting the correct count. Ideally, tot_num_users-num_new_users should be repeated users, but I am getting incorrect counts. Am I missing something?
Expected Output:
Month new_users repeated_users
9 2 1
Code:
data user_events;
set user_events;
new_date=input(date,yymmdd10.);
run;
proc sql;select month(new_date) as mm,
count(distinct vv.id) as total_num_users,
count(distinct case when v.new_date = vv.minva then v.id end) as num_new_users,
(count(distinct vv.id) - count(distinct case when v.new_date = vv.minva then id end)
) as num_repeated_users
from user_events v inner join
(select t.id, min(new_date) as minva
from user_events t
group by t.id
) vv
on v.id = vv.id
group by 1
order by 1;quit;
In a sub-select, for each ID you can count the number of distinct DATE to determine the new / repeated status. The all ids aggregate computations are made from the sub-select.
proc sql;
create table freq as
select
count(*) as id_count
, sum (status='repeated') as id_repeated_count /* sum counts a logic eval state */
, sum (status='new') as id_new_count
from
( select
id
, case
when count(distinct date) > 1 then 'repeated'
else 'new'
end as status
from
user_events
group by
id
) as statuses
;
An alternative solution not using proc sql (though I'm aware you tagged this with "proc sql").
data final;
set user_events;
Month=month(new_date);
run;
proc sort data=final; by Month ID;
data final;
set final;
by Month ID;
if first.Month then do;
new_users=0;
repeated_users=0;
end;
if last.ID then do;
if first.ID then
new_users+1;
else
repeated_users+1;
end;
if last.Month then
output;
keep Month new_users repeated_users;
run;
Since you are using proc sql, this is a sql question, not a SAS question.
Try something like:
proc sql;
select ID,count(Unique_Event)
from <that table>
group by ID
order by ID
run;
I have created a stored process in SAS that prompts the user to select a month/year combination which looks like this 2015_10. Then from the next box they can click on a calendar and select a startdate which is a timestamp and an enddate also a timestamp.
I would like to combine this into one step, where the user only selects the start and end date. However, my source table is in SQL Server, and the tables are partitioned by months, and the tables are named like this datatabel_2015_10 where the last two digits represent the month. Once the user selects the month, I have proc sql query that table, and then after that there is another query to pull only the rows which fall between the start date and end date, those are time1 and time2 stored as character strings in MS SQL Server which look like this 30JAN2015:19:52:29
How can I code this up so as to eliminate the month/year prompt and only have two selections, namely startdatetime and enddatetime, and still get the query.
Concatenating the monthly tables is not an option because they are huge and runs forever even if I use a pass through query.
Please help.
Thanks
LIBNAME SWAPPLI ODBC ACCESS=READONLY read_lock_type=nolock noprompt="driver=SQL server; server=XXX; Trusted Connection=yes; database=XXX" schema='dbo';
proc sql;
create table a as
select
startdate_time,
enddate_time
from SWAPPLI.SQL_DB_2015_10
quit;
proc sort data=a out=b;
by startdate_time, enddate_time
where enddate_time between "&startdate"dt and "&enddate"dt;
run;
You can delete a step by asking directly the timestamp begin and end.
Then you can deduce the YEAR and MONTH from the timestamp selected.
For &startdate = 30JAN2015:19:52:29, you can create 2 macro-variable by creating 2 substring following this :
%let startdate = 30JAN2015:19:52:29;
%let YEAR=%SUBSTR("&startdate",4,3);
%let MONTH=%SUBSTR("&startdate",7,4);
%PUT YEAR=&YEAR;
%PUT MONTH=&MONTH;
Result :
%PUT YEAR=&YEAR;
YEAR=JAN
%PUT MONTH=&MONTH;
MONTH=2015
Then you can create a %if condition to match JAN with 01, FEB with 02, ect...
Here MONTH will be 01
So you don't need to ask the information 2 times any more.
Then you can select your dataset by doing this :
proc sql;
create table a as
select
startdate_time,
enddate_time
from SWAPPLI.SQL_DB_&YEAR._&MONTH.
quit;
proc sort data=a out=b;
by startdate_time, enddate_time
where enddate_time between "&startdate"dt and "&enddate"dt;
run;
You should probably restrict the selection to a MONTH and not allow the selection through several month like start=01JAN and end=01MAR. Because the selection will be only on the dataset SQL_DB_2017_01 and not taking into account the end=01MAR.
I have a macro inspired by "PROC SQL by Example" that finds duplicate rows based on a single column/field:
data have ;
input name $ term $;
cards;
Joe 2000
Joe 2000
Joe 2002
Joe 2008
Sally 2001
Sally 2003
; run;
%MACRO DUPS(LIB, TABLE, GROUPBY) ;
PROC SQL ;
CREATE TABLE DUPROWS AS
SELECT &GROUPBY, COUNT(*) AS Duplicate_Rows
FROM &LIB..&TABLE
GROUP BY &GROUPBY
HAVING COUNT(*) > 1
ORDER BY Duplicate_Rows;
QUIT;
%MEND DUPS ;
%DUPS(WORK,have,name) ;
proc print data=duprows ; run;
I would like to extend this to look for duplicates based on multiple columns (Rows 1 and 2 in my example), but still be flexible enough to deal with a single column.
In this case it would run the code:
proc sql ;
create table duprows as select name,term,count(*) as Duplicate_Rows
from work.have
group by name,term
HAVING COUNT(*) > 1
;quit;
To produce:
To include an arbitrary number of fields to group on, you can list them all in the groupby macro parameter, but the list must be comma-delimited and surrounded by %quote(). Otherwise SAS will see the commas and think you're providing more macro parameters.
So in your case, your macro call would be:
%dups(lib = work, table = have, groupby = %quote(name, term));
Since &groupby is included in the select and group by clauses, all fields listed will appear in the output and will be used for grouping. This is because when &groupby resolves, it becomes the text name, term.