Can't use LAG function in Proc SQL in SAS - sas

I have created proc sql query in SAS program, but need to use LAG function and it tell me it can't be used in proc sql, just in data step.
Code:
proc sql;
CREATE TABLE agg_table AS
SELECT USER, MAX(TIME) AS LAST_TIME, SUM(BONUS) AS BONUS_SUM, LAG(EXPDT) AS EXPDT_LAG FROM WORK.MY_DATA GROUP BY USER_ID;
So, I don't how to combine proc sql and datastep into one query to get one table as an output?
Or maybe there is a better approach to the whole problem?
Thanks

PROC SQL does not have a concept of rows the same way the datastep does. SQL may process rows in any order, not necessarily sequential, and may use hash tables, parallel processing, or various binary tree and similar methods to process its query; and the same query may be processed in different methods. Thus lag is not usable in SQL, nor are diff or other functions that expect row data.
It's unclear from your question what exactly you're doing, so it's not really possible to give a direct answer how to do this separately; but you may be able to accomplish this entirely in one datastep, or you may combine a datastep and a SQL query, or two datasteps. You can perform the lag in a prior datastep or a view, then the rest in SQL; or you may use a DoW loop datastep to perform the max/sum elements.

Related

The processing of large data sets in sas

I am looking for solutions or ideas how to speed the processing of large data sets in sas.
What would you recommend?
What is better data step or proc sql procedure?
Speeding up your data processing depends on where your data is saved.
Your data can be either in:
SAS Table,
Database Table (Miscrosfot SQL, Oracle, DB2, MYSQL, ..
etc.)
Use SAS Data Step when:
You are querying/processing SAS tables,
You want to do iterative
processing (ex. retaining values or using arrays).
Use Proc SQL when:
You are querying a large Database table,
You can do a SQL "Pass Through" where you send SQL code to be
executed on the DB server and only the output is sent to SAS (instead
of bringing the entire tables through the network to SAS and then filter it),
You want to query SAS Tables but prefer SQL joins to data step merges.
Another topic you should consider is efficiency programming; where you are optimising your query and look-ups.
I find Proc SQL to be better for my use cases. We may need some more specifics on the size and variety of data your trying to join/export etc.
Give us some info on that and we can try to help.
Tips:
Limit the fields your pulling over
Subset data
Anecdotally from my experience Proc SQL seems faster.
Here are two tips on speeding up queries with Proc SQL:
In general, you want to rule out as much data as possible when querying. If you are usingProc SQL, the order of the restrictions in the where clause matters. Put the most restrictive parts first.
For example, if I'm querying a database for teachers with the last name "JONES", that were hired after Jan 2005, I would structure my where clause like this: where last_name = 'JONES' and hire_date > 200501 I would do this because last name is likely to exclude more records than the hire date restriction.
When possible, don't use Select * instead, list out the specific columns that you need. Remember, even if you are doing a calculation with a column, you don't have to include that column in your select statement.
Here is a very useful resource for understanding how to use proc sql efficiently. I recommend reading it in it's entirety if you do a lot of work with large data sets in SAS.
http://www2.sas.com/proceedings/sugi29/127-29.pdf

Filter a SAS dataset to contain only identifiers given in a list

I am working in SAS Enterprise guide and have a one column SAS table that contains unique identifiers (id_list).
I want to filter another SAS table to contain only observations that can be found in id_list.
My code so far is:
proc sql noprint;
CREATE TABLE test AS
SELECT *
FROM data_sample
WHERE id IN id_list
quit;
This code gives me the following errors:
Error 22-322: Syntax error, expecting on of the following: (, SELECT.
What am I doing wrong?
Thanks up front for the help.
You can't just give it the table name. You need to make a subquery that includes what variable you want it to read from ID_LIST.
CREATE TABLE test AS
SELECT *
FROM data_sample
WHERE id IN (select id from id_list)
;
You could use a join in proc sql but probably simpler to use a merge in a data step with an in= statement.
data want;
merge oneColData(in = A) otherData(in = B);
by id_list;
if A;
run;
You merge the two datasets together, and then using if A you only take the ID's that appear in the single column dataset. For this to work you have to merge on id_list which must be in both datasets, and both datasets must be sorted by id_list.
The problem with using a Data Step instead of a PROC SQL is that for the Data step the Data-set must be sorted on the variable used for the merge. If this is not yet the case, the complete Data-set must be sorted first.
If I have a very large SAS Data-set, which is not sorted on the variable to be merged, I have to sort it first (which can take quite some time). If I use the subquery in PROC SQL, I can read the Data-set selectively, so no sort is needed.
My bet is that PROC SQL is much faster for large Data-sets from which you want only a small subset.

SAS: Track progress inside of a Proc Sql statement

I found this piece of code online
data _null_;
set sashelp.class;
if mod(_n_,5)=0 then
rc = dosubl(cats('SYSECHO "OBS N=',_n_,'";'));
s = sleep(1); /* contrived delay - 1 second on Windows */
run;
I would like to know if you had any idea of how to adapt this piece to a proc sql statement, so I could track the progress of a long query...
For example
proc sql;
create table test as
select * from work.mytable
where mycolumn="thisvalue";
quit;
and somewhere in the statement above we would include the
rc = dosubl(cats('SYSECHO "OBS N=',_n_,'";'));
You wouldn't be able to directly check the progress of a SQL query, unfortunately (if it's operating on SAS datasets, anyway), except by monitoring the physical size of the table (you can do a directory listing of your WORK directory, or depending on how it's building the table, the Utility directory). However, it may or may not be linear; SQL might, for example, use a hash strategy which would not necessarily take up disk space until it was fairly close to being done.
For SQL, you're best off looking at the query plan to tell how long something's going to take. There are several guides out there, such as The SQL Optimizer Project, which explains the _METHOD and _TREE options among other things.

how to get timing information of a data step query

I just wanted to know like in proc sql we define stimer option.
The PROC SQL option STIMER | NOSTIMER specifies whether PROC SQL writes timing information for each statement to the SAS log, instead of writing a cumulative value for the entire procedure. NOSTIMER is the default.
Now in same way how to specify timing information in data set step. I am not using proc sql step
data h;
select name,empid
from employeemaster;
quit;
PROC SQL steps individually are effectively separate data steps, so in a sense you always get the identical information from SAS. What you're asking is effectively how to find out how long 'select name' takes versus 'empid'.
There's not a direct way to get the timing of an individual statement in a data step, but you could write data step code to find out. The problem is that the data step is executed row-wise, so it's really quite different from the PROC SQL STIMER details; almost nothing you do in a data step will take very long by itself, unless you are doing something more complex like a hash table lookup. What takes long is writing out the data first, and reading in the data second.
You do have some options for troubleshooting long data steps, if that's your concern. OPTIONS MSGLEVEL=I will give you information about index usage, merge details, etc., which can be helpful if you aren't sure why it is taking a long time to do certain things (see http://goo.gl/bpGWL in SAS documentation for more info). You can write your own timestamp:
data test;
set sashelp.class sashelp.class;
_t=time();
put _t=;
run;
Odds are that won't show you much of use since most data step iterations won't take very long but if you are doing something fancy it might help. You could also use conditional statements to only print the time at certain intervals - when at FIRST.ID for example in a process that works BY ID;.
Ultimately though the information you already get from notes is what is most useful. In PROC SQL you need the STIMER information because SQL is doing several things at once, while SAS lets/makes you do everything out step-wise. Example:
PROC SQL;
create table X as select * from A,B where A.ID=B.ID;
quit;
is one step - but in SAS this would be:
proc sort data=a; by ID; run;
proc sort data=b; by ID; run;
data x;
merge a(in=a) b(in=b);
by id;
if a and b;
run;
For that you would get information on the duration of each of those steps (the two sorts and the merge) in SAS, which is similar to what STIMER would tell you.
No way.
PROC SQL STIMER logs timing for each separately executable SQL statement/query.
In data step, as you may know, the data step looping occurs, observation per observation, so the data step statement timing would be something like per observation, let's say transactional. Anyway this would not describe all the details where the time is being spent - waiting for disk reads, writes, etc.
So I guess this won't be very usable. In general, SAS performance is I/O driven.

Efficient way to merge/join 2 large datasets in SAS 8.2

I have tried the following options with unacceptable response times - creating index 'key' did not help either (NOTE:duplicate'keys'in both datasets):
data a;
merge b
c;
by key
if b;
run;
=== OR ===
proc sql;
create a
as select *
from b
left outer join c
on b.key;
quit;
You should first sort the two datasets before merging them. This is what will give the performance. using an index when you have to scan the whole table to have a result is usually slower then presorting the datasets and merging them.
Be sure to trim your dataset as much as possible. Sort your dataset before the data step or proc sql. Also, I'm not 100% if it matters, but ANSI SQL would be proc sql; create a as select * from b left outer join c on b.key=C.KEY; quit;
Creating a key is the fastest way to get the datasets ready for joining. Sorting them can take just as long as merging them, if not longer, but is still a good idea.
AFHood's suggestion of trimming them is a good. Can you possibly run them through a PROC SUMMARY? Are there any columns you can drop? Any of these will reduce the size of your dataset and make the merging faster.
None of these methods may work however. I routinely merge files of several million rows and it can take a while.
You might try the SQL merge. I don't know if it would be faster for your needs but I've found SQL to be much more efficient than a regular SAS merge. Plus, once you realize what you can do with SQL, manipulating datasets becomes easier!
Don't use a data step merge to do this.
With duplicate keys in both datasets the result will be wrong.
The only way to do this is with a
Proc SQL;
Create table newdata
as select firsttable.aster, secondtable.aster
from table1 as firsttable
inner join table2 as secondtable
on (firstable.keyfield = secondtable.keyfield);
quit;
If you have more than one keyfield the join order should be least match field first to greatest match field last. SAS has a bad habbit of creating a tempory dataset containing all possible matches and the siveing it down from there. Can blow out your tempory space allocation and slow everyting down.
If you still wish to use a DATA Step then you need to get rid of the duplicate keys out of one of the datasets.