The processing of large data sets in sas

The processing of large data sets in sas - sas

I am looking for solutions or ideas how to speed the processing of large data sets in sas.
What would you recommend?
What is better data step or proc sql procedure?

Speeding up your data processing depends on where your data is saved.
Your data can be either in:
SAS Table,
Database Table (Miscrosfot SQL, Oracle, DB2, MYSQL, ..
etc.)
Use SAS Data Step when:
You are querying/processing SAS tables,
You want to do iterative
processing (ex. retaining values or using arrays).
Use Proc SQL when:
You are querying a large Database table,
You can do a SQL "Pass Through" where you send SQL code to be
executed on the DB server and only the output is sent to SAS (instead
of bringing the entire tables through the network to SAS and then filter it),
You want to query SAS Tables but prefer SQL joins to data step merges.
Another topic you should consider is efficiency programming; where you are optimising your query and look-ups.

I find Proc SQL to be better for my use cases. We may need some more specifics on the size and variety of data your trying to join/export etc.
Give us some info on that and we can try to help.
Tips:
Limit the fields your pulling over
Subset data

Anecdotally from my experience Proc SQL seems faster.
Here are two tips on speeding up queries with Proc SQL:
In general, you want to rule out as much data as possible when querying. If you are usingProc SQL, the order of the restrictions in the where clause matters. Put the most restrictive parts first.
For example, if I'm querying a database for teachers with the last name "JONES", that were hired after Jan 2005, I would structure my where clause like this: where last_name = 'JONES' and hire_date > 200501 I would do this because last name is likely to exclude more records than the hire date restriction.
When possible, don't use Select * instead, list out the specific columns that you need. Remember, even if you are doing a calculation with a column, you don't have to include that column in your select statement.
Here is a very useful resource for understanding how to use proc sql efficiently. I recommend reading it in it's entirety if you do a lot of work with large data sets in SAS.
http://www2.sas.com/proceedings/sugi29/127-29.pdf

Related

Using merge in Power Query while keeping native query

I'm trying to reduce my dataset of 1.000.000 records to only the subset I need (+/- 500) by creating an Inner Join to a different table. Unfortunataly it seems that Power Query drops the "native query" and loads the entire dataset before reducing it by merging it with a related table. I have no access to the database unfortunately, otherwise I would have written the SQL myself. Is there a way to make merge work with a native SQL query?
Thanks

I would first check that your "related table" query can run as a native query - right-click on it's last step and check if View Native Query is enabled.
If that's the case, then it may be due to the Join Kind in the Merge Queries step. I've noticed that against SQL Server data sources, Join Kinds other than the default Left Outer Join tend to kill the Native Query option.

Can't use LAG function in Proc SQL in SAS

I have created proc sql query in SAS program, but need to use LAG function and it tell me it can't be used in proc sql, just in data step.
Code:
proc sql;
CREATE TABLE agg_table AS
SELECT USER, MAX(TIME) AS LAST_TIME, SUM(BONUS) AS BONUS_SUM, LAG(EXPDT) AS EXPDT_LAG FROM WORK.MY_DATA GROUP BY USER_ID;
So, I don't how to combine proc sql and datastep into one query to get one table as an output?
Or maybe there is a better approach to the whole problem?
Thanks

PROC SQL does not have a concept of rows the same way the datastep does. SQL may process rows in any order, not necessarily sequential, and may use hash tables, parallel processing, or various binary tree and similar methods to process its query; and the same query may be processed in different methods. Thus lag is not usable in SQL, nor are diff or other functions that expect row data.
It's unclear from your question what exactly you're doing, so it's not really possible to give a direct answer how to do this separately; but you may be able to accomplish this entirely in one datastep, or you may combine a datastep and a SQL query, or two datasteps. You can perform the lag in a prior datastep or a view, then the rest in SQL; or you may use a DoW loop datastep to perform the max/sum elements.

Reading (even joining) a very large (1.1bn row) table in Enterprise Guide from Teradata

Hopefully you guys can help with what I'm hoping is quite a simple question for those in the know!
I live (well, work) in SAS Enterprise Guide and am trying to perform a simple left join against a table in Teradata.
The table is extremely large (700+ columns, 1.1bn rows) and so far I have been connecting via a LIBNAME statement at the top of my program, followed by the usual PROC SQL to read the data.
The issue I am having is its is extremely slow. I performed the join successfully using 90 rows on the left table and it took 3 hours to complete. The real table I want to use has something like 15,000 rows.
I have tried to connect via the SQL Pass-Through method, but this throws a hosts file error, which I can't fix due to corporate security limitations.
Has anyone had any experience performing this kind of task?
I should mention that I can run a simple select * query in Teradata SQL Assistant is just over 1 minute (16,666,666 obs/s!) so the limitation must be somewhere between SAS/Teradata, or even SAS itself.
I'm sorry I haven't posted actual code snippets as they're on my work machine but this has been bugging me for ages so thought I'd see if I'm missing any tricks.
Thanks in advance for your help.

So you're joining a SAS data set to a Teradata table and want to return the matching records. You'll want to use SAS's DBMASTER= data set option. It designates which of the tables is larger. By telling SAS this, it knows which table to move.
Here I assume librefs have already been assigned and that the Teradata table is larger--more obs--than the SAS data set.
proc sql threads; select tdTable.* from sastables.sasTable1, td.tdTable(dbmaster=yes)
where tdTable.idNum=sasTable1.idNum; quit;
If by chance your SAS data set is larger, you'll want to use the MULTI_DATASRC_OPT= option. Either google these terms or look in the SAS/Access to Relational Databases manual. It's pretty good.
Good luck.

Have you considered creating a volatile table in Teradata? Since this is created in your spool allocation you shouldn't need explicit permissions to create the table. Once created you can load the SAS data set into the Volatile table and collect statistics on the table's join columns and filter columns. This will help the optimizer understand the demographics about your "small" table. The volatile table will only persist for the duration of your session and is not accessible across multiple sessions.
Then rewrite your SAS code to push-down the SQL to Teradata joining the large table to your volatile table. The results can be returned to SAS and loaded into another data set.
CREATE VOLATILE TABLE MyTable, NO FALLBACK
( ColA SMALLINT NOT NULL,
ColB VARCHAR(10) NOT NULL
) PRIMARY INDEX (ColA)
ON COMMIT PRESERVE ROWS /* This is important */
;
The primary index is how Teradata distributes the data and accesses the data. Tables distributed on the same column will join "AMP local" and will not require a redistribution. This is not always possible, as your primary index selection has to consider even distribution as well as access path. The primary index does not have to be unique, but can be.
Hope this helps.

Sas mdx drillthrough statement

I use Sas WRS sat on a information map over a cube. My business users want to see the raw data behind each figure on a report. I have set up a drill through table but I need to limit the result data set to the measure being queried.
I've come across the option "drillthrough" but wondered if someone could tell me if I use this directly in the olap cube code, create a stored process or other method. I'm not really sure how to use this syntax. Will it serve my purpose? The syntax I'm thinking is
Drillthrough
(Select([measures].currentmember) on column
([reporting date].[yqmd].[date]) on rows
From (claim_table)
)

I'm not familiar with SAS but DRILLTHROUGH is standard MDX and can be used to access the ' raw ' data behind the MDX select. There might be more or less limitations depending on the actual OLAP product you're using. To limit the number of rows (e.g., 5000) returned use the syntax :
DRILLTHROUGH MAXROWS 5000 SELECT ...

Proc SQL: How / When does SAS Move the Data

I am a DBA / R user. I just took a job in an office full of SAS users and I am trying to understand better how SAS' proc sql works. I understand that SAS includes a relational database and it includes the ability to run proc sql against external servers like Oracle. I am trying to better understand when / how it decides to use the database server rather than its internal database system.
I have seen some really S. L. O. W. SAS code where my coworkers running a series of proc sql commands. These programs typically include 3 - 5 proc sql steps. Each proc sql command creates a local SAS table. They are not using passthrough sql. The data sets are large (1 million rows +) and these proc sql steps run slowly. Most of the data lives on the server. There is usually a small table that defines the population that we want to look at and it is in a SAS data file, but everything else lives on the server.
I have demonstrated dramatic improvements in speed by simply running all of the queries directly on the server. (Oracle in this case, but I don't think that is important.) Usually, I have to first upload a table to my personal schema that defines the population of clients we want to examine. Everything else is on the server. Sometimes I collapse their queries together because they can be done in a single step, but I do not believe that is why my version of their program is so much faster.
I think proc sql uploads the initial data set and then runs the first query on the server. It then downloads the output to the local computer, creating the local SAS data set. For the second proc sql step, it uploads the table created in step one back to the server and then runs the query on the server. To make this all even worse, the "local" SAS data sets are actually stored on a remote server, not the actual local machine. This is invisible to SAS, but it does mean we are copying data across the network yet again. I believe SAS is running slowly because of a large amount of unnecessary network traffic.
Question #1 - Is my understanding of what proc sql is doing correct? Are we really wasting as much time as I think we are uploading and downloading large tables / data sets across our network?
Qeustion #2 - Is there some way to control when proc sql runs against a server versus when it runs against the local database? In some cases, if we could prevent the upload / download step, the query would run more efficiently.

Short answer
Your understanding is not exactly correct, but it's in the right ballpark. SQL is probably not sending the SAS dataset to the server, it is more likely downloading the server data to SAS - but it's probably downloading the entire table, not limited by the join criteria. Your solution is exactly what I would suggest doing - hopefully your colleagues will get on board.
Long answer
In terms of how the processing works, it depends on your code. PROC SQL will execute code locally (as in, on the SAS server/desktop), unless it decides to pass the query up to the server and hasn't been told it's not allowed to. That's called implicit passthrough. You can't really control it except to turn it entirely off (with noipassthru on the PROC SQL statement). You can look at it sometimes using options msglevel=i; (a system option), and _METHOD or _TREE to see what SQL decided to do (similar to explain plan).
I've had cases where it caused harm: SQL Server runs character comparisons case-insensitively while SAS does not, and I had a particular query that sometimes was sent up to the server and sometimes not depending on details of the data. I wasn't careful enough with checking case, and so it appeared to work when it really wasn't correct (comparing Propcase to UPCASE).
The general rule is that SAS will try to send the query to the server if:
The data in the query entirely already resides on the server
The query is sufficiently simple that SAS can easily figure out how to tell the server to do it, in its native language
If you're running a query with local SAS dataset (say, joining a server table to a SAS dataset locally), it won't (at least as far as I know) go to the server. It should always run it locally, which would mean downloading from the server all data in the contributing tables (possibly filtered if there is a logical filter in the query). IE (these examples aren't necessarily good SQL code, just examples of concept):
libname oralib oracle [connection info];
proc sql;
*Will pass through likely;
select tableA.*, tableB.cost
from oralib.tableA inner join oralib.tableB
on tableA.id=tableB.id;
*Will probably not pass through;
select tableA.*, tableB.cost
from oralib.tableA inner join work.tableB
on tableA.id=tableB.id;
*Might pass through, might not;
select tableA.*, tableB.cost, tableC.productID
from oralib.tableA inner join oralib.tableB
on tableA.id=tableB.id
left join oralib.tableC
on tableA.id=tableC.id;
*This downloads the data but probably applies the where statement server side;
select tableA.*, tableB.cost
from oralib.tableA inner join work.tableB
on tableA.id=tableB.id
where tableA.date < '01JAN2010'd;
quit;
In the case of the second query, it probably pulls all of tableA down. In the fourth query, it likely will pass the where clause to the server (assuming the date doesn't cause a problem, but it shouldn't, SAS knows how to convert dates to oracle type dates).
Note that SAS procs can also generate passthrough. PROC MEANS, etc., will send the instructions to Oracle to do the means/sums/etc. if it can easily do so.
Your best bet is to:
Try to do everything in pass through that you can (and that makes sense). Only way to be sure it goes to the server is to use passthrough.
If you have a large table on the server and a small table in SAS, upload the table in SAS to the server. A passthrough session and a libname session can't see each others session-specific temporary tables, so you'd have to use a GTT or similar (something all users can see). Similarly, if you have a large table in SAS and a small table (or small query result) in SQL, bring it down locally (through passthrough if necessary).
When you do have to bring things down, limit as much as possible. When I worked in that kind of environment, I made huge time savings simply by joining to tables on the server to limit my result set before bringing them down.
At the end of the day, you will be constrained by network traffic no matter what you do; just try to optimize it as best you can. It sounds like you understand how to do that already, so just do what you normally would do in non-SAS environments.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js