Reading (even joining) a very large (1.1bn row) table in Enterprise Guide from Teradata - sas

Hopefully you guys can help with what I'm hoping is quite a simple question for those in the know!
I live (well, work) in SAS Enterprise Guide and am trying to perform a simple left join against a table in Teradata.
The table is extremely large (700+ columns, 1.1bn rows) and so far I have been connecting via a LIBNAME statement at the top of my program, followed by the usual PROC SQL to read the data.
The issue I am having is its is extremely slow. I performed the join successfully using 90 rows on the left table and it took 3 hours to complete. The real table I want to use has something like 15,000 rows.
I have tried to connect via the SQL Pass-Through method, but this throws a hosts file error, which I can't fix due to corporate security limitations.
Has anyone had any experience performing this kind of task?
I should mention that I can run a simple select * query in Teradata SQL Assistant is just over 1 minute (16,666,666 obs/s!) so the limitation must be somewhere between SAS/Teradata, or even SAS itself.
I'm sorry I haven't posted actual code snippets as they're on my work machine but this has been bugging me for ages so thought I'd see if I'm missing any tricks.
Thanks in advance for your help.

So you're joining a SAS data set to a Teradata table and want to return the matching records. You'll want to use SAS's DBMASTER= data set option. It designates which of the tables is larger. By telling SAS this, it knows which table to move.
Here I assume librefs have already been assigned and that the Teradata table is larger--more obs--than the SAS data set.
proc sql threads; select tdTable.* from sastables.sasTable1, td.tdTable(dbmaster=yes)
where tdTable.idNum=sasTable1.idNum; quit;
If by chance your SAS data set is larger, you'll want to use the MULTI_DATASRC_OPT= option. Either google these terms or look in the SAS/Access to Relational Databases manual. It's pretty good.
Good luck.

Have you considered creating a volatile table in Teradata? Since this is created in your spool allocation you shouldn't need explicit permissions to create the table. Once created you can load the SAS data set into the Volatile table and collect statistics on the table's join columns and filter columns. This will help the optimizer understand the demographics about your "small" table. The volatile table will only persist for the duration of your session and is not accessible across multiple sessions.
Then rewrite your SAS code to push-down the SQL to Teradata joining the large table to your volatile table. The results can be returned to SAS and loaded into another data set.
CREATE VOLATILE TABLE MyTable, NO FALLBACK
( ColA SMALLINT NOT NULL,
ColB VARCHAR(10) NOT NULL
) PRIMARY INDEX (ColA)
ON COMMIT PRESERVE ROWS /* This is important */
;
The primary index is how Teradata distributes the data and accesses the data. Tables distributed on the same column will join "AMP local" and will not require a redistribution. This is not always possible, as your primary index selection has to consider even distribution as well as access path. The primary index does not have to be unique, but can be.
Hope this helps.

Related

The processing of large data sets in sas

I am looking for solutions or ideas how to speed the processing of large data sets in sas.
What would you recommend?
What is better data step or proc sql procedure?
Speeding up your data processing depends on where your data is saved.
Your data can be either in:
SAS Table,
Database Table (Miscrosfot SQL, Oracle, DB2, MYSQL, ..
etc.)
Use SAS Data Step when:
You are querying/processing SAS tables,
You want to do iterative
processing (ex. retaining values or using arrays).
Use Proc SQL when:
You are querying a large Database table,
You can do a SQL "Pass Through" where you send SQL code to be
executed on the DB server and only the output is sent to SAS (instead
of bringing the entire tables through the network to SAS and then filter it),
You want to query SAS Tables but prefer SQL joins to data step merges.
Another topic you should consider is efficiency programming; where you are optimising your query and look-ups.
I find Proc SQL to be better for my use cases. We may need some more specifics on the size and variety of data your trying to join/export etc.
Give us some info on that and we can try to help.
Tips:
Limit the fields your pulling over
Subset data
Anecdotally from my experience Proc SQL seems faster.
Here are two tips on speeding up queries with Proc SQL:
In general, you want to rule out as much data as possible when querying. If you are usingProc SQL, the order of the restrictions in the where clause matters. Put the most restrictive parts first.
For example, if I'm querying a database for teachers with the last name "JONES", that were hired after Jan 2005, I would structure my where clause like this: where last_name = 'JONES' and hire_date > 200501 I would do this because last name is likely to exclude more records than the hire date restriction.
When possible, don't use Select * instead, list out the specific columns that you need. Remember, even if you are doing a calculation with a column, you don't have to include that column in your select statement.
Here is a very useful resource for understanding how to use proc sql efficiently. I recommend reading it in it's entirety if you do a lot of work with large data sets in SAS.
http://www2.sas.com/proceedings/sugi29/127-29.pdf

Sas mdx drillthrough statement

I use Sas WRS sat on a information map over a cube. My business users want to see the raw data behind each figure on a report. I have set up a drill through table but I need to limit the result data set to the measure being queried.
I've come across the option "drillthrough" but wondered if someone could tell me if I use this directly in the olap cube code, create a stored process or other method. I'm not really sure how to use this syntax. Will it serve my purpose? The syntax I'm thinking is
Drillthrough
(Select([measures].currentmember) on column
([reporting date].[yqmd].[date]) on rows
From (claim_table)
)
I'm not familiar with SAS but DRILLTHROUGH is standard MDX and can be used to access the ' raw ' data behind the MDX select. There might be more or less limitations depending on the actual OLAP product you're using. To limit the number of rows (e.g., 5000) returned use the syntax :
DRILLTHROUGH MAXROWS 5000 SELECT ...

Proc SQL: How / When does SAS Move the Data

I am a DBA / R user. I just took a job in an office full of SAS users and I am trying to understand better how SAS' proc sql works. I understand that SAS includes a relational database and it includes the ability to run proc sql against external servers like Oracle. I am trying to better understand when / how it decides to use the database server rather than its internal database system.
I have seen some really S. L. O. W. SAS code where my coworkers running a series of proc sql commands. These programs typically include 3 - 5 proc sql steps. Each proc sql command creates a local SAS table. They are not using passthrough sql. The data sets are large (1 million rows +) and these proc sql steps run slowly. Most of the data lives on the server. There is usually a small table that defines the population that we want to look at and it is in a SAS data file, but everything else lives on the server.
I have demonstrated dramatic improvements in speed by simply running all of the queries directly on the server. (Oracle in this case, but I don't think that is important.) Usually, I have to first upload a table to my personal schema that defines the population of clients we want to examine. Everything else is on the server. Sometimes I collapse their queries together because they can be done in a single step, but I do not believe that is why my version of their program is so much faster.
I think proc sql uploads the initial data set and then runs the first query on the server. It then downloads the output to the local computer, creating the local SAS data set. For the second proc sql step, it uploads the table created in step one back to the server and then runs the query on the server. To make this all even worse, the "local" SAS data sets are actually stored on a remote server, not the actual local machine. This is invisible to SAS, but it does mean we are copying data across the network yet again. I believe SAS is running slowly because of a large amount of unnecessary network traffic.
Question #1 - Is my understanding of what proc sql is doing correct? Are we really wasting as much time as I think we are uploading and downloading large tables / data sets across our network?
Qeustion #2 - Is there some way to control when proc sql runs against a server versus when it runs against the local database? In some cases, if we could prevent the upload / download step, the query would run more efficiently.
Short answer
Your understanding is not exactly correct, but it's in the right ballpark. SQL is probably not sending the SAS dataset to the server, it is more likely downloading the server data to SAS - but it's probably downloading the entire table, not limited by the join criteria. Your solution is exactly what I would suggest doing - hopefully your colleagues will get on board.
Long answer
In terms of how the processing works, it depends on your code. PROC SQL will execute code locally (as in, on the SAS server/desktop), unless it decides to pass the query up to the server and hasn't been told it's not allowed to. That's called implicit passthrough. You can't really control it except to turn it entirely off (with noipassthru on the PROC SQL statement). You can look at it sometimes using options msglevel=i; (a system option), and _METHOD or _TREE to see what SQL decided to do (similar to explain plan).
I've had cases where it caused harm: SQL Server runs character comparisons case-insensitively while SAS does not, and I had a particular query that sometimes was sent up to the server and sometimes not depending on details of the data. I wasn't careful enough with checking case, and so it appeared to work when it really wasn't correct (comparing Propcase to UPCASE).
The general rule is that SAS will try to send the query to the server if:
The data in the query entirely already resides on the server
The query is sufficiently simple that SAS can easily figure out how to tell the server to do it, in its native language
If you're running a query with local SAS dataset (say, joining a server table to a SAS dataset locally), it won't (at least as far as I know) go to the server. It should always run it locally, which would mean downloading from the server all data in the contributing tables (possibly filtered if there is a logical filter in the query). IE (these examples aren't necessarily good SQL code, just examples of concept):
libname oralib oracle [connection info];
proc sql;
*Will pass through likely;
select tableA.*, tableB.cost
from oralib.tableA inner join oralib.tableB
on tableA.id=tableB.id;
*Will probably not pass through;
select tableA.*, tableB.cost
from oralib.tableA inner join work.tableB
on tableA.id=tableB.id;
*Might pass through, might not;
select tableA.*, tableB.cost, tableC.productID
from oralib.tableA inner join oralib.tableB
on tableA.id=tableB.id
left join oralib.tableC
on tableA.id=tableC.id;
*This downloads the data but probably applies the where statement server side;
select tableA.*, tableB.cost
from oralib.tableA inner join work.tableB
on tableA.id=tableB.id
where tableA.date < '01JAN2010'd;
quit;
In the case of the second query, it probably pulls all of tableA down. In the fourth query, it likely will pass the where clause to the server (assuming the date doesn't cause a problem, but it shouldn't, SAS knows how to convert dates to oracle type dates).
Note that SAS procs can also generate passthrough. PROC MEANS, etc., will send the instructions to Oracle to do the means/sums/etc. if it can easily do so.
Your best bet is to:
Try to do everything in pass through that you can (and that makes sense). Only way to be sure it goes to the server is to use passthrough.
If you have a large table on the server and a small table in SAS, upload the table in SAS to the server. A passthrough session and a libname session can't see each others session-specific temporary tables, so you'd have to use a GTT or similar (something all users can see). Similarly, if you have a large table in SAS and a small table (or small query result) in SQL, bring it down locally (through passthrough if necessary).
When you do have to bring things down, limit as much as possible. When I worked in that kind of environment, I made huge time savings simply by joining to tables on the server to limit my result set before bringing them down.
At the end of the day, you will be constrained by network traffic no matter what you do; just try to optimize it as best you can. It sounds like you understand how to do that already, so just do what you normally would do in non-SAS environments.

Select for Teradata Primary/Foreign key relationship

Im in the process of learning to properly pull appropriate metadata from a Teradata database and a large part of what I need is to pull all existing primary/foreign keys within a database. I am still very much a beginner with Teradata as well as big data in general, so a simplified explanation would be nice.
A simplified version of a select statement would also be incredibly helpful. Thanks in advance.
Foreign Keys: dbc.All_RI_ParentsV[X]
PK/Unique: dbc.IndicesV[X]. Unique Indexes got a UniqueFlag Y, if it was defined as a PK in the Create Table IndexType will be P. Multi-column indexes got one row per column all sharing the same IndexNumber, 1 is always the PI.
But as Teradata is a DWH you might have tables without defined PK and you will hardly find any defined FKs.

Limiting results in PROC SQL

I am trying to use PROC SQL to query a DB2 table with hundreds of millions of records. During the development stage, I want to run my query on an arbitrarily small subset of those records (say, 1000). I've tried using INOBS to limit the observations, but I believe that this parameter is simply limiting the number of records which SAS is processing. I want SAS to only fetch an arbitrary number of records from the database (and then process all of them).
If I were writing a SQL query myself, I would simply use SELECT * FROM x FETCH FIRST 1000 ROWS ONLY ... (the equivalent of SELECT TOP 1000 * FROM x in SQL Server). But PROC SQL doesn't seem to have any option like this. It's taking an extremely long time to fetch the records.
The question: How can I instruct SAS to arbitrarily limit the number of records to return from the database.
I've read that PROC SQL uses ANSI SQL, which doesn't have any specification for a row limiting keyword. Perhaps SAS didn't feel like making the effort to translate its SQL syntax to vendor-specific keywords? Is there no work around?
Have you tried using the outobs option in your proc sql?
For example,
proc sql outobs=10; create table test
as
select * from schema.HUGE_TABLE
order by n;
quit;
Alternatively, you can use SQL passthrough to write a query using DB2 syntax (FETCH FIRST 10 ROWS ONLY), although this requires you to store all your data in the database, at least temporarily.
Passthrough looks something like this:
proc sql;
connect to db2 (user=&userid. password=&userpw. database=MY_DB);
create table test as
select * from connection to db2 (
select * from schema.HUGE_TABLE
order by n
FETCH FIRST 10 ROWS ONLY
);
quit;
It requires more syntax and can't access your sas datasets, so if outobs works for you, I would recommend that.
When SAS is talking to a database via SAS syntax, part of the query can be translated to DBMS language equivalent - this is called implicit pass through. The rest of the query is "post-processed" by SAS to produce final result.
Depending on SAS version, DBMS vendor and DBMS version, and in some cases even some connection/libname options, different parts of SAS syntax are translatable/considered compatible between SAS and DBMS and thus sent to be performed by DBMS instead of SAS.
With SAS SQL options - INOBS and OUTOBS - I've worked a lot with MS SQL and Oracle via different versions of SAS, but I haven't seen those ever translated to TOP xxx type of queries, so this is probably not supported yet, although when query touches just DMBS data (no joins to SAS data etc), should be quite doable.
So I think you're left with the so called explicit pass-through - specific SAS SQL syntax to connect to database. This type of queries look like this:
proc sql;
connect to oracle as db1 (user=user1 pw=pasw1 path=DB1);
create table test_table as
select *
from connection to db1
( /* here we're in oracle */
select * from test.table1 where rownum <20
)
;
disconnect from db1;
quit;
In SAS 9.3 the syntax can be simplified - if there's already a LIBNAME connection, you can reuse it for explicit pass-through:
LIBNAME ORALIB ORACLE user=...;
PROC SQL;
connect to oracle using ORALIB;
create table work.test_table as
select *
from connection to ORALIB (
....
When connecting using libname be sure to use READBUFF (I usually set some 5000 or so) or INSERTBUFF options (1000 or more) when loading database.
To see if implicit pass-through takes place, set sastrace option:
option sastrace=',,,ds' sastraceloc=saslog nostsuffix;