sort error at end of proc sql for inner join - sas

I ran the following code and an hour later, just as the code was finishing a sort execute error occurred. Is there something wrong with my code or is my computer processor and Ram insufficient
proc sql;
create table today as
select a.account_number, a.client_type, a.device ,a.entry_date_est,
a.entry_time_est, a.duration_seconds, a.channel_name, b.esn, b.service_start_date,
b.service_end_date, b.product_name, b.billing_frequency_fee, b.plan_category,
b.plan_subtype, b.plan_type
from listen_nomiss a inner join service_nomiss b
on (a.account_number = b.account_number)
order by account_number;
quit;

That error is most commonly seen when you run out of utility space to perform the sort. A few suggestions for troubleshooting are available in this SAS KB post; the most useful suggestions:
options fullstimer msglevel=i ; will give you a lot more information about what's going on behind the scenes, so you can troubleshoot what is causing the issue
proc options option=utilloc; run; will tell you where the utility directory is that your temporary files will be created in for the sort. Verify that about 3 times the space needed for the final table is available - sorting requires roughly 3 times the space in order to properly sort the dataset due to how the sort is processed.
OPTIONS COMPRESS; will save some (possibly a lot of) space if not already enabled.
options memsize; and options sortsize; will tell you how much memory is allocated to SAS, and at what size a sort is done in memory versus on disk. sortsize should be about 1/3 of memsize (given the requirement of 3x space to process it). If your final table is around but just over sortsize, you may be better off trying to increase sortsize if the default is too low (same for memsize).
You could also have some issues with permissions; some of the other suggestions in the kb article relate to verifying you actually have permission to write to the utility directory, or that it exists at all.

I've had a project in the past where resources was an issue as well.
A couple of ways around it when sorting were:
Don't forget that proc sort has a TAGSORT option, which will make it first only sort on the by statement variables and attach everything else afterwards. Useful when having many columns not involved in the by statement.
Indexes: if you build an index of exactly the variables in your by-statement, you can use a by statement without sorting, it will rely on the index.
Split it up: you can split up the dataset in multiple chunks and sort each chunk separately. Then you do a data step where you put them all in the set statement. When you use a by statement there as well, SAS will weave the records so that the result is also according to the by-statement.
Note that these approaches have a performance hit (maybe the third one only to a lesser extent) and indexes can give you headaches if you don't take them into account later on (or destroy them intentionally).
One note if/when you would rewrite the whole join as a SAS merge: keep in mind that SAS merge does not by itself mimic many-to-many joins. (it does one-to-one, one-to-many and many-to-one) Probably not the case here (it rarely is), but i mention it to be on the safe side.

Related

Redshift Query Performance to reduce CPU utilisation

I want to take a general Idea of how I can optimise the query performance in redshift Database, I have Huge queries with lots of joins , I do understand using sort and Dist key it can be achieved but is there a method which we can follow in order to get some optimal results.
What to look in a table and how to approach query optimisation in redshift?
What are the necessary steps to look for or approach in order to have a certain plan for optimisation?
Any guidance will help a lot
Having improved many queries on Redshift there are a few things I can point you towards. First let me list a few tools / techniques to make sure you have these in your toolbox.
Ability to read and EXPLAIN plan and find expected costly points
Know where to find the query "actual" execution report
Know the system tables to find join, distribution, and disk io reports
So with those understood let's look at where many queries go sideways on Redshift. I will try to list these out in pareto order but any of these, or combos, can create significant issue.
#1 - Fat in the middle queries. When joining it is possible to expand the number of rows being operated upon many fold. Cross joining is a clear way this can happen but isn't how this usually happens. If the join on conditions create a many to many join pattern the number of rows can expand. When the table sizes are very large and the "multiplication" can make absurd data sizes. The explain plan can show this but not always - use of DISTINCT and GROUP BY can "hide" the true size of the dataset in play. Performing a SELECT COUNT(*) on your join tree can help show how big this is. You may also may need to look a pieces of the join tree if a later join is collapsing the rows (failure of the query optimizer?). Redshift is a columnar database and not well set up for the creation of data - this includes during the execution of query.
#2 - Distribution of large amounts of data. Redshift is a cluster and the node are connected together by ethernet cables and these connections are the slowest part of the cluster. A lot of work is done by the query optimizer to minimize the amount of data that needs to move around the network. However, it doesn't know your data as well as you do and doesn't always do this well. Look at the type of joins you are getting - is distribution needed? how much data is being distributed? Also, group by (and window functions) need to combine rows and therefore may need redistribution to complete. How big are the data sets entering your aggregation steps?
Moving a lot of data around the network will be slow. The difficulty is that it isn't always clear how to reduce this movement. Large join trees like you say you have can do "odd" things when it comes to the resulting distribution of the "joined" data. Joins are performed one at a time and the order these happen can matter. The query optimizer is making a number of decisions about the order of joins and how to organize the resulting data from each join. The choices it makes is based on what it sees in the table metadata so completeness of metadata matters. WHERE conditions can also impact the optimizer's choices. There are just way to many interactions to itemize them out here. Best advice is to look at the performance per step and see if data distribution is a factor. Then work to control how data is distributed in the query's execution. This may mean changing the join trees or even decomposing the query into several with temp table that have distribution set so that data movement is minimized.
#3 Excessive IO traffic - While not as slow as the networks, the disk IO subsystem is often a bottleneck. This shows up in a few ways. Are you reading more data from disk than is needed? (Metadata up to date?) Do you need a redundant WHERE clause to eliminate data? (Redundant WHERE clause is one that isn't needed functionally but is added so Redshift can perform the metadata comparisons that will reduce data read at scan.) Data spill is another way that disk IO can be strained (this goes back to #1). If data needs to spill to disk it can bring the disk IO performance down considerably. Use your metadata and Where clauses well.
Now these 3 areas often team up to kill your performance. Read too many rows from your tables, join all these extra rows together across the network while also making many new rows. This data doesn't fit in memory so now Redshift needs to spill to disk to complete the query. Things slow down real fast in these conditions.
Lastly these factors I've listed are cluster wide "resources" of Redshift. If one query take up a lot of one of these then there is less for other queries running at the same time. What often happens is that the query writers on a cluster follow similar patterns (good or bad) and when their pattern is costly on one axis then many of their queries are costly on the same axis. This shows up as queries that work "ok" when run in isolation but very badly when others are using the cluster. This generally means that many queries are contributing to pushing the cluster "over the edge" on some limited resource. There are system tables that you can look at to see aggregated IO or network traffic to see these effects.
Good queries are:
Don't make a lot of new "rows" during execution (not fat in the middle)
Keep large data sets "on node" and only redistribute data once the data has been pared down significantly
Don't read more data from disk than is necessary and don't spill
The problem is that doing all of these isn't always possible the trick is to not over subscribe the cluster resources you have.

SAS out of memory error

I'm getting a "The remote Process is out of memory" in SAS DIS (Data Integration Studio):
Since it is possible that my approach is wrong, I'll explain the problem I'm working on and the solution I've decided on:
I have a large list of customer names which need cleaning. In order to achieve this, I use a .csv file containing regular expression patterns and their corresponding replacements; (I use this approach since it is easier to add new patterns to the file and upload it to the server for the deployed job to read from rather than harcoding new rules and redeploying the job).
In order to get my data step to make use of the rules in the file I add the patterns and their replacements to an array in the first iteration of my data step then apply them to my names. Something like:
DATA &_OUPUT;
ARRAY rule_nums{1:&NOBS} _temporary_;
IF(_n_ = 1) THEN
DO i=1 to &NOBS;
SET WORK.CLEANING_RULES;
rule_nums{i} = PRXPARSE(CATS('s/',rule_string_match,'/',rule_string_replace,'/i'));
END;
SET WORK.CUST_NAMES;
customer_name_clean = customer_name;
DO i=1 to &NOBS;
customer_name_clean = PRXCHANGE(a_rule_nums{i},1,customer_name_clean);
END;
RUN;
When I run this on around ~10K rows or less, it always completes and finishes extremely quickly. If I try on ~15K rows it chokes for a super long time and eventually throws an "Out of memory" error.
To try and deal with this I built a loop (using the SAS DIS loop transformation) wherein I number the rows of my dataset first, then apply the preceding logic in batches of 10000 names at a time. After a very long time I got the same out of memory error, but when I checked my target table (Teradata) I noticed that it ran and loaded the data for all but the last iteration. When I switched the loop size from 10000 to 1000 I saw exactly the same behaviour.
For testing purposes I've been working with only around ~500K rows but will soon have to handle millions and am worried about how this is going to work. For reference, the set of cleaning rules I'm applying is currently 20 rows but will grow to possibly a few hundred.
Is it significantly less efficient to use a file with rules rather than hard coding the regular expressions directly in my datastep?
Is there any way to achieve this without having to loop?
Since my dataset gets overwritten on every loop iteration, how can there be an out of memory error for datasets that are 1000 rows long (and like 3 columns)?
Ultimately, how do I solve this out of memory error?
Thanks!
The issue turned out to be that the log that the job was generating was too large. The possible solutions are to disable logging or to redirect the log to a location which can be periodically purged and/or has enough space.

Solution to work disk space not enough in sas

I have more than 50 tables running in work. Before, it worked well.
But recently, there are some errors like:
ERROR: An I/O error has occurred on file
WORK.'SASTMP-000000030'n.UTILITY. ERROR: File
WORK.'SASTMP-000000030'n.UTILITY is damaged. I/O processing did not
complete. NOTE: Error was encountered during utility-file processing.
You may be able to execute the SQL statement successfully if you
allocate more space to the WORK library. ERROR: There is not enough WORK disk space to store the results of an internal sorting
phase. ERROR: An error has occurred.
Does anyone know how to solve this error?
Your disk is full. If this is running on a server, ask your system administrator to investigate the problem.
If this is your desktop, find and delete un-needed files to free up space.
Clean out old SAS Work Folders
Often, old SAS Work folders do not get cleared when SAS closes. You can get back a lot of disk space by going to the path defined for SAS Work, and deleting all the old folders.
In SAS
%put %sysfunc(pathname(work));
will show you where the current WORK library is located. One level up is where all SAS Work folders are created.
On my system, that returns:
C:\Users\dpazzula\AppData\Local\Temp\SAS Temporary Files\_TD9512_GXM2L12-PAZZULA_
That means that I should look in "C:\Users\dpazzula\AppData\Local\Temp\SAS Temporary Files\" to find old folders to delete.
Your work space is full.
Your SAS server uses a dedicated directory where all SAS sessions store their temporary files: All files in the work libraries, as well as temp files as used while sorting, joining etc.
Solutions:
Have more space allocated.
Make certain only to put necessary files into work/ clean up/ close old sessions.
Run less processes.
Replace interim datasets with views instead, especially if you're using large source datasets :
data master /view=master ;
set lib.monthlydata20: ; /* all datasets since Jan 2000 */
run ;
proc sql ;
create table want as
select *
from master
where ID in(select ID from lookup) ;
quit ;
try to compress all datasets using this option
OPTIONS COMPRESS=YES REUSE=YES;
this should be in the very beginning of your code. it will compress all datasets by nearly 98%.It will also make your code run faster. It will consume more CPU but will decrease size.
In some cases, this might not help if the compressed data sets exceed the hard disk space.
Also, change your work directory to the biggest drive that has disk space.
Study your code.
Create a Data Flow Diagram to determine WHEN each file is created, where it is used downstream. Find out when a data set is no longer needed and DELETE it. If you have 50 data sets, chances are numerous data sets are 'value-added' by a subsequent step, and can go away freeing up your work space. A cute trick is to REUSE some of the data set names - to keep the number of unneeded data sets in check.
Rule of thumb: leave the environment the way you found it - if there were no files in WORK to start, manually clean up after yourself. Unless it is a Stored Process, which starts a completely new SAS job, and will clean up after itself upon completion of the job.

How to optimize writing this data to a postgres database

I'm parsing poker hand histories, and storing the data in a postgres database. Here's a quick view of that:
I'm getting a relatively bad performance, and parsing files will take several hours. I can see that the database part takes 97% of the total program time. So only a little optimization would make this a lot quicker.
The way I have it set-up now is as follows:
Read next file into a string.
Parse one game and store it into object GameData.
For every player, check if we have his name in the std::map. If so; store the playerids in an array and go to 5.
Insert the player, add it to the std::map, store the playerids in an array.
Using the playerids array, insert the moves for this betting round, store the moveids in an array.
Using the moveids array, insert a movesequence, store the movesequenceids in an array.
If this isn't the last round played, go to 5.
Using the movesequenceids array, insert a game.
If this was not the final game, go to 2.
If this was not the last file, go to 1.
Since I'm sending queries for every move, for every movesequence, for every game, I'm obviously doing too many queries. How should I bundle them for best performance? I don't mind rewriting a bit of code, so don't hold back. :)
Thanks in advance.
CX
It's very hard to answer this without any queries, schema, or a Pg version.
In general, though, the answer to these problems is to batch the work into bigger coarser batches to avoid repeating lots of work, and, most importantly, by doing it all in one transaction.
You haven't said anything about transactions, so I'm wondering if you're doing all this in autocommit mode. Bad plan. Try wrapping the whole process in a BEGIN and COMMIT. If it's a seriously long-running process the COMMIT every few minutes / tens of games / whatever, write a checkpoint file or DB entry your program can use to resume the import from that point, and open a new transaction to carry on.
It'll help to use multi-valued inserts where you're inserting multiple rows to the same table. Eg:
INSERT INTO some_table(col1, col2, col3) VALUES
('a','b','c'),
('1','2','3'),
('bork','spam','eggs');
You can improve commit rates with synchronous_commit=off and a commit_delay, but that's not very useful if you're batching work into bigger transactions.
One very good option will be to insert your new data into UNLOGGED tables (PostgreSQL 9.1 or newer) or TEMPORARY tables (all versions, but lost when session disconnects), then at the end of the process copy all the new rows into the main tables and drop the import tables with commands like:
INSERT INTO the_table
SELECT * FROM the_table_import;
When doing this, CREATE TABLE ... LIKE is useful.
Another option - really a more extreme version of the above - is to write your results to CSV flat files as you read and convert them, then COPY them into the database. Since you're working in C++ I'm assuming you're using libpq - in which case you're hopefully also using libpqtypes. libpq offers access to the COPY api for bulk-loading, so your app wouldn't need to call out to psql to load the CSV data once it'd produced it.

Does Sybase ASE 12.5 support Common Table Expressions?

I noted that Sybase SQL Anywhere supports them, but can't find any documentation on ASE also doing so.
If it doesn't, what would be my best option for designing a recursive query? In SQL Server 2008 I'd do it with a CTE, but if that's not available? A function perhaps?
Sybase ASE 12.5 (and also 15.0) doesn't support CTE.
You can use a inner query to solve your problem.
A simple CTE like this:
WITH Sales_CTE (Folders_Id)
AS
-- Define the CTE query.
(
SELECT Folders_Id FROM Folders
)
SELECT Folders_Id FROM Sales_CTE
is the same as this:
SELECT aux.Folders_Id
FROM (SELECT Folders_Id FROM Folders) aux
For a litle more info, check this!
Since 1984, the Standard, and Sybase, allowed for full recursion. We generally perform recursion in a stored proc, so that depth is controlled, infinite loops are avoided, and execution is faster than uncompiled SQL, etc.
Stored procs have no limits to recursion, result set construction etc. Of course, defining the content of the brackets as a View would make it faster again (it is after all a real View, not one that we have to Materialise every time we need it).
Point being, if you were used to this method (recursion in the server, a proc coded for recursion), as I am, there is no need for CTEs, with its new syntax; uncompiled speeds; temp tables; work tables; a cursor "walking" the hierarchy; all resulting in horrendous performance.
A recursive proc reads only the data, and nothing but the data, and reads only those rows that qualify at each level of the recursion. It does not use a cursor. It does not "walk", it builds the hierarchy.
A second option is to use Dynamic SQL. Simply construct the SELECTs, one per level of the hierarchy, and keep adding the UNIONs until you run out of levels; then execute.
You can use a function to provide the facility of a CTE, but do not do so. A function is intended for a different, column-oriented purpose, the code is subject to those constraints. It is scalar, good for constructing a column value. Stored procs and CTEs are row-oriented.