I'm getting a "The remote Process is out of memory" in SAS DIS (Data Integration Studio):
Since it is possible that my approach is wrong, I'll explain the problem I'm working on and the solution I've decided on:
I have a large list of customer names which need cleaning. In order to achieve this, I use a .csv file containing regular expression patterns and their corresponding replacements; (I use this approach since it is easier to add new patterns to the file and upload it to the server for the deployed job to read from rather than harcoding new rules and redeploying the job).
In order to get my data step to make use of the rules in the file I add the patterns and their replacements to an array in the first iteration of my data step then apply them to my names. Something like:
DATA &_OUPUT;
ARRAY rule_nums{1:&NOBS} _temporary_;
IF(_n_ = 1) THEN
DO i=1 to &NOBS;
SET WORK.CLEANING_RULES;
rule_nums{i} = PRXPARSE(CATS('s/',rule_string_match,'/',rule_string_replace,'/i'));
END;
SET WORK.CUST_NAMES;
customer_name_clean = customer_name;
DO i=1 to &NOBS;
customer_name_clean = PRXCHANGE(a_rule_nums{i},1,customer_name_clean);
END;
RUN;
When I run this on around ~10K rows or less, it always completes and finishes extremely quickly. If I try on ~15K rows it chokes for a super long time and eventually throws an "Out of memory" error.
To try and deal with this I built a loop (using the SAS DIS loop transformation) wherein I number the rows of my dataset first, then apply the preceding logic in batches of 10000 names at a time. After a very long time I got the same out of memory error, but when I checked my target table (Teradata) I noticed that it ran and loaded the data for all but the last iteration. When I switched the loop size from 10000 to 1000 I saw exactly the same behaviour.
For testing purposes I've been working with only around ~500K rows but will soon have to handle millions and am worried about how this is going to work. For reference, the set of cleaning rules I'm applying is currently 20 rows but will grow to possibly a few hundred.
Is it significantly less efficient to use a file with rules rather than hard coding the regular expressions directly in my datastep?
Is there any way to achieve this without having to loop?
Since my dataset gets overwritten on every loop iteration, how can there be an out of memory error for datasets that are 1000 rows long (and like 3 columns)?
Ultimately, how do I solve this out of memory error?
Thanks!
The issue turned out to be that the log that the job was generating was too large. The possible solutions are to disable logging or to redirect the log to a location which can be periodically purged and/or has enough space.
Related
I have a database whose size after importing in SAS is around 600mb.
( I use OPTIONS COMPRESS = YES at start of my program)
Then I derive some columns/variables and get a final database of size around 800 mb
Final Database has 1929743 observations
What I want
I want to sort the data in descending order of PUBLICATION_DATE for each record in column ITEM in my final database
My code so far
PROC SORT DATA=newdb.access_db OUT= newdb.access_sorted;
BY ITEM DESCENDING PUBLICATION_DATE;
RUN;
The error which I am getting
ERROR: No disk space is available for the write operation. Filename =
C:\Users\AB364273\AppData\Local\Temp\SAS Temporary
Files\SAS_util00010000204C_A00DVDPCSAS2007\ut204C000008.utl.
ERROR: Failure while attempting to write page 134 of sorted run 11.
ERROR: Failure while attempting to write page 40544 to utility file 1.
ERROR: Failure encountered while creating initial set of sorted runs.
ERROR: Failure encountered during external sort.
ERROR: Sort execution failure.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 1244486 observations read from the data set
NEWDB.ACCESS_DB.
WARNING: The data set NEWDB.ACCESS_SORTED may be incomplete. When this step was
stopped there were 0 observations and 57 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 2:17.20
cpu time 14.66 seconds
My database is not so large that error like no disk space should appear.
Also I have a lot of space on my hard disk( around 500 GB on drive where I am storing the database using libname and 8 GB on C Drive)
I have RAM of 4GB
So with all this I am not getting the reason that why this error is appearing and any way I can get the desired output
If you have 8GB free on C drive, then that is likely your problem.
Sorting happens in a temporary (scratch) file, and that file can be up to three times as large as the original file. It also has to be on uncompressed data, for obvious reasons. As such, if your uncompressed file is say 3-4 GB in size, it wouldn't be sortable on the 8GB drive.
You can solve this by either moving your work location to a larger drive (or freeing up space), or by using the TAGSORT option, which reduces the utility file usage at the cost of speed (See SAS documentation for more details).
You also could request it from the database sorted; this is what I'd recommend if you're sorting by fields in the database (not by modified fields). You don't even have to use proc sort in most cases; if the database is in a libname db:
data access_sorted;
set db.access_db_Table;
by item descending publication_date;
run;
That will work just fine and will ask for it in sorted order directly from the database.
My first thought was something #Joe said, that your work library location is lacking space even if overall you have space.
I don't know the answer to this but is an ORDER BY statement in PROC SQL less expensive in terms of the temporary memory required? You could try it at least.
I have more than 50 tables running in work. Before, it worked well.
But recently, there are some errors like:
ERROR: An I/O error has occurred on file
WORK.'SASTMP-000000030'n.UTILITY. ERROR: File
WORK.'SASTMP-000000030'n.UTILITY is damaged. I/O processing did not
complete. NOTE: Error was encountered during utility-file processing.
You may be able to execute the SQL statement successfully if you
allocate more space to the WORK library. ERROR: There is not enough WORK disk space to store the results of an internal sorting
phase. ERROR: An error has occurred.
Does anyone know how to solve this error?
Your disk is full. If this is running on a server, ask your system administrator to investigate the problem.
If this is your desktop, find and delete un-needed files to free up space.
Clean out old SAS Work Folders
Often, old SAS Work folders do not get cleared when SAS closes. You can get back a lot of disk space by going to the path defined for SAS Work, and deleting all the old folders.
In SAS
%put %sysfunc(pathname(work));
will show you where the current WORK library is located. One level up is where all SAS Work folders are created.
On my system, that returns:
C:\Users\dpazzula\AppData\Local\Temp\SAS Temporary Files\_TD9512_GXM2L12-PAZZULA_
That means that I should look in "C:\Users\dpazzula\AppData\Local\Temp\SAS Temporary Files\" to find old folders to delete.
Your work space is full.
Your SAS server uses a dedicated directory where all SAS sessions store their temporary files: All files in the work libraries, as well as temp files as used while sorting, joining etc.
Solutions:
Have more space allocated.
Make certain only to put necessary files into work/ clean up/ close old sessions.
Run less processes.
Replace interim datasets with views instead, especially if you're using large source datasets :
data master /view=master ;
set lib.monthlydata20: ; /* all datasets since Jan 2000 */
run ;
proc sql ;
create table want as
select *
from master
where ID in(select ID from lookup) ;
quit ;
try to compress all datasets using this option
OPTIONS COMPRESS=YES REUSE=YES;
this should be in the very beginning of your code. it will compress all datasets by nearly 98%.It will also make your code run faster. It will consume more CPU but will decrease size.
In some cases, this might not help if the compressed data sets exceed the hard disk space.
Also, change your work directory to the biggest drive that has disk space.
Study your code.
Create a Data Flow Diagram to determine WHEN each file is created, where it is used downstream. Find out when a data set is no longer needed and DELETE it. If you have 50 data sets, chances are numerous data sets are 'value-added' by a subsequent step, and can go away freeing up your work space. A cute trick is to REUSE some of the data set names - to keep the number of unneeded data sets in check.
Rule of thumb: leave the environment the way you found it - if there were no files in WORK to start, manually clean up after yourself. Unless it is a Stored Process, which starts a completely new SAS job, and will clean up after itself upon completion of the job.
I ran the following code and an hour later, just as the code was finishing a sort execute error occurred. Is there something wrong with my code or is my computer processor and Ram insufficient
proc sql;
create table today as
select a.account_number, a.client_type, a.device ,a.entry_date_est,
a.entry_time_est, a.duration_seconds, a.channel_name, b.esn, b.service_start_date,
b.service_end_date, b.product_name, b.billing_frequency_fee, b.plan_category,
b.plan_subtype, b.plan_type
from listen_nomiss a inner join service_nomiss b
on (a.account_number = b.account_number)
order by account_number;
quit;
That error is most commonly seen when you run out of utility space to perform the sort. A few suggestions for troubleshooting are available in this SAS KB post; the most useful suggestions:
options fullstimer msglevel=i ; will give you a lot more information about what's going on behind the scenes, so you can troubleshoot what is causing the issue
proc options option=utilloc; run; will tell you where the utility directory is that your temporary files will be created in for the sort. Verify that about 3 times the space needed for the final table is available - sorting requires roughly 3 times the space in order to properly sort the dataset due to how the sort is processed.
OPTIONS COMPRESS; will save some (possibly a lot of) space if not already enabled.
options memsize; and options sortsize; will tell you how much memory is allocated to SAS, and at what size a sort is done in memory versus on disk. sortsize should be about 1/3 of memsize (given the requirement of 3x space to process it). If your final table is around but just over sortsize, you may be better off trying to increase sortsize if the default is too low (same for memsize).
You could also have some issues with permissions; some of the other suggestions in the kb article relate to verifying you actually have permission to write to the utility directory, or that it exists at all.
I've had a project in the past where resources was an issue as well.
A couple of ways around it when sorting were:
Don't forget that proc sort has a TAGSORT option, which will make it first only sort on the by statement variables and attach everything else afterwards. Useful when having many columns not involved in the by statement.
Indexes: if you build an index of exactly the variables in your by-statement, you can use a by statement without sorting, it will rely on the index.
Split it up: you can split up the dataset in multiple chunks and sort each chunk separately. Then you do a data step where you put them all in the set statement. When you use a by statement there as well, SAS will weave the records so that the result is also according to the by-statement.
Note that these approaches have a performance hit (maybe the third one only to a lesser extent) and indexes can give you headaches if you don't take them into account later on (or destroy them intentionally).
One note if/when you would rewrite the whole join as a SAS merge: keep in mind that SAS merge does not by itself mimic many-to-many joins. (it does one-to-one, one-to-many and many-to-one) Probably not the case here (it rarely is), but i mention it to be on the safe side.
I'm parsing poker hand histories, and storing the data in a postgres database. Here's a quick view of that:
I'm getting a relatively bad performance, and parsing files will take several hours. I can see that the database part takes 97% of the total program time. So only a little optimization would make this a lot quicker.
The way I have it set-up now is as follows:
Read next file into a string.
Parse one game and store it into object GameData.
For every player, check if we have his name in the std::map. If so; store the playerids in an array and go to 5.
Insert the player, add it to the std::map, store the playerids in an array.
Using the playerids array, insert the moves for this betting round, store the moveids in an array.
Using the moveids array, insert a movesequence, store the movesequenceids in an array.
If this isn't the last round played, go to 5.
Using the movesequenceids array, insert a game.
If this was not the final game, go to 2.
If this was not the last file, go to 1.
Since I'm sending queries for every move, for every movesequence, for every game, I'm obviously doing too many queries. How should I bundle them for best performance? I don't mind rewriting a bit of code, so don't hold back. :)
Thanks in advance.
CX
It's very hard to answer this without any queries, schema, or a Pg version.
In general, though, the answer to these problems is to batch the work into bigger coarser batches to avoid repeating lots of work, and, most importantly, by doing it all in one transaction.
You haven't said anything about transactions, so I'm wondering if you're doing all this in autocommit mode. Bad plan. Try wrapping the whole process in a BEGIN and COMMIT. If it's a seriously long-running process the COMMIT every few minutes / tens of games / whatever, write a checkpoint file or DB entry your program can use to resume the import from that point, and open a new transaction to carry on.
It'll help to use multi-valued inserts where you're inserting multiple rows to the same table. Eg:
INSERT INTO some_table(col1, col2, col3) VALUES
('a','b','c'),
('1','2','3'),
('bork','spam','eggs');
You can improve commit rates with synchronous_commit=off and a commit_delay, but that's not very useful if you're batching work into bigger transactions.
One very good option will be to insert your new data into UNLOGGED tables (PostgreSQL 9.1 or newer) or TEMPORARY tables (all versions, but lost when session disconnects), then at the end of the process copy all the new rows into the main tables and drop the import tables with commands like:
INSERT INTO the_table
SELECT * FROM the_table_import;
When doing this, CREATE TABLE ... LIKE is useful.
Another option - really a more extreme version of the above - is to write your results to CSV flat files as you read and convert them, then COPY them into the database. Since you're working in C++ I'm assuming you're using libpq - in which case you're hopefully also using libpqtypes. libpq offers access to the COPY api for bulk-loading, so your app wouldn't need to call out to psql to load the CSV data once it'd produced it.
I am trying to use sqlite (sqlite3) for a project to store hundreds of thousands of records (would like sqlite so users of the program don't have to run a [my]sql server).
I have to update hundreds of thousands of records sometimes to enter left right values (they are hierarchical), but have found the standard
update table set left_value = 4, right_value = 5 where id = 12340;
to be very slow. I have tried surrounding every thousand or so with
begin;
....
update...
update table set left_value = 4, right_value = 5 where id = 12340;
update...
....
commit;
but again, very slow. Odd, because when I populate it with a few hundred thousand (with inserts), it finishes in seconds.
I am currently trying to test the speed in python (the slowness is at the command line and python) before I move it to the C++ implementation, but right now this is way to slow and I need to find a new solution unless I am doing something wrong. Thoughts? (would take open source alternative to SQLite that is portable as well)
Create an index on table.id
create index table_id_index on table(id)
Other than making sure you have an index in place, you can checkout the SQLite Optimization FAQ.
Using transactions can give you a very big speed increase as you mentioned and you can also try to turn off journaling.
Example 1:
2.2 PRAGMA synchronous
The Boolean synchronous value controls
whether or not the library will wait
for disk writes to be fully written to
disk before continuing. This setting
can be different from the
default_synchronous value loaded from
the database. In typical use the
library may spend a lot of time just
waiting on the file system. Setting
"PRAGMA synchronous=OFF" can make a
major speed difference.
Example 2:
2.3 PRAGMA count_changes
When the count_changes setting is ON,
the callback function is invoked once
for each DELETE, INSERT, or UPDATE
operation. The argument is the number
of rows that were changed. If you
don't use this feature, there is a
small speed increase from turning this
off.