I am currently working on optimizing a script in SAS EG.
The script basically has to be run for each month since 2012 on a daily-basis.
Up until 2022 the code for all periods took less than 24 hours to run but every now and then, it exceed this treshold.
The process (which is a macro) is structured as follow:
Retrieve data from several Oracle tables
Transform (transpose/concatenate...)
Compute statistics based on explicit rules
Delete statistics for the given month in Result Oracle table
Insert the new statistics for the given month in Result Oracle table
The reason why it takes so much time is because we run the program sequentially, looping on every periods.
%macro run_all_periods;
%let start_date = 31jan2012; * define start date;
%let i = %sysfunc(inputn(&start_date, date9.)); * date format;
* define last date to be considered ;
%let today = %sysfunc(inputn(%sysfunc(date(), date9.), date9.)); * today;
%let last_date = %sysfunc(intnx(month, &today,-1, e)); * last period to be considered;
%do %until (&i > &last_date); * do loop for all reference periods until last_date;
%let date=%sysfunc(putn(&i, date9.));
%run_script(period=&date);
%let i = %sysfunc(intnx(month, &i, +1, e)); * following period;
%mend;
As the periods are independant to each other (i.e., it doesn't matter the order for which it run) I think that it would be better to run all periods in parallel instead of optimizing the script in itself.
Therefore, is there any way to run the same script in SAS EG in parallel with different argument (in my case periods)?
At the same time, we are currently testing SAS Viya at work. While looking into the functionnalities, I found out about the Background Submit.
You can run a saved SAS program, query, task, or flow as a background submission, which means that the file can run while you continue to use SAS Studio. You can view the status of files that have been submitted in the background, and you can cancel files that are currently running in the background.
And the associated note caught my eye:
Note: Because a background submission uses a separate compute server, any libraries or tables that are created by the submission do not appear in the Libraries section of the navigation pane in SAS Studio
Would it be possible to leverage this functionnality to run several times the same script in background with different periods ?
There are at least ten different ways to do this, both on SAS 9.4 and SAS Viya. Viya is a bit more "native" to this concept, as it tries to do everything as parallel as possible, but 9.4 would also do this no problem.
Command line submission - you can write a .bat/.ps1/shell script/etc. to run SAS for each of your jobs, and submit them all in parallel. That's how we do it usually - we call SAS once for each job.
SAS/CONNECT - using MP CONNECT, this is very easy to do. Just make sure you have it set up to run things in parallel and not wait until the point you want it to wait for (if any exists).
Grid multiprocessing - using the SAS Grid Manager. Not that different from how SAS/CONNECT works, really. You use PROC SCAPROC to analyze the program and identify splits, or just do it yourself.
Background submits in SAS Studio - this is possible in both 9.4 and Viya. Each job is in a separate session. This is somewhat manual though, so you'd have to submit each job by hand.
Use Python to do the parallelization, then SASPy (94) or SWAT (Viya) to submit the SAS jobs.
Directly call SAS (using x command) for each of your sub-jobs
Use EG's built in ability to run multiple processes at once - see The SAS Dummy for more.
Use EG's scheduling interface to run multiple jobs
Use the built-in SAS Scheduler to run the various jobs
Modify your script to use BY group processing so that you can do the whole thing at once but still take advantage of efficiencies (some jobs this will work for, some it won't).
Related
I work on a team that runs this big data QA project. In addition to the QA I'm often tasked with trying to improve the speed/efficiency of the SAS code (the QA evolves a lot). One coworker will write multiple DATA or PROC steps and then put a RUN; at the end of them all- I never learned to code that way. All I care about is speed and memory use- does this style of coding impact that?
example"
data a;
set b;
if yadda yadda;
proc transpose data= a;
out= c;
id ;
var;
data ;
set;
data;
set;
run;
Any execution speed impact (plus or minus) from leaving off the explicit step ending statements will be trivial. SAS will determine the step has ended when it sees that you have started another step (PROC or DATA).
Only speed impact would be if you were running interactively, say in Display Manager, and left the last DATA step unterminated. The SAS session would just wait for you to finish defining the data step before it would start to compile and run it.
But leaving them off might have a large impact on your ability to maintain the code or debug any issues by reading the SAS logs. You also might run the risk of coding mistake impacting more than one step. If the step has been ended with a RUN statement and does not contain any coding errors it will run. But if there is a coding error in the first line of the next step then that might impact both steps ability to be understood and executed by SAS.
I would consider this poor programming style, but it doesn't affect the functioning of the program.
SAS will consider a DATA or PROC step terminated when it either encounters a step boundary, such as:
RUN
QUIT
DATA
PROC
Any of those ends the current step.
From SAS's RUN documentation:
Although the RUN statement is not required between steps in a SAS program, using it creates a step boundary and can make the SAS log easier to read.
For that reason, I consider it mandatory in my environment. But it doesn't affect the actual running time. However, to me running time is less important than programmer time.
Hope someone can shed some light on this for me.
I have a process that uses the below table. There is a subsequent table (resource5) that has the same data as resource4 - basically I can use either table - not sure why there's two to be honest but it may come in handy here.
Both tables are updated sequentially twice an hour at irregular intervals, so I cannot schedule around them and it seems to take around 5mins to update each table.
I always need the latest available data, and other data is live so I'm hitting the table quite frequently (every 15 mins).
Is there a way to check resource4 is available to be locked by my process and if so, proceed to run the data step and if not, hit resource5 instead and if not res5 then just quit the entire process so nothing else tries (other proc sql from oracle) to run?
As long as work.resource4 appears and is usable then all is well.
All my code does is this, once it's in WORK I can do whatever without fear of an issue.
data resource4;
set publprev.resource4;
run;
ps. I'm using SAS EG in Windows to make the change, then the process is exported via a .sas file with all of the code and runs off of a Unix SAS server via crontab though a shell script which also creates a log file. Probably not the most efficient way to schedule this stuff but it is what I have.
Many thanks in advance.
You can use the function open to check if a table is available to you for reading (i.e. copying to WORK).
You will have to use macro to provide the name of the available data set to your DATA Step.
Example:
* NOTE: The DATA Step will automatically close resources it open()s;
%let RESOURCE=NONE AVAILABLE;
data _null_;
if open ('publprev.resource4') ne 0 then do;
call symput('RESOURCE', 'publprev.resource4');
stop;
end;
if open ('publprev.resource5') ne 0 then do;
call symput('RESOURCE', 'publprev.resource5');
end;
run;
data work.resource;
set &RESOURCE;
run;
I am designing an auto SAS program. I want it execute at the very first time I start SAS everyday and it should be executed only once. That is to say, I may start SAS several times this day, but the auto program will be executed only the first time I start SAS.
There are also some restricts:
1. It won't be executed if I have not use my SAS one day;
2. It won't be executed if I happen to working on SAS at daybreak;
I think recording the number of SAS startups is the key but have no idea on how to record it. Thanks for any hints.
Same as Quentin's comment
Add code such as the following to your autoexec.
options nodsnferr;
data _null_;
if not exist ('sasuser.laststart') then
call execute ('%include "my-once-a-day.sas";');
set sasuser.laststart;
if date < today() then
call execute ('%include "my-once-a-day.sas";');
run;
options nodsnferr;
data sasuser.laststart;
date = today();
run;
If you run multiple concurrent SAS sessions with different autoexecs and sasuser paths the above is not sufficient.
In SAS, what is the difference between 'quit' and 'run'? statements? I cannot figure out when to use 'quit' and when to use 'run'? For example, why is proc datasets using quit but proc contents using run
This dates back to where SAS used to be a mainframe program (and still can be!).
RUN; is a command for SAS to run the submitted statements. Back in the older mainframe days, statements would've been submitted to SAS one at a time (or in batches, but the core concept here is that each line is separate from SAS's point of view). SAS accepts statements without doing anything until it hits a RUN; or something else that would create a step boundary (another DATA or PROC line, usually). In a data step, or a non-interactive proc (proc means, for example - a proc that can only do one set of instructions, and then exits), run tells it to do (whatever) and then return to a blank slate.
QUIT; is used in interactive programming environments. IML, SQL, many of the regression and modelling PROCs, FORMAT, TEMPLATE, DATASETS, etc. - all can be used interactively, meaning, more than one set of instructions can be sent to them.
In these interactive cases, you want SAS to go ahead and run some of the instructions, but still keep that PROC's environment open - your next statement would be in the same PROC, for example. Some of those run immediately - PROC SQL is a good example of this - while some (particularly the modelling PROCs) RUN; does something (tells it to run the model so far) but it won't exit the proc until QUIT; is encountered (or another step boundary that requires it to exit, i.e. a data/proc statement). These are called "run groups", and "run group processing" is the term you'll see associated with that.
You will find that some people put run; quit; at every point that run; or quit; might be appropriate; that doesn't hurt anything, though it isn't really 'right', either. And there are some cases where it's needed to do that!
One example:
/* first run group*/
proc gplot data=sales;
title1 "Sales Summary";
plot sales*model_a;
run;
/* second run group */
plot sales*model_b;
run;
quit;
(from run-group processing )
I am a DBA / R user. I just took a job in an office full of SAS users and I am trying to understand better how SAS' proc sql works. I understand that SAS includes a relational database and it includes the ability to run proc sql against external servers like Oracle. I am trying to better understand when / how it decides to use the database server rather than its internal database system.
I have seen some really S. L. O. W. SAS code where my coworkers running a series of proc sql commands. These programs typically include 3 - 5 proc sql steps. Each proc sql command creates a local SAS table. They are not using passthrough sql. The data sets are large (1 million rows +) and these proc sql steps run slowly. Most of the data lives on the server. There is usually a small table that defines the population that we want to look at and it is in a SAS data file, but everything else lives on the server.
I have demonstrated dramatic improvements in speed by simply running all of the queries directly on the server. (Oracle in this case, but I don't think that is important.) Usually, I have to first upload a table to my personal schema that defines the population of clients we want to examine. Everything else is on the server. Sometimes I collapse their queries together because they can be done in a single step, but I do not believe that is why my version of their program is so much faster.
I think proc sql uploads the initial data set and then runs the first query on the server. It then downloads the output to the local computer, creating the local SAS data set. For the second proc sql step, it uploads the table created in step one back to the server and then runs the query on the server. To make this all even worse, the "local" SAS data sets are actually stored on a remote server, not the actual local machine. This is invisible to SAS, but it does mean we are copying data across the network yet again. I believe SAS is running slowly because of a large amount of unnecessary network traffic.
Question #1 - Is my understanding of what proc sql is doing correct? Are we really wasting as much time as I think we are uploading and downloading large tables / data sets across our network?
Qeustion #2 - Is there some way to control when proc sql runs against a server versus when it runs against the local database? In some cases, if we could prevent the upload / download step, the query would run more efficiently.
Short answer
Your understanding is not exactly correct, but it's in the right ballpark. SQL is probably not sending the SAS dataset to the server, it is more likely downloading the server data to SAS - but it's probably downloading the entire table, not limited by the join criteria. Your solution is exactly what I would suggest doing - hopefully your colleagues will get on board.
Long answer
In terms of how the processing works, it depends on your code. PROC SQL will execute code locally (as in, on the SAS server/desktop), unless it decides to pass the query up to the server and hasn't been told it's not allowed to. That's called implicit passthrough. You can't really control it except to turn it entirely off (with noipassthru on the PROC SQL statement). You can look at it sometimes using options msglevel=i; (a system option), and _METHOD or _TREE to see what SQL decided to do (similar to explain plan).
I've had cases where it caused harm: SQL Server runs character comparisons case-insensitively while SAS does not, and I had a particular query that sometimes was sent up to the server and sometimes not depending on details of the data. I wasn't careful enough with checking case, and so it appeared to work when it really wasn't correct (comparing Propcase to UPCASE).
The general rule is that SAS will try to send the query to the server if:
The data in the query entirely already resides on the server
The query is sufficiently simple that SAS can easily figure out how to tell the server to do it, in its native language
If you're running a query with local SAS dataset (say, joining a server table to a SAS dataset locally), it won't (at least as far as I know) go to the server. It should always run it locally, which would mean downloading from the server all data in the contributing tables (possibly filtered if there is a logical filter in the query). IE (these examples aren't necessarily good SQL code, just examples of concept):
libname oralib oracle [connection info];
proc sql;
*Will pass through likely;
select tableA.*, tableB.cost
from oralib.tableA inner join oralib.tableB
on tableA.id=tableB.id;
*Will probably not pass through;
select tableA.*, tableB.cost
from oralib.tableA inner join work.tableB
on tableA.id=tableB.id;
*Might pass through, might not;
select tableA.*, tableB.cost, tableC.productID
from oralib.tableA inner join oralib.tableB
on tableA.id=tableB.id
left join oralib.tableC
on tableA.id=tableC.id;
*This downloads the data but probably applies the where statement server side;
select tableA.*, tableB.cost
from oralib.tableA inner join work.tableB
on tableA.id=tableB.id
where tableA.date < '01JAN2010'd;
quit;
In the case of the second query, it probably pulls all of tableA down. In the fourth query, it likely will pass the where clause to the server (assuming the date doesn't cause a problem, but it shouldn't, SAS knows how to convert dates to oracle type dates).
Note that SAS procs can also generate passthrough. PROC MEANS, etc., will send the instructions to Oracle to do the means/sums/etc. if it can easily do so.
Your best bet is to:
Try to do everything in pass through that you can (and that makes sense). Only way to be sure it goes to the server is to use passthrough.
If you have a large table on the server and a small table in SAS, upload the table in SAS to the server. A passthrough session and a libname session can't see each others session-specific temporary tables, so you'd have to use a GTT or similar (something all users can see). Similarly, if you have a large table in SAS and a small table (or small query result) in SQL, bring it down locally (through passthrough if necessary).
When you do have to bring things down, limit as much as possible. When I worked in that kind of environment, I made huge time savings simply by joining to tables on the server to limit my result set before bringing them down.
At the end of the day, you will be constrained by network traffic no matter what you do; just try to optimize it as best you can. It sounds like you understand how to do that already, so just do what you normally would do in non-SAS environments.