I work on a team that runs this big data QA project. In addition to the QA I'm often tasked with trying to improve the speed/efficiency of the SAS code (the QA evolves a lot). One coworker will write multiple DATA or PROC steps and then put a RUN; at the end of them all- I never learned to code that way. All I care about is speed and memory use- does this style of coding impact that?
example"
data a;
set b;
if yadda yadda;
proc transpose data= a;
out= c;
id ;
var;
data ;
set;
data;
set;
run;
Any execution speed impact (plus or minus) from leaving off the explicit step ending statements will be trivial. SAS will determine the step has ended when it sees that you have started another step (PROC or DATA).
Only speed impact would be if you were running interactively, say in Display Manager, and left the last DATA step unterminated. The SAS session would just wait for you to finish defining the data step before it would start to compile and run it.
But leaving them off might have a large impact on your ability to maintain the code or debug any issues by reading the SAS logs. You also might run the risk of coding mistake impacting more than one step. If the step has been ended with a RUN statement and does not contain any coding errors it will run. But if there is a coding error in the first line of the next step then that might impact both steps ability to be understood and executed by SAS.
I would consider this poor programming style, but it doesn't affect the functioning of the program.
SAS will consider a DATA or PROC step terminated when it either encounters a step boundary, such as:
RUN
QUIT
DATA
PROC
Any of those ends the current step.
From SAS's RUN documentation:
Although the RUN statement is not required between steps in a SAS program, using it creates a step boundary and can make the SAS log easier to read.
For that reason, I consider it mandatory in my environment. But it doesn't affect the actual running time. However, to me running time is less important than programmer time.
Related
I am currently working on optimizing a script in SAS EG.
The script basically has to be run for each month since 2012 on a daily-basis.
Up until 2022 the code for all periods took less than 24 hours to run but every now and then, it exceed this treshold.
The process (which is a macro) is structured as follow:
Retrieve data from several Oracle tables
Transform (transpose/concatenate...)
Compute statistics based on explicit rules
Delete statistics for the given month in Result Oracle table
Insert the new statistics for the given month in Result Oracle table
The reason why it takes so much time is because we run the program sequentially, looping on every periods.
%macro run_all_periods;
%let start_date = 31jan2012; * define start date;
%let i = %sysfunc(inputn(&start_date, date9.)); * date format;
* define last date to be considered ;
%let today = %sysfunc(inputn(%sysfunc(date(), date9.), date9.)); * today;
%let last_date = %sysfunc(intnx(month, &today,-1, e)); * last period to be considered;
%do %until (&i > &last_date); * do loop for all reference periods until last_date;
%let date=%sysfunc(putn(&i, date9.));
%run_script(period=&date);
%let i = %sysfunc(intnx(month, &i, +1, e)); * following period;
%mend;
As the periods are independant to each other (i.e., it doesn't matter the order for which it run) I think that it would be better to run all periods in parallel instead of optimizing the script in itself.
Therefore, is there any way to run the same script in SAS EG in parallel with different argument (in my case periods)?
At the same time, we are currently testing SAS Viya at work. While looking into the functionnalities, I found out about the Background Submit.
You can run a saved SAS program, query, task, or flow as a background submission, which means that the file can run while you continue to use SAS Studio. You can view the status of files that have been submitted in the background, and you can cancel files that are currently running in the background.
And the associated note caught my eye:
Note: Because a background submission uses a separate compute server, any libraries or tables that are created by the submission do not appear in the Libraries section of the navigation pane in SAS Studio
Would it be possible to leverage this functionnality to run several times the same script in background with different periods ?
There are at least ten different ways to do this, both on SAS 9.4 and SAS Viya. Viya is a bit more "native" to this concept, as it tries to do everything as parallel as possible, but 9.4 would also do this no problem.
Command line submission - you can write a .bat/.ps1/shell script/etc. to run SAS for each of your jobs, and submit them all in parallel. That's how we do it usually - we call SAS once for each job.
SAS/CONNECT - using MP CONNECT, this is very easy to do. Just make sure you have it set up to run things in parallel and not wait until the point you want it to wait for (if any exists).
Grid multiprocessing - using the SAS Grid Manager. Not that different from how SAS/CONNECT works, really. You use PROC SCAPROC to analyze the program and identify splits, or just do it yourself.
Background submits in SAS Studio - this is possible in both 9.4 and Viya. Each job is in a separate session. This is somewhat manual though, so you'd have to submit each job by hand.
Use Python to do the parallelization, then SASPy (94) or SWAT (Viya) to submit the SAS jobs.
Directly call SAS (using x command) for each of your sub-jobs
Use EG's built in ability to run multiple processes at once - see The SAS Dummy for more.
Use EG's scheduling interface to run multiple jobs
Use the built-in SAS Scheduler to run the various jobs
Modify your script to use BY group processing so that you can do the whole thing at once but still take advantage of efficiencies (some jobs this will work for, some it won't).
Hope someone can shed some light on this for me.
I have a process that uses the below table. There is a subsequent table (resource5) that has the same data as resource4 - basically I can use either table - not sure why there's two to be honest but it may come in handy here.
Both tables are updated sequentially twice an hour at irregular intervals, so I cannot schedule around them and it seems to take around 5mins to update each table.
I always need the latest available data, and other data is live so I'm hitting the table quite frequently (every 15 mins).
Is there a way to check resource4 is available to be locked by my process and if so, proceed to run the data step and if not, hit resource5 instead and if not res5 then just quit the entire process so nothing else tries (other proc sql from oracle) to run?
As long as work.resource4 appears and is usable then all is well.
All my code does is this, once it's in WORK I can do whatever without fear of an issue.
data resource4;
set publprev.resource4;
run;
ps. I'm using SAS EG in Windows to make the change, then the process is exported via a .sas file with all of the code and runs off of a Unix SAS server via crontab though a shell script which also creates a log file. Probably not the most efficient way to schedule this stuff but it is what I have.
Many thanks in advance.
You can use the function open to check if a table is available to you for reading (i.e. copying to WORK).
You will have to use macro to provide the name of the available data set to your DATA Step.
Example:
* NOTE: The DATA Step will automatically close resources it open()s;
%let RESOURCE=NONE AVAILABLE;
data _null_;
if open ('publprev.resource4') ne 0 then do;
call symput('RESOURCE', 'publprev.resource4');
stop;
end;
if open ('publprev.resource5') ne 0 then do;
call symput('RESOURCE', 'publprev.resource5');
end;
run;
data work.resource;
set &RESOURCE;
run;
I had one main program in sas, in that another 2 sas programs are being called.
These 2 sas programs create formats using proc format cntlin from large data sets and are temporary means residing in workspace. These formats are used in sas program to assing format to some variables.
In main sas program almost 15 large data sets are created in work library.
Some proc sql joins and data step merges are happening
We have index creation on data sets using proc datasets.
We also used proc sort
Where ever possible used where instead of if
It had mprint mlogic symbolgen options enabled
And some small logic wise performance tuning is done.
Here most part of dataset creation is done in work library. If we clear total work space previously created formats are lost. We dont want to loose formats untill end of job because these are used in entire sas program.
It is taking 1TB of sas workspace to accomplish all this job. So i wanted to reduce this usage space.
Can you guys someone please suggest what are all optimizations we can do to use less space as well as memory.
Write the format catalogs to a different folder.
I'm new to SAS EG, I usually use BASE SAS when I actually need the program, but my company is moving heavily toward EG. I'm helping some areas with some code to get data they need on an ad-hoc basis (the code won't change though).
However, during processing, we create many temporary files that are just iterations across months. I.E. if the user wants data from 2002 - 2016, we have to pull all those libraries and then concatenate them with our results. This is due to high transactional volume, the final dataset is limited to a small number of observations. Whenever I run this program though, SAS outputs all 183 of the datasteps created in the macro, making it very ugly, and sometimes the "Output Data" that appears isn't even output from the last datastep, but from an intermediary step, making it annoying to search through for the 'final output dataset'.
Is there a way to limit the datasets written to "Output Data" so that it only shows the final dataset - so that our end user doesn't need to worry about being confused?
Above is an example - There's a ton of output data sets that I don't care to see. I just want the final, which is located (somewhere) in that list...
Version is SAS E.G. 7.1
EG will always automatically show every dataset that was created after the program ends. If you don't want it to show any intermediate tables, delete them at the very last step in your process.
In your case, it looks as if your temporary tables all share the name TRN. You can clean it up as such:
/* Start of process flow */
<program statements>;
/* End of process flow*/
proc datasets lib=work nolist nowarn nodetails;
delete TRN:;
quit;
Be careful if you do this. Make sure that all of your temporary tables follow the same prefix naming scheme, otherwise you may accidentally delete tables that you need.
Another solution is to limit the number of datasets generated, and have a user-created link to the final dataset. There's an article about it here.
The alternate solution here is to add the output dataset explicitly as an entry on your process flow, and disregard the OUTPUT window unless you need to investigate something from the intermediary datasets.
This has the advantage that it lets you look at the intermediary datasets if something goes wrong, but also lets you not have to look through all of them to see the final dataset.
You should be able to add the final output dataset to the process flow once it's created once easily, and then after that one time it will be there for you to select to look at.
Is there a way to force sas to continue processing, despite finding errors?
I'm appending a large quantity of datasets at the moment, however within the list I of dataset names I have, some don't exist. This is resulting in a bunch of errors and causing SAS to exit with the message "The SAS System stopped processing this step because of errors.".
You could evaluate the existence of a dataset using the EXIST() function and make the execution of the append conditional on the outcome.
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000210903.htm