How to record the number of SAS startups? - sas

I am designing an auto SAS program. I want it execute at the very first time I start SAS everyday and it should be executed only once. That is to say, I may start SAS several times this day, but the auto program will be executed only the first time I start SAS.
There are also some restricts:
1. It won't be executed if I have not use my SAS one day;
2. It won't be executed if I happen to working on SAS at daybreak;
I think recording the number of SAS startups is the key but have no idea on how to record it. Thanks for any hints.

Same as Quentin's comment
Add code such as the following to your autoexec.
options nodsnferr;
data _null_;
if not exist ('sasuser.laststart') then
call execute ('%include "my-once-a-day.sas";');
set sasuser.laststart;
if date < today() then
call execute ('%include "my-once-a-day.sas";');
run;
options nodsnferr;
data sasuser.laststart;
date = today();
run;
If you run multiple concurrent SAS sessions with different autoexecs and sasuser paths the above is not sufficient.

Related

Run the same SAS program in parallel with different argument

I am currently working on optimizing a script in SAS EG.
The script basically has to be run for each month since 2012 on a daily-basis.
Up until 2022 the code for all periods took less than 24 hours to run but every now and then, it exceed this treshold.
The process (which is a macro) is structured as follow:
Retrieve data from several Oracle tables
Transform (transpose/concatenate...)
Compute statistics based on explicit rules
Delete statistics for the given month in Result Oracle table
Insert the new statistics for the given month in Result Oracle table
The reason why it takes so much time is because we run the program sequentially, looping on every periods.
%macro run_all_periods;
%let start_date = 31jan2012; * define start date;
%let i = %sysfunc(inputn(&start_date, date9.)); * date format;
* define last date to be considered ;
%let today = %sysfunc(inputn(%sysfunc(date(), date9.), date9.)); * today;
%let last_date = %sysfunc(intnx(month, &today,-1, e)); * last period to be considered;
%do %until (&i > &last_date); * do loop for all reference periods until last_date;
%let date=%sysfunc(putn(&i, date9.));
%run_script(period=&date);
%let i = %sysfunc(intnx(month, &i, +1, e)); * following period;
%mend;
As the periods are independant to each other (i.e., it doesn't matter the order for which it run) I think that it would be better to run all periods in parallel instead of optimizing the script in itself.
Therefore, is there any way to run the same script in SAS EG in parallel with different argument (in my case periods)?
At the same time, we are currently testing SAS Viya at work. While looking into the functionnalities, I found out about the Background Submit.
You can run a saved SAS program, query, task, or flow as a background submission, which means that the file can run while you continue to use SAS Studio. You can view the status of files that have been submitted in the background, and you can cancel files that are currently running in the background.
And the associated note caught my eye:
Note: Because a background submission uses a separate compute server, any libraries or tables that are created by the submission do not appear in the Libraries section of the navigation pane in SAS Studio
Would it be possible to leverage this functionnality to run several times the same script in background with different periods ?
There are at least ten different ways to do this, both on SAS 9.4 and SAS Viya. Viya is a bit more "native" to this concept, as it tries to do everything as parallel as possible, but 9.4 would also do this no problem.
Command line submission - you can write a .bat/.ps1/shell script/etc. to run SAS for each of your jobs, and submit them all in parallel. That's how we do it usually - we call SAS once for each job.
SAS/CONNECT - using MP CONNECT, this is very easy to do. Just make sure you have it set up to run things in parallel and not wait until the point you want it to wait for (if any exists).
Grid multiprocessing - using the SAS Grid Manager. Not that different from how SAS/CONNECT works, really. You use PROC SCAPROC to analyze the program and identify splits, or just do it yourself.
Background submits in SAS Studio - this is possible in both 9.4 and Viya. Each job is in a separate session. This is somewhat manual though, so you'd have to submit each job by hand.
Use Python to do the parallelization, then SASPy (94) or SWAT (Viya) to submit the SAS jobs.
Directly call SAS (using x command) for each of your sub-jobs
Use EG's built in ability to run multiple processes at once - see The SAS Dummy for more.
Use EG's scheduling interface to run multiple jobs
Use the built-in SAS Scheduler to run the various jobs
Modify your script to use BY group processing so that you can do the whole thing at once but still take advantage of efficiencies (some jobs this will work for, some it won't).

How to accumulate the records of a newly imported table with the records of another table that I have stored on the servers in SAS?

I am new to SAS and I have the following problem:
When trying to join records I just imported (in one table) with records I have stored in another table.
What happens is that I am going to run the code in SAS daily, and I need the table that I am going to create today (17/05/2021) by importing a file 'X', to join the table that I created yesterday (16/05/2021) by importing a file 'Y'.
And so the code will be executed tomorrow, the next day and so on.
In conclusion the records will accumulate as the days go by.
To tackle this problem, I am first creating two variables, one with the date of the day the code will be executed and the other with the date of the last execution.
%let daily_date = 20210423; /*AAAAMMDD*/
%let last_execution_date = 20210422; /*AAAAMMDD*/
Then the import of a file is done, we can see that the name of this created table has the date of the day in which the code is being executed.
data InputAC.RA_ratings&daily_date;
infile "&ruta_InputRA." FIRSTOBS=2
dsd lrecl=4096 truncover;
input
#1 RA_Customer_ID $10.
#11 Rating_ID 10.
#21 ISRM_Model_Overlay_ID $10.
#31 Constant_ID 10.
#41 Value $100.
;
run;
proc sort data=inputac.RA_ratings&daily_date;
by RA_Customer_ID Rating_ID;
quit;
Finally the union of InputAC.RA_ratings&daily_date with InputAC.RA_ratings&last_execution_date is made. ('InputAC.RA_ratings&last_execution_date' should be the table that was imported at an earlier date than today.)
data InputAC.RA_ratings&fec_diario;
merge
InputAC.RA_ratings&fec_diario
InputAC.RA_ratings&ultima_fecha_de_ejecucion;
by RA_Customer_ID Rating_ID;
run;
This is how the tables are being stored on the server.
(Ignore date 20210413, let's imagine it is 20210422)
However, I have to perform this task without using the variable 'last_execution_date'.
I've been researching but I still can't find any SAS function that can help me with this problem.
I hope someone can help me, thank you very much in advance.
This is a pretty complex and interesting question from an operations point of view. The answer depends on a few things.
How much control do you have over the execution of this process?
Is "yesterday" guaranteed, or does the process need to work if "last execution date" is not yesterday?
What should happen if the process is run twice today?
The best practices way to solve this is to have a dataset (or table) that stores the last execution date. That allows you to handle #2 trivially, and the answer to #3 might guide exactly how you store this but is easily handled anyway.
Say for example you have a table, MetaAC.LastExecDate (or, in spanish, MetaAC.UltimaFecha or similar). It could store things this way:
data LastExecDate;
timestamp = datetime();
execdate = input(&daily_date,yymmdd8.);
run;
proc append base=MetaAC.LastExecDate data=LastExecDate;
run;
This lets you store an arbitrary execdate even if it's not today, and also store when you ran it (for audit purposes), and you could even add who ran it if that's interesting (there is a macro variable &sysuserid or similar). Then put all this at the bottom of your process, and it updates as you go.
Then, you can pull out from this the exact info you want - for example,
proc sql;
select max(execdate)
into :last_exec_date
from MetaAC.LastExecDate
where execdate ne today()
;
quit;
Now, if you don't have control over this for some reason, you could determine this in a different way. Again, the exact process depends on your circumstances and your answers to 2 and 3.
If your answer to 2 is you always want it to be yesterday, then this is really easy - just do this:
%let daily_date=20210517;
%let last_execution_date = %sysfunc(putn(%sysevalf(%sysfunc(inputn(&daily_date,yymmdd8.))-1),yymmddn8.));
%put &=last_execution_date;
The two %sysfuncs just do the input/put from SAS datastep inside the macro language, and %sysevalf lets you do math.
If you don't want it to always be the prior day (if there are weekends, or other days you don't necessarily want to assume it's the prior day), then your best bet is to either use the dictionary tables to look at what's there and find the largest date prior to your date, or maybe use a x command to look at the folder and do the same thing (might be easier to use OS command than to use SQL for this, sometimes SQL dictionary tables can be slow).

SAS EG - Check if lock is possible, else use other table

Hope someone can shed some light on this for me.
I have a process that uses the below table. There is a subsequent table (resource5) that has the same data as resource4 - basically I can use either table - not sure why there's two to be honest but it may come in handy here.
Both tables are updated sequentially twice an hour at irregular intervals, so I cannot schedule around them and it seems to take around 5mins to update each table.
I always need the latest available data, and other data is live so I'm hitting the table quite frequently (every 15 mins).
Is there a way to check resource4 is available to be locked by my process and if so, proceed to run the data step and if not, hit resource5 instead and if not res5 then just quit the entire process so nothing else tries (other proc sql from oracle) to run?
As long as work.resource4 appears and is usable then all is well.
All my code does is this, once it's in WORK I can do whatever without fear of an issue.
data resource4;
set publprev.resource4;
run;
ps. I'm using SAS EG in Windows to make the change, then the process is exported via a .sas file with all of the code and runs off of a Unix SAS server via crontab though a shell script which also creates a log file. Probably not the most efficient way to schedule this stuff but it is what I have.
Many thanks in advance.
You can use the function open to check if a table is available to you for reading (i.e. copying to WORK).
You will have to use macro to provide the name of the available data set to your DATA Step.
Example:
* NOTE: The DATA Step will automatically close resources it open()s;
%let RESOURCE=NONE AVAILABLE;
data _null_;
if open ('publprev.resource4') ne 0 then do;
call symput('RESOURCE', 'publprev.resource4');
stop;
end;
if open ('publprev.resource5') ne 0 then do;
call symput('RESOURCE', 'publprev.resource5');
end;
run;
data work.resource;
set &RESOURCE;
run;

Quit vs Run statements in SAS

In SAS, what is the difference between 'quit' and 'run'? statements? I cannot figure out when to use 'quit' and when to use 'run'? For example, why is proc datasets using quit but proc contents using run
This dates back to where SAS used to be a mainframe program (and still can be!).
RUN; is a command for SAS to run the submitted statements. Back in the older mainframe days, statements would've been submitted to SAS one at a time (or in batches, but the core concept here is that each line is separate from SAS's point of view). SAS accepts statements without doing anything until it hits a RUN; or something else that would create a step boundary (another DATA or PROC line, usually). In a data step, or a non-interactive proc (proc means, for example - a proc that can only do one set of instructions, and then exits), run tells it to do (whatever) and then return to a blank slate.
QUIT; is used in interactive programming environments. IML, SQL, many of the regression and modelling PROCs, FORMAT, TEMPLATE, DATASETS, etc. - all can be used interactively, meaning, more than one set of instructions can be sent to them.
In these interactive cases, you want SAS to go ahead and run some of the instructions, but still keep that PROC's environment open - your next statement would be in the same PROC, for example. Some of those run immediately - PROC SQL is a good example of this - while some (particularly the modelling PROCs) RUN; does something (tells it to run the model so far) but it won't exit the proc until QUIT; is encountered (or another step boundary that requires it to exit, i.e. a data/proc statement). These are called "run groups", and "run group processing" is the term you'll see associated with that.
You will find that some people put run; quit; at every point that run; or quit; might be appropriate; that doesn't hurt anything, though it isn't really 'right', either. And there are some cases where it's needed to do that!
One example:
/* first run group*/
proc gplot data=sales;
title1 "Sales Summary";
plot sales*model_a;
run;
/* second run group */
plot sales*model_b;
run;
quit;
(from run-group processing )

how to get timing information of a data step query

I just wanted to know like in proc sql we define stimer option.
The PROC SQL option STIMER | NOSTIMER specifies whether PROC SQL writes timing information for each statement to the SAS log, instead of writing a cumulative value for the entire procedure. NOSTIMER is the default.
Now in same way how to specify timing information in data set step. I am not using proc sql step
data h;
select name,empid
from employeemaster;
quit;
PROC SQL steps individually are effectively separate data steps, so in a sense you always get the identical information from SAS. What you're asking is effectively how to find out how long 'select name' takes versus 'empid'.
There's not a direct way to get the timing of an individual statement in a data step, but you could write data step code to find out. The problem is that the data step is executed row-wise, so it's really quite different from the PROC SQL STIMER details; almost nothing you do in a data step will take very long by itself, unless you are doing something more complex like a hash table lookup. What takes long is writing out the data first, and reading in the data second.
You do have some options for troubleshooting long data steps, if that's your concern. OPTIONS MSGLEVEL=I will give you information about index usage, merge details, etc., which can be helpful if you aren't sure why it is taking a long time to do certain things (see http://goo.gl/bpGWL in SAS documentation for more info). You can write your own timestamp:
data test;
set sashelp.class sashelp.class;
_t=time();
put _t=;
run;
Odds are that won't show you much of use since most data step iterations won't take very long but if you are doing something fancy it might help. You could also use conditional statements to only print the time at certain intervals - when at FIRST.ID for example in a process that works BY ID;.
Ultimately though the information you already get from notes is what is most useful. In PROC SQL you need the STIMER information because SQL is doing several things at once, while SAS lets/makes you do everything out step-wise. Example:
PROC SQL;
create table X as select * from A,B where A.ID=B.ID;
quit;
is one step - but in SAS this would be:
proc sort data=a; by ID; run;
proc sort data=b; by ID; run;
data x;
merge a(in=a) b(in=b);
by id;
if a and b;
run;
For that you would get information on the duration of each of those steps (the two sorts and the merge) in SAS, which is similar to what STIMER would tell you.
No way.
PROC SQL STIMER logs timing for each separately executable SQL statement/query.
In data step, as you may know, the data step looping occurs, observation per observation, so the data step statement timing would be something like per observation, let's say transactional. Anyway this would not describe all the details where the time is being spent - waiting for disk reads, writes, etc.
So I guess this won't be very usable. In general, SAS performance is I/O driven.