I had one main program in sas, in that another 2 sas programs are being called.
These 2 sas programs create formats using proc format cntlin from large data sets and are temporary means residing in workspace. These formats are used in sas program to assing format to some variables.
In main sas program almost 15 large data sets are created in work library.
Some proc sql joins and data step merges are happening
We have index creation on data sets using proc datasets.
We also used proc sort
Where ever possible used where instead of if
It had mprint mlogic symbolgen options enabled
And some small logic wise performance tuning is done.
Here most part of dataset creation is done in work library. If we clear total work space previously created formats are lost. We dont want to loose formats untill end of job because these are used in entire sas program.
It is taking 1TB of sas workspace to accomplish all this job. So i wanted to reduce this usage space.
Can you guys someone please suggest what are all optimizations we can do to use less space as well as memory.
Write the format catalogs to a different folder.
Related
Hope someone can shed some light on this for me.
I have a process that uses the below table. There is a subsequent table (resource5) that has the same data as resource4 - basically I can use either table - not sure why there's two to be honest but it may come in handy here.
Both tables are updated sequentially twice an hour at irregular intervals, so I cannot schedule around them and it seems to take around 5mins to update each table.
I always need the latest available data, and other data is live so I'm hitting the table quite frequently (every 15 mins).
Is there a way to check resource4 is available to be locked by my process and if so, proceed to run the data step and if not, hit resource5 instead and if not res5 then just quit the entire process so nothing else tries (other proc sql from oracle) to run?
As long as work.resource4 appears and is usable then all is well.
All my code does is this, once it's in WORK I can do whatever without fear of an issue.
data resource4;
set publprev.resource4;
run;
ps. I'm using SAS EG in Windows to make the change, then the process is exported via a .sas file with all of the code and runs off of a Unix SAS server via crontab though a shell script which also creates a log file. Probably not the most efficient way to schedule this stuff but it is what I have.
Many thanks in advance.
You can use the function open to check if a table is available to you for reading (i.e. copying to WORK).
You will have to use macro to provide the name of the available data set to your DATA Step.
Example:
* NOTE: The DATA Step will automatically close resources it open()s;
%let RESOURCE=NONE AVAILABLE;
data _null_;
if open ('publprev.resource4') ne 0 then do;
call symput('RESOURCE', 'publprev.resource4');
stop;
end;
if open ('publprev.resource5') ne 0 then do;
call symput('RESOURCE', 'publprev.resource5');
end;
run;
data work.resource;
set &RESOURCE;
run;
I have to perform statistical analysis on a file with hundreds of observations and 7 variables(columns)on SAS. I know that it is necessary to insert all the observations after "cards" or "datalines". But I can't write them all obviously. How can I do? Moreover, the given data file already is .sas7bdat.
Then, since (in my case) the multiple correspondence analysis requires only six of the seven variables, does this affect what I have to write in INPUT or/and in CARDS?
You only use CARDS when you're trying to manually write a data set. If you already have a SAS data set (sas7bdat) you can usually use that directly (there are some exceptions but likely don't apply here).
First create a libname to the folder where the file is:
libname myFiles 'path to fodler with sas file';
Then load it into your work library - this is a temporary space that is cleaned up when you're done so no files here are saved permanently.
This copies it over to that library - which is often faster.
data myFileName;
set myFiles.myFileName;
run;
You can just work with the file from that library by referencing it as myFiles.myFileName in your code.
proc means data=myFiles.myFileName;
run;
This should get you started, but you should take the SAS free e-course to understand the basics, it will save you time overall.
Just tell SAS to use the dataset. INPUT statement (and CARDS/DATALINES or INFILE statement) are for reading from text files.
proc corresp data='/my directory/mydataset.sas7bdat' .... ;
...
run;
You could also make a libref that points to the directory and use two level name to reference the dataset.
libname myfiles '/my directory/';
proc corresp data=myfiles.mydataset .... ;
...
run;
I'm new to SAS EG, I usually use BASE SAS when I actually need the program, but my company is moving heavily toward EG. I'm helping some areas with some code to get data they need on an ad-hoc basis (the code won't change though).
However, during processing, we create many temporary files that are just iterations across months. I.E. if the user wants data from 2002 - 2016, we have to pull all those libraries and then concatenate them with our results. This is due to high transactional volume, the final dataset is limited to a small number of observations. Whenever I run this program though, SAS outputs all 183 of the datasteps created in the macro, making it very ugly, and sometimes the "Output Data" that appears isn't even output from the last datastep, but from an intermediary step, making it annoying to search through for the 'final output dataset'.
Is there a way to limit the datasets written to "Output Data" so that it only shows the final dataset - so that our end user doesn't need to worry about being confused?
Above is an example - There's a ton of output data sets that I don't care to see. I just want the final, which is located (somewhere) in that list...
Version is SAS E.G. 7.1
EG will always automatically show every dataset that was created after the program ends. If you don't want it to show any intermediate tables, delete them at the very last step in your process.
In your case, it looks as if your temporary tables all share the name TRN. You can clean it up as such:
/* Start of process flow */
<program statements>;
/* End of process flow*/
proc datasets lib=work nolist nowarn nodetails;
delete TRN:;
quit;
Be careful if you do this. Make sure that all of your temporary tables follow the same prefix naming scheme, otherwise you may accidentally delete tables that you need.
Another solution is to limit the number of datasets generated, and have a user-created link to the final dataset. There's an article about it here.
The alternate solution here is to add the output dataset explicitly as an entry on your process flow, and disregard the OUTPUT window unless you need to investigate something from the intermediary datasets.
This has the advantage that it lets you look at the intermediary datasets if something goes wrong, but also lets you not have to look through all of them to see the final dataset.
You should be able to add the final output dataset to the process flow once it's created once easily, and then after that one time it will be there for you to select to look at.
Is there any way to view all permanently stored datasets used by or created by a SAS project (or program, if this is not possible)? I have been tasked with creating a matrix of data inputs and outputs for 40 different SAS projects, each of which contains at least 50 programs. Needless to say there are THOUSANDS of temporary datasets created, but all I am interested in are the permanent ones. After manually checking one project, I noticed that the project process flow does not contain many permanently stored inputs (i.e. from libraries other than WORK) and it is very time consuming to check the properties of each dataset to see if it is temporary or not.
Three other things of note-
1. None of the code is documented.
2. I did not write any of it.
3. I am using SAS enterprise guide
it is not exactly clear what you are asking for. You may want to check out the table sashelp.Vcolumn which will contain a list of all permanent data sets and lists variables of each table. If your project stored all data sets in one library you could try:
proc sql;
select * from sashelp.vcolumn
where libname ="yourprojectslibrary";
In SAS, while creating a SAS data set from a raw data file (csv), we can either use the DATA step with the infile keyword or the PROC IMPORT step.
What are the advantages and disadvantages of each over the other?
Proc Import makes assumptions about the lengths of character variables and types of variables based on reading a number of rows in the CSV which is controlled by an option. If you issue a recall command in interactive sas after running proc import you get the data step code that proc import generated to do the actual work. It generates format and informat statements that may or may not be exactly what you want.
I often use proc import as a data step code generator, recall the code, and then modify it to suit what I want.
You can also add other processing logic to extend the functionality of the step beyond simply reading the source data into a data set. Creating new variables as transformations of one or more of the columns in the CSV springs to mind.
I generally agree this is too broad a question. That said:
PROC IMPORT is slower than a DATA STEP. This is because PROC IMPORT looks at the file and then writes and executes a DATA STEP.
A DATA STEP requires you to know the name, position, and attributes (type, length, etc) for each variable.
If I need to read a file once, I just use PROC IMPORT.
If I need to read a file multiple times, I don't care about speed, and the file format might change, then I use PROC IMPORT.
If I am in a production system where speed matters and I want an ERROR if the format changes, then I use PROC IMPORT. But I take the DATA STEP it writes for me and put that into my code.
If PROC IMPORT fails to guess my columns correctly, I use PROC IMPORT, modify the DATA STEP it produces, and then use that.