Is there any way to view all permanently stored datasets used by or created by a SAS project (or program, if this is not possible)? I have been tasked with creating a matrix of data inputs and outputs for 40 different SAS projects, each of which contains at least 50 programs. Needless to say there are THOUSANDS of temporary datasets created, but all I am interested in are the permanent ones. After manually checking one project, I noticed that the project process flow does not contain many permanently stored inputs (i.e. from libraries other than WORK) and it is very time consuming to check the properties of each dataset to see if it is temporary or not.
Three other things of note-
1. None of the code is documented.
2. I did not write any of it.
3. I am using SAS enterprise guide
it is not exactly clear what you are asking for. You may want to check out the table sashelp.Vcolumn which will contain a list of all permanent data sets and lists variables of each table. If your project stored all data sets in one library you could try:
proc sql;
select * from sashelp.vcolumn
where libname ="yourprojectslibrary";
Related
I'm facing a very weird problem.
I've assigned a libname such as
libname TEST_LIB "/Info-One/...." /*have removed the exact location*/
/*The dataset TEST_DATA is visible in this output*/
proc datasets lib = TEST_LIB;
RUN;
/*This statement throws an error saying the file does not exist*/
DATA TEST_DATA_2;
set TEST_LIB.TEST_DATA;
RUN;
I'm running this code in SAS Enterprise guide connected to the remote server.
I'm also able to navigate to the location from File explorer and drag it into Enterprise guide and then the dataset is visible. But even if I double click on the dataset from the defined library then it says that the dataset does not exist.
I've run out of ideas now and I'm not sure how to troubleshoot this.
A couple of things that I've tried/checked
Case sensitivity is not an issue
The filename does not have spaces
I have permissions to the folder because I can work normally with another dataset that I created and placed in that folder
In fact if I copy the data to excel, upload the excel to SAS to create a SAS dataset and place it in the same location with a different name, I face the same issue!!
Would really appreciate any ideas that you guys may have, not just on why its happening, but also how to bypass it
Transferred from comments and expanded
Here are 2 possibilities:
The filename of the data set may contain uppercase. This is an unlikely but possible scenario:
On Unix systems filenames are case sensitive. A data set name in a SAS program will be mapped internally to a corresponding lowercase named data file (those sas7bdat files at the operating system level). If a copy process somehow creates a .sas7bdat data file on Unix whose name is mixed case or uppercase, the SAS session won't map to it. In such a scenario, the SAS file explorer might list a data set, but be unable to open it. However, a direct file reference to the data set might work, such as set '~/project1/datasets/MyWeirdlyCasedDataset';
There is a file permission problem with the folder mount, allowing reading of directory entries (the filenames) but not the file contents (the data sets) within. Try opening a terminal session (putty or mobaxterm) and see what the detailed directory listing is for the data folder (ls -l) You might have to also look at the access control lists (lsacl) and get network and IT admin involved.
Are you sure that is a dataset?
Put memtype=data in Proc Datasets as below
proc datasets lib = TEST_LIB memtype=data;
RUN;
I had one main program in sas, in that another 2 sas programs are being called.
These 2 sas programs create formats using proc format cntlin from large data sets and are temporary means residing in workspace. These formats are used in sas program to assing format to some variables.
In main sas program almost 15 large data sets are created in work library.
Some proc sql joins and data step merges are happening
We have index creation on data sets using proc datasets.
We also used proc sort
Where ever possible used where instead of if
It had mprint mlogic symbolgen options enabled
And some small logic wise performance tuning is done.
Here most part of dataset creation is done in work library. If we clear total work space previously created formats are lost. We dont want to loose formats untill end of job because these are used in entire sas program.
It is taking 1TB of sas workspace to accomplish all this job. So i wanted to reduce this usage space.
Can you guys someone please suggest what are all optimizations we can do to use less space as well as memory.
Write the format catalogs to a different folder.
I'm new to SAS EG, I usually use BASE SAS when I actually need the program, but my company is moving heavily toward EG. I'm helping some areas with some code to get data they need on an ad-hoc basis (the code won't change though).
However, during processing, we create many temporary files that are just iterations across months. I.E. if the user wants data from 2002 - 2016, we have to pull all those libraries and then concatenate them with our results. This is due to high transactional volume, the final dataset is limited to a small number of observations. Whenever I run this program though, SAS outputs all 183 of the datasteps created in the macro, making it very ugly, and sometimes the "Output Data" that appears isn't even output from the last datastep, but from an intermediary step, making it annoying to search through for the 'final output dataset'.
Is there a way to limit the datasets written to "Output Data" so that it only shows the final dataset - so that our end user doesn't need to worry about being confused?
Above is an example - There's a ton of output data sets that I don't care to see. I just want the final, which is located (somewhere) in that list...
Version is SAS E.G. 7.1
EG will always automatically show every dataset that was created after the program ends. If you don't want it to show any intermediate tables, delete them at the very last step in your process.
In your case, it looks as if your temporary tables all share the name TRN. You can clean it up as such:
/* Start of process flow */
<program statements>;
/* End of process flow*/
proc datasets lib=work nolist nowarn nodetails;
delete TRN:;
quit;
Be careful if you do this. Make sure that all of your temporary tables follow the same prefix naming scheme, otherwise you may accidentally delete tables that you need.
Another solution is to limit the number of datasets generated, and have a user-created link to the final dataset. There's an article about it here.
The alternate solution here is to add the output dataset explicitly as an entry on your process flow, and disregard the OUTPUT window unless you need to investigate something from the intermediary datasets.
This has the advantage that it lets you look at the intermediary datasets if something goes wrong, but also lets you not have to look through all of them to see the final dataset.
You should be able to add the final output dataset to the process flow once it's created once easily, and then after that one time it will be there for you to select to look at.
One of the critical SAS dataset is left open from SAS Enterprise Guide, by our offshore associate. We are depending on that dataset for many updates through various jobs. I tried searching for an option from various sites to unlock the dataset, but of no use. Kindly provide any suggestion. Thanks.
Depending on some of the specifics of your situation, another option is to prevent anyone from locking it in the first place using a PW= dataset option like:
data myImportantTable(PW=pass123);
x=1;output;
run;
Then you could create a view that allows EG users to click and see the underlying data, but does not LOCK the original dataset:
proc sql;
CREATE VIEW myImportantTable_view AS
SELECT * FROM myImportantTable(read=pass123)
;quit;
Now INSERTS, UPDATES etc will work even if the view is opened by a user in EG:
*This will work even if view is opened in EG;
proc sql;
INSERT INTO myImportantTable(PW=pass123) VALUES(101)
;quit;
Note that this is not a good option if you've got a lot of different INSERT/UPDATE statements spread throughout your program - each of them would need the (PW=...) dataset option added to them in order to work.
Use the SYSTASK command to execute a mv (move) or cp (copy) UNIX command to replace the existing data set. If you need to move or copy more than one data set at a time you can use the * wildcard but you must also use the SHELL option.
There is an option in SAS Enterprise Guide. Under Tools --> Options --> Data --> Performance. There is a check box, "Close Data Grid after period of inactivity (in minutes)" This was even if the data grid is opened after 'n' minutes, it will be available for others to update.
I am a DBA / R user. I just took a job in an office full of SAS users and I am trying to understand better how SAS' proc sql works. I understand that SAS includes a relational database and it includes the ability to run proc sql against external servers like Oracle. I am trying to better understand when / how it decides to use the database server rather than its internal database system.
I have seen some really S. L. O. W. SAS code where my coworkers running a series of proc sql commands. These programs typically include 3 - 5 proc sql steps. Each proc sql command creates a local SAS table. They are not using passthrough sql. The data sets are large (1 million rows +) and these proc sql steps run slowly. Most of the data lives on the server. There is usually a small table that defines the population that we want to look at and it is in a SAS data file, but everything else lives on the server.
I have demonstrated dramatic improvements in speed by simply running all of the queries directly on the server. (Oracle in this case, but I don't think that is important.) Usually, I have to first upload a table to my personal schema that defines the population of clients we want to examine. Everything else is on the server. Sometimes I collapse their queries together because they can be done in a single step, but I do not believe that is why my version of their program is so much faster.
I think proc sql uploads the initial data set and then runs the first query on the server. It then downloads the output to the local computer, creating the local SAS data set. For the second proc sql step, it uploads the table created in step one back to the server and then runs the query on the server. To make this all even worse, the "local" SAS data sets are actually stored on a remote server, not the actual local machine. This is invisible to SAS, but it does mean we are copying data across the network yet again. I believe SAS is running slowly because of a large amount of unnecessary network traffic.
Question #1 - Is my understanding of what proc sql is doing correct? Are we really wasting as much time as I think we are uploading and downloading large tables / data sets across our network?
Qeustion #2 - Is there some way to control when proc sql runs against a server versus when it runs against the local database? In some cases, if we could prevent the upload / download step, the query would run more efficiently.
Short answer
Your understanding is not exactly correct, but it's in the right ballpark. SQL is probably not sending the SAS dataset to the server, it is more likely downloading the server data to SAS - but it's probably downloading the entire table, not limited by the join criteria. Your solution is exactly what I would suggest doing - hopefully your colleagues will get on board.
Long answer
In terms of how the processing works, it depends on your code. PROC SQL will execute code locally (as in, on the SAS server/desktop), unless it decides to pass the query up to the server and hasn't been told it's not allowed to. That's called implicit passthrough. You can't really control it except to turn it entirely off (with noipassthru on the PROC SQL statement). You can look at it sometimes using options msglevel=i; (a system option), and _METHOD or _TREE to see what SQL decided to do (similar to explain plan).
I've had cases where it caused harm: SQL Server runs character comparisons case-insensitively while SAS does not, and I had a particular query that sometimes was sent up to the server and sometimes not depending on details of the data. I wasn't careful enough with checking case, and so it appeared to work when it really wasn't correct (comparing Propcase to UPCASE).
The general rule is that SAS will try to send the query to the server if:
The data in the query entirely already resides on the server
The query is sufficiently simple that SAS can easily figure out how to tell the server to do it, in its native language
If you're running a query with local SAS dataset (say, joining a server table to a SAS dataset locally), it won't (at least as far as I know) go to the server. It should always run it locally, which would mean downloading from the server all data in the contributing tables (possibly filtered if there is a logical filter in the query). IE (these examples aren't necessarily good SQL code, just examples of concept):
libname oralib oracle [connection info];
proc sql;
*Will pass through likely;
select tableA.*, tableB.cost
from oralib.tableA inner join oralib.tableB
on tableA.id=tableB.id;
*Will probably not pass through;
select tableA.*, tableB.cost
from oralib.tableA inner join work.tableB
on tableA.id=tableB.id;
*Might pass through, might not;
select tableA.*, tableB.cost, tableC.productID
from oralib.tableA inner join oralib.tableB
on tableA.id=tableB.id
left join oralib.tableC
on tableA.id=tableC.id;
*This downloads the data but probably applies the where statement server side;
select tableA.*, tableB.cost
from oralib.tableA inner join work.tableB
on tableA.id=tableB.id
where tableA.date < '01JAN2010'd;
quit;
In the case of the second query, it probably pulls all of tableA down. In the fourth query, it likely will pass the where clause to the server (assuming the date doesn't cause a problem, but it shouldn't, SAS knows how to convert dates to oracle type dates).
Note that SAS procs can also generate passthrough. PROC MEANS, etc., will send the instructions to Oracle to do the means/sums/etc. if it can easily do so.
Your best bet is to:
Try to do everything in pass through that you can (and that makes sense). Only way to be sure it goes to the server is to use passthrough.
If you have a large table on the server and a small table in SAS, upload the table in SAS to the server. A passthrough session and a libname session can't see each others session-specific temporary tables, so you'd have to use a GTT or similar (something all users can see). Similarly, if you have a large table in SAS and a small table (or small query result) in SQL, bring it down locally (through passthrough if necessary).
When you do have to bring things down, limit as much as possible. When I worked in that kind of environment, I made huge time savings simply by joining to tables on the server to limit my result set before bringing them down.
At the end of the day, you will be constrained by network traffic no matter what you do; just try to optimize it as best you can. It sounds like you understand how to do that already, so just do what you normally would do in non-SAS environments.