SAS/ACCESS and data step on external DB

SAS/ACCESS and data step on external DB - sas

I have the following concern regarding SAS/ACCESS facility.
Let's imagine that we have an external DB (i.e. Oracle), which we have assigned to a certain libname.
Next, we do a simple operation on one of the tables within this DB, i.e.
data db.table_new;
set db.table_old(keep=var1 var2 var3);
if var1>0 then new_var1=5;
run;
My question is the following:
Will the whole table table_old be pulled from external DB to SAS Server in order to process the data?
Will SAS/ACCESS transform the data step into DBMS operation or SQL so the whole processing will be performed outside SAS?
The documentation is unclear about it . See page 62.

Usually the rule of thumb is: if the SAS functions that are used in DATA step can be converted to native db sql functions, then SAS will let the DB server do the data processing. In your case, this seems to be the situation.

You can answer this question on any piece of code through a set of non-syntax-highlighted options that need to be simplified:
options sastrace=',,,d' sastraceloc=saslog nostsuffix;
When you run the data step, check the log. You will see information about whether SAS is able to successfully translate the code or not. If it was unsuccessful, you will see:
ACCESS ENGINE: SQL statement was not passed to the DBMS, SAS will do the processing.
If this occurs, SAS will usually send out a select * to the server and pull everything before filtering. When you see that error, try doing explicit passthrough, or redesign your query so that it can do everything on the server. It is possible to bring down the SAS server, or severely degreade performance on the Oracle server, if the table is large enough.
Some common functions you'll want to avoid using directly in the query, especially with Oracle:
datepart()
intnx()
intck()
today()
put()
input()
If I have to use any of those functions, I usually play it safe and create a macro variable of static ones beforehand (e.g. today()), filter the raw data at the lowest level first to get it into the SAS server, or use explicit SQL passthrough.

In summary, I would say it depends on your method. On the second page of Chapter 1 of the SAS/Access 9.2 document in your above link, there are two methods (among the older DBLOAD procedure) of the SAS/ACCESS facility:
LIBNAME reference - assign SAS librefs to DBMS objects such
as schemas and databases; you can then work with the table or view as you would with a SAS data set...You can use such SAS procedures as PROC SQL or DATA step programming on any libref that references DBMS data.
SQL Pass-through facility - to interact with a data source using its
native SQL syntax without leaving your SAS session. SQL statements are passed directly to the data source for processing...The DBMS optimizer can take advantage of indexes on DBMS columns to process a query more quickly and
efficiently
Hence, for the first method SAS handles processing and second method DBMS handles processing. Like most clients (Java, C#, Python script or PHP webpage) that connect to external RDMS sources, unless a direct ODBC/OLEDB or other API connection is explicitly employed and request sent, processing is handled in the frontend (i.e., calculating parameters) and the end result is updated to the backend via transactions. All SAS's libraries would live in memory (or temporary hard disk) during the appointed session and depending on the code handles data itself and passes results to external source or passes data handling entirely to another source.
Comparative Example: Microsoft Access
One good comparative example would be Microsoft Access which like SAS too provides a linked table connection and pass-through query for any ODBC-compliant RDMS including SQL Server, Oracle, MySQL, etc. It is often a misnomer to tag Access as a database when actually it is a GUI program and collection of objects, one of which is the default Windows JET/ACE engine (a .dll file) not at all restricted to Access but available to all Office programs. Notice the world default as this can be switched out to any ODBC database source.
Linked tables are essentially Access GUI objects (specifically special tabledefs) not unlike SAS's libname refs that are loaded into a JET/ACE table container with data pointing externally. One can then use a linked table like any other Access local table and use anything of the ACE SQL dialect. This special linked table (much like SAS's libname refs are established by ODBC or other connection type) points to the external source and the driver translates query command for the migration action. Therefore, an exact same Access linked table query may perform differently than same RDMS query.
Analogy
I imagine SAS behaves the same way and exists as a front-end with libname ref as local objects with pointers to the backend. All data step handling is processed locally and simply the resultset are imported or extracted by the engine. To use an analogy. A database would be the home and SAS is the garbage man, home decorator, or move-in helper. SAS (like Java's JDBC, PHP's PDO, Python's cursors, R's libraries) knocks on the door which the database answers (annoyed by so many requests). "Hey buddy, we need to take out the garbage and here are the exact items...or we need to remodel the basement and here are the exact specs...or we have new furniture to add in the truck ready for drop off...with credentials signed please carry out immediately." And like in both, pass-through methods are requests carried out on the backend engine. So SAS leaves instructions, maybe a note on the door (without exactness) for homeowner to carry out.

Related

SAS Metadata Objects Details - SASLibraries, PhysicalTables, Jobs

I've used this code to fetch the list of objects for all SAS Libraries, Physical Tables and Jobs.
https://github.com/sasjs/core/blob/master/meta/mm_getobjects.sas
I now need to fetch these objects details,
Like for Libraries - I need their libname and full path,
Teradata Libs - Schema Name,Lib path
Physical Tables - Location and other attribs
Jobs - Location, and other attribs.
I'm not very familiar on how or what attribs can we report, but I definitely need their paths and attribs.
Thank you.

The example you refer to is using proc metadata that returns XML you need to understand and process. The real problem here is you have to learn how to build input XML to construct the metadata query, which is quite a complex thing.
Maybe more straight forward is to use Data step metadata functions like here.
METABROWSE command is useful to understand metadata object relations (if you have access to SAS Foundation), see here

The attributes you are asking for will change depending on the library engine you are checking for.
The following macro will generate a libname for BASE, OLEDB, ODBC and POSTGRES engines (note that the repository has moved):
https://github.com/sasjs/core/blob/main/meta/mm_assigndirectlib.sas
The direct attributes are available as per this answer: How to get details of metadata objects in SAS
Folder path is available as per this answer:
%let metauri=OMSOBJ:PhysicalTable\A5HOSDWY.BE0006N9;
/* get metadata paths */
data ;
length tree_path $500 tree_uri parent_uri parent_name $200;
call missing(tree_path,tree_uri,parent_uri,parent_name);
drop tree_uri parent_uri parent_name rc ;
uri="&metauri";
rc=metadata_getnasn(uri,"Trees",1,tree_uri);
rc=metadata_getattr(tree_uri,"Name",tree_path);
do while (metadata_getnasn(tree_uri,"ParentTree",1,parent_uri)>0);
rc=metadata_getattr(parent_uri,"Name",parent_name);
tree_path=strip(parent_name)||'/'||strip(tree_path);
tree_uri=parent_uri;
end;
tree_path='/'||strip(tree_path);
run;

MySQL check if table has correct schema

I am currently developing server software in C++ with a MySQL data backend. I am using the official MySQL/connector library from Oracle to work with MySQL. The connection itself is working and I'm not having any issues with that.
My problem is that the database and the table schemas tend to change every once in a while because new tables and columns keep getting added. Also exiting column may be changed for the same reason. To make sure I recognize outdated server software quickly I wanted to add a warning when the database has changed.
My first idea was to hardcode how the database (and tables and such) should look and then check whether the current database matches the hardcoded data. But I have no clue how to achive that.
In summary I want to be able to detect whether
A table has been added or removed
A column in a table has been altered
A column in a table has been added or removed
with as little C++ code as possible. Also it should be quite easy to maintain.
Additional information will be added when required.

I would suggest the following approach:
1) fork and execute the mysql command line client. Set up a pair of pipes, to mysql's standard input and output.
2) At this point you should be able to execute simple commands by piping them to mysql via the standard input pipe, and read the output from the standard output pipe.
You will need to make careful notes as to the output format of each mysql command, so that you know when you finished reading its output, and you can send the next command.
3) As the first order of being, execute:
show tables;
The output that comes back will list all tables in the database. Parsing the output into a list of table names is trival. Then execute for each table:
show create table <tablename>;
The resulting output shows all fields in the table, its keys, and constraints. Pretty much all of this table's schema. Lather, rinse, repeat, for every table.
4) In this manner you can capture a basic schema of the entire database, for comparison purposes. If necessary, use the same approach to capture the triggers, and other objects. You'll likely need to do some minor massaging of the data, and exclude a few bits. "show create table", for example, will include the current AUTO_INCREMENT values, which you can ignore.
This general approach, of driving a mysql process via its standard input and output, is bit wobbly, of course. With a little bit of work, you can use mysql's native client library, and execute all of these commands, and capture their results, directly. This should be more reliable.

How do I manage SAS formats from various sources?

I am wondering how I can efficiently manage formats in SAS for a reporting office that takes in data from various sources, some with proper lookup tables / metadata, and some without.
For data sources that have proper metadata, joining tables for value descriptions works fine, but when metadata doesn't exist and needs to be maintained separately, how should that be done? Some straightforward examples/ideas:
Plain .sas files with a native PROC FORMAT step that is maintained separately.
External files (e.g., Excel, CSV) that are maintained separately and imported into SAS to create a format library.
Database tables maintained separately that can be read from to create a format library.
In addition to just the formatted values, managing value changes (i.e., effective dates for certian values) is also a concern.
Any help in conventions or standards that work well for this type of task is greatly appreciated.

I'm not sure there's a single best solution here - it depends largely on your environment, your users, etc.
If you have fairly naive users, then I'd definitely recommend a single complete repository if possible; whether that is a .sas7bcat file if you are using a single SAS version/OS/bitness, or a ready-made table/dataset to input into PROC FORMAT (and a .sas file included in their autoexec to do the importing). The biggest drawbacks to this are that you have to manage it actively (you cannot allow users to write their own formats to the master format dataset, for example, as they may overwrite other ones), and that there will be additional work to ensure format names do not conflict - YNF. might be 1=YES 2=NO or 1=YES 0=NO or something else. This also doesn't allow you to very easily handle effective dates; but it's possible this is better for your users (and then just handle the documentation separately).
If you have more advanced users, then you might consider a table/dataset that is more relational in nature. A hybrid approach might include a dataset with columns:
Dataset Name (qualified as needed to ensure uniqueness)
Format Name
Start
Label
Other elements (Type, HLO, etc.)
Effective date
That would allow users to make their own modifications (assuming you trust them enough to add dataset name properly, anyway - or set up a stored proc to do the adding from a temp table that checked for conflicts) and allow you to handle format names that conflicted. You'd still have to have a way for the user to handle using multiple datasets, if that's necessary (such as by adding some unique element to the format name by default, like 'dataset ID').
In my mind, however, the best option is using a data dictionary to handle the metadata, which combines self-documentation with metadata management. Similar to above, you have a table with dataset and format elements, but add columns for descriptive text (question description, for example) and other useful information, depending on your use cases. This can be maintained in a database table or dataset, or perhaps more usefully in an excel or similar document that can be shared with non-programmers and easily edited. I use this method for several projects, and it has paid off by allowing my users to help write the documentation for my code, keeping my programs accurate and up to date, while minimizing back-and-forth discussions of updates. I just import the spreadsheet and run a proc format each time I run my data.
You can then have one spreadsheet per dataset, one tab, or one full spreadsheet with all datasets in them - whichever is easiest to use. This easily handles 'effective date' type issues as well - or even versioning, as that can be handled in the spreadsheet.

Tell SAS not to add newly generated tables on the Process Flow

I have a SAS code that creates a lot of intermediary tables for my calculations. Thing is, I don't really care about this tables after the job is done, I only care to the finals results.
But, everytime I run this code, SAS add all the generated tables do my process flow, turning it into a huge mess (I am talking here of 40+ intermediary tables).
Is there a way to tell SAS not to add some tables to the process flow? Or at least to tell it not to add any tables at all? I am using SAS Enterprise Guide 4.1
Thanks in advance

Under SAS 9.1.x and 9.2.x (for Windows), it's possible to suppress the display of datasets in SAS client environments by prefixing the dataset name with "_TO". So in your code and/or tasks, you could call all your intemediate datasets _TO<DataSetName>, and they won't clutter up your process flow. But they will still be there and can be referenced in code and tasks.
If you do this and you're using tasks, note that it might be tricky to work out how to use the output data from a task as the input for another, if you can't see the dataset to select it. If you have trouble with this, comment on this post and we can address that.
Note that this "_TO" prefix thing is an undocumented, "hidden" feature that is to be deprecated in 9.3 - see this blog for details.

If you set the option "Maximum Number of output data sets to add to the project" (under Results General) to zero, it will not add any datasets to the project, but they'll still be available to view from the Server -> Library view (they'll be added to the flow at the point you request them).

I know this question is a year and a half old now, but if you are working with intermediate tables that can be deleted after you get the final results, SAS EG has a built in macro you can use for deleting these tables:
%_eg_conditional_dropds([table1], [table2], ... ,[table-n]);

For a SAS dataset, what is the best way to prevent locking for multiple user access

Thanks for reading this.
I am using a shared service (server=sharedLib) when setting up my libref, to allow users of my SAS/IntrNet application to modify and update (add new) records of a single dataset. The application will also be used to query my dataset. To minimize locking, I am only using a data step to modify and update rather than Proc SQL (which locks the entire member). However, I wonder if locking is more or less likely if only update/modify access to the data uses the share service but queries do not.
%if &type=QUERY %then %do ;
LIBNAME lib '/myServer/library' ;
%end ;
%else %do ;
LIBNAME lib '/myServer/library' server=shareLib ;
%end;
this isn't my actual code, but I do know whether or not the request is going to just send data back or modify an existing record or add a new record (update);
I had originally made this distinction because we were having some failures attaching to the share service (not sure that is the correct terminology), but referencing the lib to query the data did not fail. Since then we have, I think solved this problem, but I wondering if I am setting myself up for problems.
Thanks

Since your question is more like a request for general advice on data access and concurrency in SAS, my answer will be formed as general advice, more than a specific solution.
There are excellent SAS online documentations. Please go visit the index, and find the information relevant for your further reading.
Please have a further look at the "ACCESS=READONLY" libname option. It pretty much does what it says, namely restrict access to data members in the libname to be read-only. This holds the benefit that you don't alter your data by accident during a non-altering query. It also enables SAS to leave some room for data altering queries to get higher levels of control over the data.
SAS/SHARE enables concurrent data access, so if you need to provide concurrent access (read/write) to the same data, SAS/SHARE is a good choice. It means that you can get away with assigning your libname just once, giving your libname statement the option "SERVER=SHARELIB", and have SAS/SHARE manage concurrent data access. If you assign the libname to your SAS/SHARE server process, all subsequent SAS processes needing access to this libname only have to assign the libname like "LIBNAME LIB SERVER=SHARELIB" (note there is no physical path - the SAS/SHARE server process takes care of that). This setup functions at its best if you have a separate SAS process for your SAS/SHARE server.
You can also lock your libname with the LOCK statement or the LOCK command. This means that your SAS program can ensure itself exclusive access rights to a libname, in case that's what you need. Other SAS processes can then use the lock command to query a specific libname and see if it can get (exclusive) access.
You can also control access on the data member level. This is done with the CNTLLEV data set option. For example "DATA LIB.MYDATA(CNTLLEV=LIB);" specifies that access control is at the library level, restricting concurrent access to only one update process to the library. CNTLLEV=MEM and CNTLLEV=REC restricts concurrent access at member level and record level respectively.
These options can be combined in many different ways, giving a lot of room for you to make access as fine-grained as you need. I hope these choices will help you complete your task.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js