Dropping a table in SAS - sas

What is the most efficient way to drop a table in SAS?
I have a program that loops and drops a large number of tables, and would like to know if there is a performance difference between PROC SQL; and PROC DATASETS; for dropping a single table at a time..
Or if there is another way perhaps???

If it is reasonable to outsource to the OS, that might be fastest. Otherwise, my unscientific observations seem to suggest that drop table in proc sql is fastest. This surprised me as I expected proc datasets to be fastest.
In the code below, I create 4000 dummy data sets then try deleting them all with different methods. The first is with sql and on my system took about 11 seconds to delete the files.
The next two both use proc datasets. The first creates a delete statement for each data set and then deletes. The second just issues a blanket kill command to delete everything in the work directory. (I had expected this technique to be the fastest). Both proc datasets routines reported about 20 seconds to delete all 4000 files.
%macro create;
proc printto log='null';run;
%do i=1 %to 4000;
data temp&i;
x=1;
y="dummy";
output;run;
%end;
proc printto;run;
%mend;
%macro delsql;
proc sql;
%do i=1 %to 4000;
drop table temp&i;
%end;
quit;
%mend;
%macro deldata1;
proc datasets library=work nolist;
%do i=1 %to 4000;
delete temp&i.;
%end;
run;quit;
%mend;
%macro deldata2;
proc datasets library=work kill;
run;quit;
%mend;
option fullstimer;
%create;
%delsql;
%create;
%deldata1;
%create;
%deldata2;

I tried to fiddle with the OS-delete approach.
Deleting with the X-command can not be recommended. It took forever!
I then tried with the system command in a datastep:
%macro delos;
data _null_;
do i=1 to 9;
delcmd="rm -f "!!trim(left(pathname("WORK","L")))!!"/temp"!!trim(left(put(i,4.)))!!"*.sas7*";
rc=system(delcmd);
end;
run;
%mend;
As you can see, I had to split my deletes into 9 separate delete commands. The reason is, I'm using wildcards, "*", and the underlying operating system (AIX) expands these to a list, which then becomes too large for it to handle...
The program basically constructs a delete command for each of the nine filegroups "temp[1-9]*.sas7*" and issues the command.
Using the create macro function from cmjohns answer to create 4000 data tables, I can delete those in only 5 seconds using this approach.
So a direct operating system delete is the fastest way to mass-delete, as I expected.

We are discussing tables or datasets?
Tables implies database tables. To get rid of these in a fast way, using proc SQL pass-through facility would be the fastest. Specifically if you can connect to the database once and drop all of the tables, then disconnect.
If we are discussing datasets in SAS, I would argue that both proc sql and proc datasets are extremely similar. From an application standpoint, they both go through the same deduction to create a system command that deletes a file. All testing I have seen from SAS users groups or presentations have always suggested that the use of one method over the other is marginal and based on many variables.
If it is imperative that you have the absolute fastest way to drop the datasets / tables, you may just have to test it. Each install and setup of SAS is different enough to warrant testing.

In terms of which is faster, excluding extremely large data, I would wager that there is little difference between them.
When handling permanent SAS datasets, however, I like to use PROC DATASETS rather than PROC SQL, simply because I feel better manipulating permanent datasets using the SAS-designed method, and not the SQL implementation

Simple Solution for temporary tables that are named similarly:
If all of your tables start with the same prefix, for example p1_table1 and p1_table2, then the following code will delete any table with that starts with p1
proc datasets;
delete p1: ;
run;

proc delete is another, albeit undocumented, solution..
http://www.sascommunity.org/wiki/PROC_Delete

Related

How to skip code if created dataset has zero rows

I have a job which at first imports some xlsx files, then connects to multiple DB tables. Based on conditions, the job selects rows to output, and creates an excel file to send on to the final end-user.
Sometimes, that job returns zero rows, which is acceptable; in that case, I would prefer to create an empty excel file with only the variables, but not run the other code (checking/cleaning code).
How can I conditionally execute code only when there are results?
Something like this:
I get 0 rows
If Result = 0 then Go to *"here"*
Else *"just run the code further"*
You have a few useful things that can help you here.
First off, PROC SQL sets a macro variable SQLOBS, which is particularly useful in identifying how many records were returned from the last SQL query it ran.
proc sql;
select * from sashelp.class;
quit;
%put I returned &SQLOBS rows;
You might use this to drive further processing, either with %IF blocks as Tom notes in comments or other methods I will cover below.
You can also check how many rows are in a dataset explicitly, if you prefer a slightly more robust option.
proc sql;
select count(*) into :class_count from sashelp.class;
quit;
%put I returned &class_count rows;
For very large datasets, there are faster options (using the dataset descriptors, dictionary tables, or a few other options), but for most tables this is fine.
Either way, what I would typically do with a program I intended to run in production would be then to drive the rest of the program from macros.
%macro whatIWantToDo(params);
...
do stuff
...
%mend whatIWantToDo;
proc sql;
mySqlStuff;
quit;
%if &sqlobs. gt 0 %then %do;
%whatIWantToDo(params);
%end;
%else %do;
%put Nothing to do;
%end;
Another option is to use call execute; this is appropriate if your data drives the macro parameters. The big advantage of call execute is that it only runs if you have data rows - if you have zero, it won't do anything!
Say you have some datasets to run code on. You could have up to twelve - one per month - but only have them for the current calendar year, so in Jan you have one, Feb you have two, etc. You could do this:
data mydata_jan mydata_feb mydata_mar;
set sashelp.class;
run;
%macro printit(data=);
title "Printing &data.";
proc print data=&data;
run;
title;
%mend printit;
data _null_;
set sashelp.vtable;
where upcase(memname) like 'MYDATA_%' and nobs gt 0;
callstr = cats('%printit(data=',memname,')');
call execute(Callstr);
run;
First I make the datasets, with a name I can programmatically identify. Then I make the macro that I want to run on each (this could be checking, cleaning, whatever). Then I use sashelp.vtable which shows which tables are created, and check the nobs variable (number of observations) is more than zero. Then I use call execute to run the macro on that dataset!

SAS: proc reg and macro

i have a data that contain 30 variable and 2000 Observations.
I want to calculate regression in a loop, whan in each step I delete the i row in the data.
so in the end I need thet my output will be 2001 regrsion, one for the regrsion on all the data end 2000 on each time thet I drop a row.
I am new to sas, and I tray to find how to do it withe macro, but I didn't understand.
Any comments and help will be appreciated!
This will create the data set I was talking about in my comment to Chris.
data del1V /view=del1v;
length group _obs_ 8;
set sashelp.class nobs=nobs;
_obs_ = _n_;
group=0;
output;
do group=1 to nobs;
if group eq _n_ then;
else output;
end;
run;
proc sort out=analysis;
by group;
run;
DATA NEW;
DATA OLD;
do i = 1 to 2001;
IF _N_ ^= i THEN group=i;
else group=.;
output;
end;
proc sort data=new;
by group;
proc reg syntax;
by group;
run;
This will create a data set that is much longer. You will only call proc reg once, but it will run 2001 models.
Examining 2001 regression outputs will be difficult just written as output. You will likely need to go read the PROC REG support documentation and look into the output options for whatever type of output you're interested in. SAS can create a data set with the GROUP column to differentiate the results.
I edited my original answer per #data null suggestion. I agree that the above is probably faster, though I'm not as confident that it would be 100x faster. I do not know enough about the costs of the overhead of proc reg versus the cost of the group by statement and a larger data set. Regardless the answer above is simpler programming. Here is my original answer/alternate approach.
You can do this within a macro program. It will have this general structure:
%macro regress;
%do i=1 %to 2001;
DATA NEW;
DATA OLD;
IF _N_=&I THEN DELETE;
RUN;
proc reg syntax;
run;
%end;
%mend;
%regress
Macros are an advanced programming function in SAS. The macro program is required in order to do a loop of proc reg. The %'s are indicative of macro functions. &i is a macro variable (& is the prefix of a macro variable that is being called). The macro is created in a block that starts and ends with %macro / %mend, and called by %regress.
Examining 2001 regression outputs will be difficult just written as output. You will likely need to go read the PROC REG support documentation and look into the output options for whatever type of output you're interested in. Use &i to create a different data set each time and then append together as part of the macro loop.

How do I work with a SAS file that was created in a different format (Linux/Windows) if I don't have access to machine that created it?

I have numerous SAS datasets on my Windows 7 / SAS 9.4 machine:
data_19921.sas7bdat
data_19922.sas7bdat
data_19923.sas7bdat
....
data_200212.sas7bdat
etc. The format is data_YYYYM.sas7bdat or data_YYYYMM.sas7bdat (the latter for two digit months) and every dataset has identical variables and formatting. I'm trying to iterate over all of these files and append them into one big SAS dataset. The datasets were created on some Unix machine elsewhere in my company that I don't have access to. When I try to append the files:
%let root = C:\data;
libname in "&raw\in";
libname out "&raw\out";
/*****************************************************************************/
* initialize final data set but don't add any observations to it;
data out.alldata;
set in.data_19921;
stop;
run;
%macro append_files;
%do year=1992 %to 2002;
%do month=1 %to 12;
proc append data=out.alldata base=in.data_&year&month;
run;
%end;
%end;
%mend;
%append_files;
I get errors that say
File in.data_19921 cannot be updated because its encoding does not match the session encoding or the file is in a format native to another host, such as LINUX_32, INTEL_ABI
I don't have access to the Unix machine that created these datasets (or any Unix machine right now), so I don't think I can use proc cport and proc cimport.
How can I read/append these data sets on my Windows machine?
You can use the colon operator or dash to shortcut your process:
data out.alldata;
set in.data_:;
run;
OR
data out.alldata;
set in.data19921-in.data200212;
run;
If you have variables that are different lengths you may get truncation. SAS will warn you though:
WARNING: Multiple lengths were specified for the variable name by input data set(s). This may
cause truncation of data.
One way to avoid it is to create a false table first, using the original table. The following SQL code will generate the CREATE Table statements and you can modify the lengths manually as needed.
proc sql;
describe table in.data19921;
quit;
*run create table statement here with modified lengths;
data out.alldata;
set fake_data (obs=0)
in.data19921-in.data200212;
run;

Grouping plots and tables using a BY statement in SAS

I am using a BY statement with both proc boxplot and proc report to create a plot and a table for each level of the BY variable. As is, the code prints all the plots and then prints all of the tables. I would like it to print the plot and then the table for each level of the By variable (so the ouput would alternate between a plot and a table). Is there a way to do this?
This is the code I currently have for the plots and tables-
proc boxplot data=study;
plot Lead_Time*Study_ID/ horizontal;
by Project_Name;
format Lead_Time dum.;
run;
proc report data=study nowd;
column ID Title Contact Status Message Audience Priority;
by Project_Name;
run;
Thank You!!
Unfortunately, I don't think the ODS (Output Delivery System) can interleave outputs from procedures. You will need to use a macro to loop over all the by variables and call BOXPLOT and REPORT for each one.
Something like this:
%macro myreport();
%let byvars = A B C D;
%let n=4;
%do i=1 %to &n;
%let var = %scan(&byvars,&i);
proc something data=have(where=(byvar="&var"));
...;
run;
proc report data=have(where=(byvar="&var"));
....
run;
%end;
%mend;
%myreport();
Obviously you need to change this to fit your needs. There are plenty of examples on Stackoverflow of it. Here is one: looping over character values in SAS
This is in principle possible using PROC DOCUMENT and the ODS DOCUMENT output type. It's not exactly easy, per se, but it's possible, and has some advantages over the macro option, although I'm not sure sufficient to recommend its use. However, it's worth exploring nonetheless.
First off, this is largely guided (including, coincidentally, using the same dataset!) by Cynthia Zender's excellent tutorial, Have It Your Way: Rearrange and Replay Your Output with ODS DOCUMENT, presented during the 2009 SAS Global Forum. She initially describes a GUI method of doing this, but then later explains it in code, which would clearly be superior for this sort of thing. Kevin Smith covers similar ground in ODS DOCUMENT From Scratch, from 2012's SGF, though Cynthia's paper is a bit more applicable here (as she covers the exact topic).
First, you need to generate all of your results. Order here doesn't matter too much.
I generate a sample of SASHELP.PRDSALE that is sorted appropriately by country.
proc sort data=sashelp.prdsale out=prdsale;
by country;
run;
Then, we generate some tables; a proc means and a sgplot. Note the title uses #BYVAL1 to make sure the title is included - otherwise we lose the useful labels on the procs!
title "#BYVAL1 Report";
ods _all_ close;
ods document name=work.mydoc(write);
proc means data=prdsale sum;
by country;
class quarter year;
var predict;
run;
proc sgplot data=prdsale;
by country;
vbar quarter/response=predict group=year groupdisplay=cluster;
run;
ods document close;
ods preferences;
Now, we have something that is wrong, but is usable for what you actually want. You can use the techniques in Cynthia or Kevin's papers to look into this in detail; for now I'll just go into what you need for this purpose.
It's now organized like this, imagining a folder tree:
\REPORT\MEANS\COUNTRY\
What we need is:
\REPORT\COUNTRY\MEANS
That's easy enough to do. The code to do so is below. Obviously, for a production process this would be better automated; given the input dataset it should be trivial to generate this code. Note that the BYVALs increment for each by value, so CANADA is 1 and 4, GERMANY is 2 and 5, and USA is 3 and 6.
proc document name=work.mydoc_new(write);
make CANADA, GERMANY, USA; *make the lower level folders;
run;
dir ^^; *Go to the bottom level, think "cd .." in unix/windows;
dir CANADA; *go to Canada folder;
dir; *Notes to the Listing destination where we are, not that important;
copy \work.mydoc\Means#1\ByGroup1#1\Summary#1 to ^; *copy that folder from orig doc to here;
copy \work.mydoc\SGPlot#1\ByGroup4#1\SGPlot#1 to ^; *^ being current directory, like '.' in unix/windows;
*You could also copy \ByGroup1#1 and \Bygroup4#1 without the last level of the tree. That would give a slightly different result (a bit more of the text around the table would be included), so do whichever matches your expectations.;
**Same for Germany and USA here. Note that this is the part that would be easy to automate!;
dir ^^;
dir GERMANY;
dir;
copy \work.mydoc\Means#1\ByGroup2#1\Summary#1 to ^;
copy \work.mydoc\SGPlot#1\ByGroup5#1\SGPlot#1 to ^;
dir ^^;
dir USA;
dir;
copy \work.mydoc\Means#1\ByGroup3#1\Summary#1 to ^;
copy \work.mydoc\SGPlot#1\ByGroup6#1\SGPlot#1 to ^;
run;
quit; *this is one of those run group procedures, need a quit;
Now, you only have to replay the document to get it out the right way.
proc document name=mydoc_new;
replay;
run;
quit;
Tada, you have what you want.
If you're going to run the procs once per by value, that's pretty easy. Create a macro to run just one instance, then use proc sql to create a call for each instance. That is entirely dynamic, and could be easily adjusted to allow for other options such as multiple by variables, levels, etc.
Given a single by value:
*Macro that runs it once;
%macro run_reports(project_name=);
title "Report for &project_name.";
proc boxplot data=study;
plot Lead_Time*Study_ID/ horizontal;
where Project_Name="&project_name.";
format Lead_Time dum.;
run;
proc report data=study nowd;
column ID Title Contact Status Message Audience Priority;
where Project_Name="&project_name.";
run;
%mend run_Reports;
*SQL pull to create a list of macro calls;
proc sql;
select distinct cats('%run_Reports(project_name=',project_name,')')
into :runlist separated by ' '
from study;
quit;
&runlist.;
Turn options symbolgen; on to see what the runlist looks like, or look at your output window (or results window in 9.3+). When you're running this in production, add noprint to proc sql to avoid generating that table.

Outputting the top 10% in SAS

How can I limit the number of observations that are output in SAS to the top 10%? I know about the obs= function, but I haven't been able to figure out how to make the obs= lead to a percentage.
Thanks in advance!
As far as I know there's not a direct way to do what you're asking. You could pretty easily write a macro to do this, though.
Assuming you're asking to view the first 10% of records in a PROC PRINT, you could do something like this.
%macro top10pct(lib=WORK,dataset=);
proc sql noprint;
select max(ceil(0.1*nlobs)) into :_nobs
from dictionary.tables
where upcase(libname)=upcase("&lib.") and upcase(memname)=upcase("&dataset.");
quit;
proc print data=&lib..&dataset.(obs=&_nobs.);
run;
%mend top10pct;
dictionary.tables has all of the PROC CONTENTS information available to it, including the number of logical observations (NLOBS). This number is not 100% guaranteed to be accurate if you've been doing things to the dataset like deleting observations in SQL, but for SAS datasets this is almost always accurate, or close enough. For RDBMS tables this may be undefined or may not be accurate.