Outputting the top 10% in SAS - sas

How can I limit the number of observations that are output in SAS to the top 10%? I know about the obs= function, but I haven't been able to figure out how to make the obs= lead to a percentage.
Thanks in advance!

As far as I know there's not a direct way to do what you're asking. You could pretty easily write a macro to do this, though.
Assuming you're asking to view the first 10% of records in a PROC PRINT, you could do something like this.
%macro top10pct(lib=WORK,dataset=);
proc sql noprint;
select max(ceil(0.1*nlobs)) into :_nobs
from dictionary.tables
where upcase(libname)=upcase("&lib.") and upcase(memname)=upcase("&dataset.");
quit;
proc print data=&lib..&dataset.(obs=&_nobs.);
run;
%mend top10pct;
dictionary.tables has all of the PROC CONTENTS information available to it, including the number of logical observations (NLOBS). This number is not 100% guaranteed to be accurate if you've been doing things to the dataset like deleting observations in SQL, but for SAS datasets this is almost always accurate, or close enough. For RDBMS tables this may be undefined or may not be accurate.

Related

Understanding SAS output data sets

SAS has several forms it uses to create output data sets from within a procedure. It is not always clear whether or not a particular procedure can generate a data set and, if it seems to be able to, it's not always clear how.
Off the top of my head, here are some examples of how widely the syntax can differ.
Example 1
proc sort data = sashelp.baseball out = baseball_sorted;
by
league
division
;
run;
Example 2
proc means noprint data = baseball_sorted;
by
league
division
;
var nHits;
output
out = baseball_avg_hits (drop = _TYPE_ _FREQ_)
mean = mean_hits
;
run;
Example 3
ods exclude all;
ods output
statistics = baseball_statistics
equality = baseball_ftest
;
proc ttest data = baseball_sorted;
class league;
var nHits;
run;
ods exclude none;
Example 4
The PROC ANOVA OUTSTAT= option.
It seems almost as if SAS has implemented each of these willy-nilly. Is the SAS syntax dictating how to create a data set directed by some consistent approach I am not seeing or is it truly capricious and arbitrary?
For PROC code, the syntax for outputting data is often specific to that procedure, which often feels willy-nilly. (Your examples 1, 2, 4) I think PROC developers are given a lot of freedom, and remember that many of these PROCS are 30+ years old.
The great thing about the Output Delivery System (ODS, your example 3) is it provides a single syntax for outputting data, regardless of the procedure. So you can use the ODS OUTPUT statement with (almost?) any PROC. The names and structures of the output objects will of course vary between PROCs. So if you are looking for a consistent approach, I would focus on using ODS OUTPUT. ODS was added in V7 (I think).
It would be interesting to try to find an example of an output dataset which could be made by a PROC but could not be made by ODS OUTPUT. I hope there aren't any. If that is the case, you could consider the range of OUTPUT statements/options within PROCs as legacy code.
Agree with Quentin. You have to remember that there are SAS systems out there running code written in the 80s. SAS would have a huge headache if they forced every team to rewrite all the procedures and then forced their customers to change all their code. SAS has been around since the 60s and the organic growth of the syntax is to be expected.
FWIW, having an OUT= statement makes sense on things with no graphical output. I.E. PROC SORT or PROC TRANSPOSE.
The way I see it there are four main ways to specify the output data sets.
In the PROC statement you may be able to specify some type of output statements or options, such as OUT= OUTEST=.
In the main statement of the procedure, ie MODEL/TABLE can have options that allow for output. ie PROC FREQ has an OUT= on the TABLE statement.
An explicit OUTPUT statement within a procedure. These are typically from older procedures. ie PROC MEANS
ODS tables which are relatively newer method, more frequently used these days since the format aligns with what you'd expect to see.
Yes, there are multiple places to check, but fortunately the SAS documentation for procedures is relatively clear with the options and how to use/specify the outputs.
If I've missed anything that seems different post in the comments and I can update this.
PS. Although SAS is definitely bad, trying to navigate different packages/modules in Python to export an XLSX file isn't straight forward either. Some packages support some options others don't. I've given up on asking why these days and just accept it as peculiarities of the different languages at this point.

Interpreting WHERE= in Proc

Converting a SAS code to Sql based process. Came across this pretty simple snippet.
proc freq data=temp1(where=(SAMERETAIL='Y')) noprint;
tables RETAILER*store/list nocum nopercent out=retailer_list;
run;
My interpretation of this is:
From Temp1:
Choose all observations which fit the criteria (sameretail=Y)
Extract Retail, Store frequency counts:
Store Retailer Count(*)
Output to Retailer_List.
The question I have is on the WHERE=. Is this applied to the Proc or Data? Is my interpretation correct? Business wise this is incorrect since we are only restricting the records with the flag=Y. Hence the question.
Any pointers?
Any help is greatly appreciated.
TIA.
where= is applied to the dataset you're pulling observations from to use in the proc freq. SUGI 24 has a good summary of how this works (see page 3).
THE WHERE DATA SET OPTION
The syntax of the WHERE data set option, called
WHERE=, is a combination of standard data set
option parentheses and a where-expression.

SAS dynamically declaring macro variable

My company just switched from R to SAS and I am converting a lot of my R code to SAS. I am having a huge issue dynamically declaring variables (macro variables) in SAS.
For example one of my processes needs to take in the mean of a column and then apply it throughout the code in many steps.
%let numm =0;
I have tried the following with my numm variable but both methods do not work and I cannot seem to find anything online.
PROC MEANS DATA = ASSGN3.COMPLETE mean;
#does not work
&numm = VAR MNGPAY;
run;
Proc SQL;
#does not work
&numm =(Select avg(Payment) from CORP.INV);
quit;
I would highly recommend buying a book on SAS or taking a class from SAS Training. SAS Programming II is a good place to start (Programming I if you have not programmed anything else, but that doesn't sound like the case). The code you have shows you need it. It is a complete paradigm shift from R.
That said, try this:
proc sql noprint;
select mean(payment) into :numm from corp.inv;
quit;
%put The mean is: &numm;
Here's the proc summary / data step equivalent:
proc summary data = corp.inv;
var payment;
output out = inv_summary mean=;
run;
data _null_;
set inv_summary;
call symput('numm',payment);
run;
%put The mean is: &numm;
Proc sql is a more compact approach if you just want a simple arithmetic mean, but if you require more complex statistics it makes sense to use proc summary.

how to get timing information of a data step query

I just wanted to know like in proc sql we define stimer option.
The PROC SQL option STIMER | NOSTIMER specifies whether PROC SQL writes timing information for each statement to the SAS log, instead of writing a cumulative value for the entire procedure. NOSTIMER is the default.
Now in same way how to specify timing information in data set step. I am not using proc sql step
data h;
select name,empid
from employeemaster;
quit;
PROC SQL steps individually are effectively separate data steps, so in a sense you always get the identical information from SAS. What you're asking is effectively how to find out how long 'select name' takes versus 'empid'.
There's not a direct way to get the timing of an individual statement in a data step, but you could write data step code to find out. The problem is that the data step is executed row-wise, so it's really quite different from the PROC SQL STIMER details; almost nothing you do in a data step will take very long by itself, unless you are doing something more complex like a hash table lookup. What takes long is writing out the data first, and reading in the data second.
You do have some options for troubleshooting long data steps, if that's your concern. OPTIONS MSGLEVEL=I will give you information about index usage, merge details, etc., which can be helpful if you aren't sure why it is taking a long time to do certain things (see http://goo.gl/bpGWL in SAS documentation for more info). You can write your own timestamp:
data test;
set sashelp.class sashelp.class;
_t=time();
put _t=;
run;
Odds are that won't show you much of use since most data step iterations won't take very long but if you are doing something fancy it might help. You could also use conditional statements to only print the time at certain intervals - when at FIRST.ID for example in a process that works BY ID;.
Ultimately though the information you already get from notes is what is most useful. In PROC SQL you need the STIMER information because SQL is doing several things at once, while SAS lets/makes you do everything out step-wise. Example:
PROC SQL;
create table X as select * from A,B where A.ID=B.ID;
quit;
is one step - but in SAS this would be:
proc sort data=a; by ID; run;
proc sort data=b; by ID; run;
data x;
merge a(in=a) b(in=b);
by id;
if a and b;
run;
For that you would get information on the duration of each of those steps (the two sorts and the merge) in SAS, which is similar to what STIMER would tell you.
No way.
PROC SQL STIMER logs timing for each separately executable SQL statement/query.
In data step, as you may know, the data step looping occurs, observation per observation, so the data step statement timing would be something like per observation, let's say transactional. Anyway this would not describe all the details where the time is being spent - waiting for disk reads, writes, etc.
So I guess this won't be very usable. In general, SAS performance is I/O driven.

Dropping a table in SAS

What is the most efficient way to drop a table in SAS?
I have a program that loops and drops a large number of tables, and would like to know if there is a performance difference between PROC SQL; and PROC DATASETS; for dropping a single table at a time..
Or if there is another way perhaps???
If it is reasonable to outsource to the OS, that might be fastest. Otherwise, my unscientific observations seem to suggest that drop table in proc sql is fastest. This surprised me as I expected proc datasets to be fastest.
In the code below, I create 4000 dummy data sets then try deleting them all with different methods. The first is with sql and on my system took about 11 seconds to delete the files.
The next two both use proc datasets. The first creates a delete statement for each data set and then deletes. The second just issues a blanket kill command to delete everything in the work directory. (I had expected this technique to be the fastest). Both proc datasets routines reported about 20 seconds to delete all 4000 files.
%macro create;
proc printto log='null';run;
%do i=1 %to 4000;
data temp&i;
x=1;
y="dummy";
output;run;
%end;
proc printto;run;
%mend;
%macro delsql;
proc sql;
%do i=1 %to 4000;
drop table temp&i;
%end;
quit;
%mend;
%macro deldata1;
proc datasets library=work nolist;
%do i=1 %to 4000;
delete temp&i.;
%end;
run;quit;
%mend;
%macro deldata2;
proc datasets library=work kill;
run;quit;
%mend;
option fullstimer;
%create;
%delsql;
%create;
%deldata1;
%create;
%deldata2;
I tried to fiddle with the OS-delete approach.
Deleting with the X-command can not be recommended. It took forever!
I then tried with the system command in a datastep:
%macro delos;
data _null_;
do i=1 to 9;
delcmd="rm -f "!!trim(left(pathname("WORK","L")))!!"/temp"!!trim(left(put(i,4.)))!!"*.sas7*";
rc=system(delcmd);
end;
run;
%mend;
As you can see, I had to split my deletes into 9 separate delete commands. The reason is, I'm using wildcards, "*", and the underlying operating system (AIX) expands these to a list, which then becomes too large for it to handle...
The program basically constructs a delete command for each of the nine filegroups "temp[1-9]*.sas7*" and issues the command.
Using the create macro function from cmjohns answer to create 4000 data tables, I can delete those in only 5 seconds using this approach.
So a direct operating system delete is the fastest way to mass-delete, as I expected.
We are discussing tables or datasets?
Tables implies database tables. To get rid of these in a fast way, using proc SQL pass-through facility would be the fastest. Specifically if you can connect to the database once and drop all of the tables, then disconnect.
If we are discussing datasets in SAS, I would argue that both proc sql and proc datasets are extremely similar. From an application standpoint, they both go through the same deduction to create a system command that deletes a file. All testing I have seen from SAS users groups or presentations have always suggested that the use of one method over the other is marginal and based on many variables.
If it is imperative that you have the absolute fastest way to drop the datasets / tables, you may just have to test it. Each install and setup of SAS is different enough to warrant testing.
In terms of which is faster, excluding extremely large data, I would wager that there is little difference between them.
When handling permanent SAS datasets, however, I like to use PROC DATASETS rather than PROC SQL, simply because I feel better manipulating permanent datasets using the SAS-designed method, and not the SQL implementation
Simple Solution for temporary tables that are named similarly:
If all of your tables start with the same prefix, for example p1_table1 and p1_table2, then the following code will delete any table with that starts with p1
proc datasets;
delete p1: ;
run;
proc delete is another, albeit undocumented, solution..
http://www.sascommunity.org/wiki/PROC_Delete