I have many large_tables (billions of lines), which I want to subset based on an id_list (millions of lines). I'm using a hash table to speed it up:
data subset1;
set large_table1;
if _n_ eq 1 then do;
declare hash ht(dataset:"id_list");
ht.definekey('id');
ht.definedone();
end;
if ht.check() eq 0 then do; output; end;
run;
How can I reuse id_list's hash table? Recreating it in each subset query wastes too much time.
Update: As shown in the answers, currently there's no workaround to make persistent hash tables in SAS. I tested empirically two less optimal options with a 12mn lines id_list and 1.5bn lines large_table. Using format instead of hash table took almost double the time (40 minutes vs. 23 minutes). This makes the overhead of recreating the hash table in each data step negligible, therefore I'll just do that for the time being.
Sadly, hash tables cannot persist across DATA steps. AFAIK, when the step ends they are erased to free the memory. I saw a talk by Art Carpenter at SGF 2018 where he tried different ways to trick SAS into making a persistent hash table, and could not succeed.
https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/2399-2018.pdf
For completeness; this is how you'd re-use the hash: using FCMP. It doesn't truly reuse the table in a data step (it will re-load the hash table), but in a macro it does persist.
proc fcmp outlib=work.funcs.func;
function check_ids( name $ );
declare hash h_ids(dataset:"work.class_names");
rc = h_ids.defineKey( "name");
rc = h_ids.definedone();
rc = h_ids.check();
return( not rc );
endsub;
quit;
data class_names;
set sashelp.class;
where sex='F';
run;
options cmplib=work.funcs;
data class_find_f;
set sashelp.class;
if check_ids(name)=1;
run;
See Hashing in PROC FCMP to Enhance Your Productivity for more details on hashing in FCMP.
The way I'd do it is not to use a hash table, but to use a format.
data for_fmt;
set id_list;
retain fmtname 'idlistf' type 'n'; *or c if id is character, and add $ to fmtname;
start=id;
label=1;
output;
if _n_=1 then do; *this section we tell it what to do with 'other' (not found) IDs;
hlo='o';
call missing(start); *unneeded but I like to do this for clarity;
label=0;
output;
end;
run;
*if ID can be duplicated, then run a proc sort nodupkey here;
proc format cntlin=for_fmt;
run;
This persists and should be as fast as your hash table. If your ID list is very large you can use a view here and only process it one time.
You could also load the smaller dataset into memory using the SASFILE statement.
http://documentation.sas.com/?docsetId=lestmtsglobal&docsetTarget=n0osyhi338pfaan1plin9ioilduk.htm&docsetVersion=9.4&locale=en
This would speed up the load each time as it would be loading from memory to memory, rather than from disk to memory...
Related
I know this is a very basic question but my code keeps failing when trying to run what I found through the help documentation.
Up to now I have been running an analysis project off of the .WORK directory which I understand gets wiped out every time a session ends. I have done a bunch of data cleaning and preparation and do not want to have to do that every time before I start my analysis.
So I understand, from reading this: https://support.sas.com/documentation/cdl/en/basess/58133/HTML/default/viewer.htm#a001310720.htm that I have to output the cleaned dataset to a non-temporary directory.
Steps I have taken so far:
1) created a new Library called "Project"
2) Saved it in a folder that I have under "my folders" in SAS
3) My code for saving the cleaned dataset to the "Project" library is as follows:
PROC SORT DATA=FAA_ALL NODUPKEY;
BY GROUND_SPEED;
DATA PROJECT.FAA_ALL;
RUN;
Then I run this code in a new program:
PROC PRINT DATA=PROJECT.FAA_ALL;
RUN;
It says there are no observations and that the dataset is essentially empty.
Can some tell me where I'm going wrong?
Your problem is the PROC SORT
PROC SORT DATA=FAA_ALL NODUPKEY;
BY GROUND_SPEED;
DATA PROJECT.FAA_ALL;
RUN;
Should be
PROC SORT DATA=FAA_ALL OUT= PROJECT.FAA_ALL NODUPKEY;
BY GROUND_SPEED;
RUN;
That DATA PROJECT.FAA_ALL was starting a Data Step creating a blank data set.
Something else worth mentioning: your data step didn't do what you might have expected because you had no set statement. Your code was equivalent to:
PROC SORT DATA=WORK.FAA_ALL NODUPKEY;
BY GROUND_SPEED;
RUN;
DATA PROJECT.FAA_ALL;
SET _NULL_;
RUN;
PROJECT.FAA_ALL is empty because nothing was read in.
The SORT procedure implicitly sorts a dataset in-place. You could have SAS move the sorted data by adding the set statement to your data step:
PROC SORT DATA=WORK.FAA_ALL NODUPKEY;
BY GROUND_SPEED;
RUN;
DATA PROJECT.FAA_ALL;
SET WORK.FAA_ALL;
RUN;
However, this still takes two steps, and requires extra disk I/O. Using the out option in a SAS procedure (as in DomPazz's answer) is almost always faster and more efficient than using a data step just to move data.
I'm new to SAS and have some problems with adding a column to existing data set in SAS using MODIFY statement (without proc sql).
Let's say I have data like this
id name salary perks
1 John 2000 50
2 Mary 3000 120
What I need to get is a new column with the sum of salary and perks.
I tried to do it this way
data data1;
modify data1;
money=salary+perks;
run;
but apparently it doesn't work.
I would be grateful for any help!
As #Tom mentioned you use SET to access the dataset.
I generally don't recommend programming this way with the same name in set and data statements, especially as you're learning SAS. This is because it's harder to detect errors, since once run and encounter an error, you destroy your original dataset and have to recreate it before you start again.
If you want to work step by step, consider intermediary datasets and then clean up after you're done by using proc datasets to delete any unnecessary intermediary datasets. Use a naming conventions to be able to drop them all at once, i.e. data1, data2, data3 can be referenced as data1-data3 or data:.
data data2;
set data1;
money = salary + perks;
run;
You do now have two datasets but it's easy to drop datasets later on and you can now run your code in sections rather than running all at once.
Here's how you would drop intermediary datasets
proc datasets library=work nodetails holist;
delete data1-data3;
run;quit;
You can't add a column to an existing dataset. You can make a new dataset with the same name.
data data1;
set data1;
money=salary+perks;
run;
SAS will build it as a new physical file (with a temporary name) and when the step finishes without error it deletes the original and renames the new one.
If you want to use a data set you do it like this:
data dataset;
set dataset;
format new_column $12;
new_column = 'xxx';
run;
Or use Proc SQL and ALTER TABLE.
proc sql;
alter table dataset
add new_column char(8) format = $12.
;
quit;
I have a data where I have various types of loan descriptions, there are at least 100 of them.
I have to categorise them into various buckets using if and then function. Please have a look at the data for reference
data des;
set desc;
if loan_desc in ('home_loan','auto_loan')then product_summary ='Loan';
if loan_desc in ('Multi') then product_summary='Multi options';
run;
For illustration I have shown it just for two loan description, but i have around 1000 of different loan_descr that I need to categorise into different buckets.
How can I categorise these loan descriptions in different buckets without writing the product summary and the loan_desc again and again in the code which is making it very lengthy and time consuming
Please help!
Another option for categorizing is using a format. This example uses a manual statement, but you can also create a format from a dataset if you have the to/from values in a dataset. As indicated by #Tom this allows you to change only the table and the code stays the same for future changes.
One note regarding your current code, you're using If/Then rather than If/ElseIf. You should use If/ElseIf because then it terminates as soon as one condition is met, rather than running through all options.
proc format;
value $ loan_fmt
'home_loan', 'auto_loan' = 'Loan'
'Multi' = 'Multi options';
run;
data want;
set have;
loan_desc = put(loan, $loan_fmt.);
run;
For a mapping exercise like this, the best technique is to use a mapping table. This is so the mappings can be changed without changing code, among other reasons.
A simple example is shown below:
/* create test data */
data desc (drop=x);
do x=1 to 3;
loan_desc='home_loan'; output;
loan_desc='auto_loan'; output;
loan_desc='Multi'; output;
loan_desc=''; output;
end;
data map;
loan_desc='home_loan'; product_summary ='Loan '; output;
loan_desc='auto_loan'; product_summary ='Loan'; output;
loan_desc='Multi'; product_summary='Multi options'; output;
run;
/* perform join */
proc sql;
create table des as
select a.*
,coalescec(b.product_summary,'UNMAPPED') as product_summary
from desc a
left join map b
on a.loan_desc=b.loan_desc;
There is no need to use the macro language for this task (I have updated the question tag accordingly).
Already good solutions have been proposed (I like #Reeza's proc format solution), but here's another route which also minimizes coding.
Generate sample data
data have;
loan_desc="home_loan"; output;
loan_desc="auto_loan"; output;
loan_desc="Multi"; output;
loan_desc=""; output;
run;
Using PROC SQL's case expression
This way doesn't allow, to my knowledge, having several criteria on a single when line, but it really simplifies coding since the resulting variable's name needs to be written down only once.
proc sql;
create table want as
select
loan_desc,
case loan_desc
when "home_loan" then "Loan"
when "auto_loan" then "Loan"
when "Multi" then "Multi options"
else "Unknown"
end as product_summary
from have;
quit;
Otherwise, using the following syntax is also possible, giving the same results:
proc sql;
create table want as
select
loan_desc,
case
when loan_desc in ("home_loan", "auto_loan") then "Loan"
when loan_desc = "Multi" then "Multi options"
else "Unknown"
end as product_summary
from have;
quit;
I'm trying to find ways to substitute all possible proc sql and regular merges with hash whenever possible.
Sample data looks like
data TABLE1;
input Date Property $6. Headcount;
datalines;
01Jul2013 East 100
02Jul2013 East 50
02Jul2013 West 50
;
run;
data TABLE2;
input Date Property $6. Headcount;
datalines;
11Aug2013 East 60
02Oct2013 East 50
22Dec2013 West 40
run;
Both data sets are already sorted by Date and Property. Currently I do it via
data WANT;
set TABLE1 TABLE2;
run;
But the problem is the total number of records in both tables are quite large. The codes above require 20mins or even more to finish this concatenation.
I do know how to use hash object to obtain a outer join result. But how to use it for this purpose?
Do you subsequently use your WANT datastep in other steps (data or proc), e.g. to summarize or subset it down?
If so, you can reduce the I/O by specifying WANT as a view instead of a table.
data want /view=want ;
set table1 table2 ;
run ;
/* Then use `want` elsewhere... */
proc summary data=want ... ;
...
run ;
BUT... if you use want several times, it may still be more efficient (in terms or time or I/O) to build it as a table.
You are unlikely to get much of a performance gain from using a hash object in this scenario. The main benefit of using hash objects is that they allow you to merge on values from one or more small datasets onto a larger dataset without having to sort the large dataset. In this scenario:
Both of your datasets are large
You aren't doing any merging
Appending is possible via the use of hash iterators, if you're really keen, but I wouldn't bother. As other users have suggested, appending is the way to go here, as it will reduce the I/O requirements. Look at the documentation for proc append for more details.
What is the most efficient way to drop a table in SAS?
I have a program that loops and drops a large number of tables, and would like to know if there is a performance difference between PROC SQL; and PROC DATASETS; for dropping a single table at a time..
Or if there is another way perhaps???
If it is reasonable to outsource to the OS, that might be fastest. Otherwise, my unscientific observations seem to suggest that drop table in proc sql is fastest. This surprised me as I expected proc datasets to be fastest.
In the code below, I create 4000 dummy data sets then try deleting them all with different methods. The first is with sql and on my system took about 11 seconds to delete the files.
The next two both use proc datasets. The first creates a delete statement for each data set and then deletes. The second just issues a blanket kill command to delete everything in the work directory. (I had expected this technique to be the fastest). Both proc datasets routines reported about 20 seconds to delete all 4000 files.
%macro create;
proc printto log='null';run;
%do i=1 %to 4000;
data temp&i;
x=1;
y="dummy";
output;run;
%end;
proc printto;run;
%mend;
%macro delsql;
proc sql;
%do i=1 %to 4000;
drop table temp&i;
%end;
quit;
%mend;
%macro deldata1;
proc datasets library=work nolist;
%do i=1 %to 4000;
delete temp&i.;
%end;
run;quit;
%mend;
%macro deldata2;
proc datasets library=work kill;
run;quit;
%mend;
option fullstimer;
%create;
%delsql;
%create;
%deldata1;
%create;
%deldata2;
I tried to fiddle with the OS-delete approach.
Deleting with the X-command can not be recommended. It took forever!
I then tried with the system command in a datastep:
%macro delos;
data _null_;
do i=1 to 9;
delcmd="rm -f "!!trim(left(pathname("WORK","L")))!!"/temp"!!trim(left(put(i,4.)))!!"*.sas7*";
rc=system(delcmd);
end;
run;
%mend;
As you can see, I had to split my deletes into 9 separate delete commands. The reason is, I'm using wildcards, "*", and the underlying operating system (AIX) expands these to a list, which then becomes too large for it to handle...
The program basically constructs a delete command for each of the nine filegroups "temp[1-9]*.sas7*" and issues the command.
Using the create macro function from cmjohns answer to create 4000 data tables, I can delete those in only 5 seconds using this approach.
So a direct operating system delete is the fastest way to mass-delete, as I expected.
We are discussing tables or datasets?
Tables implies database tables. To get rid of these in a fast way, using proc SQL pass-through facility would be the fastest. Specifically if you can connect to the database once and drop all of the tables, then disconnect.
If we are discussing datasets in SAS, I would argue that both proc sql and proc datasets are extremely similar. From an application standpoint, they both go through the same deduction to create a system command that deletes a file. All testing I have seen from SAS users groups or presentations have always suggested that the use of one method over the other is marginal and based on many variables.
If it is imperative that you have the absolute fastest way to drop the datasets / tables, you may just have to test it. Each install and setup of SAS is different enough to warrant testing.
In terms of which is faster, excluding extremely large data, I would wager that there is little difference between them.
When handling permanent SAS datasets, however, I like to use PROC DATASETS rather than PROC SQL, simply because I feel better manipulating permanent datasets using the SAS-designed method, and not the SQL implementation
Simple Solution for temporary tables that are named similarly:
If all of your tables start with the same prefix, for example p1_table1 and p1_table2, then the following code will delete any table with that starts with p1
proc datasets;
delete p1: ;
run;
proc delete is another, albeit undocumented, solution..
http://www.sascommunity.org/wiki/PROC_Delete