Would overwriting the existing SAS dataset take more time? - sas

I got a short question - If we are creating a SAS dataset say - Sample.sas7bdat which already exists, will the code take more time to execute (because here the code has to overwrite the existing dataset) than the case when this dataset was not already there?
data sample;
.....
.....
run;
I did some reasearch on the internet but could not find a satisfactory answer. To me it seems like the code should take a little bit extra time, though not sure how much of impact it would make on a 10GB of dataset.

You could test this yourself fairly easily. A few caveats:
Make sure you have a large enough dataset such that you won't miss the differences in simple random cpu activity. 100+MB is usually a good target.
Make sure you perform the test multiple times - the more the better, with no time in between if possible. One test will always be insufficient and will always tend to show the first dataset as faster, because it benefits from write caching (basically the OS saying that it's done writing when it's not, but simply has the write queued up in memory).
Here's an example of my test. This is a 100 million row dataset with two 8 byte numerics, so 1.6 GB.
First, the results. I see a few second difference. Why? SAS takes a few operations when replacing a dataset:
Write dataset to temporary file
Delete the old dataset
Rename temporary dataset to new dataset
On some OSs this seems to be faster than others; I've found Windows desktop to be fairly slow about this, compared to unix or even Windows Server OS which is pretty quick. I'm guessing Windows is more careful about deleting than simply changing a file system pointer, but I don't really know. It's certainly not copying the whole file over from the utility directory (it's not nearly enough time for that). I also suspect write caching is still giving a bit of a boost to the new datasets, particularly as time for all datasets is growing as I write. The difference is probably only about a second or so - the difference between _REP iteration 2 and _NEW iteration 3 seems the most reasonable to me.
Iteration 1 _NEW=7.26999998099927 _REP=12.9079999922978
Iteration 2 _NEW=10.0119998454974 _REP=11.0789999961998
Iteration 3 _NEW=10.1360001564025 _REP=15.3819999695042
Iteration 4 _NEW=14.7720000743938 _REP=17.4649999142056
Iteration 5 _NEW=16.2560000418961 _REP=19.2009999752044
Notice the first iteration new is far faster than the others, and overall time increases as you go (as the write caching is less and less able to keep up). I suspect if you allow it to continue (or use a still larger file, which I don't have time for right now) you might see even more consistent times. I'm also not sure what happens with write caching when a file that is write cached is deleted; it's possible it has to wait for the write caching to write out to disk before doing the delete op or something similar. You could perform a test where you waited 30 seconds between _NEW and _REP to verify that.
The code:
%macro test_me(iter=1);
%do _i=1 %to &iter.;
%let start = %sysfunc(time());
data test&_i.;
do x = 1 to 1e8;
y=x**2;
output;
end;
run;
%let mid=%sysfunc(time());
data test&_i.;
do x = 1 to 1e8;
y=x**2;
output;
end;
run;
%let end=%sysfunc(time());
%let _new = %sysevalf(&mid.-&start.);
%let _rep = %sysevalf(&end.-&mid.);
%put Iteration &_i. &=_new. &=_rep.;
%end;
proc datasets nolist kill;
quit;
%mend test_me;
options nosource nonotes nomprint nosymbolgen;
%test_me(iter=5);

There are more file operations involved when you are overwriting. After creating the table, SAS will delete the old table and rename the new. In my tests this took 0.2 seconds extra time.

In a brief test, my 800Mb dataset took 4 seconds to create new and 10-15 seconds to overwrite. I'm assuming this is because SAS has to preserve the existing dataset until the datastep completes executing so as to preserve data-integrity. That's why you might get the following message in the log:
WARNING: Data set dset was not replaced because this step was stopped.
Overwrite test
NOTE: The data set WORK.SAMPLE has 100000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 10.06 seconds
user cpu time 3.08 seconds
system cpu time 1.48 seconds
memory 1506.46k
OS Memory 26268.00k
Timestamp 08/12/2014 11:43:06 AM
Step Count 42 Switch Count 38
Page Faults 0
Page Reclaims 155
Page Swaps 0
Voluntary Context Switches 190
Involuntary Context Switches 288
Block Input Operations 0
Block Output Operations 1588496
New data test
NOTE: The data set WORK.SAMPLE1 has 100000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 3.94 seconds
user cpu time 3.14 seconds
system cpu time 0.80 seconds
memory 1482.18k
OS Memory 26268.00k
Timestamp 08/12/2014 11:43:10 AM
Step Count 43 Switch Count 38
Page Faults 0
Page Reclaims 112
Page Swaps 0
Voluntary Context Switches 99
Involuntary Context Switches 294
Block Input Operations 0
Block Output Operations 1587464
The only difference between the log messages is the real time, which to me would indicate SAS is processing filesystem operations on the dataset files.
N.B. I have tested this on SAS (r) Proprietary Software Release 9.4 TS1M2, which I'm running through SAS Studio online. I think it's a Linux operating system, results could vary depending on your operating system.

Related

SAS: Adding aggregated data to same dataset

I'm migrating from SPSS to SAS.
I need to compute the sum of variable varX, separately by groups of variables varA varB, and add it as a new variable sumX to the same dataset.
In SPSS this is implemented easily with aggregate:
aggregate outfile *
/break varA varB
/SUMvarX = sum(varX).
can this be done in SAS?
There are a number of ways to do this, but the best way depends on your data.
For a typical use case, the PROC MEANS solution is what I'd recommend. It's not the fastest, but it gets the job done, and it has a lot lower opportunity of error - you're not really doing anything except match-merging afterwards.
Use the class statement instead of by in most cases; it shouldn't make much of a difference, but it's the purpose of class. by runs the analysis separately for each value of those variables; class runs one analysis grouping by all of those variables. It is more flexible and doesn't require a sorted dataset (though you would have to sort anyway for the later merge). class also lets you do multiple combinations - not just the nway combination you ask for here, but if you want it grouped just by a, just by b, and by a*b, you can get that (with class and types).
proc means data=have;
class a b;
var x;
output out=summary sum(x)=;
run;
data want;
merge have summary;
by a b;
run;
The DoW loop covered in Kermit's answer is a reasonable data step option also, though more risky in terms of programmer error; I'd use it only in particular cases where the dataset is very very large - more than fits in memory in summary size large - and performance was important.
If the data fits in memory, you can also use a hash table to do the summary, and that's what I'd do if the summary dataset fit comfortably in memory. This is too long for an answer here, but Data Aggregation using Hash Object is a good start for how to do that. Basically, you use a hash table to store the results of the summary (not the raw data), adding to it with each row, and then output the hash table at the end. A bit faster than the DoW loop, but slightly memory constrained (although if you used SPSS, you're much more memory constrained than this!). Also very easy to handle multiple combinations.
Another "programmer easy" way to do it is with SQL.
proc sql;
create want as
select *, sum(x) as sum_x
from have
group by a,b
;
quit;
This is not standard SQL, but SAS manages it - basically it does the two step process of the proc means and the merge, in one step. I like this in some ways (because it skips the intermediate dataset, even if it does actually make this dataset in the util folder, just cleans up for you automatically) and dislike it in others (it's not standard SQL so it will confuse people, and it leaves a note in the log - only a note, so not a big deal, but still).
Adding a note about SPSS -> SAS thinking. One of the bigger differences you'll see going from SPSS to SAS is that, in SPSS, you have one dataset, and you do stuff to it (mostly). You could save it as a different dataset, but you mostly don't until the end - all of your work really is just editing one dataset, in memory.
In SAS, you read datasets from disk and do stuff and then write them out, and if you're doing anything that is at the dataset level (like a summary), you mostly will do it separately and then recombine with the data in a later step. As such, it's very, very common to have lots of datasets - a program I just ran probably has a thousand. Not kidding! Don't worry about random temporary datasets being produced - it doesn't mean your code is not efficient. It's just how SAS works. There are times where you do have to be careful about it - like you have 150GB datasets or something - but if you're working with 5000 rows with 150 variables, your dataset is so small you could write it a thousand times without noticing a meaningful difference to your code execution time.
The big benefit to this style is that you have different datasets for each step, so if you go back and want to rerun part of your code, you can safely - knowing the predecessor dataset still exists, without having to rerun all of your code. It also lets you debug really easily since you can see each of the component parts.
It's a tradeoff for sure, because it does mean it takes a little longer to run the code, but in the modern day CPUs are really really fast, and so are SSDs - it's just not necessary to write code that stays all in one data step or runs entirely in memory. The tradeoff is that you get the ability to do crazy large amounts of things that couldn't possibly fit in memory, work with massive datasets, etc. - only constrained by disk, which is usually in far greater supply. It's a tradeoff worth making in many cases. When it's possible to do something in a PROC, do so, even when that means it costs a tiny bit of time at the end to re-merge it - the PROCs are what you're paying SAS the big bucks for, they are easy to use, well tested, and fast at what they do.
OK, I think I found a way of doing that.
First, you produce the summing varables:
proc means data= <dataset> noprint nway;
by varA varB;
var varX;
output out=<TEMPdataset> sum = SUMvarX;
run;
then you merge the two datasets:
DATA <dataset>;
MERGE <TEMPdataset> <dataset>;
BY varA varB;
run;
This seems to work, although an extra dataset and several extra variables are formed in the process.
There are probably more efficient ways of doing it...
Ever heard of DoW Loop?
*-- Create synthetic data --*
data have;
varA=2; varB=4; varX=21; output;
varA=4; varB=6; varX=32; output;
varA=5; varB=8; varX=83; output;
varA=4; varB=3; varX=78; output;
varA=4; varB=8; varX=72; output;
varA=2; varB=4; varX=72; output;
run;
proc sort data=have; by varA varB; quit;
varA varB varX
2 4 21
2 4 72
4 3 78
4 6 32
4 8 72
5 8 83
data stage1;
set have;
by varA varB;
if first.varB then group_number+1;
run;
data want;
do _n_=1 by 1 until (last.group_number);
set stage1;
by group_number;
SUMvarX=sum(SUMvarX, varX);
end;
do until (last.group_number);
set stage1;
by group_number;
output;
end;
drop group_number;
run;
varA varB varX SUMvarX
2 4 21 93
2 4 72 93
4 3 78 78
4 6 32 32
4 8 72 72
5 8 83 83

proc surveryselect sample defined verus sample received

I am using the following code
proc surveyselect data = tmp method = urs sampsize = 500 seed = 100 out = out_tmp; run;
However when I look at the logs I am getting 491 records. My tmp dataset has 30,000 records. Need help to understand why the 9 records are getting dropped. I played around with changing the seed value and I am getting around 470 to 495 records per random seed but never get an absolute 500. Referred to the documentation and URS option means "unrestricted random sampling, which is selection with equal probability and with replacement". Probability being equal has no impact however, replacement terminology , I understand as, a record could be present more than once, which is what I am aiming for.
What I do not understand is why does the drawn sample stops are at number less than the 500 i have specified?
Thanks for the help.
The issue is you're failing to quite understand how URS works - I recommend a look through the documentation.
Take this (extreme) example:
proc surveyselect data=sashelp.cars method=urs out=sample_cars sampsize=10000 seed=100;
run;
NOTE: The sample size, 10000, is greater than the number of sampling units, 428.
NOTE: The data set WORK.SAMPLE_CARS has 428 observations and 16 variables.
NOTE: PROCEDURE SURVEYSELECT used (Total process time):
real time 0.02 seconds
cpu time 0.03 seconds
Here I ask for 10,000 (out of 428 total records!), and get... 428 records. The important detail to pay attention to is the NumberHits variable. That says how many times each record was sampled.
If you want one record output for each hit, meaning you want those duplicates, you can add outhits to your PROC SURVEYSELECT statement. From the documentation on URS:
For unrestricted random sampling, by default, the output data set contains a single copy of each unit selected, even when a unit is selected more than once, and the variable NumberHits records the number of hits (selections) for each unit. If you specify the OUTHITS option, the output data set contains m copies of a sampling unit for which NumberHits is m; for example, the output data set contains three copies of a sampling unit that is selected three times (NumberHits is three). For information about the contents of the output data set, see the section Sample Output Data Set.
Here is my example modified to do just that.
proc surveyselect data=sashelp.cars method=urs out=sample_cars sampsize=10000 seed=100 outhits;
run;
NOTE: The sample size, 10000, is greater than the number of sampling units, 428.
NOTE: The data set WORK.SAMPLE_CARS has 10000 observations and 16 variables.
NOTE: PROCEDURE SURVEYSELECT used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

Capturing NOTE generated by _ERROR_ condition from input statement

Below is a simple representation of my problem. I do not control the data, nor the format applied (this is a backend service for a Stored Process Web App). My goal is to return the error message generated - which in this case is actually a NOTE.
data _null_;
input x 8.;
cards;
4 4
;
run;
The above generates:
NOTE: Invalid data for x in line 61 1-8. RULE:
----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 61 4 4 x=. ERROR=1 N=1 NOTE: DATA statement used (Total
process time):
real time 0.00 seconds
cpu time 0.01 seconds
It's easy enough to capture the error status (if _error_ ne 0 then do) but what I'd like to do is return the value of the NOTE - which handily tells us which column was invalid, along with line and column numbers.
Is this possible without log scanning? I've tried sysmsg() and syswarningtext to no avail.
AFAIK, There is no feature for capturing the NOTES a data step causes while the data step is running.
Since you are in STP environment, you might either use either:
-altlog at session startup or
proc printto log=… wrap of the step
and do that scan.

SAS Proc SQL Not Exist Query vs Data Step a=1 b=0

Trying to measure performance on two small sets of data in order to determine an efficient execution method for a much larger pair of data sets.
*This test is being done on a dataset with 32 observations and a dataset with 37 observations.
Both methods give me identical results, slightly different process times. I have a simple data step :
data check;
merge d1(in=a) d2(in=b);
by ssn;
if a=0 and b=1;
run;
The Data Step method (1st execution) log produced the following -
NOTE: There were 32 observations read from the data set WORK.D1.
NOTE: There were 37 observations read from the data set WORK.D2.
NOTE: The data set WORK.CHECK has 5 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
The Proc SQL method (not exists query in our specific case) is below-
proc sql;
create table chck2 as
select b.* from d2 b
where not exists (select a.* from d1 a
where a.ssn=b.ssn)
;
quit;
The sql proc prints the following in the log -
NOTE: PROCEDURE SQL used (Total process time):
real time 0.04 seconds
cpu time 0.03 seconds
These methods both yield the same results, creating my final data set of the same 5 individuals. While the data step processing seems faster (even if only by fraction of a second), will these performance results ALWAYS hold true? Will the Data step method ALWAYS win? What are the key influencing factors here? Do listing the table in a certain order play a role, or would SAS scan both tables simultaneously?
FYI - I mentioned (1st execution) because I noticed from the experiment above, and general exposure, that if you process data steps subsequently, SAS will process subsequent steps faster than the original execution. Assuming this has something to do with SAS having memory on previously executed steps...?
You'll never find meaningful performance evaluations from small datasets. Overhead will inevitably whammy any sort of actual performance difference. PROC SQL has a bit of overhead involved in invoking the procedure (a few hundredths of a second), which is more than the total execution time. Run your test with large enough datasets that it takes minutes to run - usually that's the right balance between tests taking too long and legit differences being squashed by overhead/randomness.
As far as what would be faster: If the dataset is sorted, and SAS knows it's sorted, then the odds are very good that both processes will be in the same magnitude of time. Data step merge is quite fast, as is SQL merge.
If it's not sorted, SQL might (would probably) choose to turn the where-exists into a hash join, which would be much faster than sorting a large dataset. Of course that requires the dataset to fit into memory. Sorting and then merging in the data step might be the same as SQL, or it might be slower - or even faster, though I suspect usually not much faster if it requires sorting first. There are faster solutions in the data step than sort/merge if that's needed (hash or format).
As far as what the order on the PROC SQL statement is; odds are it won't matter, if SQL can figure out what you're doing and optimize it. However, it may because SQL may not easily see the optimal path, so one order (usually the large dataset as the main one and the smaller dataset as the subquery) may help SQL figure out the right approach more easily than the other.
And - the reason SAS has a faster time doing a second or later run is that your OS (or possibly your file system) is caching the read, so it doesn't have to re-read the SET file from disk.

proc transpose using SPDE takes ~60x longer to run than v9 library

I've been moving all of my datasets into SPDE libraries because I've experienced wonderful performance gains in everything. Everything until running proc transpose. This takes ~60x longer to execute on the SPDE dataset than the same dataset stored in normal v9 library. The data sets is sorted by item_id. It is being read/written to the same library.
Does anyone have an idea why this is the case? Am I missing something important about SPDE and Proc Transpose not playing well together?
SPDE Libary
MPRINT(XMLIMPORT_VANTAGE): proc transpose data = smplus.links_response_mechanism out = smplus.response_mechanism (drop = _NAME_)
prefix = rm_;
MPRINT(XMLIMPORT_VANTAGE): by item_id;
MPRINT(XMLIMPORT_VANTAGE): id lookup_code;
MPRINT(XMLIMPORT_VANTAGE): var x;
MPRINT(XMLIMPORT_VANTAGE): run;
NOTE: There were 5866747 observations read from the data set SMPLUS.LINKS_RESPONSE_MECHANISM.
NOTE: The data set SMPLUS.RESPONSE_MECHANISM has 3209353 observations and 14 variables.
NOTE: Compressing data set SMPLUS.RESPONSE_MECHANISM decreased size by 37.98 percent.
NOTE: PROCEDURE TRANSPOSE used (Total process time):
real time 28:27.63
cpu time 28:34.64
V9 Library
MPRINT(XMLIMPORT_VANTAGE): proc transpose data = mplus.links_response_mechanism out = mplus.response_mechanism (drop = _NAME_)
prefix = rm_;
MPRINT(XMLIMPORT_VANTAGE): by item_id;
68 The SAS System 02:00 Thursday, August 8, 2013
MPRINT(XMLIMPORT_VANTAGE): id lookup_code;
MPRINT(XMLIMPORT_VANTAGE): var x;
MPRINT(XMLIMPORT_VANTAGE): run;
NOTE: There were 5866747 observations read from the data set MPLUS.LINKS_RESPONSE_MECHANISM.
NOTE: The data set MPLUS.RESPONSE_MECHANISM has 3209353 observations and 14 variables.
NOTE: Compressing data set MPLUS.RESPONSE_MECHANISM decreased size by 27.60 percent.
Compressed is 32271 pages; un-compressed would require 44572 pages.
NOTE: PROCEDURE TRANSPOSE used (Total process time):
real time 28.76 seconds
cpu time 28.79 seconds
Looks to me that there is some issue with PROC TRANSPOSE and SPDE. Here's a simple SSCCE, which has significant differences; not as significant as yours, but to some extent that may be a factor of this being on a desktop with not particularly substantial performance tuning in the first place. Sounds like a call to SAS tech support is in order.
libname spdelib spde 'c:\temp\SPDE Main'
datapath=('c:\temp\SPDE Data' 'd:\temp\SPDE Data')
indexpath=('d:\temp\SPDE Index')
partsize=512;
libname mainlib 'c:\temp\';
data mainlib.bigdata;
do ID = 1 to 1500000;
do _varn=1 to 10;
varname=cats("Var_",_varn);
vardata=ranuni(7);
output;
end;
end;
run;
data spdelib.bigdata;
do ID = 1 to 1500000;
do _varn=1 to 10;
varname=cats("Var_",_varn);
vardata=ranuni(7);
output;
end;
end;
run;
*These data steps take roughly the same amount of time, around 30 seconds each;
proc transpose data=spdelib.bigdata out=spdelib.transdata;
by id;
id varname;
var vardata;
run;
*Run a few times, this takes around 3 to 4 minutes, with 1.5 minutes CPU time;
proc transpose data=mainlib.bigdata out=mainlib.transdata;
by id;
id varname;
var vardata;
run;
*Run a few times, this takes around 30 to 45 seconds, with 20 seconds CPU time;
There have been known issues with SPDE and proc compare in the past (not multi-threading), at least up to version 4.1. What version are you using? (can be seen in the “!install/logs” folder).
This is definitely something to raise with SAS support, to "speed" things along I would recommend submitting a log with the following options:
proc setinit noalias; run;
proc options; run;
%put _ALL_;
options fullstimer msglevel=i;
Also:
options spdedebug='DA_TRACEIO_OCR CJNL=Trace.txt';
(The CJNL option simply routes the trace message output to a text file)
In the meantime, you may be able to take advantage of some of the following SPD specific options:
http://support.sas.com/kb/11/349.html
This issue usually occurs when PROC TRANSPOSE is used with BY-processing on compressed datasets. SAS is forced to read the same block of rows repeatedly decompressing them every time until all the records are fully sorted.
Set Compress=No option and it will work. See the log below, one program has Compress=yes and the other Compress=no, the former was 56 minutes vs .5 seconds.
OPTIONS COMPRESS=YES;
50 **tranpose from spde to spde;
51 proc transpose data=spdelib.balancewalkoutput out=spdelib.spdelib_to_spdelib;
52 var metric ;
53 by balancewalk facility_id isretained isexisting isicaapnpl monthofmaturity vintage;
54 run;
NOTE: There were 10000000 observations read from the data set SPDELIB.BALANCEWALKOUTPUT.
NOTE: The data set SPDELIB.SPDELIB_TO_SPDELIB has 160981 observations and 74 variables.
NOTE: Compressing data set SPDELIB.SPDELIB_TO_SPDELIB decreased size by 69.96 percent.
NOTE: PROCEDURE TRANSPOSE used (Total process time):
real time 56:58.54
user cpu time 52:03.65
system cpu time 4:03.00
memory 19028.75k
OS Memory 34208.00k
Timestamp 09/16/2019 06:19:55 PM
Step Count 9 Switch Count 22476
Page Faults 0
Page Reclaims 4056
Page Swaps 0
Voluntary Context Switches 142316
Involuntary Context Switches 5726
Block Input Operations 88
Block Output Operations 569200
OPTIONS COMPRESS=NO;
50 **tranpose from spde to spde;
51 proc transpose data=spdelib.balancewalkoutput out=spdelib.spdelib_to_spdelib;
52 var metric ;
53 by balancewalk facility_id isretained isexisting isicaapnpl monthofmaturity vintage;
18 The SAS System 16:04 Monday, September 16, 2019
54 run;
NOTE: There were 10000000 observations read from the data set SPDELIB.BALANCEWALKOUTPUT.
NOTE: The data set SPDELIB.SPDELIB_TO_SPDELIB has 160981 observations and 74 variables.
NOTE: PROCEDURE TRANSPOSE used (Total process time):
real time 26.73 seconds
user cpu time 14.52 seconds
system cpu time 11.99 seconds
memory 13016.71k
OS Memory 27556.00k
Timestamp 09/16/2019 04:13:06 PM
Step Count 9 Switch Count 24827
Page Faults 0
Page Reclaims 2662
Page Swaps 0
Voluntary Context Switches 162653
Involuntary Context Switches 1678
Block Input Operations 96
Block Output Operations 1510040