proc surveryselect sample defined verus sample received - sas

I am using the following code
proc surveyselect data = tmp method = urs sampsize = 500 seed = 100 out = out_tmp; run;
However when I look at the logs I am getting 491 records. My tmp dataset has 30,000 records. Need help to understand why the 9 records are getting dropped. I played around with changing the seed value and I am getting around 470 to 495 records per random seed but never get an absolute 500. Referred to the documentation and URS option means "unrestricted random sampling, which is selection with equal probability and with replacement". Probability being equal has no impact however, replacement terminology , I understand as, a record could be present more than once, which is what I am aiming for.
What I do not understand is why does the drawn sample stops are at number less than the 500 i have specified?
Thanks for the help.

The issue is you're failing to quite understand how URS works - I recommend a look through the documentation.
Take this (extreme) example:
proc surveyselect data=sashelp.cars method=urs out=sample_cars sampsize=10000 seed=100;
run;
NOTE: The sample size, 10000, is greater than the number of sampling units, 428.
NOTE: The data set WORK.SAMPLE_CARS has 428 observations and 16 variables.
NOTE: PROCEDURE SURVEYSELECT used (Total process time):
real time 0.02 seconds
cpu time 0.03 seconds
Here I ask for 10,000 (out of 428 total records!), and get... 428 records. The important detail to pay attention to is the NumberHits variable. That says how many times each record was sampled.
If you want one record output for each hit, meaning you want those duplicates, you can add outhits to your PROC SURVEYSELECT statement. From the documentation on URS:
For unrestricted random sampling, by default, the output data set contains a single copy of each unit selected, even when a unit is selected more than once, and the variable NumberHits records the number of hits (selections) for each unit. If you specify the OUTHITS option, the output data set contains m copies of a sampling unit for which NumberHits is m; for example, the output data set contains three copies of a sampling unit that is selected three times (NumberHits is three). For information about the contents of the output data set, see the section Sample Output Data Set.
Here is my example modified to do just that.
proc surveyselect data=sashelp.cars method=urs out=sample_cars sampsize=10000 seed=100 outhits;
run;
NOTE: The sample size, 10000, is greater than the number of sampling units, 428.
NOTE: The data set WORK.SAMPLE_CARS has 10000 observations and 16 variables.
NOTE: PROCEDURE SURVEYSELECT used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

Related

SAS Regression Returns 0 Coefficient

I am running a SAS regression using the following model:
ods output ParameterEstimates=stock_params;
proc reg data=REG_DATA;
by SYMBOL DATE;
model RETURN_SEC = market_premium;
run;
ods output close;
Where RETURN_SEC is the return of the stock per second and market_premium is the return of SPY index minus the risk free rate (the risk free rate is quite close to zero because it is at a second level).
However, I got lots of 0s (not all of them, but a significant number of) in the coefficient of market_premium. When I check the log it says:
NOTE: Model is not full rank. Least-squares solutions for the parameters are
not unique. Some statistics will be misleading. A reported DF of 0 or B
means that the estimate is biased.
NOTE: The following parameters have been set to 0, since the variables are a
linear combination of other variables as shown.
market_premium = - 19E-12 * Intercept
This is quite weird. I checked the data and it seems fine (although lot of data contains 0 return_sec, which is normal because sometimes the return doesn't change in seconds but in minutes).
What also puzzles me is that why SAS would return 0 coefficient on market_premium when market_premium = - 19E-12 * Intercept. I mean, does SAS treat the Intercept as the only variable when it sees that market_premium is a scalar times of Intercept?

"BY variables are not properly sorted" error although it was sorted already

I am using SAS for a large dataset (>20gb). When I run a DATA step, I received the "BY variables are not properly sorted ......" although I sorted the dataset by the same variables. When I ran the PROC SORT again, SAS even said "Input dataset is already sorted, No sorting done"
My code is:
proc sort data=output.TAQ;
by market ric date miliseconds descending type order;
run;
options nomprint;
data markers (keep=market ric date miliseconds type order);
set output.TAQ;
by market ric date;
if first.date;
* ie do the following once per stock-day;
* Make 1-second markers;
/*Type="AMARK"; Order=0; * Set order to zero to ensure that markers get placed before trades and quotes that occur at the same milisecond;
do i=((9*60*60)+(30*60)) to (16*60*60); miliseconds=i*1000; output; end;*/
run;
And the error message was:
ERROR: BY variables are not properly sorted on data set OUTPUT.TAQ.
RIC=CXR.CCP Date=20160914 Time=13:47:18.125 Type=Quote Price=. Volume=. BidPrice=9.03 BidSize=400
AskPrice=9.04 AskSize=100 Qualifiers= order=116458952 Miliseconds=49638125 exchange=CCP market=1
FIRST.market=0 LAST.market=0 FIRST.RIC=0 LAST.RIC=0 FIRST.Date=0 LAST.Date=1 i=. _ERROR_=1
_N_=43297873
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 43297874 observations read from the data set OUTPUT.TAQ.
WARNING: The data set WORK.MARKERS may be incomplete. When this step was stopped there were
56770826 observations and 6 variables.
WARNING: Data set WORK.MARKERS was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 1:14.21
cpu time 26.71 seconds
The error is occurring deep into your data step, at _N_=43297873. That suggests to me that the PROC SORT is working up to a point, but then fails. It is hard to know what the reason is without knowing your SAS environment or how OUTPUT.TAQ is stored.
Some people have reported resource problems or file system limitations when sorting large data sets.
From SAS FAQ: Sorting Very Large Datasets with SAS (not an official source):
When sorting in a WORK folder, you must have free storage equal to 4x the size of the data set (or 5x if under Unix)
You may be running out of RAM
You may be able to use options MSGLEVEL=i and FULLSTIMER to get a fuller picture
Also using options sastraceloc=saslog; can produce helpful messages.
Maybe instead of sorting it, you could break it up into a few steps, something like:
/* Get your market ~ ric ~ date pairs */
proc sql;
create table market_ric_date as
select distinct market, ric, date
from output.TAQ
/* Possibly an order by clause here on market, ric, date */
; quit;
data millisecond_stuff;
set market_ric_date;
*Possibly add type/order in this step as well?;
do i=((9*60*60)+(30*60)) to (16*60*60); miliseconds=i*1000; output; end;
run;
/* Possibly a third step here to add type / order if you need to get from original data source */
If your source dataset is in a database, it may be sorted in a different collation.
Try the following before your sort:
options sortpgm=sas;
I had the same error, and the solution was to make a copy of the original table in the work directory, do the sort, and then the "by" was working.
In your case something like below:
data tmp_TAQ;
set output.TAQ;
run;
proc sort data=tmp_TAQ;
by market ric date miliseconds descending type order;
run;
data markers (keep=market ric date miliseconds type order);
set tmp_TAQ;
by market ric date;
if first.date;
* ie do the following once per stock-day;
* Make 1-second markers;
/*Type="AMARK"; Order=0; * Set order to zero to ensure that markers get placed before trades and quotes that occur at the same milisecond;
do i=((9*60*60)+(30*60)) to (16*60*60); miliseconds=i*1000; output; end;*/
run;

Offsetting Oversampling in SAS for rare events in Logistic Regression

Can anyone help me understand the Premodel and Postmodel adjustments for Oversampling using the offset method ( preferably in Base SAS in Proc Logistic and Scoring) in Logistic Regression .
I will take an example. Considering the traditional Credit scoring model for a bank, lets say we have 10000 customers with 50000 good and 2000 bad customers. Now for my Logistic Regression I am using all 2000 bad and random sample of 2000 good customers. How can I adjust this oversampling in Proc Logistic using options like Offset and also during scoring. Do you have any references with illustrations on this topic?
Thanks in advance for your help!
Ok here are my 2 cents.
Sometimes, the target variable is a rare event, like fraud. In this case, using logistic regression will have significant sample bias due to insufficient event data. Oversampling is a common method due to its simplicity.
However, model calibration is required when scores are used for decisions (this is your case) – however nothing need to be done if the model is only for rank ordering (bear in mind the probabilities will be inflated but order still the same).
Parameter and odds ratio estimates of the covariates (and their confidence limits) are unaffected by this type of sampling (or oversampling), so no weighting is needed. However, the intercept estimate is affected by the sampling, so any computation that is based on the full set of parameter estimates is incorrect.
Suppose the true model is: ln(y/(1-y))=b0+b1*x. When using oversampling, the b1′ is consistent with the true model, however, b0′ is not equal to bo.
There are generally two ways to do that:
weighted logistic regression,
simply adding offset.
I am going to explain the offset version only as per your question.
Let’s create some dummy data where the true relationship between your DP (y) and your IV (iv) is ln(y/(1-y)) = -6+2iv
data dummy_data;
do j=1 to 1000;
iv=rannor(10000); *independent variable;
p=1/(1+exp(-(-6+2*iv))); * event probability;
y=ranbin(10000,1,p); * independent variable 1/0;
drop j;
output;
end;
run;
and let’s see your event rate:
proc freq data=dummy_data;
tables y;
run;
Cumulative Cumulative
y Frequency Percent Frequency Percent
------------------------------------------------------
0 979 97.90 979 97.90
1 21 2.10 1000 100.00
Similar to your problem the event rate is p=0.0210, in other words very rare
Let’s use poc logistic to estimate parameters
proc logistic data=dummy_data;
model y(event="1")=iv;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -5.4337 0.4874 124.3027 <.0001
iv 1 1.8356 0.2776 43.7116 <.0001
Logistic result is quite close to the real model however basic assumption will not hold as you already know.
Now let’s oversample the original dataset by selecting all event cases and non-event cases with p=0.2
data oversampling;
set dummy_data;
if y=1 then output;
if y=0 then do;
if ranuni(10000)<1/20 then output;
end;
run;
proc freq data=oversampling;
tables y;
run;
Cumulative Cumulative
y Frequency Percent Frequency Percent
------------------------------------------------------
0 54 72.00 54 72.00
1 21 28.00 75 100.00
Your event rate has jumped (magically) from 2.1% to 28%. Let’s run proc logistic again.
proc logistic data=oversampling;
model y(event="1")=iv;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -2.9836 0.6982 18.2622 <.0001
iv 1 2.0068 0.5139 15.2519 <.0001
As you can see the iv estimate still close to the real value but your intercept has changed from -5.43 to -2.98 which is very different from our true value of -6.
Here is where the offset plays its part. The offset is the log of the ratio between known population and sample event probabilities and adjust the intercept based on the true distribution of events rather than the sample distribution (the oversampling dataset).
Offset = log(0.28)/(1-0.28)*(0.0210)/(1-0.0210) = 2.897548
So your intercept adjustment will be intercept = -2.9836-2.897548= -5.88115 which is quite close to the real value.
Or using the offset option in proc logistic:
data oversampling_with_offset;
set oversampling;
off= log((0.28/(1-0.28))*((1-0.0210)/0.0210)) ;
run;
proc logistic data=oversampling_with_offset;
model y(event="1")=iv / offset=off;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -5.8811 0.6982 70.9582 <.0001
iv 1 2.0068 0.5138 15.2518 <.0001
off 1 1.0000 0 . .
From here all your estimates are correctly adjusted and analysis & interpretation should be carried out as normal.
Hope its help.
This is a great explanation.
When you oversample or undersample in the rare event experiment, the intercept is impacted and not slope. Hence in the final output , you just need to adjust the intercept by adding offset statement in proc logistic in SAS. Probabilities are impacted by oversampling but again, ranking in not impacted as explained above.
If your aim is to score your data into deciles, you do not need to adjust the offset and can rank the observations based on their probabilities of the over sampled model and put them into deciles (Using Proc Rank as normal). However, the actual probability scores are impacted so you cannot use the actual probability values. ROC curve is not impacted as well.

SAS Proc SQL Not Exist Query vs Data Step a=1 b=0

Trying to measure performance on two small sets of data in order to determine an efficient execution method for a much larger pair of data sets.
*This test is being done on a dataset with 32 observations and a dataset with 37 observations.
Both methods give me identical results, slightly different process times. I have a simple data step :
data check;
merge d1(in=a) d2(in=b);
by ssn;
if a=0 and b=1;
run;
The Data Step method (1st execution) log produced the following -
NOTE: There were 32 observations read from the data set WORK.D1.
NOTE: There were 37 observations read from the data set WORK.D2.
NOTE: The data set WORK.CHECK has 5 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
The Proc SQL method (not exists query in our specific case) is below-
proc sql;
create table chck2 as
select b.* from d2 b
where not exists (select a.* from d1 a
where a.ssn=b.ssn)
;
quit;
The sql proc prints the following in the log -
NOTE: PROCEDURE SQL used (Total process time):
real time 0.04 seconds
cpu time 0.03 seconds
These methods both yield the same results, creating my final data set of the same 5 individuals. While the data step processing seems faster (even if only by fraction of a second), will these performance results ALWAYS hold true? Will the Data step method ALWAYS win? What are the key influencing factors here? Do listing the table in a certain order play a role, or would SAS scan both tables simultaneously?
FYI - I mentioned (1st execution) because I noticed from the experiment above, and general exposure, that if you process data steps subsequently, SAS will process subsequent steps faster than the original execution. Assuming this has something to do with SAS having memory on previously executed steps...?
You'll never find meaningful performance evaluations from small datasets. Overhead will inevitably whammy any sort of actual performance difference. PROC SQL has a bit of overhead involved in invoking the procedure (a few hundredths of a second), which is more than the total execution time. Run your test with large enough datasets that it takes minutes to run - usually that's the right balance between tests taking too long and legit differences being squashed by overhead/randomness.
As far as what would be faster: If the dataset is sorted, and SAS knows it's sorted, then the odds are very good that both processes will be in the same magnitude of time. Data step merge is quite fast, as is SQL merge.
If it's not sorted, SQL might (would probably) choose to turn the where-exists into a hash join, which would be much faster than sorting a large dataset. Of course that requires the dataset to fit into memory. Sorting and then merging in the data step might be the same as SQL, or it might be slower - or even faster, though I suspect usually not much faster if it requires sorting first. There are faster solutions in the data step than sort/merge if that's needed (hash or format).
As far as what the order on the PROC SQL statement is; odds are it won't matter, if SQL can figure out what you're doing and optimize it. However, it may because SQL may not easily see the optimal path, so one order (usually the large dataset as the main one and the smaller dataset as the subquery) may help SQL figure out the right approach more easily than the other.
And - the reason SAS has a faster time doing a second or later run is that your OS (or possibly your file system) is caching the read, so it doesn't have to re-read the SET file from disk.

Would overwriting the existing SAS dataset take more time?

I got a short question - If we are creating a SAS dataset say - Sample.sas7bdat which already exists, will the code take more time to execute (because here the code has to overwrite the existing dataset) than the case when this dataset was not already there?
data sample;
.....
.....
run;
I did some reasearch on the internet but could not find a satisfactory answer. To me it seems like the code should take a little bit extra time, though not sure how much of impact it would make on a 10GB of dataset.
You could test this yourself fairly easily. A few caveats:
Make sure you have a large enough dataset such that you won't miss the differences in simple random cpu activity. 100+MB is usually a good target.
Make sure you perform the test multiple times - the more the better, with no time in between if possible. One test will always be insufficient and will always tend to show the first dataset as faster, because it benefits from write caching (basically the OS saying that it's done writing when it's not, but simply has the write queued up in memory).
Here's an example of my test. This is a 100 million row dataset with two 8 byte numerics, so 1.6 GB.
First, the results. I see a few second difference. Why? SAS takes a few operations when replacing a dataset:
Write dataset to temporary file
Delete the old dataset
Rename temporary dataset to new dataset
On some OSs this seems to be faster than others; I've found Windows desktop to be fairly slow about this, compared to unix or even Windows Server OS which is pretty quick. I'm guessing Windows is more careful about deleting than simply changing a file system pointer, but I don't really know. It's certainly not copying the whole file over from the utility directory (it's not nearly enough time for that). I also suspect write caching is still giving a bit of a boost to the new datasets, particularly as time for all datasets is growing as I write. The difference is probably only about a second or so - the difference between _REP iteration 2 and _NEW iteration 3 seems the most reasonable to me.
Iteration 1 _NEW=7.26999998099927 _REP=12.9079999922978
Iteration 2 _NEW=10.0119998454974 _REP=11.0789999961998
Iteration 3 _NEW=10.1360001564025 _REP=15.3819999695042
Iteration 4 _NEW=14.7720000743938 _REP=17.4649999142056
Iteration 5 _NEW=16.2560000418961 _REP=19.2009999752044
Notice the first iteration new is far faster than the others, and overall time increases as you go (as the write caching is less and less able to keep up). I suspect if you allow it to continue (or use a still larger file, which I don't have time for right now) you might see even more consistent times. I'm also not sure what happens with write caching when a file that is write cached is deleted; it's possible it has to wait for the write caching to write out to disk before doing the delete op or something similar. You could perform a test where you waited 30 seconds between _NEW and _REP to verify that.
The code:
%macro test_me(iter=1);
%do _i=1 %to &iter.;
%let start = %sysfunc(time());
data test&_i.;
do x = 1 to 1e8;
y=x**2;
output;
end;
run;
%let mid=%sysfunc(time());
data test&_i.;
do x = 1 to 1e8;
y=x**2;
output;
end;
run;
%let end=%sysfunc(time());
%let _new = %sysevalf(&mid.-&start.);
%let _rep = %sysevalf(&end.-&mid.);
%put Iteration &_i. &=_new. &=_rep.;
%end;
proc datasets nolist kill;
quit;
%mend test_me;
options nosource nonotes nomprint nosymbolgen;
%test_me(iter=5);
There are more file operations involved when you are overwriting. After creating the table, SAS will delete the old table and rename the new. In my tests this took 0.2 seconds extra time.
In a brief test, my 800Mb dataset took 4 seconds to create new and 10-15 seconds to overwrite. I'm assuming this is because SAS has to preserve the existing dataset until the datastep completes executing so as to preserve data-integrity. That's why you might get the following message in the log:
WARNING: Data set dset was not replaced because this step was stopped.
Overwrite test
NOTE: The data set WORK.SAMPLE has 100000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 10.06 seconds
user cpu time 3.08 seconds
system cpu time 1.48 seconds
memory 1506.46k
OS Memory 26268.00k
Timestamp 08/12/2014 11:43:06 AM
Step Count 42 Switch Count 38
Page Faults 0
Page Reclaims 155
Page Swaps 0
Voluntary Context Switches 190
Involuntary Context Switches 288
Block Input Operations 0
Block Output Operations 1588496
New data test
NOTE: The data set WORK.SAMPLE1 has 100000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 3.94 seconds
user cpu time 3.14 seconds
system cpu time 0.80 seconds
memory 1482.18k
OS Memory 26268.00k
Timestamp 08/12/2014 11:43:10 AM
Step Count 43 Switch Count 38
Page Faults 0
Page Reclaims 112
Page Swaps 0
Voluntary Context Switches 99
Involuntary Context Switches 294
Block Input Operations 0
Block Output Operations 1587464
The only difference between the log messages is the real time, which to me would indicate SAS is processing filesystem operations on the dataset files.
N.B. I have tested this on SAS (r) Proprietary Software Release 9.4 TS1M2, which I'm running through SAS Studio online. I think it's a Linux operating system, results could vary depending on your operating system.