I have a data set with 1100 samples, target class isReturn, there are
800 isReturn='True'
300 isReturn='False'
How can I use PROC SURVEYSELECT to oversample the 300 isReturn='False' so that I will have 800 isReturn='False' to make the data set balance?
Thanks in advance.
I may not understand what you want, but if you just want to have 800 of the false folks, you could use proc surveyselect or the data step.
The data step would give you granular control. This gives you your 300 twice, plus another 200 picked randomly (possibly 1 or 0 times) from the 300 a third time.
data have;
length isReturn $5;
do _n_=1 to 800;
isReturn='True';
output;
if _n_ le 300 then do;
isReturn='False';
output;
end;
end;
run;
data want;
set have;
retain k 200 n 300;
if isReturn='True' then output;
else do;
output;
output;
if ranuni(7) le k/n then do;
output;
k+-1;
end;
n+-1;
end;
run;
You could tweak that pretty easily to get any distribution you want (you could take 500 out of '600' (double 300) for example by setting k and n to 500 and 600 and doing the if bit twice, each time decrementing n once).
You could also use proc surveyselect to do this.
proc surveyselect data=have(where=(isReturn='False')) out=want_add method=urs n=500 outhits;
run;
That would give you an extra 500 records, chosen at random with replacement; just add those back to the original dataset. You don't have as granular control but it is very easy to code.
Alternately, you could do this in one step. However, this does not guarantee you for either false or true a single record will always be represented - so this likely doesn't do exactly what you ask for; presented for completeness.
data sizes;
input isReturn :$5. _NSIZE_;
datalines;
False 800
True 800
;;;;
run;
proc sort data=have;
by isReturn;
run;
proc surveyselect data=have out=want method=urs n=sizes outhits;
strata isReturn;
run;
All of this assumes you're trying to get 100% of the original dataset plus some. If you're trying to oversample in the sense of pick False records with equal probability to True records, but you are ultimately picking a smaller sample than the total (and only picking each once, ie without replacement) then the strata statement is what you should be looking at.
Related
For an if-query I would like to create a macro varibale giving the respective frequency of the underlying time
series. I tried to get some descriptive statistics from proc time series. However, they unfortunately do not include the figure for the frequency.
The underlying times series does not necessarily conclude all periods of the frequency. That excludes a selected count by proc sql from my point of view.
Does anyone know an efficient procedure to determine the frequency without computing the frequency on my own (in a data step or a proc sql code)?
You can use the outspectra statement to help learn what kind of seasonality it has. Based on the data, give PROC TIMESERIES your best guess of day, month, etc. In the example below, we know we want to forecast by month but we do not know what seasonality it has.
proc timeseries data=sashelp.air outspectra=spectra;
id date interval=month;
var air;
run;
Plot this spectra dataset in proc sgplot and you'll see something that looks like this:
proc sgplot data=spectra;
where NOT missing(period);
series x=period y=p;
run;
This line will naturally increase over time, but we're looking for a bumps in the line. Notice the large bump somewhere between 0 and 24 months and the several smaller bumps before it. Let's zoom in on that by filtering out the longer periods.
proc sgplot data=spectra;
where period < 24 and NOT missing(period);
series x=period y=p;
run;
It's pretty clear that there is a strong seasonality of 12, with potentially smaller cycles at 3 and 6 months. From this plot, we can conclude that our seasonality should be 12 based on our spectra plot.
You can turn this into a macro to help identify the season if you'd like. Simply search for the largest bump within a reasonable timeframe. In our case we'll choose 36 because we do not suspect that we have any seasonality > 36 months.
proc sort data=spectra;
by period;
run;
data identify_period;
set spectra;
by period;
where NOT missing(period) AND period LE 36;
delta = abs(p - lag(p) );
run;
proc sql;
select period, max(delta) as max_delta
from identify_period
having delta = max(delta)
;
quit;
Output:
PERIOD max_delta
12 163712
I don't know how to do this without data step logic, but you could wrap the data step in a macro as follows:
%macro get_frequency(data,date_variable,output_variable);
proc sort data=&data (keep=&date_variable) out=__tempsorted;
by &date_variable;
run;
data _null_;
set __tempsorted end=lastobs;
prevdate=lag(&date_variable);
if _n_ > 1 then do;
interval_number+1;
interval_total + (&date_variable - prevdate);
end;
if lastobs then do;
average_interval = interval_total/interval_number;
frequency = round(365.25/average_interval);
call symput ("&output_variable",left(put(frequency,best32.)));
end;
run;
proc datasets nolist;
delete __tempsorted;
run;
quit;
%mend get_frequency;
Then you can call the macro on your original data set timeseries to examine the variable date and create a new macro variable frequency1 with the required frequency.
data work.timeseries;
input date date. value;
format date date9.;
datalines;
01Oct18 3000
01Nov18 4000
01Dec18 6500
01Jan19 7000
01Feb19 4000
01Mar19 5000
01Apr19 7500
01May19 4800
01Jun19 4500
;
run;
%get_frequency(timeseries,date,freqency1)
%put &=frequency1;
This seems to work on your sample data where each date is the first of the month. If your dates are evenly distributed (e.g. always near month start/end, or always near mid-month etc.) then this macro should work ok. Obviously if you have multiple observations per date then it will give the completely incorrect frequency.
I have some data set ranked by some variable.
I need to take every 1000 observation from the beginning and count in which field1=1, then count next 1000 observations in the same way.
How can I do it?
I hope I understand correctly what you want.
You could try a datastep like this
data result (Keep=countob obnr);
retain obnr 1000;
retain countob 0;
set mydata;
if field1=1 then
countob=countob+1;
if mod(_n_,1000) = 0 then do;
output;
obnr=obnr+1000;
countob=0;
end;
run;
this would lead to a result like this:
obnr | countob
------------
1000 | 247
2000 | 325
3000 | 198
obnr is obviously optional...
Another, slightly shorter, way, utilizing CEIL-function and PROC FREQ-procedure:
data want;
set have;
thousand=ceil(_N_/1000)*1000;
run;
proc freq data=want;
tables thousand / out=want;
where field1=1;
run;
I'm trying to calculate the grand mean of a subset of observations (e.g., observation 20 to observation 50) in the data step. In this calculation, I also want to skip over (ignore) any missing values.
I've tried to play around with the mean function using various if … then statements, but I can't seem to fit all of it together.
Any help would be much appreciated.
For reference, here's the basic outline of my data steps:
data sas1;
infile '[file path]';
input v1 $ 1-9 v2 $ 11 v3 13-17 [redacted] RPQ 50-53 [redacted] v23 101-106;
v1=translate(v1,"0"," ");
format [redacted];
label [redacted];
run;
data gmean;
set sas1;
id=_N_;
if id = 10-40 then do;
avg = mean(RPQ);
end;
/*Here, I am trying to calculate the grand mean of the RPQ variable*/
/*but only observations 10 to 40, and skipping over missing values*/
run;
Use the automatic variable /_N_/ to id the rows. Use a sum value that is retained row to row and then divide by the number of observations at the end. Use the missing() function to determine the number of observations present and whether or not to add to the running total.
data stocks;
set sashelp.stocks;
retain sum_total n_count 0;
if 10<=_n_<=40 and not missing(open) then do;
n_count=n_count+1;
sum_total=sum_total+open;
end;
if _n_ = 40 then average=sum_total/n_count;
run;
proc print data=stocks(obs=40 firstobs=40);
var average;
run;
*check with proc means that the value is correct;
proc means data=sashelp.stocks (firstobs=10 obs=40) mean;
var open;
run;
I need to create 100 copies of a data set (which has 3 variables) but one of the variables need to be assign randomly (1 through 1000)
I know I can use 100 data statement but I don't want to go down that road!
Let say I have data set A and want to create data set A1 to A100, I used the following code;
data A1--A100;
set A;
do i=1 to 1000;
var3=int(ranuni(0) * 1000 + 1);
output A1--A1000;
end;
run;
but SAS does not generate anything at all
You can't do it via any shortcut like that. You could use the macro language to create the 1000 dataset names and 1000 output statements.
However, more than likely you shouldn't do this. Instead, have one dataset with a BY variable, and then in whatever you're going to do (MCMC or whatever) use that BY variable with the BY statement.
data want;
set have;
do byvar=1 to 1000;
var3 = int(ranuni(7)*1000+1);
output;
end;
run;
Also, don't use ranuni(0). Always use a positive seed (and save it) so you can replicate your results.
Here is the answer, hope it could help;
data want;
set have;
do dset=1 to 101;
rand=ranuni(4011120);
if dset=1 then real=1; else real=0;
output;
end;
run;
proc sort data=want;
by dset rand;
run;
data want2;
set permut;
if real=0 then rank= mod(_N_,366);
if real then realrank=rank;
run;
proc sort data=want2;
by dset dayofyear;
run;
Using the following code
data mydata5;
input default$ numofkids$ count;
datalines;
good nochildren 1500
good kids1to2 2200
good kids3plus 300
bad nochildren 500
bad kids1to2 300
bad kids3plus 200
;
run;
I created a dataset
Obs default numofkids count
1 good nochildr 1500
2 good kids1to2 2200
3 good kids3plu 300
4 bad nochildr 500
5 bad kids1to2 300
6 bad kids3plu 200
What I've been trying to get to is something like this
nochildren other
good 1500 2500
bad 500 500
I've tried many a things but nothing has worked so far. I know there is any easy way out without getting into complicated codes.
I want to run a datastep where I can set mydata5 and create a dataset which will format like the way i want it with minimal coding required.
Could someone please offer some insights on this.
The purpose is then to run a proc freq to get a chisq test done.
I managed to make some progress with the code but my code does not produce the table like I want it. However, I am able to do a chisq test nonethless
data mydata6;
set mydata5;
if numofkids='nochildren' then Group=1;
else Group=2;
run;
proc freq data=mydata6;
weight count;
tables default*Group/chisq;
run;
data mydata61;
set mydata5;
if numofkids='kids3plu' then Group=1;
else Group=2;
run;
proc freq data=mydata61;
weight count;
tables default*Group/chisq;
run;
Also, another thing I faced an issue with was when I tried to group the data, I had to specify numofkids=kids3plu instead of the whole string kids3plus. The data did not group if i specified the whole string. Can someone comment on this as well, please?
I would use PROC SUMMARY/MEANS to do the sum and then transpose to create the format you are looking for.
I'm creating new data sets along the way that should help you see how this works.
data mydata5;
length default $4. numofkids $32.;
input default$ numofkids$ count;
datalines;
good nochildren 1500
good kids1to2 2200
good kids3plus 300
bad nochildren 500
bad kids1to2 300
bad kids3plus 200
;
run;
/*Populate a variable for "nochildren" and "other"*/
data mydata6;
set mydata5;
length kids $32.;
if numofkids = "nochildren"
then kids=numofkids;
else
kids = "other";
run;
proc sort data=mydata6;
by default kids;
run;
proc summary data=mydata6;
by default kids;
var count;
output out=mydata7 sum=;
run;
proc transpose data=mydata7 out=mydata8(drop=_name_);
by default;
id kids;
var count;
run;
Produces this:
Modify the first DATA step you have as follows in order to fix the concatenation to 8 characters problem:
data mydata5;
length default numofkids $ 25;
input default $ numofkids $ count;
datalines;
Now, run PROC SORT followed by a DATA step to create your PROC FREQ-friendly variables. You'll need to use the by, last, and retain statements for SAS to remember previous rows in order to sum up columns to collapse them.
proc sort data=mydata5; by default; run;
data mydata6; set mydata5;
by default;
if numofkids="nochildren" then output;
if numofkids="kids1to2" then hold1=count;
if numofkids="kids3plus" then hold2=count;
if last.default then do;
numofkids="other";
count=hold1+hold2;
output;
end;
retain hold1 hold2;
run;
Now you can run your PROC FREQ.
data mydata5;
input default$ numofkids$ count;
datalines;
good nochildren 1500
good kids1to2 2200
good kids3plus 300
bad nochildren 500
bad kids1to2 300
bad kids3plus 200
;
run;
proc sort data=mydata5;
by default numofkids;
run;
data edf;
set mydata5(rename=(count=noofchildren));
by default;
if first.default then count1=0;
if numofkids= 'nochildren' then output;
else count1+noofchildren;
other=count1;
if last.default then output;
keep default noofchildren other ;
run;
Output will be like this:
default nochildren other
good 1500 2500
bad 500 500