I have some data set ranked by some variable.
I need to take every 1000 observation from the beginning and count in which field1=1, then count next 1000 observations in the same way.
How can I do it?
I hope I understand correctly what you want.
You could try a datastep like this
data result (Keep=countob obnr);
retain obnr 1000;
retain countob 0;
set mydata;
if field1=1 then
countob=countob+1;
if mod(_n_,1000) = 0 then do;
output;
obnr=obnr+1000;
countob=0;
end;
run;
this would lead to a result like this:
obnr | countob
------------
1000 | 247
2000 | 325
3000 | 198
obnr is obviously optional...
Another, slightly shorter, way, utilizing CEIL-function and PROC FREQ-procedure:
data want;
set have;
thousand=ceil(_N_/1000)*1000;
run;
proc freq data=want;
tables thousand / out=want;
where field1=1;
run;
Related
I'm new to SAS and wondering how to randomly sample a dataset.
I create a dataset work.seg, then sample from that table. I want to continue sampling until the sum of the prem column in the resampled table is greater than some amount.
In my current version of the code, I think it resets sumprem to 0 each time, so it never exceeds the threshold, and the code just keeps running.
data work.seg;
input segment $3. prem loss;
datalines;
AAA 5000 0
AAA 3000 12584
AAA 250 245
AAA 500 678
;
data work.test;
sumprem = 0;
row_i=int(ranuni(777)*n)+1;
set work.seg point=row_i nobs=n;
sumprem=sumprem+prem_i;
if sumprem>15000 then stop;
run;
Since you are using POINT= option there is no need to let the normal iteration of the data step happen. Just add a loop and an output statement. You might want to also put an upper bound on maximum number of samples.
data work.test;
do _n_=1 to 100000 until (sumprem>15000) ;
row_i=int(ranuni(777)*n)+1;
set work.seg point=row_i nobs=n;
sumprem + prem_i;
output;
end;
stop;
run;
you just need to replace the sumprem=0 to retain statement and also prem_i is unidentified, use prem variable instead
sumprem=0; /* Change this to next statement*/
retain sumprem 0;
I have 10 different variables in 10 different tables with the VARNAME and MISSING PERCENT.
Out of these 10, lets say 5 do not have the "MISSING PERCENT" and I want to include these observation with 0% Missing. For now, it eliminates this observation in the final output.
data Final_Output_All_Missing;
length VARNAME $ 30;
merge work.Final_Output_MOLD work.Final_output_tbm_stage2
work.final_output_article7
work.final_output_tbm_stage1 work.final_output_bladder
work.final_output_batch_id;
by varname;
keep VARNAME PERCENT;
run;
VARNAME MISSING PERCENT
BLADDER 0.10
MOLD 0.06
TBM_STAGE1 0.18
TBM_STAGE2 99.9
Secondly, I have already merged the different tables containing different variables(0% still needs to be merged) as shown below:
After merging, I want to see the output in this format. is it possible for me to get in this format?
BLADDER MOLD TBM_STAGE1 TBM_STAGE2
1. 0.10% 0.06% 0.18% 99.9%
Appreciate your help!
The Code below will:
Replace missing values with 0 (I added missing values in the data)
Adds Percent format xx.xx%
Transpose the data
Code:
/*Create input data*/
data have;
informat VARNAME $10.;
input VARNAME $ MISSING_PERCENT ;
length VARNAME $10;
datalines;
BLADDER 0.10
MOLD 0.06
TBM_STAGE1 0.18
TBM_STAGE2 99.9
TBM_STAGE3 .
TBM_STAGE4 .
;;;;;;
run;
/*Replace missing values with 0, and Add % format */
proc sql;
create table work.new as
select
VARNAME,
coalesce(MISSING_PERCENT,0)/100 as MISSING_PERCENT format=percent8.2
from work.have;
quit;
/*Transpose Data*/
proc transpose data=work.new
out=work.want name=VARNAME;
id VARNAME;
run;
Input:
Formatted:
Output:
Thank you for the answer. It is perfectly giving what I want.
But, I do have a list of 160 Variables and I think to hard code it, will take more effort. What I did is I excluded the Variables which have 0% Missing Percent and extracted the ones which have missing percent.
This is how I did it to extract the Missing Percent:
> data Final_Output_&var;
set Final_Output_&var;
VARNAME = "&var";
%if &t=char %then
%do;
where put (&var., $missfmt.) in ("Missing");
%end;
%else
%do;
where put (&var., missfmt.) in ("Missing");
%end;
run;
Thanks again for your quick reply !!!
Best Regards,
Pankaj
I have a data set with 1100 samples, target class isReturn, there are
800 isReturn='True'
300 isReturn='False'
How can I use PROC SURVEYSELECT to oversample the 300 isReturn='False' so that I will have 800 isReturn='False' to make the data set balance?
Thanks in advance.
I may not understand what you want, but if you just want to have 800 of the false folks, you could use proc surveyselect or the data step.
The data step would give you granular control. This gives you your 300 twice, plus another 200 picked randomly (possibly 1 or 0 times) from the 300 a third time.
data have;
length isReturn $5;
do _n_=1 to 800;
isReturn='True';
output;
if _n_ le 300 then do;
isReturn='False';
output;
end;
end;
run;
data want;
set have;
retain k 200 n 300;
if isReturn='True' then output;
else do;
output;
output;
if ranuni(7) le k/n then do;
output;
k+-1;
end;
n+-1;
end;
run;
You could tweak that pretty easily to get any distribution you want (you could take 500 out of '600' (double 300) for example by setting k and n to 500 and 600 and doing the if bit twice, each time decrementing n once).
You could also use proc surveyselect to do this.
proc surveyselect data=have(where=(isReturn='False')) out=want_add method=urs n=500 outhits;
run;
That would give you an extra 500 records, chosen at random with replacement; just add those back to the original dataset. You don't have as granular control but it is very easy to code.
Alternately, you could do this in one step. However, this does not guarantee you for either false or true a single record will always be represented - so this likely doesn't do exactly what you ask for; presented for completeness.
data sizes;
input isReturn :$5. _NSIZE_;
datalines;
False 800
True 800
;;;;
run;
proc sort data=have;
by isReturn;
run;
proc surveyselect data=have out=want method=urs n=sizes outhits;
strata isReturn;
run;
All of this assumes you're trying to get 100% of the original dataset plus some. If you're trying to oversample in the sense of pick False records with equal probability to True records, but you are ultimately picking a smaller sample than the total (and only picking each once, ie without replacement) then the strata statement is what you should be looking at.
Using the following code
data mydata5;
input default$ numofkids$ count;
datalines;
good nochildren 1500
good kids1to2 2200
good kids3plus 300
bad nochildren 500
bad kids1to2 300
bad kids3plus 200
;
run;
I created a dataset
Obs default numofkids count
1 good nochildr 1500
2 good kids1to2 2200
3 good kids3plu 300
4 bad nochildr 500
5 bad kids1to2 300
6 bad kids3plu 200
What I've been trying to get to is something like this
nochildren other
good 1500 2500
bad 500 500
I've tried many a things but nothing has worked so far. I know there is any easy way out without getting into complicated codes.
I want to run a datastep where I can set mydata5 and create a dataset which will format like the way i want it with minimal coding required.
Could someone please offer some insights on this.
The purpose is then to run a proc freq to get a chisq test done.
I managed to make some progress with the code but my code does not produce the table like I want it. However, I am able to do a chisq test nonethless
data mydata6;
set mydata5;
if numofkids='nochildren' then Group=1;
else Group=2;
run;
proc freq data=mydata6;
weight count;
tables default*Group/chisq;
run;
data mydata61;
set mydata5;
if numofkids='kids3plu' then Group=1;
else Group=2;
run;
proc freq data=mydata61;
weight count;
tables default*Group/chisq;
run;
Also, another thing I faced an issue with was when I tried to group the data, I had to specify numofkids=kids3plu instead of the whole string kids3plus. The data did not group if i specified the whole string. Can someone comment on this as well, please?
I would use PROC SUMMARY/MEANS to do the sum and then transpose to create the format you are looking for.
I'm creating new data sets along the way that should help you see how this works.
data mydata5;
length default $4. numofkids $32.;
input default$ numofkids$ count;
datalines;
good nochildren 1500
good kids1to2 2200
good kids3plus 300
bad nochildren 500
bad kids1to2 300
bad kids3plus 200
;
run;
/*Populate a variable for "nochildren" and "other"*/
data mydata6;
set mydata5;
length kids $32.;
if numofkids = "nochildren"
then kids=numofkids;
else
kids = "other";
run;
proc sort data=mydata6;
by default kids;
run;
proc summary data=mydata6;
by default kids;
var count;
output out=mydata7 sum=;
run;
proc transpose data=mydata7 out=mydata8(drop=_name_);
by default;
id kids;
var count;
run;
Produces this:
Modify the first DATA step you have as follows in order to fix the concatenation to 8 characters problem:
data mydata5;
length default numofkids $ 25;
input default $ numofkids $ count;
datalines;
Now, run PROC SORT followed by a DATA step to create your PROC FREQ-friendly variables. You'll need to use the by, last, and retain statements for SAS to remember previous rows in order to sum up columns to collapse them.
proc sort data=mydata5; by default; run;
data mydata6; set mydata5;
by default;
if numofkids="nochildren" then output;
if numofkids="kids1to2" then hold1=count;
if numofkids="kids3plus" then hold2=count;
if last.default then do;
numofkids="other";
count=hold1+hold2;
output;
end;
retain hold1 hold2;
run;
Now you can run your PROC FREQ.
data mydata5;
input default$ numofkids$ count;
datalines;
good nochildren 1500
good kids1to2 2200
good kids3plus 300
bad nochildren 500
bad kids1to2 300
bad kids3plus 200
;
run;
proc sort data=mydata5;
by default numofkids;
run;
data edf;
set mydata5(rename=(count=noofchildren));
by default;
if first.default then count1=0;
if numofkids= 'nochildren' then output;
else count1+noofchildren;
other=count1;
if last.default then output;
keep default noofchildren other ;
run;
Output will be like this:
default nochildren other
good 1500 2500
bad 500 500
How can you create a SAS data set from another dataset using only the last n observations from original dataset. This is easy when you know the value of n. If I don't know 'n' how can this be done?
This assumes you have a macro variable that says how many observations you want. NOBS tells you the number of observations in the dataset currently without reading the whole thing.
%let obswant=5;
data want;
set sashelp.class nobs=obscount;
if _n_ gt (obscount-&obswant.);
run;
Using Joe's example of a macro variable to specify the number of observations you want, here is another answer:
%let obswant = 10;
data want;
do _i_=nobs-(&obswant-1) to nobs;
set have point=_i_ nobs=nobs;
output;
end;
stop; /* Needed to stop data step */
run;
This should perform better since it only reads the specific observations you want.
If the dataset is large, you might not want to read the whole dataset. Instead you could try a construction that reads the total number of Observations in the dataset first. So if you want to have the last of observations:
data t;
input x;
datalines;
1
2
3
4
;
%let dsid=%sysfunc(open(t));
%let num=%sysfunc(attrn(&dsid,nlobs));
%let rc=%sysfunc(close(&dsid));
%let number = 2;
data tt;
set t (firstobs = %eval(&num.-&number.+1));
run;
For the sake of variety, here's another approach (not necessarily a better one)
%let obswant=5;
proc sql noprint;
select nlobs-&obswant.+1 into :obscalc
from dictionary.tables
where libname='SASHELP' and upcase(memname)='CLASS';
quit;
data want;
set sashelp.class (firstobs=&obscalc.);
run;
You can achive this using the
_nobs_ and _n_ variables. First, create a temporary variable to store the total no of obs. Then compare the automatic variable N to nobs.
data a;
set sashelp.class nobs=_nobs_;
if _N_ gt _nobs_ -5;
run;