Using the following code
data mydata5;
input default$ numofkids$ count;
datalines;
good nochildren 1500
good kids1to2 2200
good kids3plus 300
bad nochildren 500
bad kids1to2 300
bad kids3plus 200
;
run;
I created a dataset
Obs default numofkids count
1 good nochildr 1500
2 good kids1to2 2200
3 good kids3plu 300
4 bad nochildr 500
5 bad kids1to2 300
6 bad kids3plu 200
What I've been trying to get to is something like this
nochildren other
good 1500 2500
bad 500 500
I've tried many a things but nothing has worked so far. I know there is any easy way out without getting into complicated codes.
I want to run a datastep where I can set mydata5 and create a dataset which will format like the way i want it with minimal coding required.
Could someone please offer some insights on this.
The purpose is then to run a proc freq to get a chisq test done.
I managed to make some progress with the code but my code does not produce the table like I want it. However, I am able to do a chisq test nonethless
data mydata6;
set mydata5;
if numofkids='nochildren' then Group=1;
else Group=2;
run;
proc freq data=mydata6;
weight count;
tables default*Group/chisq;
run;
data mydata61;
set mydata5;
if numofkids='kids3plu' then Group=1;
else Group=2;
run;
proc freq data=mydata61;
weight count;
tables default*Group/chisq;
run;
Also, another thing I faced an issue with was when I tried to group the data, I had to specify numofkids=kids3plu instead of the whole string kids3plus. The data did not group if i specified the whole string. Can someone comment on this as well, please?
I would use PROC SUMMARY/MEANS to do the sum and then transpose to create the format you are looking for.
I'm creating new data sets along the way that should help you see how this works.
data mydata5;
length default $4. numofkids $32.;
input default$ numofkids$ count;
datalines;
good nochildren 1500
good kids1to2 2200
good kids3plus 300
bad nochildren 500
bad kids1to2 300
bad kids3plus 200
;
run;
/*Populate a variable for "nochildren" and "other"*/
data mydata6;
set mydata5;
length kids $32.;
if numofkids = "nochildren"
then kids=numofkids;
else
kids = "other";
run;
proc sort data=mydata6;
by default kids;
run;
proc summary data=mydata6;
by default kids;
var count;
output out=mydata7 sum=;
run;
proc transpose data=mydata7 out=mydata8(drop=_name_);
by default;
id kids;
var count;
run;
Produces this:
Modify the first DATA step you have as follows in order to fix the concatenation to 8 characters problem:
data mydata5;
length default numofkids $ 25;
input default $ numofkids $ count;
datalines;
Now, run PROC SORT followed by a DATA step to create your PROC FREQ-friendly variables. You'll need to use the by, last, and retain statements for SAS to remember previous rows in order to sum up columns to collapse them.
proc sort data=mydata5; by default; run;
data mydata6; set mydata5;
by default;
if numofkids="nochildren" then output;
if numofkids="kids1to2" then hold1=count;
if numofkids="kids3plus" then hold2=count;
if last.default then do;
numofkids="other";
count=hold1+hold2;
output;
end;
retain hold1 hold2;
run;
Now you can run your PROC FREQ.
data mydata5;
input default$ numofkids$ count;
datalines;
good nochildren 1500
good kids1to2 2200
good kids3plus 300
bad nochildren 500
bad kids1to2 300
bad kids3plus 200
;
run;
proc sort data=mydata5;
by default numofkids;
run;
data edf;
set mydata5(rename=(count=noofchildren));
by default;
if first.default then count1=0;
if numofkids= 'nochildren' then output;
else count1+noofchildren;
other=count1;
if last.default then output;
keep default noofchildren other ;
run;
Output will be like this:
default nochildren other
good 1500 2500
bad 500 500
Related
I want to use SAS and eg. proc report to produce a custom table within my workflow.
Why: Prior, I used proc export (dbms=excel) and did some very basic stats by hand and copied pasted to an excel sheet to complete the report. Recently, I've started to use ODS excel to print all the relevant data to excel sheets but since ODS excel would always overwrite the whole excel workbook (and hence also the handcrafted stats) I now want to streamline the process.
The task itself is actually very straightforward. We have some information about IDs, age, and registration, so something like this:
data test;
input ID $ AGE CENTER $;
datalines;
111 23 A
. 27 B
311 40 C
131 18 A
. 64 A
;
run;
The goal is to produce a table report which should look like this structure-wise:
ID NO-ID Total
Count 3 2 5
Age (mean) 27 45.5 34.4
Count by Center:
A 2 1 3
B 0 1 1
A 1 0 1
It seems, proc report only takes variables as columns but not a subsetted data set (ID NE .; ID =''). Of course I could just produce three reports with three subsetted data sets and print them all separately but I hope there is a way to put this in one table.
Is proc report the right tool for this and if so how should I proceed? Or is it better to use proc tabulate or proc template or...?
I found a way to achieve an almost match to what I wanted. First if all, I had to introduce a new variable vID (valid ID, 0 not valid, 1 valid) in the data set, like so:
data test;
input ID $ AGE CENTER $;
if ID = '' then vID = 0;
else vID = 1;
datalines;
111 23 A
. 27 B
311 40 C
131 18 A
. 64 A
;
run;
After this I was able to use proc tabulate as suggested by #Reeza in the comments to build a table which pretty much resembles what I initially aimed for:
proc tabulate data = test;
class vID Center;
var age;
keylabel N = 'Count';
table N age*mean Center*N, vID ALL;
run;
Still, I wonder if there is a way without introducing the new variable at all and just use the SAS counters for missing and non-missing observations.
UPDATE:
#Reeza pointed out to use the proc format to assign a value to missing/non-missing ID data. In combination with the missing option (prints missing values) in proc tabulate this delivers the output without introducing a new variable:
proc format;
value $ id_fmt
' ' = 'No-ID'
other = 'ID'
;
run;
proc tabulate data = test missing;
format ID $id_fmt.;
class ID Center;
var age;
keylabel N = 'Count';
table N age*(mean median) Center*N, (ID=' ') ALL;
run;
I'm new to SAS and wondering how to randomly sample a dataset.
I create a dataset work.seg, then sample from that table. I want to continue sampling until the sum of the prem column in the resampled table is greater than some amount.
In my current version of the code, I think it resets sumprem to 0 each time, so it never exceeds the threshold, and the code just keeps running.
data work.seg;
input segment $3. prem loss;
datalines;
AAA 5000 0
AAA 3000 12584
AAA 250 245
AAA 500 678
;
data work.test;
sumprem = 0;
row_i=int(ranuni(777)*n)+1;
set work.seg point=row_i nobs=n;
sumprem=sumprem+prem_i;
if sumprem>15000 then stop;
run;
Since you are using POINT= option there is no need to let the normal iteration of the data step happen. Just add a loop and an output statement. You might want to also put an upper bound on maximum number of samples.
data work.test;
do _n_=1 to 100000 until (sumprem>15000) ;
row_i=int(ranuni(777)*n)+1;
set work.seg point=row_i nobs=n;
sumprem + prem_i;
output;
end;
stop;
run;
you just need to replace the sumprem=0 to retain statement and also prem_i is unidentified, use prem variable instead
sumprem=0; /* Change this to next statement*/
retain sumprem 0;
I have 10 different variables in 10 different tables with the VARNAME and MISSING PERCENT.
Out of these 10, lets say 5 do not have the "MISSING PERCENT" and I want to include these observation with 0% Missing. For now, it eliminates this observation in the final output.
data Final_Output_All_Missing;
length VARNAME $ 30;
merge work.Final_Output_MOLD work.Final_output_tbm_stage2
work.final_output_article7
work.final_output_tbm_stage1 work.final_output_bladder
work.final_output_batch_id;
by varname;
keep VARNAME PERCENT;
run;
VARNAME MISSING PERCENT
BLADDER 0.10
MOLD 0.06
TBM_STAGE1 0.18
TBM_STAGE2 99.9
Secondly, I have already merged the different tables containing different variables(0% still needs to be merged) as shown below:
After merging, I want to see the output in this format. is it possible for me to get in this format?
BLADDER MOLD TBM_STAGE1 TBM_STAGE2
1. 0.10% 0.06% 0.18% 99.9%
Appreciate your help!
The Code below will:
Replace missing values with 0 (I added missing values in the data)
Adds Percent format xx.xx%
Transpose the data
Code:
/*Create input data*/
data have;
informat VARNAME $10.;
input VARNAME $ MISSING_PERCENT ;
length VARNAME $10;
datalines;
BLADDER 0.10
MOLD 0.06
TBM_STAGE1 0.18
TBM_STAGE2 99.9
TBM_STAGE3 .
TBM_STAGE4 .
;;;;;;
run;
/*Replace missing values with 0, and Add % format */
proc sql;
create table work.new as
select
VARNAME,
coalesce(MISSING_PERCENT,0)/100 as MISSING_PERCENT format=percent8.2
from work.have;
quit;
/*Transpose Data*/
proc transpose data=work.new
out=work.want name=VARNAME;
id VARNAME;
run;
Input:
Formatted:
Output:
Thank you for the answer. It is perfectly giving what I want.
But, I do have a list of 160 Variables and I think to hard code it, will take more effort. What I did is I excluded the Variables which have 0% Missing Percent and extracted the ones which have missing percent.
This is how I did it to extract the Missing Percent:
> data Final_Output_&var;
set Final_Output_&var;
VARNAME = "&var";
%if &t=char %then
%do;
where put (&var., $missfmt.) in ("Missing");
%end;
%else
%do;
where put (&var., missfmt.) in ("Missing");
%end;
run;
Thanks again for your quick reply !!!
Best Regards,
Pankaj
I have some data set ranked by some variable.
I need to take every 1000 observation from the beginning and count in which field1=1, then count next 1000 observations in the same way.
How can I do it?
I hope I understand correctly what you want.
You could try a datastep like this
data result (Keep=countob obnr);
retain obnr 1000;
retain countob 0;
set mydata;
if field1=1 then
countob=countob+1;
if mod(_n_,1000) = 0 then do;
output;
obnr=obnr+1000;
countob=0;
end;
run;
this would lead to a result like this:
obnr | countob
------------
1000 | 247
2000 | 325
3000 | 198
obnr is obviously optional...
Another, slightly shorter, way, utilizing CEIL-function and PROC FREQ-procedure:
data want;
set have;
thousand=ceil(_N_/1000)*1000;
run;
proc freq data=want;
tables thousand / out=want;
where field1=1;
run;
I have a data set with 1100 samples, target class isReturn, there are
800 isReturn='True'
300 isReturn='False'
How can I use PROC SURVEYSELECT to oversample the 300 isReturn='False' so that I will have 800 isReturn='False' to make the data set balance?
Thanks in advance.
I may not understand what you want, but if you just want to have 800 of the false folks, you could use proc surveyselect or the data step.
The data step would give you granular control. This gives you your 300 twice, plus another 200 picked randomly (possibly 1 or 0 times) from the 300 a third time.
data have;
length isReturn $5;
do _n_=1 to 800;
isReturn='True';
output;
if _n_ le 300 then do;
isReturn='False';
output;
end;
end;
run;
data want;
set have;
retain k 200 n 300;
if isReturn='True' then output;
else do;
output;
output;
if ranuni(7) le k/n then do;
output;
k+-1;
end;
n+-1;
end;
run;
You could tweak that pretty easily to get any distribution you want (you could take 500 out of '600' (double 300) for example by setting k and n to 500 and 600 and doing the if bit twice, each time decrementing n once).
You could also use proc surveyselect to do this.
proc surveyselect data=have(where=(isReturn='False')) out=want_add method=urs n=500 outhits;
run;
That would give you an extra 500 records, chosen at random with replacement; just add those back to the original dataset. You don't have as granular control but it is very easy to code.
Alternately, you could do this in one step. However, this does not guarantee you for either false or true a single record will always be represented - so this likely doesn't do exactly what you ask for; presented for completeness.
data sizes;
input isReturn :$5. _NSIZE_;
datalines;
False 800
True 800
;;;;
run;
proc sort data=have;
by isReturn;
run;
proc surveyselect data=have out=want method=urs n=sizes outhits;
strata isReturn;
run;
All of this assumes you're trying to get 100% of the original dataset plus some. If you're trying to oversample in the sense of pick False records with equal probability to True records, but you are ultimately picking a smaller sample than the total (and only picking each once, ie without replacement) then the strata statement is what you should be looking at.