I have a dataset (already sorted by the Blood Pressure variable)
Blood Pressure
87
99
99
109
111
112
117
119
121
123
139
143
145
151
165
198
I need to find the median without using proc means.
Now For this data, there are 16 observations. The median is (119+121)/2 = 120.
How can I code so that I would always be able to find the median, regardless of how many observations there are. Code that would work for even number of observations and odd number of observations.
And of course, PROC means is not allowed.
Thank you.
I use a FCMP function for this. This is a generic quantile function from my personal library. As the median is the 50%-tile, this will work.
options cmplib=work.fns;
data input;
input BP;
datalines;
87
99
99
109
111
112
117
119
121
123
139
143
145
151
165
198
;run;
proc fcmp outlib=work.fns.fns;
function qtile_n(p, arr[*], n);
alphap=1;
betap=1;
if n > 1 then do;
m = alphap+p*(1-alphap-betap);
i = floor(n*p+m);
g = n*p + m - i;
qp = (1-g)*arr[i] + g*arr[i+1];
end;
else
qp = arr[1];
return(qp);
endsub;
quit;
proc sql noprint;
select count(*) into :n from input;
quit;
data _null_;
set input end=last;
array v[&n] _temporary_;
v[_n_] = bp;
if last then do;
med = qtile_n(.5,v,&n);
put med=;
end;
run;
Assuming you have a data set named HAVE sorted by the variable BP, you can try this:
data want(keep=median);
if mod(nobs,2) = 0 then do; /* even number if records in data set */
j = nobs / 2;
set HAVE(keep=bp) point=j nobs=nobs;
k = bp; /* hold value in temp variable */
j + 1;
set HAVE(keep=bp) point=j nobs=nobs;
median = (k + bp) / 2;
end;
else do;
j = round( nobs / 2 );
set HAVE(keep=bp) point=j nobs=nobs;
median = bp;
end;
put median=; /* if all you want is to see the result */
output; /* if you want it in a new data set */
stop; /* stop required to prevent infinite loop */
run;
This is "old fashioned" code; I'm sure someone can show another solution using hash objects that might eliminate the requirement to sort the data first.
Related
I would like to create a new column whose values equal the average of values in other columns. But the number of columns I am taking the average of is dictated by a variable. My data look like this, with 'length' dictating the number of columns x1-x5 that I want to average:
data have;
input ID $ length x1 x2 x3 x4 x5;
datalines;
A 5 8 234 79 36 78
B 4 8 26 589 3 54
C 3 19 892 764 89 43
D 5 72 48 65 4 9
;
run;
I would like to end up with the below where 'avg' is the average of the specified columns.
data want;
input ID $ length avg
datalines;
A 5 87
B 4 156.5
C 3 558.3
D 5 39.6
;
run;
Any suggestions? Thanks! Sorry about the awful title, I did my best.
You have to do a little more work since mean(of x[1]-x[length]) is not valid syntax. Instead, save the values to a temporary array and take the mean of it, then reset it at each row. For example:
tmp1 tmp2 tmp3 tmp4 tmp5
8 234 79 36 78
8 26 589 3 .
19 892 764 . .
72 48 65 4 9
data want;
set have;
array x[*] x:;
array tmp[5] _temporary_;
/* Reset the temp array */
call missing(of tmp[*]);
/* Save each value of x to the temp array */
do i = 1 to length;
tmp[i] = x[i];
end;
/* Get the average of the non-missing values in the temp array */
avg = mean(of tmp[*]);
drop i;
run;
Use an array to average it by summing up the array for the length and then dividing by the length.
data have;
input ID $ length x1 x2 x3 x4 x5;
datalines;
A 5 8 234 79 36 78
B 4 8 26 589 3 54
C 3 19 892 764 89 43
D 5 72 48 65 4 9
;
data want;
set have;
array x(5) x1-x5;
sum=0;
do i=1 to length;
sum + x(i);
end;
avg = sum/length;
keep id length avg;
format avg 8.2;
run;
#Reeza's solution is good, but in case of missing values in x it will produce not always desirable result. It's better to use a function SUM. Also the code is little simplified:
data want (drop=i s);
set have;
array a{*} x:;
s=0; nm=0;
do i=1 to length;
if missing(a{i}) then nm+1;
s=sum(s,a{i});
end;
avg=s/(length-nm);
run;
Rather than writing your own code to calculate means you could just calculate all of the possible means and then just use an index into an array to select the one you need.
data have;
input ID $ length x1 x2 x3 x4 x5;
datalines;
A 5 8 234 79 36 78
B 4 8 26 589 3 54
C 3 19 892 764 89 43
D 5 72 48 65 4 9
;
data want;
set have;
array means[5] ;
means[1]=x1;
means[2]=mean(x1,x2);
means[3]=mean(of x1-x3);
means[4]=mean(of x1-x4);
means[5]=mean(of x1-x5);
want = means[length];
run;
Results:
I have a data for sales in 3 months (sale1, sale2 and sale3), and I need to show the the different summations with different filters.
data sales;
input area load $ prod : $ sale1 sale2 sale3;
diff=sale3-sale2;
datalines;
1 Y p1 109 117 138
1 N p1 23 29 20
1 Y p2 78 70 68
1 N p2 63 19 22
2 Y p1 49 36 32
2 N p1 50 39 44
2 Y p3 138 157 158
2 N p3 110 126 107
3 Y p2 251 267 259
3 N p2 182 184 160
;
run;
ods excel close;
ods excel file="/C:/data/t1.xlsx"
options (sheet_name="tab1" frozen_headers='3' frozen_rowheaders='2'
embedded_footnotes='yes' autofilter='1-8');
proc report data=sales nocenter;
column area load prod sale1 sale2 sale3 diff change;
define area -- diff/ display;
define sale1-- diff / analysis sum format=comma12. style(column)=[cellwidth=.5in];
define change / computed format=percent8.2 '% change' style(column)=[cellwidth=.8in];
compute change;
change = diff.sum/sale2.sum;
if change >= 0.1 then call define ("change",'STYLE','STYLE=[color=red
fontweight=bold]');
if change <= -0.1 then call define ("change",'STYLE','STYLE=[color=blue
fontweight=bold]');
endcomp;
rbreak after / summarize style=[background=lightblue font_weight=bold];
run;
ods excel close;
this report with no filtering looks likeoriginal report
but if I filter with column load='Y' in the .xlsx file, i want to see the result like this:
output with filter
wonder if anyone can help, thanks!
Let say I have a table like:
Z 25 26 27 ... 100
0 300 200 200 100
1 278 262 177 45
2 168 222 122 22
(The 1st line is also the header).
Now I want to add more 20 observations in my table:
Z 25 26 27 ... 100
0 300 200 200 100
1 278 262 177 45
2 168 222 122 22
3 84 111 61 11
...
22 84 111 61 11
So that (all observation with Z=3 to 22) = (observation with Z = 2) * 1/2. Is there anyway to do that?
The special variable name list _numeric_ is used to array all the numeric variables. A loop over that array will let you divide each variable of a selected row by 2.
Example:
data have;
input Z _25 _26 _27 _100;
datalines;
0 300 200 200 100
1 278 262 177 45
2 168 222 122 22
run;
data newrows(drop=last_z);
set have nobs=nobs point=nobs; * read last row;
last_z = z;
array _ _numeric_; * array all numeric variables;
do _n_ = 1 to dim(_);
_(_n_) = _(_n_) / 2; * divide each variable by 2;
end;
do z = last_z + 1 to last_z + 20; * output 20 'new' rows;
output;
end;
stop;
run;
proc append base=have data=newrows;
run;
Just to be clear, a SAS variable name can not be a number. However, this gives you what you want
data have;
input z a b c;
datalines;
0 300 200 200
1 278 262 177
2 168 222 122
;
data want;
set have end=lr;
array arr a--c;
output;
if lr;
do over arr;
arr = arr / 2;
end;
do _N_ = 1 to 20;
z + 1;
output;
end;
run;
Updated Code:
data have;
do z = 0, 1, 2;
array arr _25-_100;
do over arr;
arr = ceil(rand('uniform')*100);
end;
output;
end;
run;
data want;
set have end=lr;
array arr _25--_100;
output;
if lr;
do over arr;
arr = arr / 2;
end;
do _N_ = 1 to 20;
z + 1;
output;
end;
run;
Say that I have the following database:
Min Rank Qty
2 1 100
2 2 90
2 3 80
2 4 70
5 1 110
5 2 100
5 3 90
5 4 80
5 5 70
7 1 120
7 2 110
7 3 100
7 4 90
I need to have the database with the continuous values for minutes like this:
Min Rank Qty
2 1 100
2 2 90
2 3 80
2 4 70
3 1 100
3 2 90
3 3 80
3 4 70
4 1 100
4 2 90
4 3 80
4 4 70
5 1 110
5 2 100
5 3 90
5 4 80
5 5 70
6 1 110
6 2 100
6 3 90
6 4 80
6 5 70
7 1 120
7 2 110
7 3 100
7 4 90
How can I do this in SAS? I just need to replicate the previous minute. The number of observations per minute varies...it can be 4 or 5 or more.
It is not that hard to imagine code that would do this, the problem is that it quickly starts to look messy.
If your dataset is not too large, one approach you could consider the following approach:
/* We find all gaps. the output dataset is a mapping: the data of which minute (reference_minute) do we need to create each minute of data*/
data MINUTE_MAPPING (keep=current_minute reference_minute);
set YOUR_DATA;
by min;
retain last_minute 2; *set to the first minute you have;
if _N_ NE 1 and first.min then do;
/* Find gaps, map them to the last minute of data we have*/
if last_minute+1 < min then do;
do current_minute=last_minute+1 to min-1;
reference_minute=last_minute;
output;
end;
end;
/* For the available data, we map the minute to itself*/
reference_minute=min;
current_minute=min;
output;
*update;
last_minute=min;
end;
run;
/* Now we apply our mapping to the data */
*you must use proc sql because it is a many-to-many join, data step merge would give a different outcome;
proc sql;
create table RESULT as
select YD.current_minute as min, YD.rank, YD.qty
MINUTE_MAPPING as MM
join YOUR_DATA as YD
on (MM.reference_minute=YD.min)
;
quit;
The more performant approach would involve trickery with arrays.
But i find this approach a bit more appealing (disclaimer: at first thought), it is quicker to grasp (disclaimer again: imho) for someone else afterwards.
For good measure, the array approach:
data RESULT (keep=min rank qty);
set YOUR_DATA;
by min;
retain last_minute; *assume that first record really is first minute;
array last_data{5} _TEMPORARY_;
if _N_ NE 1 and first.min and last_minute+1 < min then do; *gap found;
do current_min=last_minute+1 to min-1;
*store data of current record;
curr_min=min;
curr_rank=rank;
curr_qty=qty;
*produce records from array with last available data;
do iter=1 to 5;
min = current_minute;
rank = iter;
qty = last_data{iter};
if qty NE . then output; *to prevent output of 5th element where there are only 4;
end;
*put back values of actual current record before proceeding;
min=curr_min;
rank=curr_rank;
qty=curr_qty;
end;
*update;
last_minute=min;
end;
*insert data for use on later missing minutes;
last_data{rank}=qty;
if last.min and rank<5 then last_data{5}=.;
output; *output actual current data point;
run;
Hope it helps.
Note, currently no access to a SAS client where i am. So untested code, might contain a couple of typo's.
Unless you have an absurd number of observations, I think transposing would make this easy.
I don't have access to sas at the moment so bear with me (I can test it out tomorrow if you can't get it working).
proc transpose data=data out=data_wide prefix=obs_;
by minute;
id rank;
var qty;
run;
*sort backwards so you can use lag() to fill in the next minute;
proc sort data=data_wide;
by descending minute;
run;
data data_wide; set data_wide;
nextminute = lag(minute);
run;
proc sort data=data_wide;
by minute;
run;
*output until you get to the next minute;
data data_wide; set data_wide;
*ensure that the last observation is output;
if nextminute = . then output;
do until (minute ge nextminute);
output;
minute+1;
end;
run;
*then you probably want to reverse the transpose;
proc transpose data=data_wide(drop=nextminute)
out=data_narrow(rename=(col1=qty));
by minute;
var _numeric_;
run;
*clean up the observation number;
data data_narrow(drop=_NAME_); set data_narrow;
rank = substr(_NAME_,5)*1;
run;
Again, I can't test this now, but it should work.
Someone else may have a clever solution that makes it so you don't have to reverse-sort/lag/forward-sort. I feel like I have dealt with this before but the obvious solution for me right now is to have it sorted backwards at whatever prior sort you do (you can do the transpose with a descending sort no problem) to save you an extra sort.
Is it possible to merge below two tables using hash object in SAS 9.1 example below ? The main problemseems to be creation of Value variable w Result dataset. Problem is that each payment could pay for more than one charge, and sometimes more than one payment is need to pay for one charge and this tho cases could appear simultaneously. Does it problem has some general name ?
http://support.sas.com/rnd/base/datastep/dot/hash-getting-started.pdf
data TABLE1;
input ID_client ID_commodity Charge;
datalines;
1 111111111 100
1 222222222 200
2 333333333 300
2 444444444 400
2 555555555 500
;;;;
run;
data TABLE2;
input ID_client_hash ID_ofpayment paymentValue;
datalines;
1 11 50
1 12 50
1 13 100
1 14 50
1 15 50
2 21 500
2 22 200
2 23 100
2 24 200
2 25 200
;;;;
run;
data OUT;
input ID_client ID_commodity ID_ofpayment value;
datalines;
1 111111111 11 50
1 111111111 12 50
1 222222222 13 100
1 222222222 14 50
1 222222222 15 50
2 333333333 21 300
2 444444444 21 200
2 444444444 22 200
2 555555555 23 100
2 555555555 24 200
2 555555555 25 200
This might work for you - I have 9.2 and 9.2 has some significant hash improvements, but I think I behaved myself and only used what was there in 9.1. You might try crossposting this to SAS-L [SAS listserv] as Paul Dorfman (ie, The Hash Guru) reads that still I believe.
I assumed you want the 'leftovers' posted out. You may need to work on that part, if it's not working the way you want. This isn't terribly well tested, it works for your example dataset. I call missing the commodity for 24 and 25 since they're not used for that.
I'm pretty sure there's a more clean way to do the iteration than what I do, but since 9.2+ is what I use and we have multidata available, i've always used that instead of hash iterators so I don't know the cleaner methods.
data have;
input ID_client ID_commodity Charge;
datalines;
1 111111111 100
1 222222222 200
2 333333333 300
2 444444444 400
2 555555555 50
;;;;
run;
data for_hash;
input ID_client_hash ID_ofpayment paymentValue;
datalines;
1 11 50
1 12 50
1 13 100
1 14 50
1 15 50
2 21 500
2 22 200
2 23 100
2 24 200
2 25 200
;;;;
run;
data want;
*Create hash and hash iterator - must use iterator since 9.1 does not allow multidata option;
if _n_ = 1 then do;
format id_client_hash paymentValue id_ofpayment BEST12.;
declare hash h(dataset:'for_hash' , ordered: 'a');
h.defineKey('ID_client_hash','id_ofpayment'); *note I put id_client_hash, renaming the id - want to be able to compare them;
h.defineData('id_client_hash','id_ofpayment','paymentValue');
call missing(id_ofpayment,paymentValue, id_client_hash);
h.defineDone();
declare hiter hi('h');
end;
do _t = 1 by 1 until (last.id_client);
set have;
by id_client;
*Iterate through the hash and find the first record with the same ID_client;
do rc = hi.first() by 0 while (rc eq 0 and ID_client ne ID_client_hash);
rc = hi.next();
end;
*For the current charge record, iterate through the payment (hash) until all paid up.;
do while (charge gt 0 and rc eq 0 and ID_client=ID_client_hash);
if charge ge paymentValue then do; *If charge >= paymentvalue, use up the payment value;
value = paymentValue; *so whole paymentValue is value;
charge = charge - paymentValue; *charge is decremented by paymentValue;
output; *output row;
_id=ID_client_hash;
_pay=id_ofpayment;
rc = hi.next();
h.remove(key:_id,key:_pay); *remove payment row from hash now that it has been used up;
end;
else do; *this is if (remaining) charge is less than payment - we will not use all of the payment;
value = charge; *value is the remainder of the charge, ie, how much of payment was actually used;
paymentValue = paymentValue - charge; *paymentValue is the remainder of paymentValue;
charge= 0; *charge is zero now;
output; *output a row;
h.replace(); *replace paymentValue in the hash with the new value of paymentValue, minus charge;
end;
end; *end of iteration through hash - at this point, either charge = 0 or we have run out of payments with that ID;
if charge gt 0 then do;
value=-1*charge;
call missing(id_ofpayment);
output; *output a row for the charge, which is not paid;
end;
if last.id_client then do; *this is cleanup, checking to see if we have any leftover payments;
do while (rc=0); *iterate through the remaining hash;
do rc = hi.first() by 0 while (rc eq 0 and ID_client ne ID_client_hash);
rc = hi.next();
end;
if rc=0 then do;
call missing(id_commodity); *to make it clear this is a leftover payment;
value=paymentValue; *update the value;
output; *output the payment;
_id=ID_client_hash;
_pay=id_ofpayment;
rc = hi.next();
if rc= 0 then h.remove(key:_id,key:_pay); *remove the payment just output;
end;
end;
end;
end;
keep id_client id_ofpayment id_commodity value;
run;
Among other things, this isn't terribly fast - I do a lot of iterating that might be wasteful. It will be relatively faster if you don't have any payment ID_client records that aren't represented in the charge records- any that you do are getting skipped over, so that could end up super slow.
I'm not confident hash is the superior solution, at least pre-9.2; keyed UPDATE might be superior. UPDATE is pretty much made for transactional database structures, which this seems close to.