I am trying to determine the probability that the mean of a sample from a unifrom distribution lies between .4 and .5.
data sample1 (drop= i x z) ;
z=0;
do i=1 to 50;
x= ranuni(234);
z= z+x;
meanz= z/50;
end;
output;
run;
This gives me the mean, but is there some nice way in the loop to output P(.4 <= meanz <=.5).
Try this:
It gives you the average of 100 meanz and the percentage of them between .4 and .5.
data sample1 (drop= i x z sim) ;
between4_5 = 0;
meanz = 0;
do sim=1 to 100;
z=0;
do i=1 to 50;
x= ranuni(234);
z= z+x;
end;
meanz = meanz + z/(50*100);
if .4 < z/50 < .5 then
between4_5 = between4_5 + 1/100;
end;
output;
run;
Related
I have the below requirement and wondering if this can be implemented in simple datastep:
will be starting with a simple dataset with two variable
input:
x y
1 0
logic:
x y z
1 0 x+y
prev z prev y +1 x+y
prev z prev y +1 x+y
output:
x y z
1 0 1
1 1 2
2 2 4
4 3 7
7 4 11
Just output the computed rows one-by-one within a do-while loop.
data have;
input x y;
cards;
1 0
;run;
data want(drop=i);
set have;
z = x + y;
output;
i = 2; /*next row*/
do while (i <= 5); /* put the total number of rows here */
x = z;
y = y + 1;
z = x + y;
output;
i = i + 1;
end;
run;
Result
proc print data=want;
run;
Obs x y z
1 1 0 1
2 1 1 2
3 2 2 4
4 4 3 7
5 7 4 11
Macro version:
%macro gen(have, want, n_rows);
data &want(drop=i);
set &have;
z = x + y;
output;
i = 2;
do while (i <= &n_rows);
x = z;
y = y + 1;
z = x + y;
output;
i = i + 1;
end;
run;
%mend gen;
/* execute */
%gen(have, want, 5)
Ok. I decided to do some research to try to answer your question. And I found out that you (probably) did the same question in a sas forum (link here).
As any interested user may see, the question was answered in a very elegant way there (by Mr. Reinhard - credits to him).
While I was pasting Reinhard answer here, I saw that #Bill Huang came with an original answer his own. So probably you should accept his answer. Anyway, Reinhard answer was really cool and elegant, so I thought it might worth to have it registerd here. Mainly for other users, because it is such an easy way of creating additional interative rows in SAS:
data have;
x=1; y=0;
run;
data want;
set have;
do y=y to 4;
z=x+y;
output;
x=z;
end;
#LuizZ's do y=y to 4 version presumes the first row has y=0. If that is not the case try
data have;
x=1; y=0;
run;
%let iterations = 4;
data want;
set have;
do y = y to y+&iterations;
z = x + y;
output;
x = z;
end;
run;
I have two sets of financial data that tend to contain differences due to unit errors e.g. $10000 in one dataset may be $1000 in the other.
I'm trying to code a check for such differences, but the only way I can think of is to divide the two variables and see if the difference is in a table of 0.001, 0.01, 0.1, 10, 100 etc, but it would be hard to catch all of the differences.
Is there a smarter way to do this?
Use proc compare. Be sure the two datasets are sorted in identical order, either by row or by specific groups. Use the by statement as needed. More info on options can be found in the documentation.
Example - compare a modified cars dataset with sashelp.cars:
data cars_modified;
set sashelp.cars;
if(mod(_N_, 2) = 0) then msrp = msrp - 100;
run;
proc compare base = sashelp.cars
compare = cars_modified
out = out_differences
outnoequal
outdif
noprint;
var msrp;
run;
Only the observations with differences are output in out_differences:
_TYPE_ _OBS_ MSRP
DIF 2 $-100
DIF 4 $-100
DIF 6 $-100
DIF 8 $-100
DIF 10 $-100
...
So you appear to be asking to find cases where X/Y is a number that is exactly 1.00Exx where XX is an integer, other than 0.
data _null_;
do x=1,10,100,1000;
do y=1,2,3,10.1,10 ;
ratio = x/y;
power = floor(log10(ratio));
if power ne 0 and 1.00 = round(ratio/10**power,0.01) then
put 'Ratio of ' x 'over ' y 'is 10**' power '.'
;
end;
end;
run;
Results:
Ratio of 1 over 10 is 10**-1 .
Ratio of 10 over 1 is 10**1 .
Ratio of 100 over 1 is 10**2 .
Ratio of 100 over 10 is 10**1 .
Ratio of 1000 over 1 is 10**3 .
Ratio of 1000 over 10 is 10**2 .
For a numeric value X you can compute the nearest the rational expression, p/q.
If you calculate ratio
X = amount_for_source_A / amount_from_source_B;
status = math.rational(X,1e5,p,q);
the ratio will be a multiple of 10 if p=1 or q=1
Example:
proc ds2;
package math / overwrite = yes;
method rational(double x, double maxden, in_out integer p, in_out integer q) returns double;
/*
** FROM: https://www.ics.uci.edu/~eppstein/numth/frap.c
** FROM: https://stackoverflow.com/questions/95727/how-to-convert-floats-to-human-readable-fractions
**
** find rational approximation to given real number
** David Eppstein / UC Irvine / 8 Aug 1993
**
** With corrections from Arno Formella, May 2008
**
** Modified for Proc DS2, Richard DeVenezia, Jan 2020.
**
** usage: rational(r,d,p,q)
** x is real number to approx
** maxden is the maximum denominator allowed
** p is return for numerator
** q is return for denominator
** returns 0 if no problems
**
** based on the theory of continued fractions
** if x = a1 + 1/(a2 + 1/(a3 + 1/(a4 + ...)))
** then best approximation is found by truncating this series
** (with some adjustments in the last term).
**
** Note the fraction can be recovered as the first column of the matrix
** ( a1 1 ) ( a2 1 ) ( a3 1 ) ...
** ( 1 0 ) ( 1 0 ) ( 1 0 )
** Instead of keeping the sequence of continued fraction terms,
** we just keep the last partial product of these matrices.
*/
declare integer m[0:1,0:1];
declare double startx e1 e2;
declare integer ai t result p1 q1 p2 q2;
startx = x;
/* initialize matrix */
m[0,0] = 1; m[1,1] = 1;
m[0,1] = 0; m[1,0] = 0;
/* loop finding terms until denom gets too big */
do while (1);
ai = x;
if not ( m[1,0] * ai + m[1,1] < maxden ) then leave;
t = m[0,0] * ai + m[0,1];
m[0,1] = m[0,0];
m[0,0] = t;
t = m[1,0] * ai + m[1,1];
m[1,1] = m[1,0];
m[1,0] = t;
if x = ai then leave; %* AF: division by zero;
x = 1 / (x - ai);
if x > 2147483647 /*x'7FFFFFFF'*/ then leave; %* AF: representation failure;
end;
/* now remaining x is between 0 and 1/ai */
/* approx as either 0 or 1/m where m is max that will fit in maxden */
/* first try zero */
p1 = m[0,0];
q1 = m[1,0];
e1 = startx - 1.0 * p1 / q1;
/* now try other possibility */
ai = (maxden - m[1,1]) / m[1,0];
m[0,0] = m[0,0] * ai + m[0,1];
m[1,0] = m[1,0] * ai + m[1,1];
p2 = m[0,0];
q2 = m[1,0];
e2 = startx - 1.0 * p2 / q2;
if abs(e1) <= abs(e2) then do;
p = p1;
q = q1;
end;
else do;
p = p2;
q = q2;
end;
return 0;
end;
endpackage;
run;
quit;
* Example uage;
proc ds2;
data _null_;
declare package math math();
declare double x;
declare int p1 q1 p q;
method run();
streaminit(12345);
x = 0;
do _n_ = 1 to 20;
p1 = ceil(rand('uniform',9));
q1 = ceil(rand('uniform',9));
x + 1. * p1 / q1;
math.rational (x, 10000, p, q);
put 'add' p1 '/' q1 ' ' x=best16. 'is' p '/' q;
end;
end;
enddata;
run;
quit;
----- LOG -----
add 4 / 1 x= 4 is 4 / 1
add 4 / 2 x= 6 is 6 / 1
add 2 / 7 x=6.28571428571429 is 44 / 7
add 4 / 6 x=6.95238095238095 is 146 / 21
add 5 / 2 x=9.45238095238095 is 397 / 42
add 5 / 2 x= 11.952380952381 is 251 / 21
add 7 / 1 x= 18.952380952381 is 398 / 21
add 8 / 6 x=20.2857142857143 is 142 / 7
add 9 / 3 x=23.2857142857143 is 163 / 7
add 8 / 2 x=27.2857142857143 is 191 / 7
add 3 / 1 x=30.2857142857143 is 212 / 7
add 9 / 3 x=33.2857142857143 is 233 / 7
add 4 / 3 x=34.6190476190476 is 727 / 21
add 4 / 6 x=35.2857142857143 is 247 / 7
add 1 / 9 x=35.3968253968254 is 2230 / 63
add 8 / 3 x=38.0634920634921 is 2398 / 63
add 2 / 4 x=38.5634920634921 is 4859 / 126
add 5 / 1 x=43.5634920634921 is 5489 / 126
add 1 / 2 x=44.0634920634921 is 2776 / 63
add 2 / 7 x=44.3492063492064 is 2794 / 63
DS2 math package
I want a new data set in which the variable y is equal to the value in the n row minus the lags values.
The original data set:
data test;
input x;
datalines;
20
40
2
5
74
;
run;
I used the dif function, but It returns the difference with a one lag:
data want;
set test;
y = dif(x);
run;
And I want:
_n_ = 1 y = 20
_n_ = 2 y = 40 - 20 = 20
_n_ = 3 y = 2 - (40 + 20) = -58
_n_ = 4 y = 5 - (2 + 40 + 20) = - 57
_n_ = 5 y = 74 - (5 + 2 + 40 + 20) = 7
Thanks.
No need for lag() or dif(). Just make another variable to retain the running total.
data want ;
set test;
y=x-cumm;
output;
cumm+x;
run;
I kept the extra column and output the values before updating the running total to make it clearer what value was used in the calculation of Y.
Obs x y cumm
1 20 20 0
2 40 20 20
3 2 -58 60
4 5 -57 62
5 74 7 67
Possible solution (thanks to Longfish for suggestions):
data want;
set test;
retain total 0;
total = total + x;
y = x - coalesce(lag(total), 0);
run;
As can bee seen, I sorted the data by rk, and descending version:
data have;
rk = 1;
version = 7;
ind = 0;
output;
rk = 1;
version = 6;
ind = 1;
output;
rk = 1;
version = 5;
ind = 0;
output;
rk = 1;
version = 4;
ind = 0;
output;
rk = 1;
version = 3;
ind = 1;
output;
rk = 1;
version = 2;
ind = 0;
output;
rk = 1;
version = 1;
ind = 0;
output;
rk = 1;
version = 0;
ind = 0;
output;
run;
I thought of the Retain statement. but any solution for this problem will suit me just fine.
What I need to do is,
if at some point, ind = 1, I want all previous rows (versions) for the same rk, to have some sort of indication for that.
So basically,
versions 0,1,2 should be flagged, because version 3 has ind = 1;
versions 4,5 should be flagged , because version 6 has ind = 1;
but version 7 should not be affected at all, as it appears after rows of ind = 1,
and not before them.
It would be even better if every flagged row affected by a row with ind = 1
will have an indicator states the version number affected that change,
meaning
versions 0,1,2 will have a field named "affected_by" equals to 3
versions 4,5 will have that field equals to 6
Your help is very much appreciated!
Since the data set was sorted, we will go "forward" (which I think is easier) using your sorted set. We'll use the SELECT statement as we only want one execution per iteration. We'll also use RETAIN statement that you have suggested and the CAT function for concatenating strings together to generate the indicator flag:
data test;
set have;
drop N count x;
select;
when(ind = 1) do;
N = 1;
count = version;
retain N count;
output;
end;
when(N = 1) do;
x = ind;
flag = cat('Flagged because of version ', count);
N = .;
retain x count;
output;
end;
when(x = ind) do;
flag = cat('Flagged because of version ', count);
retain x count;
output;
end;
otherwise do;
output;
end;
end;
run;
OUTPUT:
rk version ind flag
1 7 0
1 6 1
1 5 0 Flagged because of version 6
1 4 0 Flagged because of version 6
1 3 1
1 2 0 Flagged because of version 3
1 1 0 Flagged because of version 3
1 0 0 Flagged because of version 3
In this case, N is used as an indicator for which the previous observation had ind = 1. Then we destroy it (i.e. N = .), otherwise it will just satisfy the N = 1 condition again in next iteration.
Note that we retain the variables x and count for comparing x with next ind. Variable count equals the version in the row that has ind = 1. For the flag indicator, use the CAT function to add the numeric variable count to a string.
Cheers.
I have a dataset (already sorted by the Blood Pressure variable)
Blood Pressure
87
99
99
109
111
112
117
119
121
123
139
143
145
151
165
198
I need to find the median without using proc means.
Now For this data, there are 16 observations. The median is (119+121)/2 = 120.
How can I code so that I would always be able to find the median, regardless of how many observations there are. Code that would work for even number of observations and odd number of observations.
And of course, PROC means is not allowed.
Thank you.
I use a FCMP function for this. This is a generic quantile function from my personal library. As the median is the 50%-tile, this will work.
options cmplib=work.fns;
data input;
input BP;
datalines;
87
99
99
109
111
112
117
119
121
123
139
143
145
151
165
198
;run;
proc fcmp outlib=work.fns.fns;
function qtile_n(p, arr[*], n);
alphap=1;
betap=1;
if n > 1 then do;
m = alphap+p*(1-alphap-betap);
i = floor(n*p+m);
g = n*p + m - i;
qp = (1-g)*arr[i] + g*arr[i+1];
end;
else
qp = arr[1];
return(qp);
endsub;
quit;
proc sql noprint;
select count(*) into :n from input;
quit;
data _null_;
set input end=last;
array v[&n] _temporary_;
v[_n_] = bp;
if last then do;
med = qtile_n(.5,v,&n);
put med=;
end;
run;
Assuming you have a data set named HAVE sorted by the variable BP, you can try this:
data want(keep=median);
if mod(nobs,2) = 0 then do; /* even number if records in data set */
j = nobs / 2;
set HAVE(keep=bp) point=j nobs=nobs;
k = bp; /* hold value in temp variable */
j + 1;
set HAVE(keep=bp) point=j nobs=nobs;
median = (k + bp) / 2;
end;
else do;
j = round( nobs / 2 );
set HAVE(keep=bp) point=j nobs=nobs;
median = bp;
end;
put median=; /* if all you want is to see the result */
output; /* if you want it in a new data set */
stop; /* stop required to prevent infinite loop */
run;
This is "old fashioned" code; I'm sure someone can show another solution using hash objects that might eliminate the requirement to sort the data first.