SAS: make values missing - sas

I am trying to make some existing values to missing values (not deleting them).
Here is the basic structure of my data set.
I want to treat AGE and GENDER as missing whenever A is less than B. For example, when A=1 and B=3, I want to treat values of AGE and GENDER on the last two rows as missing (as shown on the data sets).
In my data both A and B go from 1 to 4 and have every combination of them.
Asterisks mean I have more data between them. Thanks in advance!
BEFORE
ID A B AGE GENDER
--------------
1 1 1 35 M
* * * * *
* * * * *
5 1 2 23 F
5 1 2 21 M
6 1 2 42 F
6 1 2 43 M
* * * * *
* * * * *
20 1 3 43 F
20 1 3 39 M
20 1 3 23 M
21 1 3 32 F
21 1 3 39 M
21 1 3 23 F
* * * * *
* * * * *
55 2 4 32 M
55 2 4 12 M
55 2 4 31 F
55 2 4 43 M
* * * * *
* * * * *
AFTER
ID A B AGE GENDER
--------------
1 1 1 35 M
* * * * *
* * * * *
5 1 2 23 F
5 1 2 . .
6 1 2 42 F
6 1 2 . .
* * * * *
* * * * *
20 1 3 43 F
20 1 3 . .
20 1 3 . .
21 1 3 32 F
21 1 3 . .
21 1 3 . .
* * * * *
* * * * *
55 2 4 32 M
55 2 4 12 M
55 2 4 . .
55 2 4 . .
* * * * *
* * * * *

How about now?
data temp;
retain idcount 0;
set olddata;
** Create an observation counter for each id **;
prev_id = lag(id);
if id ^= prev_id then idcount = 0;
idcount = idcount + 1;
run;
** Sort the obs by ID in reverse order **;
proc sort data=temp;
by id descending idcount;
run;
data temp2;
retain misscount 0;
set temp;
by id descending idcount;
** Keep the previous age and gender **;
old_age = age;
old_gender = gender;
** Count the number that should be missing **;
if a < b then nummiss = b - a;
else nummiss = 0;
** Set a counter of obs that we will set to missing **;
if first.id then misscount = 0;
** Set the appropriate number of rows to missing and update the counter **;
if misscount < nummiss then do;
misscount = misscount + 1;
call missing(age, gender);
end;
run;
proc sort data=temp2 out=temp3(drop=misscount nummiss idcount prev_id);
by id idcount;
run;

Related

Detecting unit differences in data (SAS)

I have two sets of financial data that tend to contain differences due to unit errors e.g. $10000 in one dataset may be $1000 in the other.
I'm trying to code a check for such differences, but the only way I can think of is to divide the two variables and see if the difference is in a table of 0.001, 0.01, 0.1, 10, 100 etc, but it would be hard to catch all of the differences.
Is there a smarter way to do this?
Use proc compare. Be sure the two datasets are sorted in identical order, either by row or by specific groups. Use the by statement as needed. More info on options can be found in the documentation.
Example - compare a modified cars dataset with sashelp.cars:
data cars_modified;
set sashelp.cars;
if(mod(_N_, 2) = 0) then msrp = msrp - 100;
run;
proc compare base = sashelp.cars
compare = cars_modified
out = out_differences
outnoequal
outdif
noprint;
var msrp;
run;
Only the observations with differences are output in out_differences:
_TYPE_ _OBS_ MSRP
DIF 2 $-100
DIF 4 $-100
DIF 6 $-100
DIF 8 $-100
DIF 10 $-100
...
So you appear to be asking to find cases where X/Y is a number that is exactly 1.00Exx where XX is an integer, other than 0.
data _null_;
do x=1,10,100,1000;
do y=1,2,3,10.1,10 ;
ratio = x/y;
power = floor(log10(ratio));
if power ne 0 and 1.00 = round(ratio/10**power,0.01) then
put 'Ratio of ' x 'over ' y 'is 10**' power '.'
;
end;
end;
run;
Results:
Ratio of 1 over 10 is 10**-1 .
Ratio of 10 over 1 is 10**1 .
Ratio of 100 over 1 is 10**2 .
Ratio of 100 over 10 is 10**1 .
Ratio of 1000 over 1 is 10**3 .
Ratio of 1000 over 10 is 10**2 .
For a numeric value X you can compute the nearest the rational expression, p/q.
If you calculate ratio
X = amount_for_source_A / amount_from_source_B;
status = math.rational(X,1e5,p,q);
the ratio will be a multiple of 10 if p=1 or q=1
Example:
proc ds2;
package math / overwrite = yes;
method rational(double x, double maxden, in_out integer p, in_out integer q) returns double;
/*
** FROM: https://www.ics.uci.edu/~eppstein/numth/frap.c
** FROM: https://stackoverflow.com/questions/95727/how-to-convert-floats-to-human-readable-fractions
**
** find rational approximation to given real number
** David Eppstein / UC Irvine / 8 Aug 1993
**
** With corrections from Arno Formella, May 2008
**
** Modified for Proc DS2, Richard DeVenezia, Jan 2020.
**
** usage: rational(r,d,p,q)
** x is real number to approx
** maxden is the maximum denominator allowed
** p is return for numerator
** q is return for denominator
** returns 0 if no problems
**
** based on the theory of continued fractions
** if x = a1 + 1/(a2 + 1/(a3 + 1/(a4 + ...)))
** then best approximation is found by truncating this series
** (with some adjustments in the last term).
**
** Note the fraction can be recovered as the first column of the matrix
** ( a1 1 ) ( a2 1 ) ( a3 1 ) ...
** ( 1 0 ) ( 1 0 ) ( 1 0 )
** Instead of keeping the sequence of continued fraction terms,
** we just keep the last partial product of these matrices.
*/
declare integer m[0:1,0:1];
declare double startx e1 e2;
declare integer ai t result p1 q1 p2 q2;
startx = x;
/* initialize matrix */
m[0,0] = 1; m[1,1] = 1;
m[0,1] = 0; m[1,0] = 0;
/* loop finding terms until denom gets too big */
do while (1);
ai = x;
if not ( m[1,0] * ai + m[1,1] < maxden ) then leave;
t = m[0,0] * ai + m[0,1];
m[0,1] = m[0,0];
m[0,0] = t;
t = m[1,0] * ai + m[1,1];
m[1,1] = m[1,0];
m[1,0] = t;
if x = ai then leave; %* AF: division by zero;
x = 1 / (x - ai);
if x > 2147483647 /*x'7FFFFFFF'*/ then leave; %* AF: representation failure;
end;
/* now remaining x is between 0 and 1/ai */
/* approx as either 0 or 1/m where m is max that will fit in maxden */
/* first try zero */
p1 = m[0,0];
q1 = m[1,0];
e1 = startx - 1.0 * p1 / q1;
/* now try other possibility */
ai = (maxden - m[1,1]) / m[1,0];
m[0,0] = m[0,0] * ai + m[0,1];
m[1,0] = m[1,0] * ai + m[1,1];
p2 = m[0,0];
q2 = m[1,0];
e2 = startx - 1.0 * p2 / q2;
if abs(e1) <= abs(e2) then do;
p = p1;
q = q1;
end;
else do;
p = p2;
q = q2;
end;
return 0;
end;
endpackage;
run;
quit;
* Example uage;
proc ds2;
data _null_;
declare package math math();
declare double x;
declare int p1 q1 p q;
method run();
streaminit(12345);
x = 0;
do _n_ = 1 to 20;
p1 = ceil(rand('uniform',9));
q1 = ceil(rand('uniform',9));
x + 1. * p1 / q1;
math.rational (x, 10000, p, q);
put 'add' p1 '/' q1 ' ' x=best16. 'is' p '/' q;
end;
end;
enddata;
run;
quit;
----- LOG -----
add 4 / 1 x= 4 is 4 / 1
add 4 / 2 x= 6 is 6 / 1
add 2 / 7 x=6.28571428571429 is 44 / 7
add 4 / 6 x=6.95238095238095 is 146 / 21
add 5 / 2 x=9.45238095238095 is 397 / 42
add 5 / 2 x= 11.952380952381 is 251 / 21
add 7 / 1 x= 18.952380952381 is 398 / 21
add 8 / 6 x=20.2857142857143 is 142 / 7
add 9 / 3 x=23.2857142857143 is 163 / 7
add 8 / 2 x=27.2857142857143 is 191 / 7
add 3 / 1 x=30.2857142857143 is 212 / 7
add 9 / 3 x=33.2857142857143 is 233 / 7
add 4 / 3 x=34.6190476190476 is 727 / 21
add 4 / 6 x=35.2857142857143 is 247 / 7
add 1 / 9 x=35.3968253968254 is 2230 / 63
add 8 / 3 x=38.0634920634921 is 2398 / 63
add 2 / 4 x=38.5634920634921 is 4859 / 126
add 5 / 1 x=43.5634920634921 is 5489 / 126
add 1 / 2 x=44.0634920634921 is 2776 / 63
add 2 / 7 x=44.3492063492064 is 2794 / 63
DS2 math package

Use the dif function to obtain the difference with several lags without specifying the number of lags

I want a new data set in which the variable y is equal to the value in the n row minus the lags values.
The original data set:
data test;
input x;
datalines;
20
40
2
5
74
;
run;
I used the dif function, but It returns the difference with a one lag:
data want;
set test;
y = dif(x);
run;
And I want:
_n_ = 1 y = 20
_n_ = 2 y = 40 - 20 = 20
_n_ = 3 y = 2 - (40 + 20) = -58
_n_ = 4 y = 5 - (2 + 40 + 20) = - 57
_n_ = 5 y = 74 - (5 + 2 + 40 + 20) = 7
Thanks.
No need for lag() or dif(). Just make another variable to retain the running total.
data want ;
set test;
y=x-cumm;
output;
cumm+x;
run;
I kept the extra column and output the values before updating the running total to make it clearer what value was used in the calculation of Y.
Obs x y cumm
1 20 20 0
2 40 20 20
3 2 -58 60
4 5 -57 62
5 74 7 67
Possible solution (thanks to Longfish for suggestions):
data want;
set test;
retain total 0;
total = total + x;
y = x - coalesce(lag(total), 0);
run;

Keeping or deleting a group of observations based on a characteristic of a BY-group

I answered a SAS question a few minutes ago and realized there is a generalization that might be more useful than that one (here). I didn't see this question already in StackOverflow.
The general question is: How can you process and keep an entire BY-group based on some characteristic of the BY-group that you might not know until you have looked at all the observations in the group?
Using input data similar to that from the earlier question:
* For some reason, we are tasked with keeping only observations that
* are in groups of ID_1 and ID_2 that contain at least one obs with
* a VALUE of 0.;
* In the following data, the following ID and ID_2 groups should be
* kept:
* A 2 (2 obs)
* B 1 (3 obs)
* B 3 (2 obs)
* B 4 (1 obs)
* The resulting dataset will have 8 observations.;
data x;
input id $ id_2 value;
datalines;
A 1 1
A 1 1
A 1 1
A 2 0
A 2 1
B 1 0
B 1 1
B 1 3
B 2 1
B 3 0
B 3 0
B 4 0
C 2 4
;
run;
Double DoW loop solution:
data have;
input id $ id_2 value;
datalines;
A 1 1
A 1 1
A 1 1
A 2 0
A 2 1
B 1 0
B 1 1
B 1 3
B 2 1
B 3 0
B 3 0
B 4 0
C 2 4
;
run;
data want;
do _n_ = 1 by 1 until(last.id_2);
set have;
by id id_2;
flag = sum(flag,value=0);
end;
do _n_ = 1 to _n_;
set have;
if flag then output;
end;
drop flag;
run;
I've tested this against the point approach using ~55m rows and found no appreciable difference in performance. Dataset used:
data have;
do ID = 1 to 10000000;
do id_2 = 1 to ceil(ranuni(1)*10);
do value = floor(ranuni(2) * 5);
output;
end;
end;
end;
run;
My answer might not be the most efficient, especially for large datasets, and I'm interested in seeing other possible answers. Here it is:
* For some reason, we are tasked with keeping only observations that
* are in groups of ID_1 and ID_2 that contain at least one obs with
* a VALUE of 0.;
* In the following data, the following ID and ID_2 groups should be
* kept:
* A 2 (2 obs)
* B 1 (3 obs)
* B 3 (2 obs)
* B 4 (1 obs)
* The resulting dataset will have 8 observations.;
data x;
input id $ id_2 value;
datalines;
A 1 1
A 1 1
A 1 1
A 2 0
A 2 1
B 1 0
B 1 1
B 1 3
B 2 1
B 3 0
B 3 0
B 4 0
C 2 4
;
run;
* I realize the data are already sorted, but I think it is better
* not to assume they are.;
proc sort data=x;
by id id_2;
run;
data obstokeep;
keep id id_2 value;
retain startptr haszero;
* This SET statement reads through the dataset in sequence and
* uses the CUROBS option to obtain the observation number. In
* most situations, this will be the same as the _N_ automatic
* variable, but CUROBS is probably safer.;
set x curobs=myptr;
by id id_2;
* When this is the first observation in a BY-group, save the
* current observation number (pointer).
* Also initialize a flag variable that will become 1 if any
* obs contains a VALUE of 0;
* The variables are in a RETAIN statement, so they keep their
* values as the SET statement above is executed for each obs
* in the BY-group.;
if first.id_2
then do;
startptr=myptr;
haszero=0;
end;
* This statement is executed for each observation. We check
* whether VALUE is 0 and, if so, record that fact.;
if value = 0
then haszero=1;
* At the end of the BY-group, we check to see if there were
* any observations with VALUE = 0. If so, we go back using
* another SET statement, re-read them via direct access, and
* write them to the output dataset.
* (Note that if VALUE order is not relevant, you can gain a bit
* more efficiency by writing the current obs first, then going
* back to get the rest.);
if last.id_2 and haszero
then do;
* When LAST and FIRST at the same time, there is only one
* obs, so no need to backtrack, just output and go on.;
if first.id_2
then output obstokeep;
else do;
* Here we assume that the observations are sequential
* (which they will be for a sequential SET statement),
* so we re-read these observations using another SET
* statement with the POINT option for direct access
* starting with the first obs of the by-group (the
* saved pointer) and ending with the current one (the
* current pointer).;
do i=startptr to myptr;
set x point=i;
output obstokeep;
end;
end;
end;
run;
proc sql;
select a.*,b.value from (select id,id_2 from have where value=0)a left join have b
on a.id=b.id and a.id_2=b.id_2;
quit;

Understanding recursion in c++

I think I'm understanding the principle behind recursion, for example the stack like behaviour and the way the program "yo-yo's" back through the function calls, I seem to be having trouble figuring out why certain functions return the values that they do though, the code below returns 160, is this due to the return 5 playing a part, I think I'm right in saying it will go 4*2 + 8*2 + 12*2 etc.. I'm really doubting that when changing my values though.
Would anybody be able to offer a brief explanation as to which values are being multiplied?
cout << mysteryFunction(20);
int mysteryFunction (int n)
{
if(n > 2)
{
return mysteryFunction(n - 4)*2;
}
else return 5;
}
If you are interested in actual call stack:
mysteryFunction(20):
[n > 2] -> mysteryFunction(16) * 2
[n > 2] -> mysteryFunction(12) * 2
[n > 2] -> mysteryFunction(8) * 2
[n > 2] -> mysteryFunction(4) * 2
[n > 2] -> mysteryFunction(0) * 2
[n <= 2] -> 5
5 * 2 = 10
10 * 2 = 20
20 * 2 = 40
40 * 2 = 80
80 * 2 = 160
More generally: 20 = 4*5, so 5 * 2^5 = 5 * 32 = 160.
mysteryFunction(20) => 80 * 2 = 160
mysteryFunction(16) => 40 * 2 = 80
mysteryFunction(12) => 20 * 2 = 40
mysteryFunction(8) => 10 * 2 = 20
mysteryFunction(4) => 5 * 2 = 10
mysteryFunction(0) => 5
Recursion doesn't yo-yo, it just nests deeply.
In you case, the if statement results in either a) the function being called from within the function, or b) a return value... let's look at it running...
A- mysteryFunction(20)
B-- mysteryFunction(16)
C--- mysteryFunction(12)
D---- mysteryFunction(8)
E----- mysteryFunction(4)
F------ mysteryFunction(0) <-- this is the first time (n > 2) is false
Line F is the first time n > 2 is false, which means it returns a 5.
Line F was called by line E, and the value line E gets (5) is multiplied by 2 and returned. So line E returns 10.
Line E was called by line D... and the value it gets (10) is multiplied by 2 and returned, so line D return 20.
... and so on.
Quick version... let's order these to match the order they act on the value...
F: 5
E: F * 2 = 10
D: E * 2 = 20
C: D * 2 = 40
B: C * 2 = 80
A: B * 2 = 160
I will suggest you to read this article on Wikipedia about recursion: http://en.wikipedia.org/wiki/Recursion
In a nutshell a recursive function is one that calls itself until you reach a base case(this is the key). If you don't reach the base case your function will run forever(infinite loop). In the case of your function, get a piece of paper a follow its path picking any number as example, it is the best way to figure out how it works. The factorial is a good example:
the factorial of a number, let's say 5 is !5 = 5 * 4 * 3 * 2 * 1 which is 120. Try it, the principles for recursion is the same regardless the problem.
Here's an example for a factorial function.
Recursion in c++ Factorial Program
Just go through the code and substitute the values.
mysteryFunction(20) -> mysteryFunction(16) * 2
mysteryFunction(16) * 2 -> mysteryFunction(12) * 2 * 2
mysteryFunction(12) * 2 * 2 -> mysteryFunction(8) * 2 * 2 * 2
mysteryFunction(8) * 2 * 2 * 2 -> mysteryFunction(4) * 2 * 2 * 2 * 2
mysteryFunction(4) * 2 * 2 * 2 * 2 -> mysteryFunction(0) * 2 * 2 * 2 * 2 * 2
mysteryFunction(0) * 2 * 2 * 2 * 2 * 2 -> 5 * 2 * 2 * 2 * 2 * 2 -> 160

FIFO Page Replacement Algorithm - Counting Page Faults

I'm currently reading about Page Replacement Algorithms, and have been looking at a couple of examples with regards to the FIFO (First In, First Out) method.
My question is as follows; how do you count the number of page faults, as I have seen different practices.
For instance:
Example 1 (on page 9) and Example 2 take the exact same sequence. The first counts the number of page faults to be 12, whereas the second states the number is 15. They are using the same number of frames, 3.
The sequence is:
Sequence: 7 0 1 2 0 3 0 4 2 3 0 3 2 1 2 0 1 7 0 1
-----------------------------------------
7 7 7 0 0 1 2 3 0 4 2 2 2 3 0 0 0 1 2 7
0 0 1 1 2 3 0 4 2 3 3 3 0 1 1 1 2 7 0
1 2 2 3 0 4 2 3 0 0 0 1 2 2 2 7 0 1
-----------------------------------------
PF (1): * * * * * * * * * * * * Total = 12 page faults
PF (2): * * * * * * * * * * * * * * * Total = 15 page faults
Hence, my question is; which method is the correct method? Do you count the first three instances as page faults?
If so, given the sequence:
Sequence: A B C D A E F G H I A J
-------------------------
A A A A A B C D E F G H
B B B B C D E F G H I
C C C D E F G H I A
D D E F G H I A J
-------------------------
PF (1): * * * * * * * * * * * Total = 11 page faults
PF (2): * * * * * * * Total = 7 page faults
Any help would be highly appreciated. Thank you guys!
"Hence, my question is; which method is the correct method? Do you count the first three instances as page faults?"
Yes. Page Fault occurs when you don't fined the referenced page in the frames. Therefore, the first entries are always PFs.