Detecting unit differences in data (SAS) - sas

I have two sets of financial data that tend to contain differences due to unit errors e.g. $10000 in one dataset may be $1000 in the other.
I'm trying to code a check for such differences, but the only way I can think of is to divide the two variables and see if the difference is in a table of 0.001, 0.01, 0.1, 10, 100 etc, but it would be hard to catch all of the differences.
Is there a smarter way to do this?

Use proc compare. Be sure the two datasets are sorted in identical order, either by row or by specific groups. Use the by statement as needed. More info on options can be found in the documentation.
Example - compare a modified cars dataset with sashelp.cars:
data cars_modified;
set sashelp.cars;
if(mod(_N_, 2) = 0) then msrp = msrp - 100;
run;
proc compare base = sashelp.cars
compare = cars_modified
out = out_differences
outnoequal
outdif
noprint;
var msrp;
run;
Only the observations with differences are output in out_differences:
_TYPE_ _OBS_ MSRP
DIF 2 $-100
DIF 4 $-100
DIF 6 $-100
DIF 8 $-100
DIF 10 $-100
...

So you appear to be asking to find cases where X/Y is a number that is exactly 1.00Exx where XX is an integer, other than 0.
data _null_;
do x=1,10,100,1000;
do y=1,2,3,10.1,10 ;
ratio = x/y;
power = floor(log10(ratio));
if power ne 0 and 1.00 = round(ratio/10**power,0.01) then
put 'Ratio of ' x 'over ' y 'is 10**' power '.'
;
end;
end;
run;
Results:
Ratio of 1 over 10 is 10**-1 .
Ratio of 10 over 1 is 10**1 .
Ratio of 100 over 1 is 10**2 .
Ratio of 100 over 10 is 10**1 .
Ratio of 1000 over 1 is 10**3 .
Ratio of 1000 over 10 is 10**2 .

For a numeric value X you can compute the nearest the rational expression, p/q.
If you calculate ratio
X = amount_for_source_A / amount_from_source_B;
status = math.rational(X,1e5,p,q);
the ratio will be a multiple of 10 if p=1 or q=1
Example:
proc ds2;
package math / overwrite = yes;
method rational(double x, double maxden, in_out integer p, in_out integer q) returns double;
/*
** FROM: https://www.ics.uci.edu/~eppstein/numth/frap.c
** FROM: https://stackoverflow.com/questions/95727/how-to-convert-floats-to-human-readable-fractions
**
** find rational approximation to given real number
** David Eppstein / UC Irvine / 8 Aug 1993
**
** With corrections from Arno Formella, May 2008
**
** Modified for Proc DS2, Richard DeVenezia, Jan 2020.
**
** usage: rational(r,d,p,q)
** x is real number to approx
** maxden is the maximum denominator allowed
** p is return for numerator
** q is return for denominator
** returns 0 if no problems
**
** based on the theory of continued fractions
** if x = a1 + 1/(a2 + 1/(a3 + 1/(a4 + ...)))
** then best approximation is found by truncating this series
** (with some adjustments in the last term).
**
** Note the fraction can be recovered as the first column of the matrix
** ( a1 1 ) ( a2 1 ) ( a3 1 ) ...
** ( 1 0 ) ( 1 0 ) ( 1 0 )
** Instead of keeping the sequence of continued fraction terms,
** we just keep the last partial product of these matrices.
*/
declare integer m[0:1,0:1];
declare double startx e1 e2;
declare integer ai t result p1 q1 p2 q2;
startx = x;
/* initialize matrix */
m[0,0] = 1; m[1,1] = 1;
m[0,1] = 0; m[1,0] = 0;
/* loop finding terms until denom gets too big */
do while (1);
ai = x;
if not ( m[1,0] * ai + m[1,1] < maxden ) then leave;
t = m[0,0] * ai + m[0,1];
m[0,1] = m[0,0];
m[0,0] = t;
t = m[1,0] * ai + m[1,1];
m[1,1] = m[1,0];
m[1,0] = t;
if x = ai then leave; %* AF: division by zero;
x = 1 / (x - ai);
if x > 2147483647 /*x'7FFFFFFF'*/ then leave; %* AF: representation failure;
end;
/* now remaining x is between 0 and 1/ai */
/* approx as either 0 or 1/m where m is max that will fit in maxden */
/* first try zero */
p1 = m[0,0];
q1 = m[1,0];
e1 = startx - 1.0 * p1 / q1;
/* now try other possibility */
ai = (maxden - m[1,1]) / m[1,0];
m[0,0] = m[0,0] * ai + m[0,1];
m[1,0] = m[1,0] * ai + m[1,1];
p2 = m[0,0];
q2 = m[1,0];
e2 = startx - 1.0 * p2 / q2;
if abs(e1) <= abs(e2) then do;
p = p1;
q = q1;
end;
else do;
p = p2;
q = q2;
end;
return 0;
end;
endpackage;
run;
quit;
* Example uage;
proc ds2;
data _null_;
declare package math math();
declare double x;
declare int p1 q1 p q;
method run();
streaminit(12345);
x = 0;
do _n_ = 1 to 20;
p1 = ceil(rand('uniform',9));
q1 = ceil(rand('uniform',9));
x + 1. * p1 / q1;
math.rational (x, 10000, p, q);
put 'add' p1 '/' q1 ' ' x=best16. 'is' p '/' q;
end;
end;
enddata;
run;
quit;
----- LOG -----
add 4 / 1 x= 4 is 4 / 1
add 4 / 2 x= 6 is 6 / 1
add 2 / 7 x=6.28571428571429 is 44 / 7
add 4 / 6 x=6.95238095238095 is 146 / 21
add 5 / 2 x=9.45238095238095 is 397 / 42
add 5 / 2 x= 11.952380952381 is 251 / 21
add 7 / 1 x= 18.952380952381 is 398 / 21
add 8 / 6 x=20.2857142857143 is 142 / 7
add 9 / 3 x=23.2857142857143 is 163 / 7
add 8 / 2 x=27.2857142857143 is 191 / 7
add 3 / 1 x=30.2857142857143 is 212 / 7
add 9 / 3 x=33.2857142857143 is 233 / 7
add 4 / 3 x=34.6190476190476 is 727 / 21
add 4 / 6 x=35.2857142857143 is 247 / 7
add 1 / 9 x=35.3968253968254 is 2230 / 63
add 8 / 3 x=38.0634920634921 is 2398 / 63
add 2 / 4 x=38.5634920634921 is 4859 / 126
add 5 / 1 x=43.5634920634921 is 5489 / 126
add 1 / 2 x=44.0634920634921 is 2776 / 63
add 2 / 7 x=44.3492063492064 is 2794 / 63
DS2 math package

Related

Debugging old Fortran code for sediment dynamics

I am looking at some Fortran code from an old scanned paper. The scan quality is not great so I may have copied it wrong. I tried to run this using an online Fortran compiler but it bombs out. Not being familiar with Fortran, I was wondering if someone can point out where the syntax does not make sense? The code is from a paper on sediment dynamics:
Komar, P.D. and Miller, M.C., 1975. On the comparison between the threshold of sediment motion under waves and unidirectional currents with a discussion of the practical evaluation of the threshold: Reply. Journal of Sedimentary Research, 45(1).
PROGRAM TSHOLD
REAL LI, LO
G = 981.0
PIE = 3.1416
RHOW = 1.00
READ (6O,1) DIAM, RHOS
1 FORMAT (2X, F6.3,2X, F5.3)
IF(DIAM .LT. 0.05) GO TO 5
A = 0.463 * PIE
B = 0.25
GO TO 7
5 A = 0.21
B = 0.50
7 PWR = 1.0 / (2.0 - B)
FAC = (A * (RHOS - RHOW) * G/(RHOW * PIE**B))**PWR
FAC1 = FAC * DIAM**((1.0 - B) * PWR)
T = 1.0
15 J = 1.20
LD = 156.13 * (T**2)
UM = FAC1 * T**(B*PWR)
WRITE(61,9) DIAM, T, UM
9 FORMAT(1H0, 10X, 17HGRAIN DIAMETER = ,F6.3,1X,2HCM //
1 11X, 14HWAVE PERIOD = ,F5.2, 1X, 3HSEC //
2 11X, 22HORBITAL VELOCITY, UM = ,F6.2, 1X, 6HCM/SECl //
3 20X, 6HHEIGHT, 5X, 5HDEPTH, 8X, 3HH/L, 6X, 7HH/DEPTH //
4 22X, 2HCM, 8X, 2HCM /)
C INCREMENT WAVE HEIGHT, CALCULATE DEPTH
H = 10.0
DO 12 K = 1.60
SING = PIE * H / (UM * T)
X = SING
IF(X.LT.1.0) GO TO 30
30 ASINH = X - 0.16666*X**3.0 + 0.07500* X ** 5.0 - 0.04464 * X ** 7.0
1 + 0.03038 * X ** 9.0 - 0.02237 * X ** 11.0
32 LI = LD * (SINH(ASINH)/COSH(ASINH))
OPTH = ASINH * LI / 6.2832
C CHECK WAVE STABILITY
RATIO = H / DPTH
IF(RATIO.GE.0.78) GO TO 11
STEEP = H / LI
TEST = 0.142 * (SINH(ASINH)/COSH(ASINH))
IF(STEEP.GE.TEST) GO TO 11
WRITE(61,10) H, OPTH, STEEP, RATIO
I0 FORMAT(IH0, 20X, F5.1, 4X, E9.3, 4X, F5.3, 4X, F4.2)
11 H = H + 10.0
12 CONTINUE
T = T + 1.0
15 CONTINUE
END
The problem is more likely that old Fortran requires fixed form code formatting where the number of spaces before a statement is very important.
Here are some general rules
Normal statements start at column 7 and beyond
Lines cannot exceed 72 columns
Any character placed on column 6 indicates the line is a continuation from the line above. I see that on the code above in the lines following 9 FORMAT(..
A number placed between columns 1-5 indicates a label, which can be a target of a GO TO statement, a DO statement or a formatting specification.
The character C on the first column, and sometimes any character on the first column indicate the line is a comment line.
see https://people.cs.vt.edu/~asandu/Courses/MTU/CS2911/fortran_notes/node4.html for more info.
Based on the rules above, here is how to enter the code, with the correct spacing. I run the F77 code through a converter to make it compatible with F90 and F77 at the same time. The code below might compile with the online compiler now.
PROGRAM TSHOLD
REAL LI, LO
G = 981.0
PIE = 3.1416
RHOW = 1.00
READ (60,1) DIAM, RHOS
1 FORMAT (2X, F6.3,2X, F5.3)
IF(DIAM .LT. 0.05) GO TO 5
A = 0.463 * PIE
B = 0.25
GO TO 7
5 A = 0.21
B = 0.50
7 PWR = 1.0 / (2.0 - B)
FAC = (A * (RHOS - RHOW) * G/(RHOW * PIE**B))**PWR
FAC1 = FAC * DIAM**((1.0 - B) * PWR)
T = 1.0
DO 15 J=1,20
LD = 156.13 * (T**2)
UM = FAC1 * T**(B*PWR)
WRITE(61,9) DIAM, T, UM
9 FORMAT(1H0, 10X, 17HGRAIN DIAMETER = ,F6.3,1X,2HCM // &
& 11X, 14HWAVE PERIOD = ,F5.2, 1X, 3HSEC // &
& 11X, 22HORBITAL VELOCITY, UM = ,F6.2, 1X, 6HCM/SECl // &
& 20X, 6HHEIGHT, 5X, 5HDEPTH, 8X, 3HH/L, 6X, 7HH/DEPTH // &
& 22X, 2HCM, 8X, 2HCM /)
! INCREMENT WAVE HEIGHT, CALCULATE DEPTH
H = 10.0
DO 12 K = 1,60
SING = PIE * H / (UM * T)
X = SING
IF(X.LT.1.0) GO TO 30
30 ASINH = X - 0.16666*X**3.0 + 0.07500* X ** 5.0 - 0.04464 * X ** 7.&
& + 0.03038 * X ** 9.0 - 0.02237 * X ** 11.0
32 LI = LD * (SINH(ASINH)/COSH(ASINH))
OPTH = ASINH * LI / 6.2832
! CHECK WAVE STABILITY
RATIO = H / DPTH
IF(RATIO.GE.0.78) GO TO 11
STEEP = H / LI
TEST = 0.142 * (SINH(ASINH)/COSH(ASINH))
IF(STEEP.GE.TEST) GO TO 11
WRITE(61,10) H, OPTH, STEEP, RATIO
10 FORMAT(G14.4, 20X, F5.1, 4X, E9.3, 4X, F5.3, 4X, F4.2)
11 H = H + 10.0
12 CONTINUE
T = T + 1.0
15 CONTINUE
END
I found several transcription errors, replacing commas with dots, zeros with the letter O, and a missing DO statement.

Use the dif function to obtain the difference with several lags without specifying the number of lags

I want a new data set in which the variable y is equal to the value in the n row minus the lags values.
The original data set:
data test;
input x;
datalines;
20
40
2
5
74
;
run;
I used the dif function, but It returns the difference with a one lag:
data want;
set test;
y = dif(x);
run;
And I want:
_n_ = 1 y = 20
_n_ = 2 y = 40 - 20 = 20
_n_ = 3 y = 2 - (40 + 20) = -58
_n_ = 4 y = 5 - (2 + 40 + 20) = - 57
_n_ = 5 y = 74 - (5 + 2 + 40 + 20) = 7
Thanks.
No need for lag() or dif(). Just make another variable to retain the running total.
data want ;
set test;
y=x-cumm;
output;
cumm+x;
run;
I kept the extra column and output the values before updating the running total to make it clearer what value was used in the calculation of Y.
Obs x y cumm
1 20 20 0
2 40 20 20
3 2 -58 60
4 5 -57 62
5 74 7 67
Possible solution (thanks to Longfish for suggestions):
data want;
set test;
retain total 0;
total = total + x;
y = x - coalesce(lag(total), 0);
run;

Keeping or deleting a group of observations based on a characteristic of a BY-group

I answered a SAS question a few minutes ago and realized there is a generalization that might be more useful than that one (here). I didn't see this question already in StackOverflow.
The general question is: How can you process and keep an entire BY-group based on some characteristic of the BY-group that you might not know until you have looked at all the observations in the group?
Using input data similar to that from the earlier question:
* For some reason, we are tasked with keeping only observations that
* are in groups of ID_1 and ID_2 that contain at least one obs with
* a VALUE of 0.;
* In the following data, the following ID and ID_2 groups should be
* kept:
* A 2 (2 obs)
* B 1 (3 obs)
* B 3 (2 obs)
* B 4 (1 obs)
* The resulting dataset will have 8 observations.;
data x;
input id $ id_2 value;
datalines;
A 1 1
A 1 1
A 1 1
A 2 0
A 2 1
B 1 0
B 1 1
B 1 3
B 2 1
B 3 0
B 3 0
B 4 0
C 2 4
;
run;
Double DoW loop solution:
data have;
input id $ id_2 value;
datalines;
A 1 1
A 1 1
A 1 1
A 2 0
A 2 1
B 1 0
B 1 1
B 1 3
B 2 1
B 3 0
B 3 0
B 4 0
C 2 4
;
run;
data want;
do _n_ = 1 by 1 until(last.id_2);
set have;
by id id_2;
flag = sum(flag,value=0);
end;
do _n_ = 1 to _n_;
set have;
if flag then output;
end;
drop flag;
run;
I've tested this against the point approach using ~55m rows and found no appreciable difference in performance. Dataset used:
data have;
do ID = 1 to 10000000;
do id_2 = 1 to ceil(ranuni(1)*10);
do value = floor(ranuni(2) * 5);
output;
end;
end;
end;
run;
My answer might not be the most efficient, especially for large datasets, and I'm interested in seeing other possible answers. Here it is:
* For some reason, we are tasked with keeping only observations that
* are in groups of ID_1 and ID_2 that contain at least one obs with
* a VALUE of 0.;
* In the following data, the following ID and ID_2 groups should be
* kept:
* A 2 (2 obs)
* B 1 (3 obs)
* B 3 (2 obs)
* B 4 (1 obs)
* The resulting dataset will have 8 observations.;
data x;
input id $ id_2 value;
datalines;
A 1 1
A 1 1
A 1 1
A 2 0
A 2 1
B 1 0
B 1 1
B 1 3
B 2 1
B 3 0
B 3 0
B 4 0
C 2 4
;
run;
* I realize the data are already sorted, but I think it is better
* not to assume they are.;
proc sort data=x;
by id id_2;
run;
data obstokeep;
keep id id_2 value;
retain startptr haszero;
* This SET statement reads through the dataset in sequence and
* uses the CUROBS option to obtain the observation number. In
* most situations, this will be the same as the _N_ automatic
* variable, but CUROBS is probably safer.;
set x curobs=myptr;
by id id_2;
* When this is the first observation in a BY-group, save the
* current observation number (pointer).
* Also initialize a flag variable that will become 1 if any
* obs contains a VALUE of 0;
* The variables are in a RETAIN statement, so they keep their
* values as the SET statement above is executed for each obs
* in the BY-group.;
if first.id_2
then do;
startptr=myptr;
haszero=0;
end;
* This statement is executed for each observation. We check
* whether VALUE is 0 and, if so, record that fact.;
if value = 0
then haszero=1;
* At the end of the BY-group, we check to see if there were
* any observations with VALUE = 0. If so, we go back using
* another SET statement, re-read them via direct access, and
* write them to the output dataset.
* (Note that if VALUE order is not relevant, you can gain a bit
* more efficiency by writing the current obs first, then going
* back to get the rest.);
if last.id_2 and haszero
then do;
* When LAST and FIRST at the same time, there is only one
* obs, so no need to backtrack, just output and go on.;
if first.id_2
then output obstokeep;
else do;
* Here we assume that the observations are sequential
* (which they will be for a sequential SET statement),
* so we re-read these observations using another SET
* statement with the POINT option for direct access
* starting with the first obs of the by-group (the
* saved pointer) and ending with the current one (the
* current pointer).;
do i=startptr to myptr;
set x point=i;
output obstokeep;
end;
end;
end;
run;
proc sql;
select a.*,b.value from (select id,id_2 from have where value=0)a left join have b
on a.id=b.id and a.id_2=b.id_2;
quit;

How can I write an if condition for my variable in GLPK?

Here is my full problem:
Information:
*Max. total investment: $125
*Pay-off is the sum of the units bought x pay-off/unit
*Cost per investment: Buy-in cost + cost/unit x number of units if you buy at least one unit
*The cost is sum of the costs per investment
Constraints:
*You may not invest in both 2 and 5.
*You may invest in 1 only if you invest at least one of 2 and 3.
*You must invest at least two of 3,4,5.
*You may not invest more than max number of units.
Problem: Maximize profit : pay-off - cost
xi: # of units i ∈ {1,2,3,4,5}
yi=1 if xi>0 else yi=0
cost = sum{i in I} buyInCost_i * yi + cost-unit_i*xi
pay-off = sum{i in I} (pay-off/unit)_i*xi
profit = pay-off - cost
Maximize profit
Subject to
y2+y5 <= 1
y1<= y2+y3
y3+y4+y5 >= 2
x1<=5, x2<=4, x3<=5, x4<=7, x5<=3
cost<=125
Here is my question:
For example I have this binary variable y
yi=1 if xi>0 else yi=0 and i ∈ {1,2,3,4,5}
I declared i as a data set
set I;
data;
set I := 1 2 3 4 5;
I don't know how to add if else condition to y variable in glpk. Can you please help me out?
My modelling :
set I;
/*if x[i]>0 y[i]=1 else y[i]=0 ?????*/
var y{i in I}, binary;
param a{i in I};
/* buy-in cost of investment i */
param b{i in I};
/* cost per unit of investment i */
param c{i in I};
/* pay-off per unit of investment i */
param d{i in I};
/* max number of units of investment i */
var x{i in I} >=0;
/* Number of units that is bought of investment i */
var po := sum{i in I} c[i]*x[i];
var cost := sum{i in I} a[i]*y[i] + b[i]*x[i];
maximize profit: po-cost;
s.t. c1: y[2]+y[5]<=1;
s.t. c2: y[1]<y[2]+y[3];
s.t. c3: y[3]+y[4]+y[5]>=2;
s.t. c4: x[1]<=5
x[2]<=4
x[3]<=5
x[4]<=7
x[5]<=3;
s.t. c5: cost <=125;
s.t. c6{i in I}: M * y[i] > x[i]; // if condition of y[i]
set I := 1 2 3 4 5;
param a :=
1 25
2 35
3 28
4 20
5 40;
param b :=
1 5
2 7
3 6
4 4
5 8;
param c :=
1 15
2 25
3 17
4 13
5 18;
param d :=
1 5
2 4
3 5
4 7
5 3;
param M := 10000;
I am getting this syntax error:
problem.mod:21: syntax error in variable statement
Context: ...I } ; param d { i in I } ; var x { i in I } >= 0 ; var po :=
MathProg model processing error
You can't directly do that (there is no way to write 'directly' an if constraint in a LP).
However, there are workarounds for this.
For example, you can write:
M * yi > xi
where M is a large constant (greater than any value of xi).
This way:
if xi > 0, then the constraint is equivalent to yi > 0, that is yi == 1 since yi is binary (if M is large enough).
if xi == 0, then the constraint is always verified, and yi will be equal to 0 since your objective is increasing with yi and you are minimizing.
in both case, the constraint is equivalent to the if test.

SAS: make values missing

I am trying to make some existing values to missing values (not deleting them).
Here is the basic structure of my data set.
I want to treat AGE and GENDER as missing whenever A is less than B. For example, when A=1 and B=3, I want to treat values of AGE and GENDER on the last two rows as missing (as shown on the data sets).
In my data both A and B go from 1 to 4 and have every combination of them.
Asterisks mean I have more data between them. Thanks in advance!
BEFORE
ID A B AGE GENDER
--------------
1 1 1 35 M
* * * * *
* * * * *
5 1 2 23 F
5 1 2 21 M
6 1 2 42 F
6 1 2 43 M
* * * * *
* * * * *
20 1 3 43 F
20 1 3 39 M
20 1 3 23 M
21 1 3 32 F
21 1 3 39 M
21 1 3 23 F
* * * * *
* * * * *
55 2 4 32 M
55 2 4 12 M
55 2 4 31 F
55 2 4 43 M
* * * * *
* * * * *
AFTER
ID A B AGE GENDER
--------------
1 1 1 35 M
* * * * *
* * * * *
5 1 2 23 F
5 1 2 . .
6 1 2 42 F
6 1 2 . .
* * * * *
* * * * *
20 1 3 43 F
20 1 3 . .
20 1 3 . .
21 1 3 32 F
21 1 3 . .
21 1 3 . .
* * * * *
* * * * *
55 2 4 32 M
55 2 4 12 M
55 2 4 . .
55 2 4 . .
* * * * *
* * * * *
How about now?
data temp;
retain idcount 0;
set olddata;
** Create an observation counter for each id **;
prev_id = lag(id);
if id ^= prev_id then idcount = 0;
idcount = idcount + 1;
run;
** Sort the obs by ID in reverse order **;
proc sort data=temp;
by id descending idcount;
run;
data temp2;
retain misscount 0;
set temp;
by id descending idcount;
** Keep the previous age and gender **;
old_age = age;
old_gender = gender;
** Count the number that should be missing **;
if a < b then nummiss = b - a;
else nummiss = 0;
** Set a counter of obs that we will set to missing **;
if first.id then misscount = 0;
** Set the appropriate number of rows to missing and update the counter **;
if misscount < nummiss then do;
misscount = misscount + 1;
call missing(age, gender);
end;
run;
proc sort data=temp2 out=temp3(drop=misscount nummiss idcount prev_id);
by id idcount;
run;