SAS create column based on comparing others - sas

I have following data:
data temp;
input id int1 int2 char1$ char2$;
cards;
1 2 3 AA BB
2 3 3 AB CC
3 4 5 AC DD
4 5 5 AD AD
5 6 7 AE FF
6 7 8 AF GG
;
run;
I want create a new column "difference" which is the result of comparing int1 to int2 and char1 to char2, "difference" is incremental in function of dissimilarities :
if int1 = int2 and char1 = char2: difference = 0
if int1 ^= int2 and char1 = char2: difference = 1
if int1 ^= int2 and char1 ^= char2: difference = 2
...
I want a result like this:
data result;
input id int1 int2 char1$ char2$ differences;
cards;
1 2 3 AA BB 2
2 3 3 AB CC 1
3 4 5 AC DD 2
4 5 5 AD AD 0
5 6 7 AE FF 2
6 7 8 AF GG 2
;
run;
In that example, there are only 4 columns but I had 36 columns in original Data.
I start trying with a loop, I don't know if it's the correct way to solve the pb:
data result;
SET temp;
array id int1 int2 char1 char2;
DO i = 1 to dim(id);
DIFFERENCE = 0;
if int1 ^= int2 THEN DIFFERENCE = DIFFERENCE + 1;
if char1 ^= char2 THEN DIFFERENCE = DIFFERENCE + 1;
END;
RUN;
but it doesn't work , do you have an idea?
thanks a lot!

From your result table I speculate that the difference shall be equal to 1 also when int1 = int2 and char1 != char2.
In that scenario, a simple use of the IFN() function should output the desired result.
data result;
set temp;
difference = ifn(int1=int2 and char1=char2, 0, ifn((int1 ne int2 and char1 = char2) or (int1 = int2 and char1 ne char2), 1, 2));
run;
EDIT: After DD Chen comment
data result;
set temp;
do i=1 to _n_;
difference = 0;
if int1 ne int2 then difference + 1;
if char1 ne char2 then difference + 1;
/* add all comparison statements ... */
end;
drop i;
run;

SAS will evaluate a boolean expression to 1 (TRUE) or 0 (FALSE).
So to get your output you just need to do:
data want;
set temp;
difference = (int1 ne int2) + (char1 ne char2);
run;

Related

sas datastep loop to calculate new rows of data

I have the below requirement and wondering if this can be implemented in simple datastep:
will be starting with a simple dataset with two variable
input:
x y
1 0
logic:
x y z
1 0 x+y
prev z prev y +1 x+y
prev z prev y +1 x+y
output:
x y z
1 0 1
1 1 2
2 2 4
4 3 7
7 4 11
Just output the computed rows one-by-one within a do-while loop.
data have;
input x y;
cards;
1 0
;run;
data want(drop=i);
set have;
z = x + y;
output;
i = 2; /*next row*/
do while (i <= 5); /* put the total number of rows here */
x = z;
y = y + 1;
z = x + y;
output;
i = i + 1;
end;
run;
Result
proc print data=want;
run;
Obs x y z
1 1 0 1
2 1 1 2
3 2 2 4
4 4 3 7
5 7 4 11
Macro version:
%macro gen(have, want, n_rows);
data &want(drop=i);
set &have;
z = x + y;
output;
i = 2;
do while (i <= &n_rows);
x = z;
y = y + 1;
z = x + y;
output;
i = i + 1;
end;
run;
%mend gen;
/* execute */
%gen(have, want, 5)
Ok. I decided to do some research to try to answer your question. And I found out that you (probably) did the same question in a sas forum (link here).
As any interested user may see, the question was answered in a very elegant way there (by Mr. Reinhard - credits to him).
While I was pasting Reinhard answer here, I saw that #Bill Huang came with an original answer his own. So probably you should accept his answer. Anyway, Reinhard answer was really cool and elegant, so I thought it might worth to have it registerd here. Mainly for other users, because it is such an easy way of creating additional interative rows in SAS:
data have;
x=1; y=0;
run;
data want;
set have;
do y=y to 4;
z=x+y;
output;
x=z;
end;
#LuizZ's do y=y to 4 version presumes the first row has y=0. If that is not the case try
data have;
x=1; y=0;
run;
%let iterations = 4;
data want;
set have;
do y = y to y+&iterations;
z = x + y;
output;
x = z;
end;
run;

Detecting unit differences in data (SAS)

I have two sets of financial data that tend to contain differences due to unit errors e.g. $10000 in one dataset may be $1000 in the other.
I'm trying to code a check for such differences, but the only way I can think of is to divide the two variables and see if the difference is in a table of 0.001, 0.01, 0.1, 10, 100 etc, but it would be hard to catch all of the differences.
Is there a smarter way to do this?
Use proc compare. Be sure the two datasets are sorted in identical order, either by row or by specific groups. Use the by statement as needed. More info on options can be found in the documentation.
Example - compare a modified cars dataset with sashelp.cars:
data cars_modified;
set sashelp.cars;
if(mod(_N_, 2) = 0) then msrp = msrp - 100;
run;
proc compare base = sashelp.cars
compare = cars_modified
out = out_differences
outnoequal
outdif
noprint;
var msrp;
run;
Only the observations with differences are output in out_differences:
_TYPE_ _OBS_ MSRP
DIF 2 $-100
DIF 4 $-100
DIF 6 $-100
DIF 8 $-100
DIF 10 $-100
...
So you appear to be asking to find cases where X/Y is a number that is exactly 1.00Exx where XX is an integer, other than 0.
data _null_;
do x=1,10,100,1000;
do y=1,2,3,10.1,10 ;
ratio = x/y;
power = floor(log10(ratio));
if power ne 0 and 1.00 = round(ratio/10**power,0.01) then
put 'Ratio of ' x 'over ' y 'is 10**' power '.'
;
end;
end;
run;
Results:
Ratio of 1 over 10 is 10**-1 .
Ratio of 10 over 1 is 10**1 .
Ratio of 100 over 1 is 10**2 .
Ratio of 100 over 10 is 10**1 .
Ratio of 1000 over 1 is 10**3 .
Ratio of 1000 over 10 is 10**2 .
For a numeric value X you can compute the nearest the rational expression, p/q.
If you calculate ratio
X = amount_for_source_A / amount_from_source_B;
status = math.rational(X,1e5,p,q);
the ratio will be a multiple of 10 if p=1 or q=1
Example:
proc ds2;
package math / overwrite = yes;
method rational(double x, double maxden, in_out integer p, in_out integer q) returns double;
/*
** FROM: https://www.ics.uci.edu/~eppstein/numth/frap.c
** FROM: https://stackoverflow.com/questions/95727/how-to-convert-floats-to-human-readable-fractions
**
** find rational approximation to given real number
** David Eppstein / UC Irvine / 8 Aug 1993
**
** With corrections from Arno Formella, May 2008
**
** Modified for Proc DS2, Richard DeVenezia, Jan 2020.
**
** usage: rational(r,d,p,q)
** x is real number to approx
** maxden is the maximum denominator allowed
** p is return for numerator
** q is return for denominator
** returns 0 if no problems
**
** based on the theory of continued fractions
** if x = a1 + 1/(a2 + 1/(a3 + 1/(a4 + ...)))
** then best approximation is found by truncating this series
** (with some adjustments in the last term).
**
** Note the fraction can be recovered as the first column of the matrix
** ( a1 1 ) ( a2 1 ) ( a3 1 ) ...
** ( 1 0 ) ( 1 0 ) ( 1 0 )
** Instead of keeping the sequence of continued fraction terms,
** we just keep the last partial product of these matrices.
*/
declare integer m[0:1,0:1];
declare double startx e1 e2;
declare integer ai t result p1 q1 p2 q2;
startx = x;
/* initialize matrix */
m[0,0] = 1; m[1,1] = 1;
m[0,1] = 0; m[1,0] = 0;
/* loop finding terms until denom gets too big */
do while (1);
ai = x;
if not ( m[1,0] * ai + m[1,1] < maxden ) then leave;
t = m[0,0] * ai + m[0,1];
m[0,1] = m[0,0];
m[0,0] = t;
t = m[1,0] * ai + m[1,1];
m[1,1] = m[1,0];
m[1,0] = t;
if x = ai then leave; %* AF: division by zero;
x = 1 / (x - ai);
if x > 2147483647 /*x'7FFFFFFF'*/ then leave; %* AF: representation failure;
end;
/* now remaining x is between 0 and 1/ai */
/* approx as either 0 or 1/m where m is max that will fit in maxden */
/* first try zero */
p1 = m[0,0];
q1 = m[1,0];
e1 = startx - 1.0 * p1 / q1;
/* now try other possibility */
ai = (maxden - m[1,1]) / m[1,0];
m[0,0] = m[0,0] * ai + m[0,1];
m[1,0] = m[1,0] * ai + m[1,1];
p2 = m[0,0];
q2 = m[1,0];
e2 = startx - 1.0 * p2 / q2;
if abs(e1) <= abs(e2) then do;
p = p1;
q = q1;
end;
else do;
p = p2;
q = q2;
end;
return 0;
end;
endpackage;
run;
quit;
* Example uage;
proc ds2;
data _null_;
declare package math math();
declare double x;
declare int p1 q1 p q;
method run();
streaminit(12345);
x = 0;
do _n_ = 1 to 20;
p1 = ceil(rand('uniform',9));
q1 = ceil(rand('uniform',9));
x + 1. * p1 / q1;
math.rational (x, 10000, p, q);
put 'add' p1 '/' q1 ' ' x=best16. 'is' p '/' q;
end;
end;
enddata;
run;
quit;
----- LOG -----
add 4 / 1 x= 4 is 4 / 1
add 4 / 2 x= 6 is 6 / 1
add 2 / 7 x=6.28571428571429 is 44 / 7
add 4 / 6 x=6.95238095238095 is 146 / 21
add 5 / 2 x=9.45238095238095 is 397 / 42
add 5 / 2 x= 11.952380952381 is 251 / 21
add 7 / 1 x= 18.952380952381 is 398 / 21
add 8 / 6 x=20.2857142857143 is 142 / 7
add 9 / 3 x=23.2857142857143 is 163 / 7
add 8 / 2 x=27.2857142857143 is 191 / 7
add 3 / 1 x=30.2857142857143 is 212 / 7
add 9 / 3 x=33.2857142857143 is 233 / 7
add 4 / 3 x=34.6190476190476 is 727 / 21
add 4 / 6 x=35.2857142857143 is 247 / 7
add 1 / 9 x=35.3968253968254 is 2230 / 63
add 8 / 3 x=38.0634920634921 is 2398 / 63
add 2 / 4 x=38.5634920634921 is 4859 / 126
add 5 / 1 x=43.5634920634921 is 5489 / 126
add 1 / 2 x=44.0634920634921 is 2776 / 63
add 2 / 7 x=44.3492063492064 is 2794 / 63
DS2 math package

Use the dif function to obtain the difference with several lags without specifying the number of lags

I want a new data set in which the variable y is equal to the value in the n row minus the lags values.
The original data set:
data test;
input x;
datalines;
20
40
2
5
74
;
run;
I used the dif function, but It returns the difference with a one lag:
data want;
set test;
y = dif(x);
run;
And I want:
_n_ = 1 y = 20
_n_ = 2 y = 40 - 20 = 20
_n_ = 3 y = 2 - (40 + 20) = -58
_n_ = 4 y = 5 - (2 + 40 + 20) = - 57
_n_ = 5 y = 74 - (5 + 2 + 40 + 20) = 7
Thanks.
No need for lag() or dif(). Just make another variable to retain the running total.
data want ;
set test;
y=x-cumm;
output;
cumm+x;
run;
I kept the extra column and output the values before updating the running total to make it clearer what value was used in the calculation of Y.
Obs x y cumm
1 20 20 0
2 40 20 20
3 2 -58 60
4 5 -57 62
5 74 7 67
Possible solution (thanks to Longfish for suggestions):
data want;
set test;
retain total 0;
total = total + x;
y = x - coalesce(lag(total), 0);
run;

Impute missing values by Replicating Previous Value

My data is in this format:
Var1 Var1 Var1 Value1 Imputer_Value1 Value2 Imputer_Value2
A A1 A11 6 6 15 15
A A1 A11 9 9 14 14
A A1 A12 1 1 19 19
A A2 A12 1 16 16
A A2 A13 10 10 13 13
A A2 A13 4 4 . 13
B B1 B11 8 8 13 13
B B1 B11 9 9 17 17
B B1 B12 5 5 18 18
B B2 B12 . 5 12 12
B B2 B13 2 2 20 20
B B2 B13 1 1 . 20
I want to impute the missing values by replicating the previous value from the same above group. Can anyone please tell me how to do it? I tried to follow this option. But this does not have multiple value computation option.
data imputedData;
set mydata;
n=_n_;
if missing(Value1) then
do;
do until (not missing(value1));
n=n-1;
set mydata(keep=Value1) point=n; *second SET statement;
end;
end;
run;
Thanks!
If I'm understanding your question correctly, there are a few simple ways you can do this. The easiest being use of the lag and coalesce functions. Unfortunately the lag function can sometimes yield unexpected results when missing values are involved.
Here is an example using lag.
data want;
set have;
* The coalesce function returns the first non-missing value
* and the lag function returns the last value;
value1 = coalesce(value1, lag(value1));
value2 = coalesce(value2, lag(value2));
run;
If that does not work then you may have to use a retain statement.
data want;
set have;
retain val1 val2;
* If it's not the first record do;
if _n_ > 1 then do;
value1 = coalesce(value1, val1);
value2 = coalesce(value2, val2);
end;
val1 = value1;
val2 = value2;
run;

Counting values to get a matrix in Stata

I have a variable age, 13 variables x1 to x13, and 802 observations in a Stata dataset. age has values ranging 1 to 9. x1 to x13 have values ranging 1 to 13.
I want to know how to count the number of 1 .. 13 in x1 to x13 according to different values of age. For example, for age 1, in x1 to x13, count the number of 1,2,3,4,...13.
I first change x1 to x13 as a matrix by using
mkmat x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13, matrix (a)
Then, I want to count using the following loop:
gen count = 0
quietly forval i = 1/802 {
quietly forval j = 1/13 {
replace count = count + inrange(a[r'i', x'j'], 0, 1), if age==1
}
}
I failed.
I am still somewhat uncertain as to what you like to achieve. But if I am understanding you correctly, here is one way to do it.
First, a simple data that has age ranging from one to three, and four variables x1-x4, each with values of integers ranging between 5 and 7.
clear
input age x1 x2 x3 x4
1 5 6 6 6
1 7 5 6 5
2 5 7 6 6
3 5 6 7 7
3 7 6 6 6
end
Then we create three count variables (n5, n6 and n7) that counts the number of 5s, 6s, and 7s for each subject across x1-x4.
forval i=5/7 {
egen n`i'=anycount(x1 x2 x3 x4),v(`i')
}
Below is how the data looks like now. To explain, the first "1" under n5 indicates that there is only one "5" for the subject across x1-x4.
+----------------------------------------+
| age x1 x2 x3 x4 n5 n6 n7 |
|----------------------------------------|
1. | 1 5 6 6 6 1 3 0 |
2. | 1 7 5 6 5 2 1 1 |
3. | 2 5 7 6 6 1 2 1 |
4. | 3 5 6 7 7 1 1 2 |
5. | 3 7 6 6 6 0 3 1 |
+----------------------------------------+
It sounds to me like your ultimate goal is to have sums calculated separately for each value in age. Assuming this is true, let's create a 3x3 matrix to store such results.
mat A=J(3,3,.) // age (1-3) and values (5-7)
mat rown A=age1 age2 age3
mat coln A=value5 value6 value7
forval i=5/7 {
forval j=1/3 {
qui su n`i' if age==`j'
loca k=`i'-4 // the first column for value5
mat A[`j',`k']=r(sum)
}
}
The matrix looks like this. To explain, the first "3" under value5 indicates that for all children of the age of 1, the value 5 appears a total of three times across x1-x4
A[3,3]
value5 value6 value7
age1 3 4 1
age2 1 2 1
age3 1 4 3
With Aspen's example, you could do this:
gen id = _n
reshape long x, i(id)
tab age x
Note that your sample code doesn't loop over different ages and there is an incorrect comma in the count command. I won't try to fix the code, as there are many more direct methods, one of which is above. tabulate has an option to save the table as a matrix.
Here is another solution closer to the original idea. Warning: code not tested.
matrix count = J(9, 13, 0)
forval i = 1/9 {
forval j = 1/13 {
forval J = 1/13 {
qui count if age == `i' & x`J' == `j'
matrix count[`i', `j'] = count[`i', `j'] + r(N)
}
}
}