Why does Stata calculate different sums depending on aggregation of variables?

Why does Stata calculate different sums depending on aggregation of variables? - stata

I noticed that Stata estimates slightly different sums depending on the level of aggregation of the summands.
To use an example, I have 4 variables (Var1, Var2, Var3, Var4).
Var1 Var2 Var3 Var4
420966 10804428 21982560 1055822272
207381 20133238 69127000 580531008
217297.6 7946694.5 23631250 554597952
327553.2 7505444 10898800 261170592
119776.4 715082.75 607820.3125 414926752
3758613 2533234.5 225734784 88380432
First, I estimate the sum of all 4 variables:
gen sumVars1234 = Var1 + Var2 + Var3 + Var4
// this calculates the same sum as `egen rowtotal`
Then I estimate the sum of Vars 1 and 2, and Vars 3 and 4, separately:
gen sumVars12 = Var1 + Var2
gen sumVars34 = Var3 + Var4
When I add together sumVars12 and sumVars34, this generates sumVars12_34:
gen sumVars12_34 = sumVars12 + sumVars34
gen dif = sumVars12_34 - sumVars1234 // I calculate difference between both sums
However, sumVars12_34 does NOT equal sumVars1234 and I don't understand why.
sumVars12 sumVars34 sumVars12_34 sumVars1234 dif
11225394 1077804800 1089030144 1089030272 -128
20340618 649657984 669998592 669998656 -64
8163992 578229184 586393152 586393216 -64
7832997 272069376 279902368 279902400 -32
834859.125 415534560 416369408 416369440 -32
6291848 314115200 320407040 320407072 -32
I know these differences are very small, and I'm sure there's a simple explanation, but I'm not sure what it is! Any insight would be very much appreciated. Thanks!

It's most likely due to "mixed math" (integers and real floating point type variables). You have digit precision in the input data which also contains Integers, so it's probably due to rounding. I would replicate the calculations in Excel, but only if .0 was added to your whole integers. In Excel, as you may know, you can select all the data in a range of cells, right-click, then select Format Cells-->Number, and specify 1 for Decimal Places. And then do your summing.

Related

SAS Rounding by thousandths place

I have been asked to write some code in SAS that rounds a number up but only if the digit in the thousandth place is greater than one. For example, 78.858 would obviously round up to 78.86 but would also want to take 78.852 and round up to 78.86.

I would just do it in two operations. Use the normal ROUND() function. Then check how much it changed. And then based on that difference decide whether or not to add an extra hundredth.
Example:
data have;
input x ;
cards;
78.858
78.86
78.852
78.8515
;
data want;
set have;
round=round(x,0.01);
diff = x-round;
if diff > 0.001 then round=round+0.01 ;
run;
Results
OBS x round diff
1 78.8580 78.86 -.0020
2 78.8600 78.86 0.0000
3 78.8520 78.86 0.0020
4 78.8515 78.86 0.0015

How to convert a time (char) variable into a numeric variable?

I have a time variable that is expressed as a character in SAS. Example: 0:04 0:12 0:01 0:11 etc. I would like to convert it to a numeric variable 0.04 0.12 0.01 etc.
Using this code:
data work.set2; set work.set;
TIME2 = input(TIME, best4.);
;
run;
creates a new column with nothing but missing values. Can you advice on what to improve in my code?

SAS stores dates and times as numbers, time is the number of seconds. I think converting it to a SAS time is your best option. And there is a significant difference between 0.1 and 10 seconds because one is 6 seconds and one is 10 seconds. For example if you had 0.1 and 0.2 and took the difference that's 0.1 -> is that now a 10 or 6 second difference. You really need to think this through on how you want to interpret it and using your approach will be problematic.
The difference in times will not be reflected correctly.
Also, is 0:04 4 seconds or 4 minutes. The standard connotation would be 4 minutes, which is 240 seconds.
Here's how you can convert it:
data have;
x = '0:04';output;
x = '0:12';output;
x = '0:11'; output;
x = '1:00'; output;
x = '4:25'; output;
run;
data want;
set have;
sas_time = input(x, time.);
sas_time2 = sas_time;
format sas_time2 time4.;
/*if it's seconds*/
seconds = input(scan(x, 1, ':'), 8.)*60 + input(scan(x, 2, ':'), 8.);
run;
proc print data=want;run;

If your times are of type string:
WITHDOTS=translate(TIME2,'.',':');
Source:
https://communities.sas.com/t5/Base-SAS-Programming/Find-And-Replace-within-a-string/td-p/45104

SAS decimal precision and writing to database

Good day,
I had this issue where I was writing some numbers to database, which should have had value 0.1 in SAS, but for some bizarre reason appeared as 0.09 in SQL database. When I manually checked the dataset it showed 0.10 in format 12.2.
So what I do is check if the values are actually 0.1 or somewhat below this:
data _checking;
set publish_data;
if value < 0.1;
dummy = value*10000000;
run;
It appeared that number of observations fulfill the first condition. Ok... That explains why the values come out as 0.09. Rounding issue.
However, all dummy values come out as integers. I tried 10, 100, 1k, 10k all appear to come out as integers. (1, 10, 100 ...)
Next step I try:
data _checking2;
set _checking;
if dummy<10; /*Depending on the factorial*/
run;
This is consistent. Dummy retains the value 'a little below the value shown'.
I solved the issue by round(value,.1);
Questions:
How to observe the actual value stored in dataset? (Especially in case 'a little below')
If first condition if is true, then how can the checking with dummy still show integer values. (Because in computers epsilon has to have actual value)
2.b Or is this just a display issue? Or does SAS has flag for 'value minus epsilon'?

Answer 1:
The most precise and least human way to see the actual value is to observe the underlying IEEE bytes using HEX format.
Answer 2:
The default format for those new dummy variables is BEST12., so you won't see any small offsets if they are smaller than what best12. will show, or more precisely epsilon < 1e-(12-log10(x)). The SAS format could be considered a display issue in this case.
If your use case is that of a 'shown' value must be the actual value sent to a remote database then you will want to use ROUND prior to populating the remote tables.
data x;
x = 1/3; output;
x = 0.1 - 1e-13; output;
format x 12.2;
run;
data y;
set x;
put x= x= HEX16.;
xhex = x;
format xhex hex16.;
array dummy dummy1-dummy13;
do _n_ = 1 to 13;
dummy(_n_) = x * 10**_n_;
end;
run;
proc print data=y;
run;
data z;
do p = 0 to 10;
do q = 1 to 15;
array z z1-z15;
z(q) = 10**p + 10**-q;
end; output;
end;
drop p q;
run;
==== LOG ====
x=0.33 x=3FD5555555555555
x=0.10 x=3FB9999999997D74
==== PRINT ====
Obs x xhex dummy1 dummy2 dummy3 dummy4 dummy5 dummy6 dummy7
1 0.33 3FD5555555555555 3.33333 33.3333 333.333 3333.33 33333.33 333333.33 3333333.33
2 0.10 3FB9999999997D74 1.00000 10.0000 100.000 1000.00 10000.00 100000.00 1000000.00
Obs dummy8 dummy9 dummy10 dummy11 dummy12 dummy13
1 33333333.33 333333333.33 3333333333.3 33333333333 333333333333 3.3333333E12
2 10000000.00 100000000.00 1000000000.0 10000000000 100000000000 999999999999

You can try a different format. try 32.31 or best32.
Subtract 0.1-value and look at the result. Again, use a format with a lot of decimal places.
You are probably not seeing the value in the dummy variables because the epsilon is very small and the dummy is still getting rounded for display.
Try dummy=value*1e16 or higher.
Numbers in SAS are C doubles, fwiw.

How to set the precision on a SAS numeric value

I have this data set with 2 numeric values, these values are calculated by different systems with different precision parameters. So they round differently.
data test;
a = 10;
b= 11;
run;
Basically a and b started out as an almost same float value but due to rounding difference, ended up having a different value.
I need a proc sql which treats values like these as same (i,e. precision of (+/- 1);
So I need this to return true;
proc sql;
select * from test where a = b;
quit;

This is ugly, and assumes you are saying that anything within the range of a single integer should be treated as the same value, then you could do something like:
where max(a,b) - min(a,b) le 1;
This assumes that there are no missing values. If you have missing values you can use something like:
where max(sum(0,a),sum(0,b)) - min(sum(0,a),sum(0,b)) le 1;

C++ Pointer Dereference Multiplication

I am currently programming an arduino and am using C++ objects to do so. I've run into a weird issue when I try to multiply the values that are being pointed at. Referring to the code below, when I run the program, var3 and var4 end up having two different values. Why is this? They are essentially multiplying the same values (or so I believe). Any help?
long var1 = info->accelXYZ[0];
long var2 = info->taughtAccelXYZ[0];
long var3 = var1*var2;
long var4 = info->accelXYZ[0]*info->taughtAccelXYZ[0];

It's possible you're overflowing in one of the situations.
The multiplication of var1 and var2 (both long) gives a long which is then loaded into var3.
If both info->accelXYZ[0] and info->taughtAccelXYZ[0] are int (for example), the result of the multiplication will be int which is then loaded into a long.
The intermediate int form may be overflowing, something you can see in the following snippet:
#include <stdio.h>
#include <limits.h>
int main(void) {
printf("int has %d bytes\n",sizeof(int));
printf("long has %d bytes\n",sizeof(long));
int a = INT_MAX;
int b = 2;
long var1 = a;
long var2 = b;
long var3 = a * b;
long var4 = var1 * var2;
printf ("var3=%ld\n", var3);
printf ("var4=%ld\n", var4);
return 0;
}
which outputs:
int has 4 bytes
long has 8 bytes
var3=-2
var4=4294967294

One reason why var3 may end up with a different value than var4 is integer overflow. This happens when both multiplicands fit in an int, but the product doesn't.
Since ints and longs have different sizes on Arduino Uno*, the computation of var3 is different from computation of var4.
When you compute var3, the multiplication is done in longs on the initial values that fit in an int, so the result of multiplication is not translated. When you compute var4, the computation is done in ints, and then promoted to long. However, by then the result is already truncated, which results in the discrepancy that you are observing.
To make var4 the same correct value as var3, add a cast to long to one of the multiplicands, like this:
long var4 = (info->accelXYZ[0])*((long)info->taughtAccelXYZ[0]);
* int has 16 bits, while long has 32 bits.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why does Stata calculate different sums depending on aggregation of variables? - stata

Related

SAS Rounding by thousandths place

How to convert a time (char) variable into a numeric variable?

SAS decimal precision and writing to database

How to set the precision on a SAS numeric value

C++ Pointer Dereference Multiplication

Categories

Resources