Convert string to float, C++ implementation - c++

VS minimum double value = 2.2250738585072014e-308. atof function converts string to double value such as when you look at this value in the debugger you get original string representation.
double d = atof("2.2250738585072014e-308"); // debugger will show 2.2250738585072014e-308
As we can see, double value is not denormalized (there is no DEN)
I try to achieve the same precision when converting string to double. Here is the code:
double my_atof(char* digits, int digits_length, int ep)
{
int idot = digits_length;
for (int i = 0; i < digits_length; i++)
{
if (digits[i] == '.')
{
idot = i;
break;
}
}
double accum = 0.0;
int power = ep + idot - 1;
for (int i = 0; i < digits_length; i++)
{
if (digits[i] != '.')
{
if (digits[i] != '0')
{
double base_in_power = 1.0;
if (power >= 0)
{
for (int k = 0; k < power; k++) base_in_power *= 10.0;
}
else if (power < 0)
{
for (int k = 0; k < -power; k++) base_in_power *= 0.1;
}
accum += (digits[i] - '0') * base_in_power;
}
power--;
}
else power = ep - 1;
}
return accum;
}
Now, let's try:
char* float_str = "2.2250738585072014";
int float_length = strlen(float_str);
double d = my_atof(float_str, float_length, -308);
Debugger shows that d = 2.2250738585072379e-308. I tried to substitute
for (int k = 0; k < -power; k++) base_in_power *= 0.1;
with
for (int k = 0; k < -power; k++) base_in_power /= 10.0;
but it results in denormalized value. How to achieve the same precision as VS does, such that debugger will show the same number?

The problem is with double representation of the 0.1 constant, or the division by 10.0, which produces exactly the same result: negative powers of ten have no exact representation in floating-point numbers, because they have no exact representation as a sum of negative powers of 2.
When you compute negative powers of ten by repeated multiplication, you accumulate the error. First few negative powers come out right, but after about 0.000001 the difference becomes visible. Run this program to see what is happening:
double p10[] = {
0.1, 0.01, 0.001, 0.0001, 0.00001, 0.000001, 0.0000001, 0.00000001, 0.000000001, 0.0000000001
};
int main(void) {
double a = 1;
for (int i = 0 ; i != 10 ; i++) {
double aa = a * 0.1;
double d = aa - p10[i];
printf("%d %.30lf\n", aa == p10[i], d);
a = aa;
}
return 0;
}
The output looks like this:
1 0.000000000000000000000000000000
1 0.000000000000000000000000000000
1 0.000000000000000000000000000000
1 0.000000000000000000000000000000
1 0.000000000000000000000000000000
0 0.000000000000000000000211758237
0 0.000000000000000000000026469780
0 0.000000000000000000000001654361
0 0.000000000000000000000000206795
0 0.000000000000000000000000025849
Demo.
The first few powers match exactly, but then some differences start appearing. When you use powers that you compute to compose the number during your string-to-float conversion, the accumulated errors make it into the final result. If the library function uses a look-up table (see this implementation for an example), the result that you get would be different from result that they get.
You can fix your implementation by hard-coding a table of negative powers of ten, and referencing this table instead of computing the powers manually. Alternatively you could construct a positive power of ten by consecutive multiplications, and then do a single division 1 / pow10 to construct the corresponding negative power (demo).

Related

Instructions inside a condition between long long unsigned integers doesn't execute

In the following code I'm trying to find find the highest p (p is integer) number where 45^p is a divisor of n! (n is integer).
int n = 14;
long long unsigned int fact = 1;
for(int i = 1; i <= n; i++){
fact *= i;
}
bool until = true;
int ans;
// for goes until x is greater half of factorial
for(int i = 1; until; i++){
long long unsigned int x = 1;
for(int j = 1; j <= i; j++){
x *= 45;
}
if(fact/2 < x){
until = false;
}
else{
if(fact % x == 0){
ans = i;
}
}
}
cout << ans;
}
However, when I'm trying to end the loop at where x is greater than the half of factorial, it just keeps going on until 45^7 for some reason and it should stop at 45^5, where the number is lesser than half of n!. Why does this happen?
P.D: I'm not saying the program doesn't return the number I want (it returns ans = 2, which is true), but it's just pointless to keep on calculating x.
If you need the biggest value, starting from x = 45 and with x > fact / 2 the only way out of the loop, you have to get to at least the logarithm in base 45 of n! / 2.
And that's a limit of 7 because 45**6 <= 14! / 2 and 45**7 > 14! / 2.
Pen and pencil as suggested by #Raymond Chen is the way to go.

Optimization C++ code to match reference run time

I have assigment to optimize some c++ code, I'm bad at coding but I made some attempts so the original is:
#include "stdafx.h"
#include "HistogramStretching.h"
void CHistogramStretching::HistogramStretching(BYTE** pImage, int nW, int nH)
{
//find minimal value
int nMin = pImage[0][0];
for(int j = 0; j < nW; j++)
for(int i = 0; i < nH; i++)
if(pImage[i][j] < nMin)
nMin = pImage[i][j];
//find maximal value
int nMax = pImage[0][0];
for(int j = 0; j < nW; j++)
for(int i = 0; i < nH; i++)
if(pImage[i][j] > nMax)
nMax = pImage[i][j];
//stretches histogram
for(int j = 0; j < nW; j++)
for(int i = 0; i < nH; i++)
{
if(nMax != nMin)
{
float fScale = (nMax - nMin)/100.0;//calculates scale
float fVal = (pImage[i][j] - nMin)/fScale;//scales pixel value
int nVal = (int)(fVal + 0.5);//rounds floating point number to integer
//checks BYTE range (must be 0-255)
if(nVal < 0)
nVal = 0;
if(nVal > 255)
nVal = 255;
pImage[i][j] = nVal;
}
else
pImage[i][j] = 0;//if all pixel values are the same, the image is changed to black
}
}
And my verison is:
#include "stdafx.h"
#include "HistogramStretching.h"
void CHistogramStretching::HistogramStretching(BYTE** pImage, int nW, int nH)
{
//find minimal value
int nMin = pImage[0][0];
int nMax = pImage[0][0];
for (int j = 0; j < nW; j++) {
for (int i = 0; i < nH; i++) {
if (pImage[i][j] < nMin)
nMin = pImage[i][j];
if (pImage[i][j] > nMax)
nMax = pImage[i][j];
}
}
if (nMax != nMin) {
float fScale = (nMax - nMin) / 100.0;//calculates scale
fScale = 1 / fScale;
//stretches histogram
for (int j = 0; j < nW; j++)
for (int i = 0; i < nH; i++)
{
float fVal = (pImage[i][j] - nMin) * fScale;//scales pixel value
int nVal = (int)(fVal + 0.5);//rounds floating point number to integer
//checks BYTE range (must be 0-255)
if (nVal < 0)
nVal = 0;
if (nVal > 255)
nVal = 255;
pImage[i][j] = nVal;
}
//if all pixel values are the same, the image is changed to black
}
else {
pImage[0][0] = 0;
}
}
So I merged the first two loops to one but still the first if make ~15% CPU time, next step was to pull the if statement outside the loops and changing division for multiplication and here that division takes ~8% of CPU time and float to int casting takes ~5% but I think I can't do much with casting. With this "correcions" my code is still some like 6-7 times slower than refference code. I test both code on the same machines. Can you point me to something I can make better?
I think tadman gave you the correct answer.
Replace
for (int j = 0; j < nW; j++) {
for (int i = 0; i < nH; i++) {
if (pImage[i][j] < nMin)
...
}
}
with
for (int i = 0; i < nH; i++) {
for (int j = 0; j < nW; j++) {
if (pImage[i][j] < nMin)
...
}
}
This way your data access becomes cache/memory aligned, which should be way faster.
All modern compilers can vectorize this nicely, when compiled at full optimization (/O2 for MSVC, -O3 for gcc and clang).
The idea is to give the compiler some help so that it can see that the code can be in fact vectorized:
Let the inner loop operate on a single pointer, not on indices, and without accessing anything but the pointed-to value.
Perform the scaling as an integer operation - and don't forget rounding :)
Try to set up operations such that additional range checks are unnecessary, e.g. your checks for BYTE being less than 0. By having the offset and scale set up properly, the result will be guaranteed to fall into the desired range.
The inner loops will get unrolled, and will be vectorized to process 4 bytes at a time. I've tried the recent gcc, clang and MSVC releases and they produce pretty fast code for this.
You're doing something "weird" in that you purposefully scale the results to a 0-99 range. Thus you lose the resolution of the data - you've got a full byte to work with, so why not scale it to 255?
But if you want to scale to 100 values, it's fine. Note that 100(dec) = 0x64. We can make the outputSpan flexible - it will work for any value <= 255.
Thus:
/* Code Part 1 */
#include <cstdint>
constexpr uint32_t outputSpan = 100;
static constexpr uint32_t scale_16(uint8_t min, uint8_t max)
{
return (outputSpan * 0x10000) / (1+max-min);
}
// scale factor in 16.16 fixed point unsigned format
// empty histogram produces scale = outputSpan
static_assert(scale_16(10, 10) == outputSpan * 0x10000, "Scale calculation is wrong");
static constexpr uint8_t scale_pixel(uint8_t const pixel, uint8_t min, uint32_t const scale)
{
uint32_t px = (pixel - min) * scale;
// result in 16.16 fixed point format
return (px + 0x8080u) >> 16;
// round to an integer value
}
We work with fixed-point numbers (instead of floating-point). The scale is in 16.16 format, thus 16 digits in the integer part, and 16 digits in the fractional part, e.g. 0x1234.5678. The value 1.0(dec) would be 0x1.0000.
The pixel scaling simply multiplies the pixel by the scale, rounds it, and returns the truncated integer part.
The rounding is "interesting". You'd think that it'd suffice to add 0.5(dec) = 0x0.8 to the result to round it. That's not the case. The value needs to be a bit larger than that, and 0x0.808 does the job. It pre-biases the value, so that the error range around the exact value has a zero mean. In all cases, the error is at most ±0.5 - thus the result, rounded to an integer, does not lose accuracy.
We use scale_16 and scale_pixel functions to implement the stretcher:
/* Code Part 2 */
void stretchHistogram(uint8_t **pImage, int const nW, int const nH)
{
uint8_t nMin = 255, nMax = 0;
for (uint8_t **row = pImage, **rowEnd = pImage + nH; row != rowEnd; ++row)
for (const uint8_t *p = *row, *pEnd = p + nW; p != pEnd; ++p)
{
auto const px = *p;
if (px < nMin) nMin = px;
if (px > nMax) nMax = px;
}
auto const scale = scale_16(nMin, nMax);
for (uint8_t **row = pImage, **rowEnd = pImage + nH; row != rowEnd; ++row)
for (uint8_t *p = *row, *pEnd = p + nW; p != pEnd; ++p)
*p = scale_pixel(*p, nMin, scale);
}
This also produces decent code on architectures without FPU, such as FPU-less ARM and AVR.
We can also do some manual checks. Suppose that min = 0x10, max = 0xEF, and pixel = 0x32. Let's remember that the scale is in 16.16 format:
scale = 0x64.0000 / (1 + max - min)
= 0x64.0000 / (1 + 0xEF - 0x10)
= 0x64.0000 / (1 + 0xDF)
= 0x64.0000 / 0xE0
Long division:
0x .7249
0x64.0000 / 0xE0
---------
64.0
- 62.0
------
2.00
- 1.C0
-------
.400
- .380
--------
. 800
- . 7E0
---------
. 20
So, we have scale = 0x0.7249. It's less than one (0x1.0), and also a bit less than 1/2 (0x0.8), since we map 224 values onto 100 values - a bit less than half as many.
Now
px = (pixel - min) * scale
= (0x32 - 0x10) * 0x0.7249
= 0x22 * 0x0.7249
Long multiplication:
0x 0.7249
* 0x .0022
------------
.E492
+ E.492
------------
0x F.2DB2
Thus, px = 0xF.2DB2 ≈ 0xF. We have to round it to an integer:
return = (px + 0x0.8080u) >> 16
= (0xF.2DB2 + 0x0.8080) >> 16
= 0xF.AE32 >> 16
≈ 0xF
Let's check in decimal system:
100 / (max-min+1) * (pixel-min) =
= 100 / (239 - 16 + 1) * (50 - 16)
= 100 / 224 * 34
= 100 * 34 / 224
= 3400 / 224
≈ 15.17
≈ 15
≈ 0xF
Here's a test case that ensures that there's no rounding bias for all combinations of min, max, and input pixel value, and that the error is bounded to [-0.5, 0.5]. Just append it to the code above and it should compile and run and produce the following output:
-0.5 0.5 1
For scaling to outputSpan = 256 values (instead of 100), it'd output:
-0.498039 0.498039 0.996078
/* Code Part 3 */
#include <cassert>
#include <cmath>
#include <iostream>
int main()
{
double errMin = 0, errMax = 0;
for (uint16_t min = 0; min <= 255; ++min)
for (uint16_t max = min; max <= 255; ++max)
for (uint16_t val = min; val <= max; ++val)
{
uint8_t const nMin = min, nMax = max;
uint8_t const span = nMax - nMin;
uint8_t const val_src = val;
uint8_t p_val = val_src;
uint8_t *const p = &p_val;
assert(nMin <= nMax);
assert(val >= nMin && val <= nMax);
auto const scale = scale_16(nMin, nMax);
*p = scale_pixel(*p, nMin, scale);
auto pValTarget = (val_src - nMin) * 256.0/(1.0+span);
auto error = pValTarget - *p;
if (error < errMin) errMin = error;
if (error > errMax) errMax = error;
}
std::cout << '\n' << errMin << ' ' << errMax << ' ' << errMax-errMin << std::endl;
assert((errMax-errMin) <= 1.0); // constrain the error
assert(std::abs(errMax+errMin) == 0.0); // constrain the error average
}

Summation of a series [duplicate]

This question already has an answer here:
Dividing two integers to produce a float result [duplicate]
(1 answer)
Closed 4 years ago.
This is my code and i'm trying to calculate this series : ((-1)^n)*n/(n+1) that n started from 1 to 5, code is not working correctly, anyone can help ?
int main(){
int i,n;
double sum1;
for (i=1; i<6; i++)
sum1 += (pow(-1,i))*((i)/(i+1));
cout<<sum1;
return 0;
}
The true answer at the end must be equal to -0.6166666666666667 which code cant calculate it correctly.
I calculated series from here. Is there any special function to do summation ?
Make sure to initialize your variables before you use them. You initialize i afterwards so it's fine like this, but sum1 needs to be initialized:
double sum1 = 0.0;
For the summation, even if the result is assigned to a double, the intermediate results might not be and integer devision result in truncated values. For this reason, double literals should be used (such as 2.0 instead of 2) and i should be casted where applicable:
sum1 += (pow(-1, i))*(((double)i) / ((double)i + 1.0));
Finally, to get the desired precision, std::setprecision can be used in the print. The final result could look like this:
int main() {
int i;
double sum1 = 0.0;
for (i = 1; i < 6; i++)
sum1 += (pow(-1, i))*(((double)i) / ((double)i + 1.0));
std::cout << std::setprecision(15) << sum1 << std::endl;
return 0;
}
Output:
-0.616666666666667
Always init variables before usage. double sum1 = 0;
((i) / (i + 1)) performs integer division, the result is 0 for any i.
Use for the pow function to find power of -1 is extremely irrational
int main() {
int i;
double sum1 = 0;
double sign = -1;
for (i = 1; i < 6; i++)
{
sum1 += sign * i / (i + 1);
sign *= -1.0;
}
std::cout << sum1;
return 0;
}
Try this instead
for (i = 0; i <= 5; i++) // from 0 to 5 inclusively
sum1 += (pow(-1, i)) * (static_cast<double>(i) / (i + 1));

Anderson Darling Test in C++

I am trying to compute the Anderson-Darling test found here. I followed the steps on Wikipedia and made sure that when I calculate the average and standard deviation of the data I am testing denoted X by using MATLAB. Also, I used a function called phi for computing the standard normal CDF, I have also tested this function to make sure it is correct which it is. Now I seem to have a problem when I actually compute the A-squared (denoted in Wikipedia, I denote it as A in C++).
Here is my function I made for Anderson-Darling Test:
void Anderson_Darling(int n, double X[]){
sort(X,X + n);
// Find the mean of X
double X_avg = 0.0;
double sum = 0.0;
for(int i = 0; i < n; i++){
sum += X[i];
}
X_avg = ((double)sum)/n;
// Find the variance of X
double X_sig = 0.0;
for(int i = 0; i < n; i++){
X_sig += (X[i] - X_avg)*(X[i] - X_avg);
}
X_sig /= n;
// The values X_i are standardized to create new values Y_i
double Y[n];
for(int i = 0; i < n; i++){
Y[i] = (X[i] - X_avg)/(sqrt(X_sig));
//cout << Y[i] << endl;
}
// With a standard normal CDF, we calculate the Anderson_Darling Statistic
double A = 0.0;
for(int i = 0; i < n; i++){
A += -n - 1/n *(2*(i) - 1)*(log(phi(Y[i])) + log(1 - phi(Y[n+1 - i])));
}
cout << A << endl;
}
Note, I know that the formula for Anderson-Darling (A-squared) starts with i = 1 to i = n, although when I changed the index to make it work in C++, I still get the same result without changing the index.
The value I get in C++ is:
-4e+006
The value I should get, received in MATLAB is:
0.2330
Any suggestions are greatly appreciated.
Here is my whole code:
#include <iostream>
#include <math.h>
#include <cmath>
#include <random>
#include <algorithm>
#include <chrono>
using namespace std;
double *Box_Muller(int n, double u[]);
double *Beasley_Springer_Moro(int n, double u[]);
void Anderson_Darling(int n, double X[]);
double phi(double x);
int main(){
int n = 2000;
double Mersenne[n];
random_device rd;
mt19937 e2(1);
uniform_real_distribution<double> dist(0, 1);
for(int i = 0; i < n; i++){
Mersenne[i] = dist(e2);
}
// Print Anderson Statistic for Mersenne 6a
double *result = new double[n];
result = Box_Muller(n,Mersenne);
Anderson_Darling(n,result);
return 0;
}
double *Box_Muller(int n, double u[]){
double *X = new double[n];
double Y[n];
double R_2[n];
double theta[n];
for(int i = 0; i < n; i++){
R_2[i] = -2.0*log(u[i]);
theta[i] = 2.0*M_PI*u[i+1];
}
for(int i = 0; i < n; i++){
X[i] = sqrt(-2.0*log(u[i]))*cos(2.0*M_PI*u[i+1]);
Y[i] = sqrt(-2.0*log(u[i]))*sin(2.0*M_PI*u[i+1]);
}
return X;
}
double *Beasley_Springer_Moro(int n, double u[]){
double y[n];
double r[n+1];
double *x = new double(n);
// Constants needed for algo
double a_0 = 2.50662823884; double b_0 = -8.47351093090;
double a_1 = -18.61500062529; double b_1 = 23.08336743743;
double a_2 = 41.39119773534; double b_2 = -21.06224101826;
double a_3 = -25.44106049637; double b_3 = 3.13082909833;
double c_0 = 0.3374754822726147; double c_5 = 0.0003951896511919;
double c_1 = 0.9761690190917186; double c_6 = 0.0000321767881768;
double c_2 = 0.1607979714918209; double c_7 = 0.0000002888167364;
double c_3 = 0.0276438810333863; double c_8 = 0.0000003960315187;
double c_4 = 0.0038405729373609;
// Set r and x to empty for now
for(int i = 0; i <= n; i++){
r[i] = 0.0;
x[i] = 0.0;
}
for(int i = 1; i <= n; i++){
y[i] = u[i] - 0.5;
if(fabs(y[i]) < 0.42){
r[i] = pow(y[i],2.0);
x[i] = y[i]*(((a_3*r[i] + a_2)*r[i] + a_1)*r[i] + a_0)/((((b_3*r[i] + b_2)*r[i] + b_1)*r[i] + b_0)*r[i] + 1);
}else{
r[i] = u[i];
if(y[i] > 0.0){
r[i] = 1.0 - u[i];
r[i] = log(-log(r[i]));
x[i] = c_0 + r[i]*(c_1 + r[i]*(c_2 + r[i]*(c_3 + r[i]*(c_4 + r[i]*(c_5 + r[i]*(c_6 + r[i]*(c_7 + r[i]*c_8)))))));
}
if(y[i] < 0){
x[i] = -x[i];
}
}
}
return x;
}
double phi(double x){
return 0.5 * erfc(-x * M_SQRT1_2);
}
void Anderson_Darling(int n, double X[]){
sort(X,X + n);
// Find the mean of X
double X_avg = 0.0;
double sum = 0.0;
for(int i = 0; i < n; i++){
sum += X[i];
}
X_avg = ((double)sum)/n;
// Find the variance of X
double X_sig = 0.0;
for(int i = 0; i < n; i++){
X_sig += (X[i] - X_avg)*(X[i] - X_avg);
}
X_sig /= (n-1);
// The values X_i are standardized to create new values Y_i
double Y[n];
for(int i = 0; i < n; i++){
Y[i] = (X[i] - X_avg)/(sqrt(X_sig));
//cout << Y[i] << endl;
}
// With a standard normal CDF, we calculate the Anderson_Darling Statistic
double A = -n;
for(int i = 0; i < n; i++){
A += -1.0/(double)n *(2*(i+1) - 1)*(log(phi(Y[i])) + log(1 - phi(Y[n - i])));
}
cout << A << endl;
}
Let me guess, your n was 2000. Right?
The major issue here is when you do 1/n in the last expression. 1 is an int and ao is n. When you divide 1 by n it performs integer division. Now 1 divided by any number > 1 is 0 under integer division (think if it as only keeping only integer part of the quotient. What you need to do is cast n as double by writing 1/(double)n.
Rest all should work fine.
Summary from discussions -
Indexes to Y[] should be i and n-1-i respectively.
n should not be added in the loop but only once.
Minor fixes like changing divisor to n instead of n-1 while calculating Variance.
You have integer division here:
A += -n - 1/n *(2*(i) - 1)*(log(phi(Y[i])) + log(1 - phi(Y[n+1 - i])));
^^^
1/n is zero when n > 1 - you need to change this to, e.g.: 1.0/n:
A += -n - 1.0/n *(2*(i) - 1)*(log(phi(Y[i])) + log(1 - phi(Y[n+1 - i])));
^^^^^

exp function using c++

I can't figure out why I keep getting the result 1.#INF from my_exp() when I give it 1 as input. Here is the code:
double factorial(const int k)
{
int prod = 1;
for(int i=1; i<=k; i++)
prod = i * prod;
return prod;
}
double power(const double base, const int exponent)
{
double result = 1;
for(int i=1; i<=exponent; i++)
result = result * base;
return result;
}
double my_exp(double x)
{
double sum = 1 + x;
for(int k=2; k<50; k++)
sum = sum + power(x,k) / factorial(k);
return sum;
}
You have an integer overflow in your factorial function. This causes it to output zero. 49! is divisible by 2^32, so your factorial function will return zero.
Then you divide by it causing it to go infinity. So the solution is to change prod to double:
double prod = 1;
Instead of completely evaluating the power and the factorial terms for each term in your expansion, you should consider how the k'th term is related to the k-1'th term and just update each term based on this relationship. That will avoid the nasty overflows in your power and factorial functions (which you will no longer need). E.g.
double my_exp(double x)
{
double sum = 1.0 + x;
double term = x; // term for k = 1 is just x
for (int k = 2; k < 50; k++)
{
term = term * x / (double)k; // term[k] = term[k-1] * x / k
sum = sum + term;
}
return sum;
}
you should just reduce max of k form 50 to like 30 it will work;
and one question your code work just near 0 ?