This question already has answers here:
Convert from the IBM floating point to the IEEE floating point standard and vice versa in C#
(4 answers)
IBM Single Precision Floating Point data conversion to intended value
(1 answer)
Closed 3 years ago.
I'm trying to read single precision floating point numbers from a binary (.segy) file on windows, using C++. These numbers follow the IBM floating point architecture, so I need to convert them into IEEE-754 floats after reading.
I have found this C code:
https://www.thecodingforums.com/threads/c-code-for-converting-ibm-370-floating-point-to-ieee-754.438469/
Unfortunately it does not compile on windows.
I also found this code:
https://www.codeproject.com/Articles/12363/Transform-between-IEEE-IBM-or-VAX-floating-point-n
This solution seems a bit too complicated for me, and I'm not sure how to change it to read (IBM format) binary data directly.
Is there a simple C++ code available to solve this problem?
This question already has answers here:
Is There a Better Double-Precision Assignment in Fortran 90?
(2 answers)
Closed 5 years ago.
When I run the code below I get an output of 6378136.5 instead of 6378136.3
PROGRAM test
implicit none
real*8 radius
radius = 6378136.3
print*,radius
END
I have read this other link (Precision problems of real numbers in Fortran) but it doesn't explain how to fix this problem.
The reason this is happening is not because the variable you are using lacks precision, but because you initialized the value using a single precision number.
Take a look at this answer for a good explanation, and an elegant solution to your problem for any larger programs.
If you just want to solve it quickly, then you only have to change one line:
radius = 6378136.3d0
Though this will still give you a value of 6378136.2999999998 because of floating point precision.
This question already has answers here:
C++ calculating more precise than double or long double
(2 answers)
Closed 6 years ago.
Is there any floating point type that stores more digits after the decimal point than double in c++ (or any alternative, which makes double stored more digits)?
I've read that long double is maybe more accurate.
In my program we can zoom into the Mandelbrot set, but after some zoom the picture gets pixelated. I think it is because the length between two complex numbers associated with two neighboring pixels is less than the difference between two consecutive value of double. In the program I used long double.
If it's important, then the processor of my computer is Intel® Core™ i3 CPU M 380 # 2.53GHz × 4, the computer is 64 bit, the operating system is Ubuntu and the compiler is gcc.
You should take a look at third party libraries like boost.multiprecision or even GMP.
You can also do it "by hand" but that would be a lot of work. You would have to keep numbers as their string representation and manually make the arithmetic operations yourself.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I'm -kind of- a beginner to C.
I made several searches but I haven't seen this question asked.
When I try to calculate very big numbers (let's say... Adding 45235432412321312 to 5495034095872309238) my calculator gives answers which are not true. (The answer of my calculator was -2 for the numbers I've given in the previous sentence).
But both Linux's and Windows's own calculators calculate these numbers precisely.
What causes my calculator written in C/C++ to give these wrong answers with big numbers? What can I do to calculate these?
Digital data is represented as binary information. So, for an 8-bit integer you would have: 0=00000000, 1=00000001, 2=00000010, 3=00000011, etc. As you can imagine, the larger the numbers grow, the more storage is required to represent this information in binary form. What happens with your calculator is called overflow, where the resulting number is simply too large to represent in binary form (there are not enough bits to hold the information).
Now as to why you get accurate results in your computer, it depends on how their software is implemented. Possible explanations are that they either use higher precision arithmetic (they dedicate more bits) the use multiple precision arithmetic, or perform floating point calculations internally. My money would be on multiple precision arithmetic though.
Simply put, the built-in numeric data types that you're using within C and C++, such as float, int, etc., are limited due to them being represented with a finite and fixed amount of bits, such as 32, 64, etc. bits. You can't "stuff" more information into 32 bits than you can, that's the theory of information (read up). Now, when you add two "very big" numbers, due to the machine representation, a so-called "overflow" occurs (read up), which means that a bit sequence is being created as a result of the operation that represents a "meaningless" number; and if the data type is signed, a negative number is likely to appear (again, due to the internal representation).
Now your calculators use so-called "big numbers arithmetic", or "long numbers arithmetic", implemented in the corresponding libraries. With this approach, the number is represented as an array of numbers, and is thus virtually unlimited (of course, there are limits to the length of an array too, but the range that you can represent this way is a lot wider than that of the built-in types.)
To sum up, read on:
theory of information
binary number system and conversions decimal <-> binary
binary arithmetics with signed numbers
big number arithmetics
Short answer (and I'm not sure why you didn't find it, because it's been asked many, many times): you want a multiple-precision arithmetic library, such as GMP.
This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 8 years ago.
I heard that C/C++ has a problem with the management of the float point numbers.
I've implemented a simple program to try it. It consists in a change machine: the user enter the quantity to charge and the quantity paid, and the program calculates the number of coins for each coin type to give as change.
Here is the code: Link to my google drive folder with the code
The thing is, when you insert a non-integer value, the program enter in a loop and never ends.
I've printed the content of the variables to find out what's going on, and, somehow, from a 2 decimal value let's say: 0.10, the program changes its value to a 0.0999998.
Then, the remaining change to be processed never is 0 and it enters in a infinite loop.
I've heard that this is due to the machine representation of the float point numbers. I've experimented the same either windows and Linux; and also programming it in Java, but I don't remember to have had the same issue in pascal.
Well, Now the question is: what is the best workaround for this?
I've thought that one possible solution is using fixed point representation, via external libraries as: http://www.trenki.net/content/view/17/1/ or http://www.codef00.com/code/Fixed.h . Other maybe is to use a precision arithmetic library as: GMP
Neither C nor C++ has a problem with floating point values. You as the programmer are trusted to use floating point appropriately in any language supporting it.
While integer variables cannot store fractions nor out of bounds values, floating point can only store a specific subset of fractions. A high quality floating point implementation also gives tight guarantees for the accuracy of calculation.
Floating point numbers are not rational numbers, which would need infinite space to store reliably.