How to convert a double to a string without using the CRT - c++

My question has no practical application. I'm just interested. Suppose, I have a double value and I want to obtain its string representation similarly to the printf function. How would I do that without the C runtime library? Let's suppose I'm on the x86 architecture.

Given that you state your question has no practical application, I figure you're trying to learn about floating point number representations.
Thus, if you're looking for a solution without using any library support, start with the format specification. From that you can discern the various "special" values (Infinity, NAN, etc) as well as decoding/calculating the actual numeric value. Once you have the significand and exponent, you know where to put the decimal point. You'll have to write your own itoa type routine. For radices which are a power of two, this can be as simple as a lookup table. For decimal, you'll have to do a little extra math.

you can get all values on left side by (double % 10) and then divide by 10 every time.
they will be in right to left.
to get values on right of dot you have to multiply by 10 and then (double % 10). they will be in left-to-right.

If you want to do it simply with a "close enough" result, see my article http://www.exploringbinary.com/quick-and-dirty-floating-point-to-decimal-conversion/ . It describes a simple program that uses floating-point to convert from floating-point to decimal, and explains why that approach can never be accurate for all conversions. (The program doesn't do decimal rounding like printf, but that should be easy enough to add.)

Related

How does printf extract digits from a floating point number?

How do functions such as printf extract digits from a floating point number? I understand how this could be done in principle. Given a number x, of which you want the first n digits, scale x by a power of 10 so that x is between pow(10, n) and pow(10, n-1). Then convert x into an integer, and take the digits of the integer.
I tried this, and it worked. Sort of. My answer was identical to the answer given by printf for the first 16 decimal digits, but tended to differ on the ones after that. How does printf do it?
The classic implementation is David Gay's dtoa. The exact details are somewhat arcane (see Why does "dtoa.c" contain so much code?), but in general it works by doing the base conversion using more precision beyond what you can get from a 32-bit, 64-bit, or even 80-bit floating point number. To do this, it uses so-called "bigints" or arbitrary-precision numbers, which can hold as many digits as you can fit in memory. Gay's code has been copied, with modifications, into countless other libraries including common implementations for the C standard library (so it might power your printf), Java, Python, PHP, JavaScript, etc.
(As a side note... not all of these copies of Gay's dtoa code were kept up to date, so because PHP used an old version of strtod it hung when parsing 2.2250738585072011e-308.)
In general, if you do things the "obvious" and simple way like multiplying by a power of 10 and then converting the integer, you will lose a small amount of precision and some of the results will be inaccurate... but maybe you will get the first 14 or 15 digits correct. Gay's implementation of dtoa() claims to get all the digits correct... but as a result, the code is quite difficult to follow. Skip to the bottom to see strtod itself, you can see that it starts with a "fast path" which just uses ordinary floating-point arithmetic, but then it detects if that result is incorrect and uses a more reliable algorithm using bigints which works in all cases (but is slower).
The implementation has the following citation, which you may find interesting:
* Inspired by "How to Print Floating-Point Numbers Accurately" by
* Guy L. Steele, Jr. and Jon L. White [Proc. ACM SIGPLAN '90, pp. 112-126].
The algorithm works by calculating a range of decimal numbers which produce the given binary number, and by using more digits, the range gets smaller and smaller until you either have an exact result or you can correctly round to the requested number of digits.
In particular, from sec 2.2 Algorithm,
The algorithm uses exact rational arithmetic to perform its computations so that there is no loss of accuracy. In order to generate digits, the algorithm scales the number so that it is of the form 0.d1d2..., where d1, d2, ..., are base-B digits. The first digit is computed by multiplying the scaled number by the output base, B, and taking the integer part. The remainder is used to compute the rest of the digits using the same approach.
The algorithm can then continue until it has the exact result (which is always possible, since floating-point numbers are base 2, and 2 is a factor of 10) or until it has as many digits as requested. The paper goes on to prove the algorithm's correctness.
Also note that not all implementations of printf are based on Gay's dtoa, this is just a particularly common implementation that's been copied a lot.
There are various ways to convert floating-point numbers to decimal numerals without error (either exactly or with rounding to a desired precision).
One method is to use arithmetic as taught in elementary school. C provides functions to work with floating-point numbers, such as frexp, which separates the fraction (also called the significand, often mistakenly called a mantissa) and the exponent. Given a floating-point number, you could create a large array to store decimal digits in and then compute the digits. Each bit in the fraction part of a floating-point number represents some power of two, as determined by the exponent in the floating-point number. So you can simply put a “1” in an array of digits and then use elementary school arithmetic to multiply or divide it the required number of times. You can do that for each bit and then add all the results, and the sum is the decimal numeral that equals the floating-point number.
Commercial printf implementations will use more sophisticated algorithms. Discussing them is beyond the scope of a Stack Overflow question-and-answer. The seminal paper on this is Correctly Rounded Binary-Decimal and Decimal-Binary Conversions by David M. Gay. (A copy appears to be available here, but that seems to be hosted by a third party; I am not sure how official or durable it is. A web search may turn up other sources.) A more recent paper with an algorithm for converting a binary floating-point number to decimal with the shortest number of digits needed to uniquely distinguish the value is Printing Floating-Point Numbers: An Always Correct Method by Marc Andrysco, Ranjit Jhala, and Sorin Lerner.
One key to how it is done is that printf will not just use the floating-point format and its operations to do the work. It will use some form of extended-precision arithmetic, either by working with parts of the floating-point number in an integer format with more bits, by separating the floating-point number into pieces and using multiple floating-point numbers to work with it, or by using a floating-point format with more precision.
Note that the first step in your question, multiple x by a power of ten, already has two rounding errors. First, not all powers of ten are exactly representable in binary floating-point, so just producing such a power of ten necessarily has some representation error. Then, multiplying x by another number often produces a mathematical result that is not exactly representable, so it must be rounded to the floating-point format.
Neither the C or C++ standard does not dictate a certain algorithm for such things. Therefore is impossible to answer how printf does this.
If you want to know an example of a printf implementation, you can have a look here: http://sourceware.org/git/?p=glibc.git;a=blob;f=stdio-common/vfprintf.c and here: http://sourceware.org/git/?p=glibc.git;a=blob;f=stdio-common/printf_fp.c

Fortran - want to round to one decimal point

In fortran I have to round latitude and longitude to one digit after decimal point.
I am using gfortran compiler and the nint function but the following does not work:
print *, nint( 1.40 * 10. ) / 10. ! prints 1.39999998
print *, nint( 1.49 * 10. ) / 10. ! prints 1.50000000
Looking for both general and specific solutions here. For example:
How can we display numbers rounded to one decimal place?
How can we store such rounded numbers in fortran. It's not possible in a float variable, but are there other ways?
How can we write such numbers to NetCDF?
How can we write such numbers to a CSV or text file?
As others have said, the issue is the use of floating point representation in the NetCDF file. Using nco utilities, you can change the latitude/longitude to short integers with scale_factor and add_offset. Like this:
ncap2 -s 'latitude=pack(latitude, 0.1, 0); longitude=pack(longitude, 0.1, 0);' old.nc new.nc
There is no way to do what you are asking. The underlying problem is that the rounded values you desire are not necessarily able to be represented using floating point.
For example, if you had a value 10.58, this is represented exactly as 1.3225000 x 2^3 = 10.580000 in IEEE754 float32.
When you round this to value to one decimal point (however you choose to do so), the result would be 10.6, however 10.6 does not have an exact representation. The nearest representation is 1.3249999 x 2^3 = 10.599999 in float32. So no matter how you deal with the rounding, there is no way to store 10.6 exactly in a float32 value, and no way to write it as a floating point value into a netCDF file.
YES, IT CAN BE DONE! The "accepted" answer above is correct in its limited range, but is wrong about what you can actually accomplish in Fortran (or various other HGL's).
The only question is what price are you willing to pay, if the something like a Write with F(6.1) fails?
From one perspective, your problem is a particularly trivial variation on the subject of "Arbitrary Precision" computing. How do you imagine cryptography is handled when you need to store, manipulate, and perform "math" with, say, 1024 bit numbers, with exact precision?
A simple strategy in this case would be to separate each number into its constituent "LHSofD" (Left Hand Side of Decimal), and "RHSofD" values. For example, you might have an RLon(i,j) = 105.591, and would like to print 105.6 (or any manner of rounding) to your netCDF (or any normal) file. Split this into RLonLHS(i,j) = 105, and RLonRHS(i,j) = 591.
... at this point you have choices that increase generality, but at some expense. To save "money" the RHS might be retained as 0.591 (but loose generality if you need to do fancier things).
For simplicity, assume the "cheap and cheerful" second strategy.
The LHS is easy (Int()).
Now, for the RHS, multiply by 10 (if, you wish to round to 1 DEC), e.g. to arrive at RLonRHS(i,j) = 5.91, and then apply Fortran "round to nearest Int" NInt() intrinsic ... leaving you with RLonRHS(i,j) = 6.0.
... and Bob's your uncle:
Now you print the LHS and RHS to your netCDF using a suitable Write statement concatenating the "duals", and will created an EXACT representation as per the required objectives in the OP.
... of course later reading-in those values returns to the same issues as illustrated above, unless the read-in also is ArbPrec aware.
... we wrote our own ArbPrec lib, but there are several about, also in VBA and other HGL's ... but be warned a full ArbPrec bit of machinery is a non-trivial matter ... lucky you problem is so simple.
There are several aspects one can consider in relation to "rounding to one decimal place". These relate to: internal storage and manipulation; display and interchange.
Display and interchange
The simplest aspects cover how we report stored value, regardless of the internal representation used. As covered in depth in other answers and elsewhere we can use a numeric edit descriptor with a single fractional digit:
print '(F0.1,2X,F0.1)', 10.3, 10.17
end
How the output is rounded is a changeable mode:
print '(RU,F0.1,2X,RD,F0.1)', 10.17, 10.17
end
In this example we've chosen to round up and then down, but we could also round to zero or round to nearest (or let the compiler choose for us).
For any formatted output, whether to screen or file, such edit descriptors are available. A G edit descriptor, such as one may use to write CSV files, will also do this rounding.
For unformatted output this concept of rounding is not applicable as the internal representation is referenced. Equally for an interchange format such as NetCDF and HDF5 we do not have this rounding.
For NetCDF your attribute convention may specify something like FORTRAN_format which gives an appropriate format for ultimate display of the (default) real, non-rounded, variable .
Internal storage
Other answers and the question itself mention the impossibility of accurately representing (and working with) decimal digits. However, nothing in the Fortran language requires this to be impossible:
integer, parameter :: rk = SELECTED_REAL_KIND(radix=10)
real(rk) x
x = 0.1_rk
print *, x
end
is a Fortran program which has a radix-10 variable and literal constant. See also IEEE_SELECTED_REAL_KIND(radix=10).
Now, you are exceptionally likely to see that selected_real_kind(radix=10) gives you the value -5, but if you want something positive that can be used as a type parameter you just need to find someone offering you such a system.
If you aren't able to find such a thing then you will need to work accounting for errors. There are two parts to consider here.
The intrinsic real numerical types in Fortran are floating point ones. To use a fixed point numeric type, or a system like binary-coded decimal, you will need to resort to non-intrinsic types. Such a topic is beyond the scope of this answer, but pointers are made in that direction by DrOli.
These efforts will not be computationally/programmer-time cheap. You will also need to take care of managing these types in your output and interchange.
Depending on the requirements of your work, you may find simply scaling by (powers of) ten and working on integers suits. In such cases, you will also want to find the corresponding NetCDF attribute in your convention, such as scale_factor.
Relating to our internal representation concerns we have similar rounding issues to output. For example, if my input data has a longitude of 10.17... but I want to round it in my internal representation to (the nearest representable value to) a single decimal digit (say 10.2/10.1999998) and then work through with that, how do I manage that?
We've seen how nint(10.17*10)/10. gives us this, but we've also learned something about how numeric edit descriptors do this nicely for output, including controlling the rounding mode:
character(10) :: intermediate
real :: rounded
write(intermediate, '(RN,F0.1)') 10.17
read(intermediate, *) rounded
print *, rounded ! This may look not "exact"
end
We can track the accumulation of errors here if this is desired.
The `round_x = nint(x*10d0)/10d0' operator rounds x (for abs(x) < 2**31/10, for large numbers use dnint()) and assigns the rounded value to the round_x variable for further calculations.
As mentioned in the answers above, not all numbers with one significant digit after the decimal point have an exact representation, for example, 0.3 does not.
print *, 0.3d0
Output:
0.29999999999999999
To output a rounded value to a file, to the screen, or to convert it to a string with a single significant digit after the decimal point, use edit descriptor 'Fw.1' (w - width w characters, 0 - variable width). For example:
print '(5(1x, f0.1))', 1.30, 1.31, 1.35, 1.39, 345.46
Output:
1.3 1.3 1.4 1.4 345.5
#JohnE, using 'G10.2' is incorrect, it rounds the result to two significant digits, not to one digit after the decimal point. Eg:
print '(g10.2)', 345.46
Output:
0.35E+03
P.S.
For NetCDF, rounding should be handled by NetCDF viewer, however, you can output variables as NC_STRING type:
write(NetCDF_out_string, '(F0.1)') 1.49
Or, alternatively, get "beautiful" NC_FLOAT/NC_DOUBLE numbers:
beautiful_float_x = nint(x*10.)/10. + epsilon(1.)*nint(x*10.)/10./2.
beautiful_double_x = dnint(x*10d0)/10d0 + epsilon(1d0)*dnint(x*10d0)/10d0/2d0
P.P.S. #JohnE
The preferred solution is not to round intermediate results in memory or in files. Rounding is performed only when the final output of human-readable data is issued;
Use print with edit descriptor ‘Fw.1’, see above;
There are no simple and reliable ways to accurately store rounded numbers (numbers with a decimal fixed point):
2.1. Theoretically, some Fortran implementations can support decimal arithmetic, but I am not aware of implementations that in which ‘selected_real_kind(4, 4, 10)’ returns a value other than -5;
2.2. It is possible to store rounded numbers as strings;
2.3. You can use the Fortran binding of GIMP library. Functions with the mpq_ prefix are designed to work with rational numbers;
There are no simple and reliable ways to write rounded numbers in a netCDF file while preserving their properties for the reader of this file:
3.1. netCDF supports 'Packed Data Values‘, i.e. you can set an integer type with the attributes’ scale_factor‘,’ add_offset' and save arrays of integers. But, in the file ‘scale_factor’ will be stored as a floating number of single or double precision, i.e. the value will differ from 0.1. Accordingly, when reading, when calculating by the netCDF library unpacked_data_value = packed_data_value*scale_factor + add_offset, there will be a rounding error. (You can set scale_factor=0.1*(1.+epsilon(1.)) or scale_factor=0.1d0*(1d0+epsilon(1d0)) to exclude a large number of digits '9'.);
3.2. There are C_format and FORTRAN_format attributes. But it is quite difficult to predict which reader will use which attribute and whether they will use them at all;
3.3. You can store rounded numbers as strings or user-defined types;
Use write() with edit descriptor ‘Fw.1’, see above.

`std::sin` is wrong in the last bit

I am porting some program from Matlab to C++ for efficiency. It is important for the output of both programs to be exactly the same (**).
I am facing different results for this operation:
std::sin(0.497418836818383950) = 0.477158760259608410 (C++)
sin(0.497418836818383950) = 0.47715876025960846000 (Matlab)
N[Sin[0.497418836818383950], 20] = 0.477158760259608433 (Mathematica)
So, as far as I know both C++ and Matlab are using IEEE754 defined double arithmetic. I think I have read somewhere that IEEE754 allows differents results in the last bit. Using mathematica to decide, seems like C++ is more close to the result. How can I force Matlab to compute the sin with precision to the last bit included, so that the results are the same?
In my program this behaviour leads to big errors because the numerical differential equation solver keeps increasing this error in the last bit. However I am not sure that C++ ported version is correct. I am guessing that even if the IEEE754 allows the last bit to be different, somehow guarantees that this error does not get bigger when using the result in more IEEE754 defined double operations (because otherwise, two different programs correct according to the IEEE754 standard could produce completely different outputs). So the other question is Am I right about this?
I would like get an answer to both bolded questions. Edit: The first question is being quite controversial, but is the less important, can someone comment about the second one?
Note: This is not an error in the printing, just in case you want to check, this is how I obtained these results:
http://i.imgur.com/cy5ToYy.png
Note (**): What I mean by this is that the final output, which are the results of some calculations showing some real numbers with 4 decimal places, need to be exactly the same. The error I talk about in the question gets bigger (because of more operations, each of one is different in Matlab and in C++) so the final differences are huge) (If you are curious enough to see how the difference start getting bigger, here is the full output [link soon], but this has nothing to do with the question)
Firstly, if your numerical method depends on the accuracy of sin to the last bit, then you probably need to use an arbitrary precision library, such as MPFR.
The IEEE754 2008 standard doesn't require that the functions be correctly rounded (it does "recommend" it though). Some C libms do provide correctly rounded trigonometric functions: I believe that the glibc libm does (typically used on most linux distributions), as does CRlibm. Most other modern libms will provide trig functions that are within 1 ulp (i.e. one of the two floating point values either side of the true value), often termed faithfully rounded, which is much quicker to compute.
None of those values you printed could actually arise as IEEE 64bit floating point values (even if rounded): the 3 nearest (printed to full precision) are:
0.477158760259608 405451814405751065351068973541259765625
0.477158760259608 46096296563700889237225055694580078125
0.477158760259608 516474116868266719393432140350341796875
The possible values you could want are:
The exact sin of the decimal .497418836818383950, which is
0.477158760259608 433132061388630377105954125778369485736356219...
(this appears to be what Mathematica gives).
The exact sin of the 64-bit float nearest .497418836818383950:
0.477158760259608 430531153841011107415427334794384396325832953...
In both cases, the first of the above list is the nearest (though only barely in the case of 1).
The sine of the double constant you wrote is about 0x1.e89c4e59427b173a8753edbcb95p-2, whose nearest double is 0x1.e89c4e59427b1p-2. To 20 decimal places, the two closest doubles are 0.47715876025960840545 and 0.47715876025960846096.
Perhaps Matlab is displaying a truncated value? (EDIT: I now see that the fourth-last digit is a 6, not a 0. Matlab is giving you a result that's still faithfully-rounded, but it's the farther of the two closest doubles to the desired result. And it's still printing out the wrong number.
I should also point out that Mathematica is probably trying to solve a different problem---compute the sine of the decimal number 0.497418836818383950 to 20 decimal places. You should not expect this to match either the C++ code's result or Matlab's result.

Conversion from string to double - Possibility and errors

I am aware that the string 2.34 would never be equal to the double 2.34. No matter what library or algorithm you tried (lexical_cast,atof). Also 2.3400 can not be represented as double type. Instead it will be equal to 2.3399999999999999 . A little background I am working on an application that passes of values to an external application using its api. Think of it as some sort of a trading application. The user can pass values using the applications api or the user can pass value by using the application directly.Now when the user uses the application directly and the user types in 2.34 the value is processed as 2.34 however when I use the API which requires double as a parameter I pass 2.34 and it passes of as 2.3399999999999999 which is not acceptable. My question is how would the application be handling this and is there a way to store 2.34000.. in a double so that I could pass it to an API ?
If you need to pass decimal values through an API which takes double but you need to get the exact values, there isn't much of a problem: As long as you don't use more than std::numeric_limits<double>::digits10 digits, you can recover the original decimal value although not necessarily the same representation (trailing fractional zeros will be lost). To do so, you need to convert the original decimal string into the closest representation as double and later use a suitable algorithm to restore the best decimal representation again. The parsing and formatting functions from the C and C++ standard libraries will do that correctly for you.
Note that you shouldn't try to do any arithmetic on the double values when you want to restore the original decimal values: the result of double arithmetic will use binary rounding and the values won't be the closest decimal values. However, as long as you only transfer the double values, there is no problem.
Since you mention "trading application" I will conclude that the numbers represent currencies. If that is the case you are probably dealing with a fixed number of fractional digits as well. In that case you can scale your floating point numbers by multiplying them by 10 ^ number_of_fractional_digits, essentially making them integer values. Floating point numbers can accurately store integer values (as long as they do not exceed the floating point type's range).
Another possibility - if the assumptions above are correct - would be to use Binary-coded decimals.
The one way to work around floating point precision issues is using a well made fraction class. You may code one for yourself or use the ones provided by common math libraries. Such classes will represent your 2.34 as 234/100 internally, which will lead higher amount of memory consumption compared to a single float.

Is there a solution for Floating point Arithmetic problems in C++?

I am doing some floating point arithmetic and having precision problems. The resulting value is different on two machines for the same input. I read the post # Why can't I multiply a float? and also read other material on the web & understood that it is got to do with binary representation of floating point and on machine epsilon. However, I wanted to check if there is a way to solve this problem / Some work around for Floating point arithmetic in C++ ?? I am converting a float to unsigned short for storage and am converting back when necessary. However, when I convert it back to unsigned short, the precision (to 6 decimal points) remains correct on one machine but fails on the other.
//convert FLOAT to short
unsigned short sConst = 0xFFFF;
unsigned short shortValue = (unsigned short)(floatValue * sConst);
//Convert SHORT to FLOAT
float floatValue = ((float)shortValue / sConst);
A short must be at least 16 bits, and in a whole lot of implementations that's exactly what it is. An unsigned 16-bit short will hold values from 0 to 65535. That means that a short will not hold a full five digits of precision, and certainly not six. If you want six digits, you need 20 bits.
Therefore, any loss of precision is likely due to the fact that you're trying to pack six digits of precision into something less than five digits. There is no solution to this, other than using an integral type that probably takes as much storage as a float.
I don't know why it would seem to work on one given system. Were you using the same numbers on both? Did one use an older floating-point system, and one that coincidentally gave the results you were expecting on the samples you tried? Was it possibly using a larger short than the other?
If you want to use native floating point types, the best you can do is to assert that the values output by your program do not differ too much from a set of reference values.
The precise definition of "too much" depends entirely on your application. For example, if you compute a + b on different platforms, you should find the two results to be within machine precision of each other. On the other hand, if you're doing something more complicated like matrix inversion, the results will most likely differ by more than machine precision. Determining precisely how close you can expect the results to be to each other is a very subtle and complicated process. Unless you know exactly what you are doing, it is probably safer (and saner) to determine the amount of precision you need downstream in your application and verify that the result is sufficiently precise.
To get an idea about how to compute the relative error between two floating point values robustly, see this answer and the floating point guide linked therein:
Floating point comparison functions for C#
Are you looking for standard like this:
Programming Languages C++ - Technical Report of Type 2 on Extensions for the programming language C++ to support decimal floating point arithmetic draft
Instead of using 0xFFFF use half of it, i.e. 32768 for conversion. 32768 (Ox8000) has a binary representation of 1000000000000000 whereas OxFFFF has a binary representation of 1111111111111111. Ox8000 's binary representation clearly implies, multiplication & divsion operations during conversion (to short (or) while converting back to float) will not change precision values after zero. For one side conversion, however OxFFFF is preferable, as it leads to more accurate result.