How to have both 32bit and 64bit float in C++

How to have both 32bit and 64bit float in C++ - c++

I have a file spec (here: http://www.septentrio.com/secure/asterx1v_2_1/SBF%20Reference%20Guide.pdf) that has fields marked as both 32-bit and 64-bit floats (see page 8). How can I use both widths in my program? I am developing on Mac OSX right now but I will also deploy on a Linux machine.
More details:
I know I could tell the compiler the width, but how could I distinguish two different float widths? Maybe someone also has a suggestion for changing the way I parse, which is to reinterpret_cast(buffer+offset) and then use the values. These file sizes are huge (4GB) so I need performance.

This might seem obvious, nevertheless:
On Intel platform and many others float is 32-bit floating point value, and double is 64-bit floating point value. Try this approach. Most likely it will work.
To be absolutely sure check sizeof of your types at the start of your program or statically during compilation if your compiler allows this.
Once again, try the simple solution first.
Float and double arithmetic is both implemented on Intel and it is fast. In any case native arithmetic is the fastest of what you can get from the CPU.
IEEE 754 (http://en.wikipedia.org/wiki/IEEE_floating_point) defines not one floating point format, but several, like 4, 8, 16 bytes, etc. They all have different range and precision but they are all still IEEE values.

Related

Using Half Precision Floating Point on x86 CPUs

I intend to use half-precision floating-point in my code but I am not able to figure out how to declare them. For Example, I want to do something like the following-
fp16 a_fp16;
bfloat a_bfloat;
However, the compiler does not seem to know these types (fp16 and bfloat are just dummy types, for demonstration purposes)
I remember reading that bfloat support was added into GCC-10, but I am not able to find it in the manual.I am especially interested in bfloat floating numbers
Additional Questions -
FP16 now has hardware support on Intel / AMD support as today? I think native hardware support was added since Ivy Bridge itself. (https://scicomp.stackexchange.com/questions/35187/is-half-precision-supported-by-modern-architecture)
I wanted to confirm whether using FP16 will indeed increase FLOPs. I remember reading somewhere that all arithmetic operations on fp16 are internally converted to fp32 first, and only affect cache footprint and bandwidth.
SIMD intrinsic support for half precision float, especially bfloat(I am aware of intrinsics like _mm256_mul_ph, but not sure how to pass the 16bit FP datatype, would really appreciate if someone could highlight this too)
Are these types added to Intel Compilers as well ?
PS - Related Post - Half-precision floating-point arithmetic on Intel chips , but it does not cover on declaring half precision floating point numbers.
TIA

Neither C++ nor C language has arithmetic types for half floats.
The GCC compiler supports half floats as a language extension. Quote from the documentation:
On x86 targets with SSE2 enabled, GCC supports half-precision (16-bit) floating point via the _Float16 type. For C++, x86 provides a builtin type named _Float16 which contains same data format as C.
...
On x86 targets with SSE2 enabled, without -mavx512fp16, all operations will be emulated by software emulation and the float instructions. The default behavior for FLT_EVAL_METHOD is to keep the intermediate result of the operation as 32-bit precision. This may lead to inconsistent behavior between software emulation and AVX512-FP16 instructions. Using -fexcess-precision=16 will force round back after each operation.
Using -mavx512fp16 will generate AVX512-FP16 instructions instead of software emulation. The default behavior of FLT_EVAL_METHOD is to round after each operation. The same is true with -fexcess-precision=standard and -mfpmath=sse. If there is no -mfpmath=sse, -fexcess-precision=standard alone does the same thing as before, It is useful for code that does not have _Float16 and runs on the x87 FPU.

Is floating point representation compiler-dependent in C++?

The question is in the title. It seems that the software I deliver to my customer has varying behaviors depending on wether some parameters passed as integers or as floats. I build a DLL for my customer with MinGW and he integrates it in his Visual Studio project, which uses some other compiler (no idea which, I guess the standard one for VS).
Could it be that floats are represented differently by him than by me ?
Thanks for the heads up,
Charles

Yes, floating point representation is compiler dependent.
In theory you can use std::numeric_limits to determine the main aspects of the representation, such as whether it's IEEE 754, or whether it's binary or decimal.
In practice you can't rely on that except for the memory layout, because with a main compiler, g++, the semantics of floating point operations are influenced strongly by the optimization options (e.g., whether NaN compares equal to itself or not).
I.e. in practice it's not only compiler dependent but also option-dependent.
Happily compilers for a given platform will generally conform to that platform's standard memory layout for floating point, e.g. IEEE 754 in Windows (the standard originated on the PC platform). And so floating point values should in general work OK when exchanged between g++ and Visual C++, say. One exception is that with g++ long double maps to 80-bit IEEE 754, while with Visual C++ it maps to ordinary double, i.e. 64-bit, and that could conceivably be what makes trouble for you.

C++ compilers/platforms that don't use IEEE754 floating point

I'm working on updating a serialization library to add support for serializing floating point in a portable manner. Ideally I'd like to be able to test the code in an environment where IEEE754 isn't supported. Would it be sufficient to test using a soft-float library? Or any other suggestions about how I can properly test the code?

Free toolchains that you can find for ARM (embedded Linux) development, mostly do not support hard-float operations but soft-float only. You could try with one of these (i.e. CodeSourcery) but you would need some kind of a platform to run the compiled code (real HW or QEMU).
Or if you would want to do the same but on x86 machine, take a look at: Using software floating point on x86 linux

Should your library work on a system where both hardware floating point and soft-float are not available ? If so, if you test using a compiler with soft-float, your code may not compile/work on such a system.
Personally, I would test the library on a ARM9 system with a gcc compiler without soft-float.

Not an answer to your actual question, but describing what you must do to solve the problem.
If you want to support "different" floating point formats, your code would have to understand the internal format of floats [unless you only support "same architecture both ends"], pick the floating point number apart into your own format [which of course may be IEEE-754, but beware of denormals, 128-bit long doubles, NaN, INFINITY, and other "exception values", and of course out of range numbers], and then put it back together to the format required by the other end. If you are not doing this, there is no point in hunting down a non-IEEE-754 system, because it won't work.

gcc and sin/cos/transcendental functions precision like in Windows

I want to achieve exactly same floating-point results in a gcc/Linux ported version of a Windows software. For that reason I want all double operations to be of 64-bit precision. This can be done using for example -mpc64 or -msse2 or -fstore-floats (all with side effects). However one thing I can't fix is transcendental functions like sin/asin etc. The docs say that they internally expect (and use I suppose) long double precision and whatever I do they produce results different from Windows counterparts.
How is it possible for these function to calculate results using 64-bit floating point precision?
UPDATE: I was wrong, it is printf("%.17f") that incorrectly rounds the correct double result, "print x" in gdb shows that the number itself is correct. I suppose I need a different question on this one... perhaps on how to make printf not to treat double internally as extended. Maybe using stringstream will give expected results... Yes it does.

Different LibM libraries use different algorithms for elementary functions, so you have to use the same library on both Windows and Linux to achieve exactly the same results. I would suggest to compile FDLibM and statically link it with your software.

I found that it is printf("%.17f") that uses incorrect precision to print results (probably extended internally), when I use stringstream << setprecision(17) the result is correct. So the answer is not really related to the question but, at least it works for me.
But I would be glad if someone provides a way to make printf to produce expected results.

An excellent solution for the transcendental function problem is to use the GNU MPFR Library. But be aware that Microsoft compilers do not support extended precision floating point. With the Microsoft compiler, double and long double are both 53-bit precision. With gcc, long double is 64-bit precision. To get matching results across Windows/linux, you must either avoid use of long double or avoid use of Microsoft compilers. For many Windows projects, the Windows port of gcc (mingw) works well. This lets the Windows project use 64-bit precision long doubles. A problem with mingw long double support is that mingw uses Microsoft libraries for calls such as printf. For that reason, printing a long double doesn't work correctly. A work-around for this problem is to use mpfr_printf.

long long implementation in 32 bit machine

As per c99 standard, size of long long should be minimum 64 bits. How is this implemented in a 32 bit machine (eg. addition or multiplication of 2 long longs). Also, What is the equivalent of long long in C++.

The equivalent in C++ is long long as well. It's not required by the standard, but most compilers support it because it's so usefull.
How is it implemented? Most computer architectures already have built-in support for multi-word additions and subtractions. They don't do 64 bit addititions directly but use the carry flag and a special add-instruction to build a 64 bit add from two 32 bit adds.
The same extension exists for subtraction as well (the carry is called borrow in these cases).
Longword multiplications and divisions can be built from smaller multiplications without the help of carry-flags. Sometimes simply doing the operations bit by bit is faster though.
There are architectures that don't have any flags at all (some DSP chips and simple micros). On these architectures the overflow has to be detected with logic operations. Multi-word arithmetic tend to be slow on these machines.

On the IA32 architecture, 64-bit integer are implemented in using two 32-bit registers (eax and edx).
There are platform specific equivalents for C++, and you can use the stdint.h header where available (boost provides you with one).

As everyone has stated, a 64-bit integer is typically implemented by simply using two 32-bit integers together. Then clever code generation is used to keep track of the carry and/or borrow bits to keep track of overflow, and adjust accordingly.
This of course makes such arithmetic more costly in terms of code space and execution time, than the same code compiled for an architecture with native support for 64-bit operations.

If you care about bit-sizes, you should use
#include <stdint.h>
int32_t n;
and friends. This works for C++ as well.
64-bit numbers on 32-bit machines are implemented as you think,
by 4 extra bytes. You could therefore implement your own 64-bit
datatype by doing something like this:
struct my_64bit_integer {
uint32_t low;
uint32_t high;
};
You would of course have to implement mathematical operators yourself.
There is an int64_t in the stdint.h that comes with my GCC version,
and in Microsoft Visual C++ you have an __int64 type as well.

The next C++ standard (due 2009, or maybe 2010), is slated to include the "long long" type. As mentioned earlier, it's already in common use.
The implementation is up to the compiler writers, although computers have always supported multiple precision operations. Some languages, like Python and Common Lisp, require support for indefinite-precision integers. Long ago, I wrote 64-bit multiplication and division routines for a computer (the Z80) that could manage 16-bit addition and subtraction, with no hardware multiplication at all.
Probably the easiest way to see how an operation is implemented on your particular compiler is to write a code sample and examine the assembler output, which is available from all the major compilers I've worked with.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to have both 32bit and 64bit float in C++ - c++

Related

Using Half Precision Floating Point on x86 CPUs

Is floating point representation compiler-dependent in C++?

C++ compilers/platforms that don't use IEEE754 floating point

gcc and sin/cos/transcendental functions precision like in Windows

long long implementation in 32 bit machine

Categories

Resources