Relevant reading: http://www.opengl.org/wiki/Image_Format#Color_formats
Normalized texture formats, (e.g., GL_RGB8_SNORM and GL_RGB16), store integers that map to floating point ranges (-[1.0,1.0] for signed normalized, [0.0,1.0] for unsigned normalized).
It seems to me like there's a very good reason for having GL_RGB32, GL_RGBA_SNORM, etc. tokens: the precisions would surpass dedicated floating point formats, like GL_RGB32F. Also, for completeness: why have normalized formats for 8 bits and 16 bits, but not 32?
So, why don't GL_RGB32, GL_RGBA32 exist?
It seems to me like there's a very good reason for having GL_RGB32, GL_RGBA_SNORM, etc. tokens: the precisions would surpass dedicated floating point formats, like GL_RGB32F.
You've just answered your own question. They don't allow it because to allow it would mean either discarding 8 bits of that extra precision in the conversion to single-precision floats, or converting those 32-bit normalized integers to double-precision floats.
And you'll notice that not even GL 4.2 allows for double-precision float textures.
It's not allowed because it simply wouldn't be useful on current hardware. Current hardware doesn't support it because supporting it would mean fetching double-precision values from textures.
Related
Cppreference documents that stdfloat includes 5 new types: float16_t, float32_t, float64_t, float128_t and bfloat16_t. While the first 4 types are self-explanatory (a float with 16, 32, 64, and 128 bits respectively), the last type bfloat16_t is not at all clear to me. What does this type represent? What does the b in its name mean?
"bfloat16" refers to a fairly recent 16-bit floating-point format that is not a valid IEEE-754/IEC 60559 defined format. But it is related to them.
BINARY16 is just BINARY32 with smaller numbers for its components. But the size changes are evenly distributed; it has both a smaller mantissa and a smaller exponent.
Bfloat16 opts for a different way to spend its 16 bits. It elects to keep the exponent the same size as BINARY32 (8-bit) while shrinking the mantissa down to 7-bits (explicit). This makes transforming between bfloat16 and BINARY32 a faster operation.
It does have some special-case weirdness, but overall, it's a truncated BINARY32. And while it's widely supported in a surprising amount of GPU hardware, it's not truly a standard.
The utility of bfloat16 comes down to two things: the speed of conversion, and the compromises of BINARY16.
BINARY16 is a great format... for colors. Having a maximum range of only 5 decimal orders of magnitude above zero is adequate for many cases of high-dynamic range rendering.
But this compromise is a problem for machine learning operations. If precision matters, you're going to have to spend 32-bits to get it. But if precision isn't all that important, being able to save 16 bits without the range compromise can be useful in these applications.
I just wanted to know how the CPU "Cast" a floating point number.
I mean, i suppouse that when when we use a "float" or "double" in C/C++ the compiler is using the x87 unit, or am i wrong? (i couldn't find the answer) So, if this is the case and the floating point numbers are not emulated how does the compiler cast it?
I mean, i suppouse that when when we use a "float" or "double" in C/C++ the compiler is using the x87 unit, or am i wrong?
On modern Intel processors, the compiler is likely to use the SSE/AVX registers. The FPU is often not in regular use.
I just wanted to know how the CPU "Cast" a floating point number.
Converting an integer to a floating-point number is a computation that is basically (glossing over some details):
Start with the binary (for unsigned types) or two’s complement (for signed types) representation of the integer.
If the number is zero, return all bits zero.
If it is negative, remember that and negate the number to make it positive.
Locate the highest bit set in the integer.
Locate the lowest bit that will fit in the significand of the destination format. (For example, for the IEEE-754 binary32 format commonly used for float, 24 bits fit in the significand, so the 25th bit after the highest bit set does not fit.)
Round the number at that position where the significand will end.
Calculate the exponent, which is a function of where the highest bit set is. Add a “bias” used in encoding the exponent (127 for binary32, 1023 for binary64).
Assemble a sign bit, bits for the exponent, and bits for the significand (omitting the high bit, because it is always one). Return those bits.
That computation prepares the bits that represent a floating-point number. (It omits details involving special cases like NaNs, infinities, and subnormal numbers because these do not occur when converting typical integer formats to typical floating-point formats.)
That computation may be performed “in software” (that is, with general instructions for shifting bits, testing values, and so on) or “in hardware” (that is, with special instructions for doing the conversion). All desktop computers have instructions for this. Small processors for special-purpose embedded use might not have such instructions.
It is not clear what do you mean by
"Cast" a floating point number. ?
If target architecture has FPU then compiler will issue FPU instructions in order to manipulate floating point variables, no mistery there...
In order to assign float variable to int variable, float must be truncated or rounded(up or down). Special instructions usually exists to serve this purpose.
If target architecture is "FPU-less" then compiler(toolchain) might provide software implementation of floating point operations using CPU instructions available. For example, expression like a = x * y; will be equivalent to a = fmul(x, y); Where fmul() is compiler provided special function(intrinsic) to do floating point operations without FPU. Ofcourse this is typically MUCH slower than using hardware FPU. Floating point arithmetic is not used on such platforms if performance matters, fixed point arithmetic https://en.wikipedia.org/wiki/Fixed-point_arithmetic could be used instead.
I'm writing a binary file reader/writer and have decided that to handle the issue of endianness I will convert all data to "network" (big) endianness on writing and to host endianness on reading. I'm avoiding hton* because I don't want to link with winsock for just those functions.
My main point of confusion comes from how to handle floating point values. For all integral values I have the sized types in <cstdint> (uint32_t, etc.), but from my research no such equivalent exists for floating point types. I'd like to convert all floating point values to a 32 bit representation on writing and convert back to whatever precision is used on the host (32 bit is enough for my application). This way I will know precisely how many bytes to write and read for floating point values; as opposed to if I used sizeof(float) and sizeof(float) was different on the machine loading the file than the machine that wrote it.
I was just made aware of the possibility of using frexp to get the mantissa and exponent in integer terms, writing those integers out (with some fixed size), then reading the integers in and reconstructing the floating point value using ldexp. This looks promising, but I am wondering if there is any generally accepted or recommended method for handling float endianness without htonf/ntohf.
I know with almost certainly any platform I'll be targeting anytime soon will have float represented with 32-bits, but I'd like to make the code I write now as compatible as I can for use in future projects.
If you want to be completely cross-platform and standards-compliant, then the frexp/ldexp solution is the best way to go. (Although you might need to consider the highly theoretical case where either the source or the target hardware uses decimal floating point.)
Suppose that one or the other machine did not have a 32-bit floating point representation. Then there is no datatype on that machine bit-compatible with a 32-bit floating pointer number, regardless of endianness. So there is then no standard way of converting the non-32-bit float to a transmittable 32-bit representation, or to convert the transmitted 32-bit representation to a native non-32-bit floating point number.
You could restrict your scope to machines which have a 32-bit floating point representation, but then you will need to assume that both machines have the same number and order of bits dedicated to sign, exponent and mantissa. That's likely to be the case, since IEEE-754 format is almost universal these days, but C++ does not insist on it and it is at least conceivable that there is a machine which implements 1/8/23-bit floating point numbers with the sign bit at the low-order end instead of the high-order end.
In short, endianness is only one of the possible incompatibilities between binary floating point formats. Reducing every floating point number to two integers, however, avoids having to deal with other incompatibilities (other than radix).
Firstly, the problem I'm trying to solve is coming up with a better representation for values that will always remain uniformly distributed in the range:
0.0 <= x < 1.0
The motivation for this is to attempt to reduce the number of bytes used to store this data (the application is heavily memory and I/O bandwidth bound). Currently a 32-bit floating-point representation is used, 16-bit floating-point is proving insufficiently accurate.
My initial thoughts are to try and store the data in a 16-bit integer and to simply use the scheme:
x/(2^16 - 1) [x is an unsigned short]
To keep the algorithms largely the same and to retain use of the same floating-point hardware operations (at least at first), I would ideally like to keep converting this fractional representation into floating-point representation, performing the operation(s), then converting back into fractional representation for storage.
Clearly, there will be a loss of precision going back and forth between these two quite different, imprecise representations, but for our application, I suspect this might be an acceptable tradeoff.
I've done some research looking at what is currently out there that might give us a good starting point. The seminal "What Every Computer Scientist Should Know About Floating-Point Arithmetic" article (http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html) led me to look at a few others, "Beyond Floating Point" (home.ccil.org/~cowan/temp/p319-clenshaw.pdf) being one such example.
Can anyone point me to other examples of representations that people have used elsewhere that might satisfy these requirements?
I'm concerned that any potential gain in exactness of representation (we're wasting much of the floating-point format currently by using this specific range) will be completely out-weighed by the requirement to round twice going from fractional representation to floating-point and back again. In which case, it may be required to do arithmetic using this fractional representation directly to get any benefit out of this approach. Any advice on this point would be helpful?
Don't use 2^16-1. Use 2^16. Yes, you will have very slightly less precision and waste your 0xFFFF, but you will guarantee that there is no loss in precision when converting to floating point. (In contrast, when converting away from floating point, you will lose 8 bits of mantissal precision.)
Round-trip conversions between precisions can cause problems with certain operations, in particular progressively summing numbers. If at all possible, treat your fixed-point values as "dirty", and don't use them for further floating-point computations; prefer recalculating from inputs to using intermediate results which are in fixed-point form.
Alternatively, use 24 bits. With this representation, you will lose no precision in either direction as long as your values don't underflow (that is, as long as they're above 2^-24).
Wouldn't 1/x be badly distributed in your range? 1/2 1/3 1/4 .. do you not want to represent numbers above 1/2?
This kind of thing is done in Netcdf quite a lot to encode data for saving space.
const double scale = 1.0/65536;
unsigned short x;
Any number in x is really x*scale
See example in NetCDF for a more general approach using scale and offset: http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/tutorial/NetcdfDataset.html
Have a look at "Packed Data Values" section of this page:
https://www.unidata.ucar.edu/software/netcdf/docs/BestPractices.html#Packed%20Data%20Values
Floating point type represents a number by storing its significant digits and its exponent separately on separate binary words so it fits in 16, 32, 64 or 128 bits.
Fixed point type stores numbers with 2 words, one representing the integer part, another representing the part past the radix, in negative exponents, 2^-1, 2^-2, 2^-3, etc.
Float are better because they have wider range in an exponent sense, but not if one wants to store number with more precision for a certain range, for example only using integer from -16 to 16, thus using more bits to hold digits past the radix.
In terms of performances, which one has the best performance, or are there cases where some is faster than the other ?
In video game programming, does everybody use floating point because the FPU makes it faster, or because the performance drop is just negligible, or do they make their own fixed type ?
Why isn't there any fixed type in C/C++ ?
That definition covers a very limited subset of fixed point implementations.
It would be more correct to say that in fixed point only the mantissa is stored and the exponent is a constant determined a-priori. There is no requirement for the binary point to fall inside the mantissa, and definitely no requirement that it fall on a word boundary. For example, all of the following are "fixed point":
64 bit mantissa, scaled by 2-32 (this fits the definition listed in the question)
64 bit mantissa, scaled by 2-33 (now the integer and fractional parts cannot be separated by an octet boundary)
32 bit mantissa, scaled by 24 (now there is no fractional part)
32 bit mantissa, scaled by 2-40 (now there is no integer part)
GPUs tend to use fixed point with no integer part (typically 32-bit mantissa scaled by 2-32). Therefore APIs such as OpenGL and Direct3D often use floating-point types which are capable of holding these values. However, manipulating the integer mantissa is often more efficient so these APIs allow specifying coordinates (in texture space, color space, etc) this way as well.
As for your claim that C++ doesn't have a fixed point type, I disagree. All integer types in C++ are fixed point types. The exponent is often assumed to be zero, but this isn't required and I have quite a bit of fixed-point DSP code implemented in C++ this way.
At the code level, fixed-point arithmetic is simply integer arithmetic with an implied denominator.
For many simple arithmetic operations, fixed-point and integer operations are essentially the same. However, there are some operations which the intermediate values must be represented with a higher number of bits and then rounded off. For example, to multiply two 16-bit fixed-point numbers, the result must be temporarily stored in 32-bit before renormalizing (or saturating) back to 16-bit fixed-point.
When the software does not take advantage of vectorization (such as CPU-based SIMD or GPGPU), integer and fixed-point arithmeric is faster than FPU. When vectorization is used, the efficiency of vectorization matters a lot more, such that the performance differences between fixed-point and floating-point is moot.
Some architectures provide hardware implementations for certain math functions, such as sin, cos, atan, sqrt, for floating-point types only. Some architectures do not provide any hardware implementation at all. In both cases, specialized math software libraries may provide those functions by using only integer or fixed-point arithmetic. Often, such libraries will provide multiple level of precisions, for example, answers which are only accurate up to N-bits of precision, which is less than the full precision of the representation. The limited-precision versions may be faster than the highest-precision version.
Fixed point is widely used in DSP and embedded-systems where often the target processor has no FPU, and fixed point can be implemented reasonably efficiently using an integer ALU.
In terms of performance, that is likley to vary depending on the target architecture and application. Obviously if there is no FPU, then fixed point will be considerably faster. When you have an FPU it will depend on the application too. For example performing some functions such as sqrt() or log() will be much faster when directly supported in the instruction set rather thna implemented algorithmically.
There is no built-in fixed point type in C or C++ I imagine because they (or at least C) were envisaged as systems level languages and the need fixed point is somewhat domain specific, and also perhaps because on a general purpose processor there is typically no direct hardware support for fixed point.
In C++ defining a fixed-point data type class with suitable operator overloads and associated math functions can easily overcome this shortcomming. However there are good and bad solutions to this problem. A good example can be found here: http://www.drdobbs.com/cpp/207000448. The link to the code in that article is broken, but I tracked it down to ftp://66.77.27.238/sourcecode/ddj/2008/0804.zip
You need to be careful when discussing "precision" in this context.
For the same number of bits in representation the maximum fixed point value has more significant bits than any floating point value (because the floating point format has to give some bits away to the exponent), but the minimum fixed point value has fewer than any non-denormalized floating point value (because the fixed point value wastes most of its mantissa in leading zeros).
Also depending on the way you divide the fixed point number up, the floating point value may be able to represent smaller numbers meaning that it has a more precise representation of "tiny but non-zero".
And so on.
The diferrence between floating point and integer math depends on the CPU you have in mind. On Intel chips the difference is not big in clockticks. Int math is still faster because there are multiple integer ALU's that can work in parallel. Compilers are also smart to use special adress calculation instructions to optimize add/multiply in a single instruction. Conversion counts as an operation too, so just choose your type and stick with it.
In C++ you can build your own type for fixed point math. You just define as struct with one int and override the appropriate overloads, and make them do what they normally do plus a shift to put the comma back to the right position.
You dont use float in games because it is faster or slower you use it because it is easier to implement the algorithms in floating point than in fixed point. You are assuming the reason has to do with computing speed and that is not the reason, it has to do with ease of programming.
For example you may define the width of the screen/viewport as going from 0.0 to 1.0, the height of the screen 0.0 to 1.0. The depth of the word 0.0 to 1.0. and so on. Matrix math, etc makes things real easy to implement. Do all of the math that way up to the point where you need to compute real pixels on a real screen size, say 800x400. Project the ray from the eye to the point on the object in the world and compute where it pierces the screen, using 0 to 1 math, then multiply x by 800, y times 400 and place that pixel.
floating point does not store the exponent and mantissa separately and the mantissa is a goofy number, what is left over after the exponent and sign, like 23 bits, not 16 or 32 or 64 bits.
floating point math at its core uses fixed point logic with extra logic and extra steps required. By definition compared apples to apples fixed point math is cheaper because you dont have to manipulate the data on the way into the alu and dont have to manipulate the data on the way out (normalize). When you add in IEEE and all of its garbage that adds even more logic, more clock cycles, etc. (properly signed infinity, quiet and signaling nans, different results for same operation if there is an exception handler enabled). As someone pointed out in a comment in a real system where you can do fixed and float in parallel, you can take advantage of some or all of the processors and recover some clocks that way. both with float and fixed clock rate can be increased by using vast quantities of chip real estate, fixed will remain cheaper, but float can approach fixed speeds using these kinds of tricks as well as parallel operation.
One issue not covered is the answers is a power consumption. Though it highly depends on specific hardware architecture, usually FPU consumes much more energy than ALU in CPU thus if you target mobile applications where power consumption is important it's worth consider fixed point impelementation of the algorithm.
It depends on what you're working on. If you're using fixed point then you lose precision; you have to select the number of places after the decimal place (which may not always be good enough). In floating point you don't need to worry about this as the precision offered is nearly always good enough for the task in hand - uses a standard form implementation to represent the number.
The pros and cons come down to speed and resources. On modern 32bit and 64bit platforms there is really no need to use fixed point. Most systems come with built in FPUs that are hardwired to be optimised for fixed point operations. Furthermore, most modern CPU intrinsics come with operations such as the SIMD set which help optimise vector based methods via vectorisation and unrolling. So fixed point only comes with a down side.
On embedded systems and small microcontrollers (8bit and 16bit) you may not have an FPU nor extended instruction sets. In which case you may be forced to use fixed point methods or the limited floating point instruction sets that are not very fast. So in these circumstances fixed point will be a better - or even your only - choice.