Convert half to float in OpenCL - c++

I apologize if this is trivial, but I've been unable to find an answer by google.
As per the OpenCL standard (since 1.0), the half type is supported for storage reasons.
It seems to me however, that without the cl_khr_fp16 extension, it's impossible to use this for anything?
What I would like to do is to save my values as half, but perform all calculations in float.
I tried using convert_half(), but that's not supported without the cl_khr_fp16.
I tried just writing (float) before the half for auto c-style conversion, didn't work eighter.
So my question is, how do I utilize half for storage?
I need to be able to both read and write half's.

Use vload_halfN and store_halfN. The halfN values stored will be converted to/from floatN.

As far as I know the type half is only supported on the GPU, but you can convert it to and back from a float fairly simply, as long as you know a bit about bitwise manipulation.
Have a look at the following link for a good explanation on how to do so.

Since it wasn't mentioned in any of the other answers I thought I'd add: You can also use half float in OpenCL images and the read_imagef and write_imagef functions will do the conversion to/from float for you (cl_khr_fp16 extension not required). That extension is only for having variables in (and doing math in) half.


Why Eigen's sample codes tend to use <float> class?

I'm trying to use Eigen in C++ for matrix manipulation.
It looks like I can choose float or double type for real numbers,
such as Eigen::Matrix4f or Eigen::Matrix4d.
In normal C++ code, I guess double is more popular nowadays than float.
However, in Eigen's documentation, float seems to be more frequently used than double.
Is there any special reason???
I know this is very immature question but I need help......
Thank you in advance.
float is usually faster. Performance makes a lot of sense for math.

What is the standard way to maintain accuracy when dealing with incredibly precise floating point calculations in C++?

I'm in the process of converting a program to C++ from Scilab (similar to Matlab) and I'm required to maintain the same level of precision that is kept by the previous code.
Note: Although maintaining the same level of precision would be ideal. It's acceptable if there is some error with the finished result. The problem I'm facing (as I'll show below) is due to looping, so the calculation error compounds rather quickly. But if the final result is only a thousandth or so off (e.g. 1/1000 vs 1/1001) it won't be a problem.
I've briefly looked into a number of different ways to do this including:
GMP (A Multiple Precision
Arithmetic Library)
Using integers instead of floats (see example below)
Int vs Float Example: Instead of using the float 12.45, store it as an integer being 124,500. Then simply convert everything back when appropriate to do so. Note: I'm not exactly sure how this will work with the code I'm working with (more detail below).
An example of how my program is producing incorrect results:
for (int i = 0; i <= 1000; i++)
for (int j = 0; j <= 10000; j++)
// This calculation will be computed with less precision than in Scilab
float1 = (1.0 / 100000.0);
// The above error of float2 will become significant by the end of the loop
float2 = (float1 + float2);
My question is:
Is there a generally accepted way to go about retaining accuracy in floating point arithmetic OR will one of the above methods suffice?
Maintaining precision when porting code like this is very difficult to do. Not because the languages have implicitly different perspectives on what a float is, but because of what the different algorithms or assumptions of accuracy limits are. For example, when performing numerical integration in Scilab, it may use a Gaussian quadrature method. Whereas you might try using a trapezoidal method. The two may both be working on identical IEEE754 single-precision floating point numbers, but you will get different answers due to the convergence characteristics of the two algorithms. So how do you get around this?
Well, you can go through the Scilab source code and look at all of the algorithms it uses for each thing you need. You can then replicate these algorithms taking care of any pre- or post-conditioning of the data that Scilab implicitly does (if any at all). That's a lot of work. And, frankly, probably not the best way to spend your time. Rather, I would look into using the Interfacing with Other Languages section from the developer's documentation to see how you can call the Scilab functions directly from your C, C++, Java, or Fortran code.
Of course, with the second option, you have to consider how you are going to distribute your code (if you need to).Scilab has a GPL-compatible license, so you can just bundle it with your code. However, it is quite big (~180MB) and you may want to just bundle the pieces you need (e.g., you don't need the whole interpreter system). This is more work in a different way, but guarantees numerical-compatibility with your current Scilab solutions.
Is there a generally accepted way to go about retaining accuracy in floating
point arithmetic
"Generally accepted" is too broad, so no.
will one of the above methods suffice?
Yes. Particularly gmp seems to be a standard choice. I would also have a look at the Boost Multiprecision library.
A hand-coded integer approach can work as well, but is surely not the method of choice: it requires much more coding, and more severe a means to store and process aritrarily precise integers.
If your compiler supports it use BCD (Binary-coded decimal)
Well, another alternative if you use GCC compilers is to go with quadmath/__float128 types.

dtoa vs sprintf vs Grisu3 algorithm

What is the best way to render double precision numbers as strings in C++?
I ran across the article Here be dragons: advances in problems you didn’t even know you had which discusses printing floating point numbers.
I have been using sprintf. I don't understand why I would need to modify the code?
If you are happy with sprintf_s you shouldn't change. However if you need to format your output in a way that is not supported by your library, you might need to reimplement a specialized version of sprintf (with any of the known algorithms).
For example JavaScript has very precise requirements on how its numbers must be printed (see section 9.8.1 of the specification). The correct output can't be accomplished by simply calling sprintf. Indeed, Grisu has been developed to implement correct number-printing for a JavaScript compiler.
Grisu is also faster than sprintf, but unless floating-point printing is a bottleneck in your application this should not be a reason to switch to a different library.
Ahah !
The problem outlined in the article you give is that for some numbers, the computer displays something that is theoritically correct but not what we, humans, would have used.
For example, like the article says, 1.2999999... = 1.3, so if your result is 1.3, it's (quite) correct for the computer to display it as 1.299999999... But that's not what you would have seen...
Now the question is why does the computer do that ? The reason is the computer compute in base 2 (binary) and that we usually compute in base 10 (decimal). The results are the same (thanks god !) but the internal storage and the representation are not.
Some numbers looks nice when displayed in base 10, like 1.3 for example, but others don't, for example 1/3 = 0.333333333.... It's the same in base 2, some numbers "looks" nice in base 2 (usually when composed of fractions of 2) and other not. When the computer stores number internally, it may not be able to store it "exactly" and store the closest possible representation, even if the number looked "finite" in decimal. So yes, in this case, it "drifts" a little bit. If you do that again and again, you may lose precision. But there is no other way (unless using special math libs able to store fractions)
The problem arise when the computer tries to give you back in base 10 the number you gave it. Then the computer may gives you 1.299999 instead of the 1.3 you were expected.
That's also the reason why you should never compare floats with ==, <, >, but instead use the special functions islessgreater(a, b) isgreater(a, b) etc.
So the actual function you use (sprintf) is fine and as exact as it can, it gives you correct values, you just have to know that when dealing with floats, 1.2999999 at maximum precision is OK if you were expecting 1.3
Now if you want to "pretty print" those numbers to have the best "human" representation (base 10), you may want to use a special library, like your grisu3 which will try to undo the drift that may have happen and align the number to the closest base 10 representation.
Now the library cannot use a crystal ball and find what numbers were drifted or not, so it may happen that you really meant 1.2999999 at maximum precision as stored in the computer and the lib will "convert" it to 1.3... But it's not worse nor less precise than displaying 1.29999 instead of 1.3.
If you need a good readability, such lib will be useful. If not, it's just a waste of time.
Hope this help !
The best way to do this in any reasonable language is:
Use your language's runtime library. Don't ever roll your own. Even if you have the knowledge and curiosity to write it, you don't want to test it and you don't want to maintain it.
If you notice any misbehavior from the runtime library conversion, file a bug.
If these conversions are a measurable bottleneck for your program, don't try to make them faster. Instead, find a way to avoid doing them at all. Instead of storing numbers as strings, just store the floating-point data (after possibly controlling for endianness). If you need a string representation, use a hexadecimal floating-point format instead.
I don't mean to discourage you, or anyone. These are actually fascinating functions to work on, but they are also shocking complex, and trying to design good test coverage for any non-naive implementation is even more involved. Don't get started unless you're prepared to spend months thinking about the problem.
You might want to use something like Grisu (or a faster method) because it gives you the shortest decimal representation with round trip guarantee unlike sprintf which only takes a fixed precision. The good news is that C++20 includes std::format that gives you this by default. For example:
printf("%.*g", std::numeric_limits<double>::max_digits10, 0.3);
prints 0.29999999999999999 while
puts(fmt::format("{}", 0.3).c_str());
prints 0.3 (godbolt).
In the meantime you can use the {fmt} library, std::format is based on. {fmt} also provides the print function that makes this even easier and more efficient (godbolt):
fmt::print("{}", 0.3);
Disclaimer: I'm the author of {fmt} and C++20 std::format.
In C++ why aren't you using iostreams? You should probably be using cout for the console and ostringstream for string-oriented output (unless you have a very specific need to use a printf family method).
You shouldn't worry about formatting performance unless actual profiling shows that CPU is the bottleneck (compared to say I/O).
void outputdouble( ostringstream & oss, double d )
oss.precision( 5 );
oss << d;

What is the difference between float2 and cuComplex, which to use?

I am trying to figure out how to use complex numbers in both my host and device code. I came across cuComplex (but can't find any documentation!) and float2 which at least gets a mention in the CUDA programming guide.
What should I use? In the header file for cuComplex, it looks like the functions are declared with __host__ __device__ so I am assuming that means that it would be ok to use them in either place.
My original data is being read in from a file into a std::complex<float> so I dont really want to mess with that. I guess in order to use the complex values on the GPU though, I will have to copy from the original complex<float> to the cuComplex?
cuComplex is defined in /usr/local/cuda/include/cuComplex.h (modulo your install dir). The relevant snippets:
typedef float2 cuFloatComplex;
typedef cuFloatComplex cuComplex;
typedef double2 cuDoubleComplex;
There are also handy functions in there for working with complex numbers -- multiplying them, building them, etc.
As for whether to use float2 or cuComplex, you should use whichever is semantically appropriate -- is it a vector or a complex number? Also, if it is a complex number, you may want to consider using cuFloatComplex or cuDoubleComplex just to be fully explicit.
If you're trying to work with cuBLAS or cuFFT you should use cuComplex. If you're are going to write your own functions there should be no difference in performance as both are just a structure of two floats.
IIRC, float2 is an array of 2 numbers. cuComplex (from the name alone) sounds like CUDA's complex format.
This post seems to point to where to find more on cuComplex:

problem with casting float -> double in C when fread

I have a problem with casting from float to double when fread;
if i change double pointer to float pointer, it works just fine.
However,i need it to be double pointer for laster on, and i thought when i write from small data type (float)to bigger data type(double)'s memory, it should be fine. but it turns out it doesnt work as i expected.
what is wrong with it, and how do i solve this problem.
i know i can solve it by converting it one by one. but i have a huge amount of data. and i dont wanna extra 9000000+ round of converting.. that would be very expensive. and is there any trick i can solve it?
is there any c++/c tricks
If you write float-formatted data into a double, you're only going to get garbage as a result. Sure, you won't overflow your buffer, but that's not the only problem - it's still going to be finding two floats where it expects a double. You need to read it as a float, then convert - casting (even implicitly) in this manner lets the compiler know that the data was originally a float and needs to be converted:
float temp[500];
int i;
fread(temp, sizeof(temp[0]), 500, f);
for (i = 0; i < 500; i++)
doublePointer[i] = temp[i];
Suppose for example a float is 4 bytes on your computer. If you read 500 floats then you read 2000 bytes, one float per float and the result is correct.
Suppose for example a double is 8 bytes on your computer. If you read 500 floats then you read 2000 bytes, but you're reading them into 250 doubles, 2 floats per double, and the result is nonsense.
If your file has 500 floats you have to read 500 floats. Cast each float value to a double value. You can convert each numeric value that way.
When you abuse a pointer, pretending that the pointer points to a type of data that it doesn't really point to, then you're not converting each numeric value, you're preserving nonsense as nonsense.
Edit: You added to your question "and i dont wanna extra 9000000+ round of converting.. that would be very expensive. and is there any trick i can solve it?" The answer is yes, you can use a trick of keeping your floats as floats. If you don't want to convert to doubles then don't convert to doubles, just keep your floats as floats.
9000000 conversions from float to double is nothing. fread into a float array, then convert that into a double array.
Benchmark this code scientifically, don't guess about where the slowdowns might be.
If you're bottlenecked on the conversion, write a unrolled, vectorized conversion loop, or use one from a commercial vector library.
If it's still too slow, tile your reads so you read in your float data in batches of a few pages that fit in L1 cache, then convert those to double, then read the next few pages and convert those to double, etc.
If it's still too slow, investigate loading your data lazily so only the parts that are needed get loaded, and only when they are used.
A modern x86 core is capable of doing two float->double conversions per cycle in a hand-tuned vectorized loop; at 2GHz, that's 4 billion conversions per second per core. 9 million conversions is small change -- my laptop does it in less than 1 millisecond.
Alternatively, just convert the whole dataset to double once, and read it in that way from now on. Problem solved.
I would look at this from a different perspective. If the data is stored as float, then that is all the precision it will ever have. There is no point in converting to double until the rules of floating point arithmetic require it.
So I would allocate a buffer for the 500 (or whatever) floats, and read them from the data file with one suitable call to fread():
float *databuffer;
databuffer = malloc(500 * sizeof(float));
fread(databuffer, sizeof(Float), 500, f);
Later, use the data in whatever math it needs to participate in. It will be promoted to double if required. Don't forget to eventually free the buffer after it is no longer needed.
If your results really do have all the precision of a double, then use a fresh buffer of doubles to hold them. However, if they are to be written back to file as float, then you will eventually need to put them into a buffer of floats.
Note that reading and writing files for interchange often needs to be considered a separate problem from efficient storage and usage of data in memory. It is often necessary to read a file and process each individual value in some way. For example, a portable program might be required to handle data written by a system using a different byte order. Less frequently today, you might find that even the layout of the bits in a float differs between systems. In general, this problem is often best solved by deferring to a library implementing a standard such as XDR (defined by RFC 4506) that was designed to deal with binary portability.