OpenCL, half vs float performance - c++

I'm currently working on an application that requires large amounts of variables to be stored and processed (~4gb in float)
Since precision of the individual variables are of less importance (I know that they'll be bounded), I saw that I could use OpenCL's half instead of floats, since that would really decrease the amount of memory.
My question is twofold.
Is there any performance hit to using half instead of float (I'd image graphics cards being built for float operations)
Is there a performance hit for mixing floats and half's in calculations? (i.e, a float times a half.)
Sincerily,
Andreas Falkenstrøm Mieritz

ARM CPUs and GPUs have native support for half in their ALUs so you'll get close to double speed, plus substantial savings in energy consumption. Edit: The same goes for PowerVR GPUs.
Desktop hardware only supports half in the load/store and texturing units, AFAIK. Even so, I'd expect half textures to perform better than float textures or buffers on any GPU. Particularly if you can make some clever use of texture filtering.

OpenCL kernels are almost always memory-speed or pci-speed bound. If you are converting a decent chunk of your data for half floats, this will enable faster transfers of your values. Almost certainly faster on any platform/device.
As far as performance, half is rarely worse than float. I am fairly sure that any device which supports half will do computations as fast as it would with float. Again, even if there is a slight overhead here, you will more than make up for it in your far-superior transfer times.

Related

GPU HLSL compute shader warnings int and uint division

I keep having warnings from compute shader compilation in that I'm recommended to use uints instead of ints with dividing.
By default from the data type I assume uints are faster; however various tests online seem to point to the contrary; perhaps this contradiction is on the CPU side only and GPU parallelisation has some unknown advantage?
(Or is it just bad advice?)
I know that this is an extremely late answer, but this is a question that has come up for me as well, and I wanted to provide some information for anyone who sees this in the future.
I recently found this resource - https://arxiv.org/pdf/1905.08778.pdf
The table at the bottom lists the latency of basic operations on several graphics cards. There is a small but consistent savings to be found by using uints on all measured hardware. However, what the warning doesn't state is that the greater optimization is to be found by replacing division with multiplication if at all possible.
https://www.slideshare.net/DevCentralAMD/lowlevel-shader-optimization-for-nextgen-and-dx11-by-emil-persson states that type conversion is a full-rate operation like int/float subtraction, addition, and multiplication, whereas division is very slow.
I've seen it suggested that to improve performance, one should convert to float, divide, then convert back to int, but as shown in the first source, this will at best give you small gains and at worst actually decrease performance.
You are correct that it varies from performance of operations on the CPU, although I'm not entirely certain why.
Looking at https://www.agner.org/optimize/instruction_tables.pdf it appears that which operation is faster (MUL vs IMUL) varies from CPU to CPU - in a few at the top of the list IMUL is actually faster, despite a higher instruction count. Other CPUs don't provide a distinction between MUL and IMUL at all.
TL;DR uint division is faster on the GPU, but on the CPU YMMV

glColorPointer performance, better with floats or bytes?

Doing some maintenance on an old project and was asked by the client to see if it was possible to improve performance. I've done the parts I know and can easily test but then I tested
glColorPointer(4,GL_UNSIGNED_BYTE,...,...)
vs
glColorPointer(4,GL_FLOAT,...,...)
I could see literally no difference on the handful of machines I could test it on. Obviously it means thats not a bottleneck but since it's the first time I've been in a situation where I have access to both color formats it's also the first time I can wonder if there's a speed difference between the 2.
I'm expecting the answer is internally opengl adapters use float colors so it would be preferable to use float when available but anyone have a more definitive answer then that?
edit: the client has a few dozen machines that are ~10 year old and the project is used on those machines if it makes a difference
There's really no generally valid answer. You did the right thing by testing.
At least on desktop GPUs, it's fairly safe to assume that they will internally operate with 32-bit floats. On mobile GPUs, lower precision formats are more common, and you have some control over it using precision qualifiers in the shader code.
Assuming that 32-bit floats are used internally, there are two competing considerations:
If you specify the colors in a different format, like GL_UNSIGNED_BYTE, a conversion is needed while fetching the vertex data.
If you specify the colors in a more compact format, the vertex data uses less memory. This also has the effect that less memory bandwidth is consumed for fetching the data, with fewer cache misses, and potentially less cache pollution.
Which of these is more relevant really depends on the exact hardware, and the overall workload. The format conversion for item 1 can potentially be almost free if the hardware supports the byte format as part of fixed function vertex fetching hardware. Otherwise, it can add a little overhead.
Saving memory bandwidth is always a good thing. So by default, I would think that using the most compact representation is more likely to be beneficial. But testing and measuring is the only conclusive way to decide.
In reality, it's fairly rare that fetching vertex data is a major bottleneck in the pipeline. It does happen, but it's just not very common. So it's not surprising that you couldn't measure a difference.
For example, in a lot of use cases, texture data is overall much bigger than vertex data. If that is the case, the bandwidth consumed by texture sampling is often much more significant than the one used by vertex fetching. Also, related to this, there are mostly many more fragments than vertices, so anything related to fragment processing is much more performance critical than vertex processing.
On top of this, many applications make too many OpenGL API calls, or use the API in inefficient ways, and end up being limited by CPU overhead, particularly on very high performance GPUs. If you're optimizing performance for an existing app, that is pretty much the first thing you should check: Find out if you're CPU or GPU limited.

OpenCL, float vs uint, expected performance gain?

What would be the expected performance gain with using (u)int16 over float in an OpenCL kernel ? If any ?
I expect the memory transfer to be roughly divided by two but what of the device load ?
Strangely I can hardly find any benchs or documentations on the subject. (or maybe my google fu is just failing me...)
I'm working on image processing (filtering mostly). The precision is not that critical, indeed the result of several kernels operations is cast into a char. We narrowed several operations where using shorter data types is acceptable. So I was wondering if those operations can be speed up by using shorter data where the precision is not critical.
thanks for your help.
GPUs tend to do floating-point operations better than integral. For example, some will have extra pipelines for floating-point ops, and making everything integral just reduces the GPU's throughput. Data copy may not be your bottleneck and halving the amount by using 16-bit integers may not help. Moreover, on integrated GPU's like Intel or AMD's you can get zero-copy behavior. So the effect on image or buffer size is minimal (to a point).
Also, you might look into 16-bit floating point number support. That gets you the best part of both worlds (half the data w/ floating point numbers).

Physics engine: use double or single precision?

I am making a rigid body physics engine from scratch (for educational purposes), and I'm wondering if I should choose single or double precision floats for it.
I will be using OpenGL to visualize it and the glm library to calculate stuff internally in the engine as well as for the visualization. The convention seems to be to use floats for OpenGL pretty much everywhere and glm::vec3 and glm::vec4 seem to be using float internally. I also noticed that there is glm::dvec3 and glm::dvec4 though but nobody seems to be using it. How do I decide which on to use? double seems to make sense as it has more precision and pretty much the same performance on today's hardware (as far as I know), but everything else seems to use float except for some of GLu's functions and some of GLFW's.
This is all going to depend on your application. You pretty much already understand the tradeoffs between the two:
Single-precision
Less accurate
Faster computations even on todays hardware. Take up less memory and operations are faster. Get more out of cache optimizations, etc.
Double-precision
More accurate
Slower computations.
Typically in graphics applications the precision for floats is plenty given the number of pixels on the screen and scaling of the scene. In scientific settings or smaller scale simulation you may need the extra precision. It also may depend on your hardware. For instance, I coded a physically based simulation for rigid bodies on a netbook and switching to float gained on average 10-15 FPS which almost doubled the FPS at that point in my implementation.
My recommendation is that if this is an educational activity use floats and target the graphics application. If you find in your studies and timing and personal experience you need double-precision then head in that direction.
Surely the general rule is correctness first and performance second? That means using doubles unless you can convince yourself that you'll get fidelity required using floats.
The thing to look at is the effective size of one bit the coordinate system relative to the smallest size you intend to model.
For example, if you use earth coordinates, 100 degrees works around to around 1E7 metres.
An IEEE 754 float has only 23 bits of precision, so that gives a relative precision of only about 1E-7.
Hence the coordinate is only accurate to around 1 meter. This may or may not be sufficient for the problem.
I have learnt from experience to always use doubles for the physics and physical modelling calculations, but concede that cannot be a universal requirement.
It does not of course follow that the rendering should be using double; you may well want that as a float.
I used typedef in a common header and went with float as my default.
typedef real_t float;
I would not recommend using templates for this because it causes huge design problem as you try to use polymorphic/virtual function.
Why floats does work
The floats worked pretty fine for me for 3 reasons:
First, almost every physical simulation would involve adding some noise to the forces and torques to be realistic. This random noises are usually far larger in magnitude than precision of floats.
Second, having limited precision is actually beneficial on many instances. Consider that almost all of the classical mechanics for rigid body doesn't apply in real world because there is no such thing as perfect rigid body. So when you apply force to less than perfect rigid body you don't gets perfect acceleration to the 7th digit.
Third, many simulations are for short duration so the accumulated errors remain small enough. Using double precision doesn't change this automatically. Creating long running simulations that matches the real world is extremely difficult and would be very specialized project.
When floats don't work
Here are the situation where I had to consider using double.
Latitude and longitudes should be double. Floats simply doesn't have good enough resolution for most purposes for these quantities.
Computing integral of very small quantities over time. For example, Gaussian Markov process is good way to represent random walks in sensor bias. However the values will typically be very small and accumulates. Errors in calculation could be much bigger in floats than doubles.
Specialized simulations that goes beyond usual classical mechanics of linear and rotational motions of rigid body. For example, if you do things with protein molecules, crystal growth, micro-gravity physics etc then you probably want to use double.
When doubles don't work
There are actually times when higher precision in double hurts, although its rare. An example from What every computer scientists should know...: if you have some quantity that is converging to 1 over time. You take its log and do something if result is 0. When using double, you might never get to 1 because rounding might not happen but with floats it might.
Another example: You need to use special code to compare real values. These code often has default rounding to epsilon which for float is fairly reasonable 1E-6 but for double its 1E-15. If you are not careful, this can give lot of surprises.
Performance
Here's another surprise: On modern x86 hardware there is little difference between raw performance of float vs double. The memory alignment, caching etc almost overwhelmingly dominates more than floating point types. On my machine a simple summation test of 100M random numbers with floats took 22 sec and with double it takes 25 sec. So floats are 12% faster indeed but I still think its too low to abandon double just for performance. However if you use SSE instructions or GPUs or embedded/mobile hardware like Arduino then floats would be much more faster and that can most certainly be driving factor.
A Physics engine that does nothing but linear and rotational motions of rigid body can run at 2000Hz on today's desktop-grade hardware on single thread. You can trivially parallelize this to many cores. Lot of simple low end simulations require just 50Hz. At 100Hz things starts to get pretty smooth. If you have things like PID controllers, you might have to go up to 500Hz. But even at that worse-case rate, you can still simulate plenty of objects with good enough desktop.
In summary, don't let performance be your driving factor unless you actually measure it.
What to do
A rule of thumb is to use as much precision as you need to get your code work. For simple physics engine for rigid body, float are often good enough. However you want to be able to change your mind without revamping your code. So the best approach is to use typedef as mentioned at the start and make sure you have your code working for float as well as double. Then measure often and chose the type as your project evolves.
Another vital thing in your case: Keep physics engine religiously separated from rendering system. Output from physics engine could be either double or float and should be typecasted to whatever rendering system needs.
Here's the short answer.
Q. Why does OpenGL use float rather than double?
A. Because most of the time you don't need the precision and doubles
are twice the size.
Another thing to consider is that you shouldn't use doubles everywhere, just as some things may take require using a double as opposed to a float. For example, if you are drawing a circle by drawing squares by looping through the angles, there can only be so many squares shown on the screen. They will overlap, and in this case, doubles would be pointless. However if you're doing arbitrary floating point arithmetic, you may need the extra precision if you're trying to accurately represent the Mandelbrot series (although that totally depends on your algorithm.)
Either way, in the end, you will need to usually cast back to float if you intend to use those values in drawing.
Single prec operations are faster and the data uses less memory less network bandwidth. So you only use double if you gain something in exchange for slower ops and more mem and bandwidth required. There are certainly applications of rigid body physics where the extra precision would be worth it, such as in manipulating lat\lon where single precision only gives you meter accuracy but is this your case?
Since it's educational purpose, maybe you want to educate yourself in the use of high precision physics algorithms where the extra accuracy would matter but lots of rigid body phys involves processes that can only be approximately quantified such as friction between 2 solids, collision reaction after detection etc, that extra precision wont matter you just get more precise approximate behavior :)

Reducing bandwidth between GPU and CPU( sending raw data or pre calculate first)

OK so I am just trying to work out the best way reduce band width between the GPU and CPU.
Particle Systems.
Should I be pre calculating most things on the CPU and sending it to the GPU this is includes stuff like positions, rotations, velocity, calculations for alpha and random numbers ect.
Or should I be doing as much as i can in the shaders and using the geometry shader as much as possible.
My problem is that the sort of app that I have written has to have a good few variables sent to the shaders for example, A user at run time will select emitter positions and velocity plus a lot more. The sorts of things that I am not sure how to tackle are things like "if a user wants a random velocity and gives a min and max value to have the random value select from, should this random value be worked out on the CPU and sent as a single value to the GPU or should both the min and max values be sent to the GPU and have a random function generator in the GPU do it? Any comments on reducing bandwidth and optimization are much appreciated.
Should I be pre calculating most things on the CPU and sending it to the GPU this is includes stuff like positions, rotations, velocity, calculations for alpha and random numbers ect.
Or should I be doing as much as i can in the shaders and using the geometry shader as much as possible.
Impossible to answer. Spend too much CPU time and performance will drop. Spend too much GPU time, performance will drop too. Transfer too much data, performance will drop. So, instead of trying to guess (I don't know what app you're writing, what's your target hardware, etc. Hell, you didn't even specify your target api and platform) measure/profile and select optimal method. PROFILE instead of trying to guess the performance. There are AQTime 7 Standard, gprof, and NVPerfKit for that (plus many other tools).
Do you actually have performance problem in your application? If you don't have any performance problems, then don't do anything. Do you have, say ten million particles per frame in real time? If not, there's little reason to worry, since a 600mhz cpu was capable of handling thousand of them easily 7 years ago. On other hand, if you have, say, dynamic 3d environmnet and particles must interact with it (bounce), then doing it all on GPU will be MUCH harder.
Anyway, to me it sounds like you don't have to optimize anything and there's no actual NEED to optimize. So the best idea would be to concentrate on some other things.
However, in any case, ensure that you're using correct way to transfer "dynamic" data that is frequently updated. In directX that meant using dynamic write-only vertex buffers with D3DLOCK_DISCARD|D3DLOCK_NOOVERWRITE. With OpenGL that'll probably mean using STREAM or DYNAMIC bufferdata with DRAW access. That should be sufficient to avoid major performance hits.
There's no single right answer to this. Here are some things that might help you make up your mind:
Are you sure the volume of data going over the bus is high enough to be a problem? You might want to do the math and see how much data there is per second vs. what's available on the target hardware.
Is the application likely to be CPU bound or GPU bound? If it's already GPU bound there's no point loading it up further.
Particle systems are pretty easy to implement on the CPU and will run on any hardware. A GPU implementation that supports nontrivial particle systems will be more complex and limited to hardware that supports the required functionality (e.g. stream out and an API that gives access to it.)
Consider a mixed approach. Can you split the particle systems into low complexity, high bandwidth particle systems implemented on the GPU and high complexity, low bandwidth systems implemented on the CPU?
All that said, I think I would start with a CPU implementation and move some of the work to the GPU if it proves necessary and feasible.