The simple question.
Is glsl/es float div operation MUCH slower than mul? I know it slower on x86, but is it slower on GPU?
When I look at GLSL disassembler I just see one more "rcp" command and all. How much costs that "rcp"?
It varies from GPU to GPU, but in most cases, an rcp (reciprocal) instruction is roughly as expensive as a mul instruction. A divide ends up being roughly as expensive as a mul + an rcp. Both are fairly cheap compared to a texture lookup or branch of any kind.
Related
I keep having warnings from compute shader compilation in that I'm recommended to use uints instead of ints with dividing.
By default from the data type I assume uints are faster; however various tests online seem to point to the contrary; perhaps this contradiction is on the CPU side only and GPU parallelisation has some unknown advantage?
(Or is it just bad advice?)
I know that this is an extremely late answer, but this is a question that has come up for me as well, and I wanted to provide some information for anyone who sees this in the future.
I recently found this resource - https://arxiv.org/pdf/1905.08778.pdf
The table at the bottom lists the latency of basic operations on several graphics cards. There is a small but consistent savings to be found by using uints on all measured hardware. However, what the warning doesn't state is that the greater optimization is to be found by replacing division with multiplication if at all possible.
https://www.slideshare.net/DevCentralAMD/lowlevel-shader-optimization-for-nextgen-and-dx11-by-emil-persson states that type conversion is a full-rate operation like int/float subtraction, addition, and multiplication, whereas division is very slow.
I've seen it suggested that to improve performance, one should convert to float, divide, then convert back to int, but as shown in the first source, this will at best give you small gains and at worst actually decrease performance.
You are correct that it varies from performance of operations on the CPU, although I'm not entirely certain why.
Looking at https://www.agner.org/optimize/instruction_tables.pdf it appears that which operation is faster (MUL vs IMUL) varies from CPU to CPU - in a few at the top of the list IMUL is actually faster, despite a higher instruction count. Other CPUs don't provide a distinction between MUL and IMUL at all.
TL;DR uint division is faster on the GPU, but on the CPU YMMV
I work on AVX2 and need to calculate 64-bit x64-bit -> 128-bit widening multiplication and got 64-bit high part in the fastest manner. Since AVX2 has not such an instruction, is it reasonable for me to use Karatsuba algorithm for efficiency and gaining speed?
No. On modern architectures the crossover at which Karatsuba beats schoolbook multiplication is usually somewhere between 8 and 24 machine words (e.g. between 512 and 1536 bits on x86_64). For fixed sizes, the threshold is at the smaller end of that range, and the new ADCX/ADOX instructions likely bring it in somewhat further for scalar code, but 64x64 is still too small to benefit from Karatsuba.
It's highly unlikely that AVX2 will beat the mulx instruction which does 64bx64b to 128b in one instruction. There is one exception I'm aware of large multiplications using floating point FFT.
However, if you don't need exactly 64bx64b to 128b you could consider
53bx53b to 106b using double-double arithmetic.
To multiply four 53-bit numbers a and b to get four 106-bit number only needs two instructions:
__m256 p = _mm256_mul_pd(a,b);
__m256 e = _mm256_fmsub_pd(a,b,p);
This gives four 106-bit numbers in two instructions compared to one 128-bit number in one instruction using mulx.
It's hard to tell without trying, but it might me faster to just use the AMD64 MUL instruction, which supports 64x64=128 with the same throughput as most AVX2 instructions (but not vectorized). The drawback is that you need to load to regular registers if the operands were in YMM registers. That would give something like LOAD + MUL + STORE for a single 64x64=128.
If you can vectorize Karatsuba in AVX2, try both AVX2 and MUL and see which is faster. If you can't vectorize, single MUL will probably be faster. If you can remove the load and store to regular registers, single MUL will be definitely faster.
Both MUL and AVX2 instructions can have an operand in memory with the same throughput, and it may help to remove one load for MUL.
I often see code that converts ints to doubles to ints to doubles and back once again (sometimes for good reasons, sometimes not), and it just occurred to me that this seems like a "hidden" cost in my program. Let's assume the conversion method is truncation.
So, just how expensive is it? I'm sure it varies depending on hardware, so let's assume a newish Intel processor (Haswell, if you like, though I'll take anything). Some metrics I'd be interested in (though a good answer needn't have all of them):
# of generated instructions
# of cycles used
Relative cost compared to basic arithmetic operations
I would also assume that the way we would most acutely experience the impact of a slow conversion would be with respect to power usage rather than execution speed, given the difference in how many computations we can perform each second relative to how much data can actually arrive at the CPU each second.
Here's what I could dig up myself, for x86-64 doing FP math with SSE2 (not legacy x87 where changing the rounding mode for C++'s truncation semantics was expensive):
When I take a look at the generated assembly from clang and gcc, it looks like the cast int to double, it boils down to one instruction: cvttsd2si.
From double to int it's cvtsi2sd. (cvtsi2sdl AT&T syntax for cvtsi2sd with 32-bit operand-size.)
With auto-vectorization, we get cvtdq2pd.
So I suppose the question becomes: what is the cost of those?
These instructions each cost approximately the same as an FP addsd plus a movq xmm, r64 (fp <- integer) or movq r64, xmm (integer <- fp), because they decode to 2 uops which on the same ports, on mainstream (Sandybridge/Haswell/Sklake) Intel CPUs.
The Intel® 64 and IA-32 Architectures Optimization Reference Manual says that cost of the cvttsd2si instruction is 5 latency (see Appendix C-16). cvtsi2sd, depending on your architecture, has latency varying from 1 on Silvermont to more like 7-16 on several other architectures.
Agner Fog's instruction tables have more accurate/sensible numbers, like 5-cycle latency for cvtsi2sd on Silvermont (with 1 per 2 clock throughput), or 4c latency on Haswell, with one per clock throughput (if you avoid the dependency on the destination register from merging with the old upper half, like gcc usually does with pxor xmm0,xmm0).
SIMD packed-float to packed-int is great; single uop. But converting to double requires a shuffle to change element size. SIMD float/double<->int64_t doesn't exist until AVX512, but can be done manually with limited range.
Intel's manual defines latency as: "The number of clock cycles that are required for the execution core to complete the execution of all of the μops that form an instruction." But a more useful definition is the number of clocks from an input being ready until the output becomes ready. Throughput is more important than latency if there's enough parallelism for out-of-order execution to do its job: What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?.
The same Intel manual says that an integer add instruction costs 1 latency and an integer imul costs 3 (Appendix C-27). FP addsd and mulsd run at 2 per clock throughput, with 4 cycle latency, on Skylake. Same for the SIMD versions, and for FMA, with 128 or 256-bit vectors.
On Haswell, addsd / addpd is only 1 per clock throughput, but 3 cycle latency thanks to a dedicated FP-add unit.
So, the answer boils down to:
1) It's hardware optimized, and the compiler leverages the hardware machinery.
2) It costs only a bit more than a multiply does in terms of the # of cycles in one direction, and a highly variable amount in the other (depending on your architecture). Its cost is neither free nor absurd, but probably warrants more attention given how easy it is write code that incurs the cost in a non-obvious way.
Of course this kind of question depends on the exact hardware and even on the mode.
On x86 my i7 when used in 32-bit mode with default options (gcc -m32 -O3) the conversion from int to double is quite fast, the opposite instead is much slower because the C standard mandates an absurd rule (truncation of decimals).
This way of rounding is bad both for math and for hardware and requires the FPU to switch to this special rounding mode, perform the truncation, and switch back to a sane way of rounding.
If you need speed doing the float->int conversion using the simple fistp instruction is faster and also much better for computation results, but requires some inline assembly.
inline int my_int(double x)
{
int r;
asm ("fldl %1\n"
"fistpl %0\n"
:"=m"(r)
:"m"(x));
return r;
}
is more than 6 times faster than naive x = (int)y; conversion (and doesn't have a bias toward 0).
The very same processor, when used in 64-bit mode however has no speed problems and using the fistp code actually makes the code run somewhat slower.
Apparently the hardware guys gave up and implemented the bad rounding algorithm directly in hardware (so badly rounding code can now run fast).
I am using the ?GEMM functions from Intel MKL to multiply matrices. Consider the following two matrix multiplications:
cblas_?gemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m,n,k,
1.0,
Matrix1,k,
Matrix2,n,
0.0,
A,n);
where m=1E5, and n=1E4, k=5. When I use pca_dgemm and pca_sgemm, this users all 12 cores, and executes beautifully.
However, when I do the following matrix multiplication:
cblas_?gemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m,l,n,
1.0,
A,n,
Ran,l,
0.0,
Q,l);
where m=1E5, n=1E5, and l=7 (note the order of the parameters passed is differnet though. this is (m,n) * (n,l)). pca_dgemm uses all 12 cores and executes beautifully.
However, pca_sgemm does not. It uses only 1 core, and of course, takes much longer. Of course, for sgemm I am using arrays of floats, whereas for dgemm I am using an arrays of doubles.
Why could this be? They both give accurate results, but sgemm only multithreads on the former, whereas dgemm multithreads and both! How could simply changing the data type make this kind of difference?
Note that all arrays were allocated in using mkl_malloc using an alignment of 64.
Edit 2: Please also note that when l=12, in other words, with a larger matrix, it does thread in the sgemm. In other words, it is clearly that the sgemm version requires larger matrices to parallelize, but dgemm does not have this requirement. Why is this?
The MKL functions do quite a bit of work up-front to try to guess what is going to be the fastest way of executing an operation, so it's no surprise that it comes to a different decision when processing doubles or singles.
When deciding which strategy to take, it has to weigh the cost of doing the operation in a single thread against the overhead of launching threads to do the operation in parallel. One factor that will come into play is that SSE instructions can do operations on single-precision numbers twice as fast as double-precision numbers, so the heuristic might well decide that it's likely quicker to do the operation on singles as SSE SIMD operations on a single core rather than kicking of twelve threads to do it in parallel. Exactly how many it can do in parallel will depend on the details of your CPU architecture; SSE2, for instance, can do an operation on four single operands or two double-operands, while more recent SSE instruction sets support wider data.
I've found in the past that, for small matrices/vectors, it's often faster to roll your own functions than to use MKL. For instance, if all your operations are on 3-vectors and 3x3 matrices, it's quite a bit faster to just write your own BLAS functions in plain C and faster again to optimise them with SSE (if you can meet the alignment constraints). For a mix of 3- and 6-vectors, it's still faster to write your own optimised SSE version. This is because the cost of the MKL version deciding which strategy to use becomes a considerable overhead when the operations are small.
I'm currently working on an application that requires large amounts of variables to be stored and processed (~4gb in float)
Since precision of the individual variables are of less importance (I know that they'll be bounded), I saw that I could use OpenCL's half instead of floats, since that would really decrease the amount of memory.
My question is twofold.
Is there any performance hit to using half instead of float (I'd image graphics cards being built for float operations)
Is there a performance hit for mixing floats and half's in calculations? (i.e, a float times a half.)
Sincerily,
Andreas Falkenstrøm Mieritz
ARM CPUs and GPUs have native support for half in their ALUs so you'll get close to double speed, plus substantial savings in energy consumption. Edit: The same goes for PowerVR GPUs.
Desktop hardware only supports half in the load/store and texturing units, AFAIK. Even so, I'd expect half textures to perform better than float textures or buffers on any GPU. Particularly if you can make some clever use of texture filtering.
OpenCL kernels are almost always memory-speed or pci-speed bound. If you are converting a decent chunk of your data for half floats, this will enable faster transfers of your values. Almost certainly faster on any platform/device.
As far as performance, half is rarely worse than float. I am fairly sure that any device which supports half will do computations as fast as it would with float. Again, even if there is a slight overhead here, you will more than make up for it in your far-superior transfer times.