sgemm does not multithread when dgemm does - Intel MKL - c++

I am using the ?GEMM functions from Intel MKL to multiply matrices. Consider the following two matrix multiplications:
cblas_?gemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m,n,k,
1.0,
Matrix1,k,
Matrix2,n,
0.0,
A,n);
where m=1E5, and n=1E4, k=5. When I use pca_dgemm and pca_sgemm, this users all 12 cores, and executes beautifully.
However, when I do the following matrix multiplication:
cblas_?gemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m,l,n,
1.0,
A,n,
Ran,l,
0.0,
Q,l);
where m=1E5, n=1E5, and l=7 (note the order of the parameters passed is differnet though. this is (m,n) * (n,l)). pca_dgemm uses all 12 cores and executes beautifully.
However, pca_sgemm does not. It uses only 1 core, and of course, takes much longer. Of course, for sgemm I am using arrays of floats, whereas for dgemm I am using an arrays of doubles.
Why could this be? They both give accurate results, but sgemm only multithreads on the former, whereas dgemm multithreads and both! How could simply changing the data type make this kind of difference?
Note that all arrays were allocated in using mkl_malloc using an alignment of 64.
Edit 2: Please also note that when l=12, in other words, with a larger matrix, it does thread in the sgemm. In other words, it is clearly that the sgemm version requires larger matrices to parallelize, but dgemm does not have this requirement. Why is this?

The MKL functions do quite a bit of work up-front to try to guess what is going to be the fastest way of executing an operation, so it's no surprise that it comes to a different decision when processing doubles or singles.
When deciding which strategy to take, it has to weigh the cost of doing the operation in a single thread against the overhead of launching threads to do the operation in parallel. One factor that will come into play is that SSE instructions can do operations on single-precision numbers twice as fast as double-precision numbers, so the heuristic might well decide that it's likely quicker to do the operation on singles as SSE SIMD operations on a single core rather than kicking of twelve threads to do it in parallel. Exactly how many it can do in parallel will depend on the details of your CPU architecture; SSE2, for instance, can do an operation on four single operands or two double-operands, while more recent SSE instruction sets support wider data.
I've found in the past that, for small matrices/vectors, it's often faster to roll your own functions than to use MKL. For instance, if all your operations are on 3-vectors and 3x3 matrices, it's quite a bit faster to just write your own BLAS functions in plain C and faster again to optimise them with SSE (if you can meet the alignment constraints). For a mix of 3- and 6-vectors, it's still faster to write your own optimised SSE version. This is because the cost of the MKL version deciding which strategy to use becomes a considerable overhead when the operations are small.

Related

Why doesn't Eigen support OpenMP for coefficient-wise operations?

This post along with some tests made it clear Eigen does not apply multiprocessing to coefficient-wise operations, such as cwiseProduct or Array multiplication, although matrix-matrix products can exploit multiple cores.
Still, with some optimization Eigen seems to be quite fast, and even if I try to write my own matrix library for particular purposes, I doubt if it will be faster than Eigen even if OpenMP for my library is enabled.
Why doesn't Eigen support OpenMP when it comes to coefficient-wise operations? Is it some sort of a blunder by the developer or are there some specific reasons to avoid multiprocessing for specific operations?
Can I manually include OpenMP support for such operations? The code for Eigen seems complicated so it is hard to find the exact implementation of a particular function, even through the use of Visual Studio instruments.
Why doesn't Eigen support OpenMP when it comes to coefficient-wise operations? Is it some sort of a blunder by the developer or are there some specific reasons to avoid multiprocessing for specific operations?
Using multiple threads by default would not be a good idea as users can already use multiple threads in their applications and having nested parallel loops is clearly not efficient. Moreover, sharing the work for each operation can introduce a significant overhead that is not great for basic operations small/medium-sized arrays. Eigen is meant to be fast for both small and big arrays. Using OpenMP on top of Eigen is better in practice. This is especially true one Numa systems due to the first touch policy: hiding the multithreading can introduce surprising overheads due to the remote accesses or page migrations.
For complex operations like LU decomposition, this is not reasonable to ask to the user to parallelise the operation as it would need to rewrite most of the algorithm. This is why Eigen try to parallelize such algorithms.
Can I manually include OpenMP support for such operations? The code for Eigen seems complicated so it is hard to find the exact implementation of a particular function, even through the use of Visual Studio instruments.
Element-wise operation are trivial to implement in parallel in OpenMP. You can simply use a #pragma omp parallel for directive (possibly combined with a collapse clause) on a loops iterating for 0 to array.rows() (or array.cols()) itself using basic Eigen array indexing operators.
Note that while using multiple threads for basic operations seems interesting at first glance, in practice, it is often disappointing in term of speed up on most machines. Indeed, reading/writing data in the RAM is very expensive compare to applying arithmetic operations (eg. add/sub/mul) on each items as long as the operation is already vectorized using SIMD instruction. For example, on my machine 1 core can reach a throughput of 20-25 GiB/s while the maximum throughput is 40 GiB/s with 6 cores (ie. speed-up less than 2x with 6x threads).

GPU HLSL compute shader warnings int and uint division

I keep having warnings from compute shader compilation in that I'm recommended to use uints instead of ints with dividing.
By default from the data type I assume uints are faster; however various tests online seem to point to the contrary; perhaps this contradiction is on the CPU side only and GPU parallelisation has some unknown advantage?
(Or is it just bad advice?)
I know that this is an extremely late answer, but this is a question that has come up for me as well, and I wanted to provide some information for anyone who sees this in the future.
I recently found this resource - https://arxiv.org/pdf/1905.08778.pdf
The table at the bottom lists the latency of basic operations on several graphics cards. There is a small but consistent savings to be found by using uints on all measured hardware. However, what the warning doesn't state is that the greater optimization is to be found by replacing division with multiplication if at all possible.
https://www.slideshare.net/DevCentralAMD/lowlevel-shader-optimization-for-nextgen-and-dx11-by-emil-persson states that type conversion is a full-rate operation like int/float subtraction, addition, and multiplication, whereas division is very slow.
I've seen it suggested that to improve performance, one should convert to float, divide, then convert back to int, but as shown in the first source, this will at best give you small gains and at worst actually decrease performance.
You are correct that it varies from performance of operations on the CPU, although I'm not entirely certain why.
Looking at https://www.agner.org/optimize/instruction_tables.pdf it appears that which operation is faster (MUL vs IMUL) varies from CPU to CPU - in a few at the top of the list IMUL is actually faster, despite a higher instruction count. Other CPUs don't provide a distinction between MUL and IMUL at all.
TL;DR uint division is faster on the GPU, but on the CPU YMMV

Opencv Subtraction costlier than multiplication

I am trying to optimize the code of image-processing project
The analysis by VS2013 preview shows that subtract operation is costlier than multiplication(mul) operation.
In general multiplication is more costlier than subtraction right.?
If so why is not here.?
I think it is potentially a combination of several factors.
t1 needs to be allocated during subtract call, and this takes a bit of time
t1 is quite possibly already in cache during t1.mul(t1) call, so accesses are faster
I'm not sure what type td is, but I bet there is a saturate_cast going on for every element in the matrix when you add 1 to td; no casting needed in the .mul() calls
subtract and multiply are both memory-bound operations, so for all but the smallest matrices, properly optimized code will hide the higher latency of the multiply instructions to achieve the same throughput for both operations, all else being equal (eg, caching, etc.)
the .mul() calls are in-place operations, which has significant advantages for caching
if this is a release build of the project, it's possible the optimizer rearranged code in such a way as to confuse the profiler about which time-consuming machine instructions correspond to which lines of code. You'd be surprised at the kind of deep wizardry involved in the optimized implementation of arithmetic operations on matrices in OpenCV.

double or float, which is faster? [duplicate]

This question already has answers here:
Is using double faster than float?
(10 answers)
Closed 8 years ago.
I am reading "accelerated C++". I found one sentence which states "sometimes double is faster in execution than float in C++". After reading sentence I got confused about float and double working. Please explain this point to me.
Depends on what the native hardware does.
If the hardware is (or is like) x86 with legacy x87 math, float and double are both extended (for free) to an internal 80-bit format, so both have the same performance (except for cache footprint / memory bandwidth)
If the hardware implements both natively, like most modern ISAs (including x86-64 where SSE2 is the default for scalar FP math), then usually most FPU operations are the same speed for both. Double division and sqrt can be slower than float, as well as of course being significantly slower than multiply or add. (Float being smaller can mean fewer cache misses. And with SIMD, twice as many elements per vector for loops that vectorize).
If the hardware implements only double, then float will be slower if conversion to/from the native double format isn't free as part of float-load and float-store instructions.
If the hardware implements float only, then emulating double with it will cost even more time. In this case, float will be faster.
And if the hardware implements neither, and both have to be implemented in software. In this case, both will be slow, but double will be slightly slower (more load and store operations at the least).
The quote you mention is probably referring to the x86 platform, where the first case was given. But this doesn't hold true in general.
Also beware that x * 3.3 + y for float x,y will trigger promotion to double for both variables. This is not the hardware's fault, and you should avoid it by writing 3.3f to let your compiler make efficient asm that actually keeps numbers as floats if that's what you want.
You can find a complete answer in this article:
What Every Computer Scientist Should Know About Floating-Point Arithmetic
This is a quote from a previous Stack Overflow thread, about how float and double variables affect memory bandwidth:
If a double requires
more storage than a float, then it
will take longer to read the data.
That's the naive answer. On a modern
IA32, it all depends on where the data
is coming from. If it's in L1 cache,
the load is negligible provided the
data comes from a single cache line.
If it spans more than one cache line
there's a small overhead. If it's from
L2, it takes a while longer, if it's
in RAM then it's longer still and
finally, if it's on disk it's a huge
time. So the choice of float or double
is less imporant than the way the data
is used. If you want to do a small
calculation on lots of sequential
data, a small data type is preferable.
Doing a lot of computation on a small
data set would allow you to use bigger
data types with any significant
effect. If you're accessing the data
very randomly, then the choice of data
size is unimportant - data is loaded
in pages / cache lines. So even if you
only want a byte from RAM, you could
get 32 bytes transfered (this is very
dependant on the architecture of the
system). On top of all of this, the
CPU/FPU could be super-scalar (aka
pipelined). So, even though a load may
take several cycles, the CPU/FPU could
be busy doing something else (a
multiply for instance) that hides the
load time to a degree
Short answer is: it depends.
CPU with x87 will crunch floats and doubles equally fast. Vectorized code will run faster with floats, because SSE can crunch 4 floats or 2 doubles in one pass.
Another thing to consider is memory speed. Depending on your algorithm, your CPU could be idling a lot while waiting for the data. Memory intensive code will benefit from using floats, but ALU limited code won't (unless it is vectorized).
I can think of two basic cases when doubles are faster than floats:
Your hardware supports double operations but not float operations, so floats will be emulated by software and therefore be slower.
You really need the precision of doubles. Now, if you use floats anyway you will have to use two floats to reach similar precision to double. The emulation of a true double with floats will be slower than using floats in the first place.
You do not necessarily need doubles but your numeric algorithm converges faster due to the enhanced precision of doubles. Also, doubles might offer enough precision to use a faster but numerically less stable algorithm at all.
For completeness' sake I also give some reasons for the opposite case of floats being faster. You can see for yourself whichs reasons dominate in your case:
Floats are faster than doubles when you don't need double's
precision and you are memory-bandwidth bound and your hardware
doesn't carry a penalty on floats.
They conserve memory-bandwidth because they occupy half the space
per number.
There are also platforms that can process more floats than doubles
in parallel.
On Intel, the coprocessor (nowadays integrated) will handle both equally fast, but as some others have noted, doubles result in higher memory bandwidth which can cause bottlenecks. If you're using scalar SSE instructions (default for most compilers on 64-bit), the same applies. So generally, unless you're working on a large set of data, it doesn't matter much.
However, parallel SSE instructions will allow four floats to be handled in one instruction, but only two doubles, so here float can be significantly faster.
In experiments of adding 3.3 for 2000000000 times, results are:
Summation time in s: 2.82 summed value: 6.71089e+07 // float
Summation time in s: 2.78585 summed value: 6.6e+09 // double
Summation time in s: 2.76812 summed value: 6.6e+09 // long double
So double is faster and default in C and C++. It's more portable and the default across all C and C++ library functions. Alos double has significantly higher precision than float.
Even Stroustrup recommends double over float:
"The exact meaning of single-, double-, and extended-precision is implementation-defined. Choosing the right precision for a problem where the choice matters requires significant understanding of floating-point computation. If you don't have that understanding, get advice, take the time to learn, or use double and hope for the best."
Perhaps the only case where you should use float instead of double is on 64bit hardware with a modern gcc. Because float is smaller; double is 8 bytes and float is 4 bytes.
float is usually faster. double offers greater precision. However performance may vary in some cases if special processor extensions such as 3dNow or SSE are used.
There is only one reason 32-bit floats can be slower than 64-bit doubles (or 80-bit 80x87). And that is alignment. Other than that, floats take less memory, generally meaning faster access, better cache performance. It also takes fewer cycles to process 32-bit instructions. And even when (co)-processor has no 32-bit instructions, it can perform them on 64-bit registers with the same speed. It probably possible to create a test case where doubles will be faster than floats, and v.v., but my measurements of real statistics algos didn't show noticeable difference.

x86 4byte floats vs. 8byte doubles (vs. long long)?

We have a measurement data processing application and currently all data is held as C++ float which means 32bit/4byte on our x86/Windows platform. (32bit Windows Application).
Since precision is becoming an issue, there have been discussions to move to another datatype. The options currently discussed are switching to double (8byte) or implementing a fixed decimal type on top of __int64 (8byte).
The reason the fixed-decimal solution using __int64 as underlying type is even discussed is that someone claimed that double performance is (still) significantly worse than processing floats and that we might see significant performance benefits using a native integer type to store our numbers. (Note that we really would be fine with fixed decimal precision, although the code would obviously become more complex.)
Obviously we need to benchmark in the end, but I would like to ask whether the statement that doubles are worse holds any truth looking at modern processors? I guess for large arrays doubles may mess up cache hits more that floats, but otherwise I really fail to see how they could differ in performance?
It depends on what you do. Additions, subtractions and multiplies on double are just as fast as on float on current x86 and POWER architecture processors. Divisions, square roots and transcendental functions (exp, log, sin, cos, etc.) are usually notably slower with double arguments, since their runtime is dependent on the desired accuracy.
If you go fixed point, multiplies and divisions need to be implemented with long integer multiply / divide instructions which are usually slower than arithmetic on doubles (since processors aren't optimized as much for it). Even more so if you're running in 32 bit mode where a long 64 bit multiply with 128 bit results needs to be synthesized from several 32-bit long multiplies!
Cache utilization is a red herring here. 64-bit integers and doubles are the same size - if you need more than 32 bits, you're gonna eat that penalty no matter what.
Look it up. Both and Intel publish the instruction latencies for their CPUs in freely available PDF documents on their websites.
However, for the most part, performance won't be significantly different, or a couple of reasons:
when using the x87 FPU instead of SSE, all floating point operations are calculated at 80 bits precision internally, and then rounded off, which means that the actual computation is equally expensive for all floating-point types. The only cost is really memory-related then (in terms of CPU cache and memory bandwidth usage, and that's only an issue in float vs double, but irrelevant if you're comparing to int64)
with or without SSE, nearly all floating-point operations are pipelined. When using SSE, the double instructions may (I haven't looked this up) have a higher latency than their float equivalents, but the throughput is the same, so it should be possible to achieve similar performance with doubles.
It's also not a given that a fixed-point datatype would actually be faster either. It might, but the overhead of keeping this datatype consistent after some operations might outweigh the savings. Floating-point operations are fairly cheap on a modern CPU. They have a bit of latency, but as mentioned before, they're generally pipelined, potentially hiding this cost.
So my advice:
Write some quick tests. It shouldn't be that hard to write a program that performs a number of floating-point ops, and then measure how much slower the double version is relative to the float one.
Look it up in the manuals, and see for yourself if there's any significant performance difference between float and double computations
I've trouble the understand the rationale "as double as slower than float we'll use 64 bits int". Guessing performance has always been an black art needing much of experience, on today hardware it is even worse considering the number of factors to take into account. Even measuring is difficult. I know of several cases where micro-benchmarks lent to one solution but in context measurement showed that another was better.
First note that two of the factors which have been given to explain the claimed slower double performance than float are not pertinent here: bandwidth needed will the be same for double as for 64 bits int and SSE2 vectorization would give an advantage to double...
Then consider than using integer computation will increase the pressure on the integer registers and computation units when apparently the floating point one will stay still. (I've already seen cases where doing integer computation in double was a win attributed to the added computation units available)
So I doubt that rolling your own fixed point arithmetic would be advantageous over using double (but I could be showed wrong by measures).
Implementing 64 fixed points isn't really fun. Especially for more complex functions like Sqrt or logarithm. Integers will probably still a bit faster for simple operations like additions. And you'll need to deal with integer overflows. And you need to be careful when implementing rounding, else errors can easily accumulate.
We're implementing fixed points in a C# project because we need determinism which floatingpoint on .net doesn't guarantee. And it's relatively painful. Some formula contained x^3 bang int overflow. Unless you have really compelling reasons not to, use float or double instead of fixedpoint.
SIMD instructions from SSE2 complicate the comparison further, since they allow operation on several floating point numbers(4 floats or 2 doubles) at the same time. I'd use double and try to take advantage of these instructions. So double will probably be significantly slower than floats, but comparing with ints is difficult and I'd prefer float/double over fixedpoint is most scenarios.
It's always best to measure instead of guess. Yes, on many architectures, calculations on doubles process twice the data as calculations on floats (and long doubles are slower still). However, as other answers, and comments on this answer, have pointed out, the x86 architecture doesn't follow the same rules as, say, ARM processors, SPARC processors, etc. On x86 floats, doubles and long doubles are all converted to long doubles for computation. I should have known this, because the conversion causes x86 results to be more accurate than SPARC and Sun went through a lot of trouble to get the less accurate results for Java, sparking some debate (note, that page is from 1998, things have since changed).
Additionally, calculations on doubles are built in to the CPU where calculations on a fixed decimal datatype would be written in software and potentially slower.
You should be able to find a decent fixed sized decimal library and compare.
With various SIMD instruction sets you can perform 4 single precision floating point operations at the same cost as one, essentially you pack 4 floats into a single 128 bit register. When switching to doubles you can only pack 2 doubles into these registers and hence you can only do two operations at the same time.
As many people have said, a 64bit int is probably not worth it if double is an option. At least when SSE is available. This might be different on micro controllers of various kinds but I guess that is not your application. If you need additional precision in long sums of floats, you should keep in mind that this operation is sometimes problematic with floats and doubles and would be more exact on integers.