GLSL - Does a dot product really only cost one cycle?

GLSL - Does a dot product really only cost one cycle? - glsl

I've come across several situations where the claim is made that doing a dot product in GLSL will end up being run in one cycle. For example:
Vertex and fragment processors operate on four-vectors, performing four-component instructions such as additions, multiplications, multiply-accumulates, or dot products in a single cycle.
http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter35.html
I've also seen a claim in comments somewhere that:
dot(value, vec4(.25))
would be a more efficient way to average four values, compared to:
(x + y + z + w) / 4.0
Again, the claim was that dot(vec4, vec4) would run in one cycle.
I see that ARB says that dot product (DP3 and DP4) and cross product (XPD are single instructions, but does that mean that those are just as computationally expensive as doing a vec4 add? Is there basically some hardware implementation, along the lines of multiply-accumulate on steroids, in play here? I can see how something like that is useful in computer graphics, but doing in one cycle what could be quite a few instructions on their own sounds like a lot.

The question cannot be answered in any definitive way as a whole. How long any operation takes in hardware is not just hardware-specific, but also code specific. That is, the surrounding code can completely mask the performance an operation takes, or it can make it take longer.
In general, you should not assume that a dot product is single-cycle.
However, there are certain aspects that can certainly be answered:
I've also seen a claim in comments somewhere that:
would be a more efficient way to average four values, compared to:
I would expect this to be kinda true, so long as x, y, z, and w are in fact different float values rather than members of the same vec4 (that is, they're not value.x, value.y, etc). If they are elements of the same vector, I would say that any decent optimizing compiler should compile both of these to the same set of instructions. A good peephole optimizer should catch patterns like this.
I say that it is "kinda true", because it depends on the hardware. The dot-product version should at the very least not be slower. And again, if they are elements of the same vector, the optimizer should handle it.
single instructions, but does that mean that those are just as computationally expensive as doing a vec4 add?
You should not assume that ARB assembly has any relation to the actual hardware machine instruction code.
Is there basically some hardware implementation, along the lines of multiply-accumulate on steroids, in play here?
If you want to talk about hardware, it's very hardware-specific. Once upon a time, there was specialized dot-product hardware. This was in the days of so-called "DOT3 bumpmapping" and the early DX8-era of shaders.
However, in order to speed up general operations, they had to take that sort of thing out. So now, for most modern hardware (aka: anything Radeon HD-class or NVIDIA 8xxx or better. So-called DX10 or 11 hardware), dot-products do pretty much what they say they do. Each multiply/add takes up a cycle.
However, this hardware also allows for a lot of parallelism, so you could have 4 separate vec4 dot products happening simultaneously. Each one would take 4 cycles. But, as long as the results of these operations are not used in the others, they can all execute in parallel. And therefore, the four of them total would take 4 cycles.
So again, it's very complicated. And hardware-dependent.
Your best bet is to start with something that is reasonable. Then learn about the hardware you're trying to code towards, and work from there.

Nicol Bolas handled the practical answer, from the perspective of "ARB assembly" or looking at IR dumps. I'll address the question "How can 4 multiples and 3 adds be one cycle in hardware?! That sounds impossible.".
With heavy pipelining, any instruction can be made to have a one cycle throughput, no matter how complex.
Do not confuse this with one cycle of latency!
With fully pipelined execution, an instruction can be spread out into several stages of the pipeline. All stages of the pipeline operate simultaneously.
Each cycle, the first stage accepts a new instruction, and its outputs go into the next stage. Each cycle, a result comes out the end of the pipeline.
Let's examine a 4d dot product, for a hypothetical core, with a multiply latency of 3 cycles, and an add latency of 5 cycles.
If this pipeline were laid out the worst way, with no vector parallelism, it would be 4 multiplies and 3 adds, giving a total of 12+15 cycles for a total latency of 27 cycles.
Does this mean that a dot product takes 27 cycles? Absolutely not, because it can start a new one every cycle, and it gets the answer to it 27 cycles later.
If you needed to do one dot product and had to wait for the answer, then you would have to wait the full 27 cycle latency for the result. If, however, you had 1000 separate dot products to compute, then it would take 1027 cycles. The first 26 cycles, there were no results, on the 27th cycle the first result comes out the end, after the 1000th input was issued, it took another 26 cycles for the last results to come out the end. This makes the dot product take "one cycle".
Real processors have the work distributed across the stages in various ways, giving more or less pipeline stages, so they may have completely different numbers than what I describe above, but the idea remains the same. Generally, the less work you do per stage, the shorter the clock cycle can become.

the key is that a vec4 can be 'operated' on in a single instruction (see the work Intel did on 16 byte register operations, aka much of the basis for IOS accelerated framework).
if you start splitting and swizzling apart the vector there will no longer be a 'single memory address' of the vector to perform the op on.

Related

How do I segregate C++ code without impacting performance?

I'm having trouble refactoring my C++ code. The code itself is barely 200 lines, if even, however, being an image processing affair, it loops a lot, and the roadblocks I'm encoutering (I assume) deal with very gritty details (e.g. memory access).
The program produces a correct output, but is supposed to ultimately run in realtime. Initially, it took ~3 minutes per 320x240px frame, but it's at around 2 seconds now (running approximately as fast on mid-range workstation and low-end laptop hardware; red flag?). Still a far cry from 24 times per second, however. Basically, any change I make propagates through the millions of repetitions, and tracking my beginner mistakes has become exponentially more cumbersome as I approach the realtime mark.
At 2 points, the program calculates a less computationally expensive variant of Euclidean distance, called taxicab distance (the sum of absolute differences).
Now, the abridged version:
std::vector<int> positiveRows, positiveCols;
/* looping through pixels, reading values */
distance = (abs(pValues[0] - qValues[0]) + abs(pValues[1] - qValues[1]) + abs(pValues[2] - qValues[2]));
if(distance < threshold)
{
positiveRows.push_back(row);
positiveCols.push_back(col);
}
If I wrap the functionality, as follows:
int taxicab_dist(int Lp,
int ap,
int bp,
int Lq,
int aq,
int bq)
{
return (abs(Lp - Lq) + abs(ap - aq) + abs(bp - bq));
}
and call it from within the same .cpp file, there is no performance degradation. If I instead declare and define it in separate .hpp / .cpp files, I get a significant slowdown. This directly opposes what I've been told in my undergraduate courses ("including a file is the same as copy-pasting it"). The closest I've gotten to the original code's performance was by declaring the arguments const, but it still takes ~100ms longer, which my judgement says is not affordable for such a meager task. Then again, I don't see why it slows down (significantly) if I also make them const int&. Then, when I do the most sensible thing, and choose to take arrays as arguments, again I take a performance hit. I don't even dare attempt any templating shenanigans, or try making the function modify its behavior so that it accepts an arbitrary number of pairs, at least not until I understand what I've gotten myself into.
So my question is: how can take the calculation definition to a separate file, and have it perform the same as the original solution? Furthermore, should the fact that compilers are optimizing my program to run 2 seconds instead of 15 be a huge red flag (bad algorithm design, not using more exotic C++ keywords / features)?
I'm guessing the main reason why I've failed to find an answer is because I don't know what is the name of this stuff. I've heard the terms "vectorization" tossed around quite a bit in the HPC community. Would this be related to that?
If it helps in any way at all, the code it its entirety can be found here.

As Joachim Pileborg says, you should profile first. Find out where in your program most of the execution time occurs. This is the place where you should optimize.
Reserving space in vector
Vectors start out small and then reallocate as necessary. This involves allocating a larger space in memory and then copying the old elements to the new vector. Finally deallocating the memory. The std::vector has the capability of reserving space upon construction. For large sizes of vectors, this can be a time saver, eliminating many reallocations.
Compiling with speed optimizations
With modern compilers, you should set the optimizations for high speed and see what they can do. The compiler writers have many tricks up their sleeve and can often spot locations to optimize that you or I miss.
Truth is assembly language
You will need to view the assembly language listing. If the assembly language shows only two instructions in the area you think is the bottleneck, you really can't get faster.
Loop unrolling
You may be able to get more performance by copying the content in a for loop many times. This is called loop unrolling. In some processors, branch or jump instructions cost more execution time than data processing instructions. Unrolling a loop reduces the number of executed branch instructions. Again, the compiler may automatically perform this when you raise the optimization level.
Data cache optimization
Search the web for "Data cache optimization". Loading and unloading the data cache wastes time. If your data can fit into the processor's data cache, it doesn't have to keep loading an unloading (called cache misses). Also remember to perform all your operations on the data in one place before performing other operations. This reduces the likelihood of the processor reloading the cache.
Multi-processor computing
If your platform has more than one processor, such as a Graphics Processing Unit (GPU), you may be able to delegate some tasks to it. Be aware that you have also added time by communicating with the other processor. So for small tasks, the communications overhead may waste the time you gained by delegating.
Parallel computing
Similar to multi-processors, you can have the Operating System delegate the tasks. The OS could delegate to different cores in your processor (if you have a multi-core processor) or it runs it in another thread. Again there is a cost: overhead of managing the task or thread and communications.
Summary
The three rules of Optimization:
Don't
Don't
Profile
After you profile, review the area where the most execution takes place. This will gain you more time than optimizing a section that never gets called. Design optimizations will generally get you more time than code optimizations. Likewise, requirement changes (such as elimination) may gain you more time than design optimizations.
After your program is working correctly and is robust, you can optimize, only if warranted. If your UI is so slow that the User can go get a cup of coffee, it is a good place to optimize. If you gain 100 milliseconds by optimizing data transfer, but your program waits 1 second for the human response, you have not gained anything. Consider this as driving really fast to a stop sign. Regardless of your speed, you still have to stop.
If you still need performance gain, search the web for "Optimizations c++", or "data optimizations" or "performance optimization".

OpenCL - Vectorization vs In-thread for loop

I have a problem where I need to process a known number of threads in parallel (great), but for which each thread may have a vastly different number of internal iterations (not great). In my mind, this makes it better to do a kernel scheme like this:
__kernel something(whatever)
{
unsigned int glIDx = get_global_id(0);
for(condition_from_whatever)
{
}//alternatively, do while
}
where id(0) is known beforehand, rather than:
__kernel something(whatever)
{
unsigned int glIDx = get_global_id(0);
unsigned int glIDy = get_global_id(1); // max "unroll dimension"
if( glIDy_meets_condition)
do_something();
else
dont_do_anything();
}
which would necessarily execute for the FULL POSSIBLE RANGE of glIDy, with no way to terminate beforehand, as per this discussion:
Killing OpenCL Kernels
I can't seem to find any specific information about costs of dynamic-sized forloops / do-while statements within kernels, though I do see them everywhere in kernels in Nvidia's and AMD's SDK. I remember reading something about how the more aperiodic an intra-kernel condition branch is, the worse the performance.
ACTUAL QUESTION:
Is there a more efficient way to deal with this on a GPU architecture than the first scheme I proposed?
I'm also open to general information about this topic.
Thanks.

I don't think there's a general answer that can be given to that question. It really depends on your problem.
However here are some considerations about this topic:
for loop / if else statements may or may not have an impact to the performance of a kernel. The fact is the performance cost is not at the kernel level but at the work-group level. A work-group is composed of one or more warps (NVIDIA)/ wavefront (AMD). These warps (I'll keep the NVIDIA terminology but it's exactly the same for AMD) are executed in lock-step.
So if within a warp you have divergence because of an if else (or a for loop with different iterations number) the execution will be serialized. That is to say that the threads within this warp following the first path will do their jobs will the others will idle. Once their job is finished, these threads will idle while the others will start working.
Another problem arise with these statements if you need to synchronize your threads with a barrier. You'll have an undefined behavior if not all the threads hit the barrier.
Now, knowing that and depending on your specific problem, you might be able to group your threads in such a fashion that within the work-groups there is not divergence, though you'll have divergence between work-groups (no impact there).
Knowing also that a warp is composed of 32 threads and a wavefront of 64 (maybe not on old AMD GPUs - not sure) you could make the size of your well organized work-groups equal or a multiple of these numbers. Note that it is quite simplified because some other problems should be taken into consideration. See for instance this question and the answer given by Chanakya.sun (maybe more digging on that topic would be nice).
In the case your problem could not be organized as just described, I'd suggest to consider using OpenCL on CPUs which are quite good dealing with branching. If I well recall, typically you'll have one work-item per work-group. In that case, better to check the documentation from Intel and AMD for CPU. I also like very much the chapter 6 of Heterogeneous Computing with OpenCL which explains the differences between using OCL with GPUs and CPUs when programming.
I like this article too. It's mainly a discussion about increasing performance for a simple reduction on GPU (not your problem), but the last part of the article examines also performance on CPUs.
Last thing, regarding your comments on the answer provided by #Oak about the "intra-device thread queuing support" which is actually called dynamic parallelism. This feature would obviously solve your problem but even using CUDA you'd need a device with capability 3.5 or higher. So even NVIDIA GPUs with Kepler GK104 architecture don't support it (capability 3.0). For OCL, the dynamic parallelism is part of the standard version 2.0. (as far as I know there is no implementation yet).

I like the 2nd version more, since for inserts a false dependency between iterations. If the inner iterations are independent, send each to a different work item and let the OpenCL implementation sort out how best to run them.
Two caveats:
If the average number of iterations is significantly lower than the max number of iterations, this might not be worth the extra dummy work items.
You will have a lot more work items and you still need to calculate the condition for each... if calculating the condition is complicated this might not be a good idea.
Alternatively, you can flatten the indices into the x dimension, group all the iterations into the same work-group, then calculate the condition just once per workgroup and use local memory + barriers to sync it.

How to measure FLOPS

How do I measure FLOPS or IOPS? If I do measure time for ordinary floating point addition / multiplication , is it equivalent to FLOPS?

FLOPS is floating point operations per second. To measure FLOPS you first need code that performs such operations. If you have such code, what you can measure is its execution time. You also need to sum up or estimate (not measure!) all floating point operations and divide that over the measured wall time. You should count all ordinary operations like additions,subtractions,multiplications,divisions (yes, even though they are slower and better avoided, they are still FLOPs..). Be careful how you count! What you see in your source code is most likely not what the compiler produces after all the optimisations. To be sure you will likely have to look at the assembly..
FLOPS is not the same as Operations per second. So even though some architectures have a single MAD (multiply-and-add) instruction, those still count as two FLOPs. Similarly the SSE instructions. You count them as one instruction, though they perform more than one FLOP.
FLOPS are not entirely meaningless, but you need to be careful when comparing your FLOPS to sb. elses FLOPS, especially the hardware vendors. E.g. NVIDIA gives the peak FLOPS performance for their cards assuming MAD operations. So unless your code has those, you will not ever get this performance. Either rethink the algorithm, or modify the peak hardware FLOPS by a correct factor, which you need to figure out for your own algorithm! E.g., if your code only performs multiplication, you would divide it by 2. Counting right might get your code from suboptimal to quite efficient without changing a single line of code..

You can use the CPU performance counters to get the CPU to itself count the number of floating point operations it uses for your particular program. Then it is the simple matter of dividing this by the run time. On Linux the perf tools allow this to be done very easily, I have a writeup on the details of this on my blog here:
http://www.bnikolic.co.uk/blog/hpc-howto-measure-flops.html

FLOP's are not well defined. mul FLOPS are different than add FLOPS. You have to either come up with your own definition or take the definition from a well-known benchmark.

Usually you use some well-known benchmark. Things like MIPS and megaFLOPS don't mean much to start with, and if you don't restrict them to specific benchmarks, even that tiny bit of meaning is lost.
Typically, for example, integer speed will be quoted in "drystone MIPS" and floating point in "Linpack megaFLOPS". In these, "drystone" and "Linpack" are the names of the benchmarks used to do the measurements.
IOPS are I/O operations. They're much the same, though in this case, there's not quite as much agreement about which benchmark(s) to use (though SPC-1 seems fairly popular).

This is a highly architecture specific question, for a naive/basic/start start I would recommend to find out how many Operations 1 multiplication take's on your specific hardware then do a large matrix multiplication , and see how long it takes. Then you can eaisly estimate the FLOP of your particular hardware
the industry standard of measuring flops is the well known Linpack or HPL high performance linpack, try looking at the source or running those your self
I would also refer to this answer as an excellent reference

Typical time of execution for elementary functions

It is well-known that the processor instruction for multiplication takes several times more time than addition, division is even worse (UPD: which is not true any more, see below). What about more complex operations like exponent? How difficult are they?
Motivation. I am interested because it would help in algorithm design to estimate performance-critical parts of algorithms on early stage. Suppose I want to apply a set of filters to an image. One of them operates on 3×3 neighborhood of each pixel, sums them and takes atan. Another one sums more neighbouring pixels, but does not use complicated functions. Which one would execute longer?
So, ideally I want to have approximate relative times of elementary operations execution, like multiplication typically takes 5 times more time than addition, exponent is about 100 multiplications. Of course, it is a deal of orders of magnitude, not the exact values. I understand that it depends on the hardware and on the arguments, so let's say we measure average time (in some sense) for floating-point operations on modern x86/x64. For operations that are not implemented in hardware, I am interested in typical running time for C++ standard libraries.
Have you seen any sources when such thing was analyzed? Does this question makes sense at all? Or no rules of thumb like this could be applied in practice?

First off, let's be clear. This:
It is well-known that processor instruction for multiplication takes
several times more time than addition
is no longer true in general. It hasn't been true for many, many years, and needs to stop being repeated. On most common architectures, integer multiplies are a couple cycles and integer adds are single-cycle; floating-point adds and multiplies tend to have nearly equal timing characteristics (typically around 4-6 cycles latency, with single-cycle throughput).
Now, to your actual question: it varies with both the architecture and the implementation. On a recent architecture, with a well written math library, simple elementary functions like exp and log usually require a few tens of cycles (20-50 cycles is a reasonable back-of-the-envelope figure). With a lower-quality library, you will sometimes see these operations require a few hundred cycles.
For more complicated functions, like pow, typical timings range from high tens into the hundreds of cycles.

You shouldn't be concerned about this. If I tell you that a typical C library implementation of transcendental functions tend to take around 10 times a single floating point addition/multiplication (or 50 floating point additions/multiplications), and around 5 times a floating point division, this wouldn't be useful to you.
Indeed, the way your processor schedules memory accesses will interfere badly with any premature optimization you'd do.
If after profiling you find that a particular implementation using transcendental functions is too slow, you can contemplate setting up a polynomial interpolation scheme. This will include a table and therefore will incur extra cache issues, so make sure to measure and not guess.
This will likely involve Chebyshev approximation. Document yourself about it, this is a particularly useful technique in this kind of domains.
I have been told that compilers are quite bad in optimizing floating point code. You may want to write custom assembly code.
Also, Intel Performance Primitives (if you are on Intel CPU) is something good to own if you are ready to trade off some accuracy for speed.

You could always start a second thread and time the operations. Most elementary operations don't have that much difference in execution time. The big difference is how many times the are executed. The O(n) is generally what you should be thinking about.

How do you measure the effect of branch misprediction?

I'm currently profiling an implementation of binary search. Using some special instructions to measure this I noticed that the code has about a 20% misprediction rate. I'm curious if there is any way to check how many cycles I'm potentially losing due to this. It's a MIPS based architecture.

You're losing 0.2 * N cycles per iteration, where N is the number of cycles that it takes to flush the pipelines after a mispredicted branch. Suppose N = 10 then that means you are losing 2 clocks per iteration on aggregate. Unless you have a very small inner loop then this is probably not going to be a significant performance hit.

Look it up in the docs for your CPU. If you can't find this information specifically, the length of the CPU's pipeline is a fairly good estimate.
Given that it's MIPS and it's a 300MHz system, I'm going to guess that it's a fairly short pipeline. Probably 4-5 stages, so a cost of 3-4 cycles per mispredict is probably a reasonable guess.

On an in-order CPU you may be able to calculate the approximate mispredict cost as a product of the number of mispredicts and the mispredict cost (which is generally a function of some part of the pipeline)
On a modern out-of-order CPU, however, such a general calculation is usually not possible. There may be a large number of instructions in flight1, only some of which are flushed by a misprediction. The surrounding code may be latency bound by one or more chains of dependent instructions, or it may be throughput bound on resources like execution units, renaming throughput, etc, or it may be somewhere in-between.
On such a core, the penalty per misprediction is very difficult to determine, even with the help of performance counters. You can find entire papers dedicated to the topic: that one found a penalty size of ranging from 9 to 35 cycles averaged across entire benchmarks: if you look at some small piece of code the range will be even larger: a penalty of zero is easy to demonstrate, and you could create a scenario where the penalty is in the 100s of cycles.
Where does that leave you, just trying to determine the misprediction cost in your binary search? Well a simple approach is just to control the number of mispredictions and measure the difference! If you set up your benchmark input have a range of behavior, starting with always following the same branch pattern, all the way to having a random pattern, you can plot the misprediction count versus runtime degradation. If you do, share your result!
1Hundreds of instructions in-flight in the case of modern big cores such as those offered by the x86, ARM and POWER architectures.

Look at your specs for that info and if that fails, run it a billion times and time it external to your program (stop watch of something.) Then run it with without a miss and compare.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js