setting and checking a 32 bit variable

setting and checking a 32 bit variable - c++

I am wondering if setting a 32 bit variable after checking it will be faster than just setting it? E.g. variable a is of uint32
if( a != 0)
{
a = 0;
}
or
a = 0;
The code will be running in a loop which it will run many times so I want to reduce the time to run the code.
Note variable a will be 0 most of the time, so the question can possibly be shortened to if it is faster to check a 32 bit variable or to set it. Thank you in advance!
edit: Thank you all who commented on the question, I have created a for loop and tested both assigning and if-ing for 100 thousand times. It turns out assigning is faster.(54ms for if-ing and 44ms for assigning)

What you describe is called a "silent store" optimization.
PRO: unnecessary stores are avoided.
This can reduce pressure on the store to load forwarding buffers, a component of a modern out-of-order CPU that is quite expensive in hardware, and, as a result, is often undersized, and therefore a performance bottleneck. On Intel x86 CPUs there are performance Event Monitoring counters (EMON) that you can use to investigate whether this is a problem in your program.
Interestingly, it can also reduce the number of loads that your program does. First, SW: if the stores are not eliminated, the compiler may be unable to prove that the do not write to the memory occupied by a different variable, the so-called address and pointer disambiguation problem, so the compiler may generate unnecessary reloads of such possibly but not actually conflicting memory locations. Eliminate the stores, some of these loDs may also be eliminated. Second, HW: most modern CPUs have store to load dependency predictors: fewer stores increase accuracy. If a dependency is predicted, the load may actually not be performed by hardware, and maybe converted into a register to register move. This was the subject of the recent patent lawsuits that the University of Wisconsin asserted against Intel and Apple, with awards exceeding hundreds of millions of dollars.
But the most important Reason to eliminate the unnecessary stores is to avoid unnecessarily dirtying the cache. A dirty cache line eventually has to be written to memory, even if unchanged. Wasting power. In many systems it will eventually be written to flash or SSD, wasting power and consuming the limited write cycles of the device.
These considerations have motivated academic research in silent stores, such as http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.28.8947&rep=rep1&type=pdf. However, a quick google scholar search shows these papers are mainly 2000-2004, and I am aware of no modern CPUs implementing true silent store elimination - actually having hardware read the old value. I suspect, however, that this lack of deployment of silent stores us mainly because CPU design went on pause for more than a decade, as focus changed from desktop PCs to cell phones. Now that cell phones are almost caught up to the sophistication of 2000-era desktop CPUs, it may arise again.
CON: Eliminating the silent store in software takes more instructions. Worse, it takes a branch. If the branch is not very predictable, the resulting branch mispredictions will consume any savings. Some machines have instructions that allow you to eliminate such stores without a branch: eg Intel's LRBNI vector store instructions with a conditional vector mask. I believe AVX has these instructions. If you or your compiler can use such instructions, then the cost is just the load of the old value and a vector compare if the old value is already in a register, then just the compare.
By the way, you can get some benefit without completely eliminating the store, but by redirecting it to a safe address. Instead if
If a[i] != 0 then a[i] := 0
Do
ptr = a+I; if *ptr == 0 then ptr.:= &safe; *ptr:=0
Still doing the store, but not dirtying so many cache lines. I have used this way if faking a conditional store instruction a lot. It is very unlikely that a compiler will do this sort of optimization.
So, unfortunately, the answer is "it depends". If you are on a vector mask machine or a GPU, and the silent stores are very common, like, more than 30%, worth thinking about. If in scalar code, probably need more like 90% silent.
Ideally, measure it yourself. Although it can be hard to make realistic measurements.
I would start with what is probably the best case for thus optimization:
char a[1024*1024*1024]; // zero filled
const int cachelinesize = 64;
for(char*p=a; p
Every store is eliminated here - mAke sure that the compiler still emits them. Good branch prediction, etc
If this limit case shows no benefit, your realistic code is unlikely to.
Come to think if it, I ran such a benchmark back in the last century. The silent store code was 2x faster, since totally memory bound, and the silent stores generate no dirty cache lines on a write back cache. Recheck thus, and then try on more realistic workload.
But first, measure whether you are memory bottlenecked or not.
By the way: if hardware implementations of silent store elimination become common, then you will never want to do it in software.
But at the moment I am aware of no hardware implementations of silent store elimination in commercially available CPUs.
As ECC becomes more common, silent store elimination becomes almost free - since you have to read the old bytes anyway to recalculate ECC in many cases.

The assignment would do you better as firstly the if statement is redundant and it would make it clearer if you omitted it, also the assignment only should be faster and even if you are not quite sure of it you can just create a simple function to test it with and without the if statement.

Related

Why are the relative performance results in Google Benchmark completely different from raw loops? [duplicate]

I am evaluating a network+rendering workload for my project.
The program continuously runs a main loop:
while (true) {
doSomething()
drawSomething()
doSomething2()
sendSomething()
}
The main loop runs more than 60 times per second.
I want to see the performance breakdown, how much time each procedure takes.
My concern is that if I print the time interval for every entrance and exit of each procedure,
It would incur huge performance overhead.
I am curious what is an idiomatic way of measuring the performance.
Printing of logging is good enough?

Generally: For repeated short things, you can just time the whole repeat loop. (But microbenchmarking is hard; easy to distort results unless you understand the implications of doing that; for very short things, throughput and latency are different, so measure both separately by making one iteration use the result of the previous or not. Also beware that branch prediction and caching can make something look fast in a microbenchmark when it would actually be costly if done one at a time between other work in a larger program.
e.g. loop unrolling and lookup tables often look good because there's no pressure on I-cache or D-cache from anything else.)
Or if you insist on timing each separate iteration, record the results in an array and print later; you don't want to invoke heavy-weight printing code inside your loop.
This question is way too broad to say anything more specific.
Many languages have benchmarking packages that will help you write microbenchmarks of a single function. Use them. e.g. for Java, JMH makes sure the function under test is warmed up and fully optimized by the JIT, and all that jazz, before doing timed runs. And runs it for a specified interval, counting how many iterations it completes. See How do I write a correct micro-benchmark in Java? for that and more.
Beware common microbenchmark pitfalls
Failure to warm up code / data caches and stuff: page faults within the timed region for touching new memory, or code / data cache misses, that wouldn't be part of normal operation. (Example of noticing this effect: Performance: memset; or example of a wrong conclusion based on this mistake)
Never-written memory (obtained fresh from the kernel) gets all its pages copy-on-write mapped to the same system-wide physical page (4K or 2M) of zeros if you read without writing, at least on Linux. So you can get cache hits but TLB misses. e.g. A large allocation from new / calloc / malloc, or a zero-initialized array in static storage in .bss. Use a non-zero initializer or memset.
Failure to give the CPU time to ramp up to max turbo: modern CPUs clock down to idle speeds to save power, only clocking up after a few milliseconds. (Or longer depending on the OS / HW).
related: on modern x86, RDTSC counts reference cycles, not core clock cycles, so it's subject to the same CPU-frequency variation effects as wall-clock time.
Most integer and FP arithmetic asm instructions (except divide and square root which are already slower than others) have performance (latency and throughput) that doesn't depend on the actual data. Except for subnormal aka denormal floating point being very slow, and in some cases (e.g. legacy x87 but not SSE2) also producing NaN or Inf can be slow.
On modern CPUs with out-of-order execution, some things are too short to truly time meaningfully, see also this. Performance of a tiny block of assembly language (e.g. generated by a compiler for one function) can't be characterized by a single number, even if it doesn't branch or access memory (so no chance of mispredict or cache miss). It has latency from inputs to outputs, but different throughput if run repeatedly with independent inputs is higher. e.g. an add instruction on a Skylake CPU has 4/clock throughput, but 1 cycle latency. So dummy = foo(x) can be 4x faster than x = foo(x); in a loop. Floating-point instructions have higher latency than integer, so it's often a bigger deal. Memory access is also pipelined on most CPUs, so looping over an array (address for next load easy to calculate) is often much faster than walking a linked list (address for next load isn't available until the previous load completes).
Obviously performance can differ between CPUs; in the big picture usually it's rare for version A to be faster on Intel, version B to be faster on AMD, but that can easily happen in the small scale. When reporting / recording benchmark numbers, always note what CPU you tested on.
Related to the above and below points: you can't "benchmark the * operator" in C in general, for example. Some use-cases for it will compile very differently from others, e.g. tmp = foo * i; in a loop can often turn into tmp += foo (strength reduction), or if the multiplier is a constant power of 2 the compiler will just use a shift. The same operator in the source can compile to very different instructions, depending on surrounding code.
You need to compile with optimization enabled, but you also need to stop the compiler from optimizing away the work, or hoisting it out of a loop. Make sure you use the result (e.g. print it or store it to a volatile) so the compiler has to produce it. For an array, volatile double sink = output[argc]; is a useful trick: the compiler doesn't know the value of argc so it has to generate the whole array, but you don't need to read the whole array or even call an RNG function. (Unless the compiler aggressively transforms to only calculate the one output selected by argc, but that tends not to be a problem in practice.)
For inputs, use a random number or argc or something instead of a compile-time constant so your compiler can't do constant-propagation for things that won't be constants in your real use-case. In C you can sometimes use inline asm or volatile for this, e.g. the stuff this question is asking about. A good benchmarking package like Google Benchmark will include functions for this.
If the real use-case for a function lets it inline into callers where some inputs are constant, or the operations can be optimized into other work, it's not very useful to benchmark it on its own.
Big complicated functions with special handling for lots of special cases can look fast in a microbenchmark when you run them repeatedly, especially with the same input every time. In real life use-cases, branch prediction often won't be primed for that function with that input. Also, a massively unrolled loop can look good in a microbenchmark, but in real life it slows everything else down with its big instruction-cache footprint leading to eviction of other code.
Related to that last point: Don't tune only for huge inputs, if the real use-case for a function includes a lot of small inputs. e.g. a memcpy implementation that's great for huge inputs but takes too long to figure out which strategy to use for small inputs might not be good. It's a tradeoff; make sure it's good enough for large inputs (for an appropriate definition of "enough"), but also keep overhead low for small inputs.
Litmus tests:
If you're benchmarking two functions in one program: if reversing the order of testing changes the results, your benchmark isn't fair. e.g. function A might only look slow because you're testing it first, with insufficient warm-up. example: Why is std::vector slower than an array? (it's not, whichever loop runs first has to pay for all the page faults and cache misses; the 2nd just zooms through filling the same memory.)
Increasing the iteration count of a repeat loop should linearly increase the total time, and not affect the calculated time-per-call. If not, then you have non-negligible measurement overhead or your code optimized away (e.g. hoisted out of the loop and runs only once instead of N times).
Vary other test parameters as a sanity check.
For C / C++, see also Simple for() loop benchmark takes the same time with any loop bound where I went into some more detail about microbenchmarking and using volatile or asm to stop important work from optimizing away with gcc/clang.

A simple and reliable C++ benchmarking solution? [duplicate]

I am evaluating a network+rendering workload for my project.
The program continuously runs a main loop:
while (true) {
doSomething()
drawSomething()
doSomething2()
sendSomething()
}
The main loop runs more than 60 times per second.
I want to see the performance breakdown, how much time each procedure takes.
My concern is that if I print the time interval for every entrance and exit of each procedure,
It would incur huge performance overhead.
I am curious what is an idiomatic way of measuring the performance.
Printing of logging is good enough?

Flushing the cache to prevent benchmarking fluctiations

I am running the c++ code of someone to do the benchmarking on a dataset. The issue I have is that often I get a timing for the first run, and these numbers massively change (i.e. 28 seconds to 10 seconds) if I run the same code again. I assume this happens due to CPU's automatic caching. Is there a way to flush the cache, or prevent these fluctuations somehow?

Not one that works "for everything, everywhere". Most processors have special instructions to flush the cache, but they are often privileged instructions, so it has to be done from inside the OS kernel, not your user-mode code. And of course, it's completely different instructions for each processor architecture.
All current x86 processors does have a clflush instruction, that flushes one cache-line, but to do that, you have to have the address of the data (or code) you want to flush. Which is fine for small and simple data structures, not so good if you have a binary tree that is all over the place. And of course, not at all portable.
In most environments, reading and writing a large block of alternative data, e.g. something like:
// Global variables.
const size_t bigger_than_cachesize = 10 * 1024 * 1024;
long *p = new long[bigger_than_cachesize];
...
// When you want to "flush" cache.
for(int i = 0; i < bigger_than_cachesize; i++)
{
p[i] = rand();
}
Using rand will be much slower than filling with something constant/known. But the compiler can't optimise the call away, which means it's (almost) guaranteed that the code will stay.
The above won't flush instruction caches - that is a lot more difficult to do, basically, you have to run some (large enough) other piece of code to do that reliably. However, instruction caches tend to have less effect on overall benchmark performance (instruction cache is EXTREMELY important for modern processor's perforamnce, that's not what I'm saying, but in the sense that the code for a benchmark is typically small enough that it all fits in cache, and the benchmark runs many times over the same code, so it's only slower the first iteration)
Other ideas
Another way to simulate "non-cache" behaviour is allocate a new area for each benchmark pass - in other words, not freeing the memory until the end of the benchmark or using an array containing the data, and output results, such that each run has it's own set of data to work on.
Further, it's common to actually measure the performance of the "hot runs" of a benchmark, not the first "cold run" where the caches are empty. This does of course depend on what you are actually trying to achieve...

Here's my basic approach:
Allocate a memory region 2x the size of the LLC, if you can determine the LLC size dynamically (or you know it statically), or if you don't, some reasonable multiple of the largest LLC size on the platform of interest1.
memset the memory region to some non-zero value: 1 will do just fine.
"Sink" the pointer somewhere so that the compiler can't optimize out the stuff above or below (writing to a volatile global works pretty much 100% of the time).
Read from random indexes in the region until you've touched each cache line an average of 10 times or so (accumulate the read values into a sum that you sink in a similar way to (3)).
Here are some notes on why this is generally works and why doing less may not work - the details are x86-centric, but similar concerns will apply on many other architectures.
You absolutely want to write to the allocated memory (step 2) before you begin your main read-only flushing loop, since otherwise you might just be repeatedly reading from the same small zero-mapped page returned by the OS to satisfy your memory allocation.
You want to use a region considerably larger than the LLC size, since the outer cache levels are typically physically addressed, but you can only allocate and access virtual addresses. If you just allocate an LLC-sized region, you generally won't get full coverage of all the ways of every cache set: some sets will be over-represented (and so will be fully flushed), while other sets be under-represented and so not all existing values can even be flushed by accessing this region of memory. A 2x over-allocation makes it highly likely that almost all sets have enough representation.
You want to avoid the optimizer doing clever things, such as noting the memory never escapes the function and eliminating all your reads and writes.
You want to iterate randomly around the memory region, rather than just striding through it linearly: some designs like the LLC on recent Intel detect when a "streaming" pattern is present, and switch from LRU to MRU since LRU is about the worst-possible replacement policy for such a load. The effect is that no matter how many times you stream though memory, some "old" lines from before your efforts can remain in the cache. Randomly accessing memory defeats this behavior.
You want to access more than just LLC amount of memory for (a) the same reason you allocate more than the LLC size (virtual access vs physical caching) and (b) because random access needs more accesses before you have a high likelihood of hitting every set enough times (c) caches are usually only pseudo-LRU, so you need more than the number of accesses you'd expect under exact-LRU to flush out every line.
Even this is not foolproof. Other hardware optimizations or caching behaviors not considered above could cause this approach to fail. You might get very unlucky with the page allocation provided by the OS and not be able to reach all the pages (you can largely mitigate this by using 2MB pages). I highly recommend testing whether your flush technique is adequate: one approach is to measure the number of cache misses using CPU performance counters while running your benchmark and see if the number makes sense based on the known working-set size2.
Note that this leaves all levels of the cache with lines in E (exclusive) or perhaps S (shared) state, and not the M (modified) state. This means that these lines don't need to be evicted to other cache levels when they are replaced by accesses in your benchmark: they can simply be dropped. The approach described in the other answer will leave most/all lines in the M state, so you'll initially have 1 line of eviction traffic for every line you access in your benchmark. You can achieve the same behavior with my recipe above by changing step 4 to write rather than read.
In that regard, neither approach here is inherently "better" than the other: in the real world the cache levels will have a mix of modified and not-modified lines, while these approaches leave the cache at the two extremes of the continuum. In principle you could benchmark with both the all-M and no-M states, and see if it matters much: if it does, you can try to evaluate what the real-world state of the cache will usually be an replicate that.
1Remember that LLC sizes are growing almost every CPU generation (mostly because core counts are increasing), so you want to leave some room for growth if this needs to be future-proof.
2 I just throw that out there as if it was "easy", but in reality may be very difficult depending on your exact problem.

Runtime performance (speed) optimization -- Cache size consideration

What are the basic tips and tricks that a C++ programmer should know when trying to optimize his code in the context of Caching?
Here's something to think about:
For instance, I know that reducing a function's footprint would make the code run a bit faster since you would have less overall instructions on the processor's instruction register I.
When trying to allocate an std::array<char, <size>>, what would be the ideal size that could make your read and writes faster to the array?
How big can an object be to decide to put it on the heap instead of the stack?

In most cases, knowing the correct answer to your question will gain you less than 1% overall performance.
Some (data-)cache optimizations that come to my mind are:
For arrays: use less RAM. Try shorter data types or a simple compression algorithm like RLE. This can also save CPU at the same time, or in the opposite waste CPU cycles with data type conversions. Especially floating point to integer conversions can be quite expensive.
Avoid access to the same cacheline (usually around 64 bytes) from different threads, unless all access is read-only.
Group members that are often used together next to each other. Prefer sequential access to random access.
If you really want to know all about caches, read What Every Programmer Should Know About Memory. While I disagree with the title, it's a great in-depth document.
Because your question suggests that you actually expect gains from just following the tips above (in which case you will be disappointed), here are some general optimization tips:
Tip #1: About 90% of your code you should be optimized for readability, not performance. If you decide to attempt an optimization for performance, make sure you actually measure the gain. When it is below 5% I usually go back to the more readable version.
Tip #2: If you have an existing codebase, profile it first. If you don't profile it, you will miss some very effective optimizations. Usually there are some calls to time-consuming functions that can be completely eliminated, or the result cached.
If you don't want to use a profiler, at least print the current time in a couple of places, or interrupt the program with a debugger a couple of times to check where it is most often spending its time.

How do I segregate C++ code without impacting performance?

I'm having trouble refactoring my C++ code. The code itself is barely 200 lines, if even, however, being an image processing affair, it loops a lot, and the roadblocks I'm encoutering (I assume) deal with very gritty details (e.g. memory access).
The program produces a correct output, but is supposed to ultimately run in realtime. Initially, it took ~3 minutes per 320x240px frame, but it's at around 2 seconds now (running approximately as fast on mid-range workstation and low-end laptop hardware; red flag?). Still a far cry from 24 times per second, however. Basically, any change I make propagates through the millions of repetitions, and tracking my beginner mistakes has become exponentially more cumbersome as I approach the realtime mark.
At 2 points, the program calculates a less computationally expensive variant of Euclidean distance, called taxicab distance (the sum of absolute differences).
Now, the abridged version:
std::vector<int> positiveRows, positiveCols;
/* looping through pixels, reading values */
distance = (abs(pValues[0] - qValues[0]) + abs(pValues[1] - qValues[1]) + abs(pValues[2] - qValues[2]));
if(distance < threshold)
{
positiveRows.push_back(row);
positiveCols.push_back(col);
}
If I wrap the functionality, as follows:
int taxicab_dist(int Lp,
int ap,
int bp,
int Lq,
int aq,
int bq)
{
return (abs(Lp - Lq) + abs(ap - aq) + abs(bp - bq));
}
and call it from within the same .cpp file, there is no performance degradation. If I instead declare and define it in separate .hpp / .cpp files, I get a significant slowdown. This directly opposes what I've been told in my undergraduate courses ("including a file is the same as copy-pasting it"). The closest I've gotten to the original code's performance was by declaring the arguments const, but it still takes ~100ms longer, which my judgement says is not affordable for such a meager task. Then again, I don't see why it slows down (significantly) if I also make them const int&. Then, when I do the most sensible thing, and choose to take arrays as arguments, again I take a performance hit. I don't even dare attempt any templating shenanigans, or try making the function modify its behavior so that it accepts an arbitrary number of pairs, at least not until I understand what I've gotten myself into.
So my question is: how can take the calculation definition to a separate file, and have it perform the same as the original solution? Furthermore, should the fact that compilers are optimizing my program to run 2 seconds instead of 15 be a huge red flag (bad algorithm design, not using more exotic C++ keywords / features)?
I'm guessing the main reason why I've failed to find an answer is because I don't know what is the name of this stuff. I've heard the terms "vectorization" tossed around quite a bit in the HPC community. Would this be related to that?
If it helps in any way at all, the code it its entirety can be found here.

As Joachim Pileborg says, you should profile first. Find out where in your program most of the execution time occurs. This is the place where you should optimize.
Reserving space in vector
Vectors start out small and then reallocate as necessary. This involves allocating a larger space in memory and then copying the old elements to the new vector. Finally deallocating the memory. The std::vector has the capability of reserving space upon construction. For large sizes of vectors, this can be a time saver, eliminating many reallocations.
Compiling with speed optimizations
With modern compilers, you should set the optimizations for high speed and see what they can do. The compiler writers have many tricks up their sleeve and can often spot locations to optimize that you or I miss.
Truth is assembly language
You will need to view the assembly language listing. If the assembly language shows only two instructions in the area you think is the bottleneck, you really can't get faster.
Loop unrolling
You may be able to get more performance by copying the content in a for loop many times. This is called loop unrolling. In some processors, branch or jump instructions cost more execution time than data processing instructions. Unrolling a loop reduces the number of executed branch instructions. Again, the compiler may automatically perform this when you raise the optimization level.
Data cache optimization
Search the web for "Data cache optimization". Loading and unloading the data cache wastes time. If your data can fit into the processor's data cache, it doesn't have to keep loading an unloading (called cache misses). Also remember to perform all your operations on the data in one place before performing other operations. This reduces the likelihood of the processor reloading the cache.
Multi-processor computing
If your platform has more than one processor, such as a Graphics Processing Unit (GPU), you may be able to delegate some tasks to it. Be aware that you have also added time by communicating with the other processor. So for small tasks, the communications overhead may waste the time you gained by delegating.
Parallel computing
Similar to multi-processors, you can have the Operating System delegate the tasks. The OS could delegate to different cores in your processor (if you have a multi-core processor) or it runs it in another thread. Again there is a cost: overhead of managing the task or thread and communications.
Summary
The three rules of Optimization:
Don't
Don't
Profile
After you profile, review the area where the most execution takes place. This will gain you more time than optimizing a section that never gets called. Design optimizations will generally get you more time than code optimizations. Likewise, requirement changes (such as elimination) may gain you more time than design optimizations.
After your program is working correctly and is robust, you can optimize, only if warranted. If your UI is so slow that the User can go get a cup of coffee, it is a good place to optimize. If you gain 100 milliseconds by optimizing data transfer, but your program waits 1 second for the human response, you have not gained anything. Consider this as driving really fast to a stop sign. Regardless of your speed, you still have to stop.
If you still need performance gain, search the web for "Optimizations c++", or "data optimizations" or "performance optimization".

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js