Optimizing subroutine with memcmp, memcpy

Optimizing subroutine with memcmp, memcpy - c++

I wonder if there are any optimizations (something more efficient than memcmp/memcpy maybe just using a for loop or breaking it down to fast assembly instructions) that can be done to this subroutine. NUM_BYTES is a constant value (always = 18):
void ledSmoothWrite(uint8_t ledTarget[])
{
// If the new target is different, set new target
if(memcmp(target_arr, ledTarget, NUM_BYTES)) memcpy(target_arr, ledTarget, NUM_BYTES);
// Obtain equality
for(uint8_t i = 0; i < NUM_BYTES; i++)
{
if(rgb_arr[i] < target_arr[i]) rgb_arr[i]++;
else if(rgb_arr[i] > target_arr[i]) rgb_arr[i]--;
}
render();
}
This subroutine smoothly setting LED colors might be called several hundred times per second. As the loop() function increases in run time it takes much more time for each LED to get desired values.
Any help would be greatly appreciated. Thank you in advance!

Check your documentation but on many good compilers memcmp() and memcpy() are implemented as efficient machine code instructions.
They may well be (for practical purposes) as fast as it gets.
Try not doing the comparison. Depending on the probability that the ranges are equal doing the comparison then (if different) doing the copy may not be a net win.
However the best solution is to not perform the copy at all!
If possible just read out of ledTarget.
It's not exactly clear what you're doing but animations often perform 'double buffering' to avoid copying big states around the place.
So if you're working concurrently write into one buffer while reading from another and then on the next cycle write into the other buffer and read from the first.

Related

Performance impact of using 'break' inside 'for-loop'

I have done my best and read a lot of Q&As on SO.SE, but I haven't found an answer to my particular question. Most for-loop and break related question refer to nested loops, while I am concerned with performance.
I want to know if using a break inside a for-loop has an impact on the performance of my C++ code (assuming the break gets almost never called). And if it has, I would also like to know tentatively how big the penalization is.
I am quite suspicions that it does indeed impact performance (although I do not know how much). So I wanted to ask you. My reasoning goes as follows:
Independently of the extra code for the conditional statements that
trigger the break (like an if), it necessarily ads additional
instructions to my loop.
Further, it probably also messes around when my compiler tries to
unfold the for-loop, as it no longer knows the number of iterations
that will run at compile time, effectively rendering it into a
while-loop.
Therefore, I suspect it does have a performance impact, which could be
considerable for very fast and tight loops.
So this takes me to a follow-up question. Is a for-loop & break performance-wise equal to a while-loop? Like in the following snippet, where we assume that checkCondition() evaluates 99.9% of the time as true. Do I loose the performance advantage of the for-loop?
// USING WHILE
int i = 100;
while( i-- && checkCondition())
{
// do stuff
}
// USING FOR
for(int i=100; i; --i)
{
if(checkCondition()) {
// do stuff
} else {
break;
}
}
I have tried it on my computer, but I get the same execution time. And being wary of the compiler and its optimization voodoo, I wanted to know the conceptual answer.
EDIT:
Note that I have measured the execution time of both versions in my complete code, without any real difference. Also, I do not trust compiling with -s (which I usually do) for this matter, as I am not interested in the particular result of my compiler. I am rather interested in the concept itself (in an academic sense) as I am not sure if I got this completely right :)

The principal answer is to avoid spending time on similar micro optimizations until you have verified that such condition evaluation is a bottleneck.
The real answer is that CPU have powerful branch prediction circuits which empirically work really well.
What will happen is that your CPU will choose if the branch is going to be taken or not and execute the code as if the if condition is not even present. Of course this relies on multiple assumptions, like not having side effects on the condition calculation (so that part of the body loop depends on it) and that that condition will always evaluate to false up to a certain point in which it will become true and stop the loop.
Some compilers also allow you to specify the likeliness of an evaluation as a hint the branch predictor.
If you want to see the semantic difference between the two code versions just compile them with -S and examinate the generated asm code, there's no other magic way to do it.

The only sensible answer to "what is the performance impact of ...", is "measure it". There are very few generic answers.
In the particular case you show, it would be rather surprising if an optimising compiler generated significantly different code for the two examples. On the other hand, I can believe that a loop like:
unsigned sum = 0;
unsigned stop = -1;
for (int i = 0; i<32; i++)
{
stop &= checkcondition(); // returns 0 or all-bits-set;
sum += (stop & x[i]);
}
might be faster than:
unsigned sum = 0;
for (int i = 0; i<32; i++)
{
if (!checkcondition())
break;
sum += x[i];
}
for a particular compiler, for a particular platform, with the right optimization levels set, and for a particular pattern of "checkcondition" results.
... but the only way to tell would be to measure.

Is adding 1 to a number repeatedly slower than adding everything at the same time in C++? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
If I have a number a, would it be slower to add 1 to it b times rather than simply adding a + b?
a += b;
or
for (int i = 0; i < b; i++) {
a += 1;
}
I realize that the second example seems kind of silly, but I have a situation where coding would actually be easier that way, and I am wondering if that would impact performance.
EDIT: Thank you for all your answers. It looks like some posters would like to know what situation I have. I am trying to write a function to shift an inputted character a certain number of characters over (ie. a cipher) if it is a letter. So, I want to say that one char += the number of shifts, but I also need to account for the jumps between the lowercase characters and uppercase characters on the ascii table, and also wrapping from z back to A. So, while it is doable in another way, I thought it would be easiest to keep adding one until I get to the end of a block of letter characters, then jump to the next one and keep going.

If your loop is really that simple, I don't see any reason why a compiler couldn't optimize it. I have no idea if any actually would, though. If your compiler doesn't the single addition will be much faster than the loop.

The language C++ does not describe how long either of those operations take. Compilers are free to turn your first statement into the second, and that is a legal way to compile it.
In practice, many compilers would treat those two subexpressions as the same expression, assuming everything is of type int. The second, however, would be fragile in that seemingly innocuous changes would cause massive performance degradation. Small changes in type that 'should not matter', extra statements nearby, etc.
It would be extremely rare for the first to be slower than the second, but if the type of a was such that += b was a much slower operation than calling += 1 a bunch of times, it could be. For example;
struct A {
std::vector<int> v;
void operator+=( int x ) {
// optimize for common case:
if (x==1 && v.size()==v.capacity()) v.reserve( v.size()*2 );
// grow the buffer:
for (int i = 0; i < x; ++i)
v.reserve( v.size()+1 );
v.resize( v.size()+1 );
}
}
};
then A a; int b = 100000; a+=b; would take much longer than the loop construct.
But I had to work at it.

The overhead (CPU instructions) on having a variable being incremented in a loop is likely to be insignificant compared to the total number of instructions in that loop (unless the only thing you are doing in the loop is incrementing). Loop variables are likely to remain in the low levels of the CPU cache (if not in CPU registries) and is very fast to increment as in doesn't need to read from the RAM via the FSB. Anyway, if in doubt just make a quick profile and you'll know if it makes sense to sacrifice code readability for speed.

Yes, absolutely slower. The second example is beyond silly. I highly doubt you have a situation where it would make sense to do it that way.
Lets say 'b' is 500,000... most computers can add that in a single operation, why do 500,000 operations (not including the loop overhead).

If the processor has an increment instruction, the compiler will usually translate the "add one" operation into an increment instruction.
Some processors may have an optimized increment instructions to help speed up things like loops. Other processors can combine an increment operation with a load or store instruction.
There is a possibility that a small loop containing only an increment instruction could be replaced by a multiply and add. The compiler is allowed to do so, if and only if the functionality is the same.
This kind of operation, generally produces negligible results. However, for large data sets and performance critical applications, this kind of operation may be necessary and the time gained would be significant.
Edit 1:
For adding values other than 1, the compiler would emit processor instructions to use the best addition operations.
The add operation is optimized in hardware as a different animal than incrementing. Arithmetic Logic Units (ALU) have been around for a long time. The basic addition operation is very optimized and a lot faster than incrementing in a loop.

Algorithm: taking out every 4th item of an array

I have two huge arrays (int source[1000], dest[1000] in the code below, but having millions of elements in reality). The source array contains a series of ints of which I want to copy 3 out of every 4.
For example, if the source array is:
int source[1000] = {1,2,3,4,5,6,7,8....};
int dest[1000];
Here is my code:
for (int count_small = 0, count_large = 0; count_large < 1000; count_small += 3, count_large +=4)
{
dest[count_small] = source[count_large];
dest[count_small+1] = source[count_large+1];
dest[count_small+2] = source[count_large+2];
}
In the end, dest console output would be:
1 2 3 5 6 7 9 10 11...
But this algorithm is so slow! Is there an algorithm or an open source function that I can use / include?
Thank you :)
Edit: The actual length of my array would be about 1 million (640*480*3)
Edit 2: Processing this for loop takes about 0.98 seconds to 2.28 seconds, while the other code only take 0.08 seconds to 0.14 seconds, so the device uses at least 90 % cpu time only for the loop

Well, the asymptotic complexity there is as good as it's going to get. You might be able to achieve slightly better performance by loading in the values as four 4-way SIMD integers, shuffling them into three 4-way SIMD integers, and writing them back out, but even that's not likely to be hugely faster.
With that said, though, the time to process 1000 elements (Edit: or one million elements) is going to be utterly trivial. If you think this is the bottleneck in your program, you are incorrect.

Before you do much more, try profiling your application and determine if this is the best place to spend your time. Then, if this is a hot spot, determine how fast is it, and how fast you need it to be/might achieve? Then test the alternatives; the overhead of threading or OpenMP might even slow it down (especially, as you now have noted, if you are on a single core processor - in which case it won't help at all). For single threading, I would look to memcpy as per Sean's answer.
#Sneftel has also reference other options below involving SIMD integers.
One option would be to try parallel processing the loop, and see if that helps. You could try using the OpenMP standard (see Wikipedia link here), but you would have to try it for your specific situation and see if it helped. I used this recently on an AI implementation and it helped us a lot.
#pragma omp parallel for
for (...)
{
... do work
}
Other than that, you are limited to the compiler's own optimisations.
You could also look at the recent threading support in C11, though you might be better off using pre-implemented framework tools like parallel_for (available in the new Windows Concurrency Runtime through the PPL in Visual Studio, if that's what you're using) than rolling your own.
parallel_for(0, max_iterations,
[...] (int i)
{
... do stuff
}
);
Inside the for loop, you still have other options. You could try a for loop that iterates and skips every for, instead of doing 3 copies per iteration (just skip when (i+1) % 4 == 0), or doing block memcopy operations for groups of 3 integers as per Seans answer. You might achieve slightly different compiler optimisations for some of these, but it is unlikely (memcpy is probably as fast as you'll get).
for (int i = 0, int j = 0; i < 1000; i++)
{
if ((i+1) % 4 != 0)
{
dest[j] = source[i];
j++;
}
}
You should then develop a test rig so you can quickly performance test and decide on the best one for you. Above all, decide how much time is worth spending on this before optimising elsewhere.

You could try memcpy instead of the individual assignments:
memcpy(&dest[count_small], &source[count_large], sizeof(int) * 3);

Is your array size only a 1000? If so, how is it slow? It should be done in no time!
As long as you are creating a new array and for a single threaded application, this is the only away AFAIK.
However, if the datasets are huge, you could try a multi threaded application.
Also you could explore having a bigger data type holding the value, such that the array size decreases... That is if this is viable to your real life application.

If you have Nvidia card you can consider using CUDA. If thats not the case you can try other parallel programming methods/environments as well.

efficient check for value change in array of floats in c++

i want to optimize an OpenGL application, and one hotspot is
doing expensive handling ( uploading to graphics card ) of
relatively small arrays ( 8-64 values ) where sometimes the
values change but most of the times stay constant. So most
efficient solution would be to upload the array only when
it has changed.
Of course the simplest way would be setting flags whenever
the data is changed, but this would need many code
changes, and for a quick test i would like to know the
possible performance gains, before too much work has to
be done.
So i thought of a quick check ( like a murmur hash etc )
in memory if the data has changed from frame to frame and
decide uploding after this check. so the question is, how
could i eg. XOR an array of values like
float vptr[] = { box.x1,box.y1, box.x1,box.y2, box.x2,box.y2, box.x2,box.y1 };
together to detect reliably value changes?
Best & thanks,
Heiner

If you're using intel, you could look into intel intrinsics.
http://software.intel.com/en-us/articles/intel-intrinsics-guide gives you an interactive reference where you can explore. There are a bunch of instructions for comparing multiple integers or doubles in one instruction, which is a nice speed-up.

#Ming, thank you for the intrinsic speedup, i will have a look into this.
float vptr[] = { box.x1,box.y1, box.x1,box.y2, box.x2,box.y2, box.x2,box.y1 };
unsigned hashval h = 0;
for(int i=...)
{
h ^= (unsigned&) vptr[i];
}
dead simple, worked for the really tiny arrays. compiler should be able to auto-vectorize, size of array is known. have to test for larger arrays.
origin: Hash function for floats

Performance of comparisons in C++ ( foo >= 0 vs. foo != 0 )

I've been working on a piece of code recently where performance is very important, and essentially I have the following situation:
int len = some_very_big_number;
int counter = some_rather_small_number;
for( int i = len; i >= 0; --i ){
while( counter > 0 && costly other stuff here ){
/* do stuff */
--counter;
}
/* do more stuff */
}
So here I have a loop that runs very often and for a certain number of runs the while block will be executed as well until the variable counter is reduced to zero and then the while loop will not be called because the first expression will be false.
The question is now, if there is a difference in performance between using
counter > 0 and counter != 0?
I suspect there would be, does anyone know specifics about this.

To measure is to know.

Do you think that what will solve your problem! :D
if(x >= 0)
00CA1011 cmp dword ptr [esp],0
00CA1015 jl main+2Ch (0CA102Ch) <----
...
if(x != 0)
00CA1026 cmp dword ptr [esp],0
00CA102A je main+3Bh (0CA103Bh) <----

In programming, the following statement is the sign designating the road to Hell:
I've been working on a piece of code recently where performance is very important
Write your code in the cleanest, most easy to understand way. Period.
Once that is done, you can measure its runtime. If it takes too long, measure the bottlenecks, and speed up the biggest ones. Keep doing that until it is fast enough.
The list of projects that failed or suffered catastrophic loss due to a misguided emphasis on blind optimization is large and tragic. Don't join them.

I think you're spending time optimizing the wrong thing. "costly other stuff here", "do stuff" and "do more stuff" are more important to look at. That is where you'll make the big performance improvements I bet.

There will be a huge difference if the counter starts with a negative number. Otherwise, on every platform I'm familiar with, there won't be a difference.

Is there a difference between counter > 0 and counter != 0? It depends on the platform.
A very common type of CPU are those from Intel we have in our PC's. Both comparisons will map to a single instruction on that CPU and I assume they will execute at the same speed. However, to be certain you will have to perform your own benchmark.

As Jim said, when in doubt see for yourself :
#include <boost/date_time/posix_time/posix_time.hpp>
#include <iostream>
using namespace boost::posix_time;
using namespace std;
void main()
{
ptime Before = microsec_clock::universal_time(); // UTC NOW
// do stuff here
ptime After = microsec_clock::universal_time(); // UTC NOW
time_duration delta_t = After - Before; // How much time has passed?
cout << delta_t.total_seconds() << endl; // how much seconds total?
cout << delta_t.fractional_seconds() << endl; // how much microseconds total?
}
Here's a pretty nifty way of measuring time. Hope that helps.

OK, you can measure this, sure. However, these sorts of comparisons are so fast that you are probably going to see more variation based on processor swapping and scheduling then on this single line of code.
This smells of unnecessary, and premature, optimization. Right your program, optimize what you see. If you need more, profile, and then go from there.

I would add that the overwhelming performance aspects of this code on modern cpus will be dominated not by the comparison instruction but whether the comparison is well predicted since any mis-predict will waste many more cycles than any integral operation.
As such loop unrolling will probably be the biggest winner but measure, measure, measure.

Thinking that the type of comparison is going to make a difference, without knowing it, is the definition of guessing.
Don't guess.

In general, they should be equivalent (both are usually implemented in single-cycle instructions/micro-ops). Your compiler may do some strange special-case optimization that is difficult to reason about from the source level, which may make either one slightly faster. Also, equality testing is more energy-efficient than inequality testing (>), though the system-level effect is so small as to not merit discussion.

There may be no difference. You could try examining the assembly output for each.
That being said, the only way to tell if any difference is significant is to try it both ways and measure. I'd bet that the change makes no difference whatsoever with optimizations on.

Assuming you are developing for the x86 architecture, when you look at the assembly output it will come down to jns vs jne. jns will check the sign flag and jne will check the zero flag. Both operations, should as far as I know, be equally costly.

Clearly the solution is to use the correct data type.
Make counter an unsigned int. Then it can't be less than zero. Your compiler will obviously know this and be forced to choose the optimal solution.
Or you could just measure it.
You could also think about how it would be implemented...(here we go on a tangent)...
less than zero: the sign bit would be set, so need to check 1 bit
equal to zero : the whole value would be zero, so need to check all the bits
Of course, computers are funny things, and it may take longer to check a single bit than the whole value (however many bytes it is on your platform).
You could just measure it...
And you could find out that one it more optimal than another (under the conditions you measured it). But your program will still run like a dog because you spent all your time optimising the wrong part of your code.
The best solution is to use what many large software companies do - blame the hardware for not runnnig fast enough and encourage your customer to upgrade their equipment (which is clearly inferior since your product doesn't run fast enough).
< /rant>

I stumbled across this question just now, 3 years after it is asked, so I am not sure how useful the answer will still be... Still, I am surprised not to see clearly stated that answering your question requires to know two and only two things:
which processor you target
which compiler you work with
To the first point, each processor has different instructions for tests. On one given processor, two similar comparisons may turn up to take a different number of cycles. For example, you may have a 1-cycle instruction to do a gt (>), eq (==), or a le (<=), but no 1-cycle instruction for other comparisons like a ge (>=). Following a test, you may decide to execute conditional instructions, or, more often, as in your code example, take a jump. There again, cycles spent in jumps take a variable number of cycles on most high-end processors, depending whether the conditional jump is taken or not taken, predicted or not predicted. When you write code in assembly and your code is time critical, you can actually take quite a bit of time to figure out how to best arrange your code to minimize overall the cycle count and may end up in a solution that may have to be optimized based on the number of time a given comparison returns a true or false.
Which leads me to the second point: compilers, like human coders, try to arrange the code to take into account the instructions available and their latencies. Their job is harder because some assumptions an assembly code would know like "counter is small" is hard (not impossible) to know. For trivial cases like a loop counter, most modern compilers can at least recognize the counter will always be positive and that a != will be the same as a > and thus generate the best code accordingly. But that, as many mentioned in the posts, you will only know if you either run measurements, or inspect your assembly code and convince yourself this is the best you could do in assembly. And when you upgrade to a new compiler, you may then get a different answer.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js