Measure execution efficiency for compairing two solutions? - c++

I want to measure the efficiency for two solutions for the same problem.
I don't need to include any environmental "noise" into the calculation, just I want to know which of these below solutions would perform better in a perfect world, ie.: which needs more steps to execute?
string a;
int b;
string c;
//SOLUTION A
c = a;
c = std::move(c) + ',' + std:to_string(b);
//SOLUTION B
c = a;
c.append(",").append(std::to_string(b));
I don't really have any experience in measuring execution times in this small scale, so I might have lost in the jungle, sorry if this is the case here.

One way is to benchmark both solutions, e.g. with QuickBench. In the chart:
you can see that second solution is faster. However, execution time of your code might also be dependent on the size and the number of the strings you try to concatenate, so take it into account. I would also recommend benchmarking whole solution (whatever you try to achieve), not just one line of your code (here: append vs operator+).
You can also try it with different compilers and different optimization levels.

Related

Is one loop better than several of them?

I've been working on my implementation of BigInteger, and when I was contemplating the solution for addition, I decided to go with cleaner one, which had in mind adding corresponding digits in function and "normalizing" them later. Like in the following example
999 999 + 111 111
= 10 10 10 10 10 10 (value after addition)
= 1 111 110 (value after normalization)
But since then I was wondering about how it affects the efficiency of the program. Are several loops doing small things each generally going to work faster than one big nested loop?
For example, using
int a[7]={0,9,9,9,9,9,9};
int b[7]={0,1,1,1,1,1,1};
int c[7];
Is this,
for(int q=0; q<7; ++q){
c[q]=a[q]+b[q];
if(c[q]>9){
c[q-1]=c[q]/10;
c[q]%=10;
}
}
better than this
for(int q=0; q<7; ++q){
c[q]=a[q]+b[q];
}
for(int q=0;q<7;++q){
if(c[q]>9){
c[q-1]=c[q]/10;
c[q]%=10;
}
}
And what about bigger loops, that have much more things to go through on each iteration?
UPD.
As someone suggested I did measure performance time for both examples. For two loops the average time (for 100mil. elements) ~4.85sec. For one loop ~3.72sec
It is very difficult to tell which one of the two approaches will be more efficient. It probably varies among C++ compiler vendors and within a single vendor, from version to version of their compiler.
The bottom line is:
You will never know unless you benchmark.
As usual, it is almost certain that it does not matter anyway, and you are most probably unduly concerned about performance, like the vast majority of programmers do, in the vast majority of cases.
At the end of the day, all that matters is what is more readable and more maintainable. Code maintainability is far more important than saving clock cycles.
If you decide to follow the wise path of "what is more readable" keep in mind that different folks find different things more readable. For example, I personally hate surprises when I am reading code, so I would be rather annoyed to read your first loop which allows decimal digits to receive erroneous values outside of the 0-9 range, only to find out later that you are finally remedying that with another loop.

Visual studio: how to make C++ consoleApplication use more of CPU power?

so i am running a console project, but when the code is running i see in Task manager that only 5% (2.8 GHz) of Cpu is been used, of course i am not exacly sure how cpu distribute the proccessing power in windows to begain with. but for more of a future reffrence i would like to know if i had a performance demanding code that i need the answer faster how would i do that?
here is the code if you would like to know:
#include "stdafx.h"
#include <iostream>
#include <string.h>
using namespace std;
void swap(char *x, char *y)
{
char temp;
temp = *x;
*x = *y;
*y = temp;
}
void permute(char *a, int l, int r)
{
int i;
if (l == r)
cout << a << endl;
else
{
for (i = l; i <= r; i++)
{
swap((a + l), (a + i));
permute(a, l + 1, r);
swap((a + l), (a + i));
}
}
}
int main()
{
char Short[] = "ABCD";
int n1 = strlen(Short);
char Long[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
int n2 = strlen(Long);
while(true)
{
cout << "Would you like to see the permutions of only a) ABCD or b) the whole alphabet?!\n(please enter a or b): ";
char input;
cin >> input;
if (input == 'a')
{
cout << "The permutions of ABCD:\n";
permute(Short, 0, n1 - 1);
cout << "-----------------------------------";
}
else if (input == 'b')
{
cout << "The permutions of Alphabet:\n";
permute(Long, 0, n2 - 1);
cout << "-----------------------------------";
}
else
{
cout << "ERROR! : Enter either a or b.\n";
}
}
}
i found the code in a blog to show the permutions of "ABCD" as part of an assgiment but i also used it for the entire alphabet, and i wanted to know for that use is there a way to make code use more cpu?(it's kinda taking a much longer time than i expected)
Learning to optimize code efficiently is a major challenge for the even experienced coders, and there are volumes of books, articles, and presentations on the topic. As such, a complete treatment is well out of scope for a Stack Overflow question.
That said, here are a few principles:
Focus initially on the algorithm. You can write a messy bubble sort or an efficient one, but in most 'real world' cases quicksort will beat either handily. This is arguably the primary reason the field of computer science exists: the study and selection of algorithms and their theoretical performance.
Related to this, make sure you are comparing your implementation against a 'stock' algorithm when possible. For example, you should see how your implementation performs compared to using the C++11 std::random_shuffle in the <random> header.
Optimize the compiler settings first. Debug builds are never going to be fast, and they aren't supposed to be. Using inline can help, but only happens if the compiler is actually doing inline optimizations. For Visual C++, there are a number of different optimization settings you can try out, but remember that there are tradeoffs so /Ox (maximum optimization) may not always be the right choice, which is why most templates default to /O2 (maximize speed). In some cases, /O1 (minimize space) is actually better.
Always measure performance before and after optimization. Modern out-of-order CPUs are sophisticated systems, and they don't always do what you think they are doing. In many cases, what is a textbook optimization in code actually performs worse than the original code due to various pipelining and microarchitecture effects. The only way to know for sure is to use a good profiler, have solid test cases, and measure the impact of any optimization work. If it's slower on average than before, then revert to the 'unoptimized' version and try something else.
Focus optimization on the hotspots. This is the so-called '80/20' rule. In many applications the vast majority of the code is run rarely, so only a few areas of your application are actually spending enough time running to be worth optimizing.
As a corollary to this rule, having all of your code using extremely inefficient anti-patterns can really hurt the baseline performance of your entire application. For this reason, it's worth knowing how to write good code generally. The point of the 80/20 rule is to spend your limited time optimizing on the areas that will have the most impact rather than what you as the programmer assume matters.
All that said, in your case none of this matters. The vast majority of the CPU time is spent just creating your process and handling the serialized input and output. When dealing with an n of 4 or 26, it doesn't matter how bad or good your algorithm is. In other words, it is highly unlikely permute is your program's 'hotspot' unless you are working with tens of thousands of millions of characters.
NOTE: Yes I am oversimplifying the topic, but I'm concerned that
without this basic understanding, the more advanced topics will
actually lead to some disastrous program designs.
Maybe I'm missing something, but there also seems to be a misunderstanding regarding the link between CPU and efficiency in your mind.
Your program has N instructions, and the CPU will process those N instructions at relatively the same speed (3.56 GHz is about 3.56 billion instructions per second). That's the same (more or less), whether you're getting "5%" or "25%" use of the CPU from a single program. (I'll explain that percentage in a moment.)
The only way to get "faster" in terms of processor usage, as erip said, with parallel computing techniques, which in a nutshell employ multiple CPUs to accomplish the task.
If you think of it like an assembly line, your one worker can only process one widget at a time. If your batch of widgets to him takes up 5% of his time, that means that in order to process ALL of your widgets one-by-one, he uses 5% of his time, and the other 95% is not needed for that batch (and he'll probably use it for some other batches other people assigned him.)
He cannot process more than one widget at a time, so that's as fast as he'll get with your batch. You might be able to make things appear faster by having him alternate between two different types of widgets, instead of finishing all of batch A before starting on batch B, but it will still take the same amount of time in the end to process both batches.
MASSIVE EXCEPTION: If he's spending 100% of his time on someone else's batch of widgets, you're literally going to have to cool your heels. That's not something you can do a thing about.
However, if you add another worker to that assembly line, they can process twice (roughly) the widgets in the same amount of time, because you are processing two widgets at once. When we say you have a "quad core processor", that basically means that you have four workers available (literally 4 CPUs). Each one can only process a single instruction at once, but by assigning more than one to the batch of widgets, you get it done faster.
All of this said, one must keep in mind that those CPUs are doing a lot - they run the entire computer. You want to try and keep those percentages down as much as possible, so your program is fast and responsive on any supported computer. Not all of your users will have 3.46 GHz quad-core machines, after all.
Surely the reason this program is not using all available CPU bandwidth is because it's emitting the permutation results to the screen once for each permutation. This will result in blocking I/O within the implementation of cout.
If you want 100% cpu use you'll want to separate computation from I/O. In this case you'd then need to either:
a) store the results for later output, or
b) communicate results across a thread boundary (which will itself have a an efficiency cost because of the cost of acquiring mutexes and synchronising cache memory), or
c) a combination of the above (batching results and communicating them across the thread boundary)
For a quick check, you could remove comment out all the cout calls and see how much CPU use you get (as mentioned it will be close to 100% divided by the number of CPUs on your computer).

How to speed up program execution

This is a very simple question, but unfortunately, I am stuck and do not know what to do. My program is a simple program that keeps on accepting 3 numbers and outputs the largest of the 3. The program keeps on running until the user inputs a character.
As the tittle says, my question is how I can make this execute faster ( There will be a large amount of input data ). Any sort of help which may include using a different algorithm or using different functions or changing the entire code is accepted.
I'm not very experienced in C++ Standard, and thus do not know about all the different functions available in the different libraries, so please do explain your reasons and if you're too busy, at least try and provide a link.
Here is my code
#include<stdio.h>
int main()
{
int a,b,c;
while(scanf("%d %d %d",&a,&b,&c))
{
if(a>=b && a>=c)
printf("%d\n",a);
else if(b>=a && b>=c)
printf("%d\n",b);
else
printf("%d\n",c);
}
return 0;
}
It's working is very simple. The while loop will continue to execute until the user inputs a character. As I've explained earlier, the program accepts 3 numbers and outputs the largest. There is no other part of this code, this is all. I've tried to explain it as much as I can. If you need anything more from my side, please ask, ( I'll try as much as I can ).
I am compiling on an internet platform using CPP 4.9.2 ( That's what is said over there )
Any sort of help will be highly appreciated. Thanks in advance
EDIT
The input is made by a computer, so there is no delay in input.
Also, I will accept answers in c and c++.
UPDATE
I would also like to ask if there are any general library functions or algorithms, or any other sort of advise ( certain things we must do and what we must not do ) to follow to speed up execution ( Not just for this code, but in general ). Any help would be appreciated. ( and sorry for asking such an awkward question without giving any reference material )
Your "algorithm" is very simple and I would write it with the use of the max() function, just because it is better style.
But anyway...
What will take the most time is the scanf. This is your bottleneck. You should write your own read function which reads a huge block with fread and processes it. You may consider doing this asynchronously - but I wouldn't recommend this as a first step (some async implementations are indeed slower than the synchronous implementations).
So basically you do the following:
Read a huge block from file into memory (this is disk IO, so this is the bottleneck)
Parse that block and find your three integers (watch out for the block borders! the first two integers may lie within one block and the third lies in the next - or the block border splits your integer in the middle, so let your parser just catch those things)
Do your comparisions - that runs as hell compared to the disk IO, so no need to improve that
Unless you have a guarantee that the three input numbers are all different, I'd worry about making the program get the correct output. As noted, there's almost nothing to speed up, other than input and output buffering, and maybe speeding up decimal conversions by using custom parsing and formatting code, instead of the general-purpose scanf and printf.
Right now if you receive input values a=5, b=5, c=1, your code will report that 1 is the largest of those three values. Change the > comparisons to >= to fix that.
You can minimize the number of comparisons by remembering previous results. You can do this with:
int d;
if (a >= b)
if (a >= c)
d = a;
else
d = c;
else
if (b >= c)
d = b;
else
d = c;
[then output d as your maximum]
That does exactly 2 comparisons to find a value for d as max(a,b,c).
Your code uses at least two and maybe up to 4.

Is adding 1 to a number repeatedly slower than adding everything at the same time in C++? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
If I have a number a, would it be slower to add 1 to it b times rather than simply adding a + b?
a += b;
or
for (int i = 0; i < b; i++) {
a += 1;
}
I realize that the second example seems kind of silly, but I have a situation where coding would actually be easier that way, and I am wondering if that would impact performance.
EDIT: Thank you for all your answers. It looks like some posters would like to know what situation I have. I am trying to write a function to shift an inputted character a certain number of characters over (ie. a cipher) if it is a letter. So, I want to say that one char += the number of shifts, but I also need to account for the jumps between the lowercase characters and uppercase characters on the ascii table, and also wrapping from z back to A. So, while it is doable in another way, I thought it would be easiest to keep adding one until I get to the end of a block of letter characters, then jump to the next one and keep going.
If your loop is really that simple, I don't see any reason why a compiler couldn't optimize it. I have no idea if any actually would, though. If your compiler doesn't the single addition will be much faster than the loop.
The language C++ does not describe how long either of those operations take. Compilers are free to turn your first statement into the second, and that is a legal way to compile it.
In practice, many compilers would treat those two subexpressions as the same expression, assuming everything is of type int. The second, however, would be fragile in that seemingly innocuous changes would cause massive performance degradation. Small changes in type that 'should not matter', extra statements nearby, etc.
It would be extremely rare for the first to be slower than the second, but if the type of a was such that += b was a much slower operation than calling += 1 a bunch of times, it could be. For example;
struct A {
std::vector<int> v;
void operator+=( int x ) {
// optimize for common case:
if (x==1 && v.size()==v.capacity()) v.reserve( v.size()*2 );
// grow the buffer:
for (int i = 0; i < x; ++i)
v.reserve( v.size()+1 );
v.resize( v.size()+1 );
}
}
};
then A a; int b = 100000; a+=b; would take much longer than the loop construct.
But I had to work at it.
The overhead (CPU instructions) on having a variable being incremented in a loop is likely to be insignificant compared to the total number of instructions in that loop (unless the only thing you are doing in the loop is incrementing). Loop variables are likely to remain in the low levels of the CPU cache (if not in CPU registries) and is very fast to increment as in doesn't need to read from the RAM via the FSB. Anyway, if in doubt just make a quick profile and you'll know if it makes sense to sacrifice code readability for speed.
Yes, absolutely slower. The second example is beyond silly. I highly doubt you have a situation where it would make sense to do it that way.
Lets say 'b' is 500,000... most computers can add that in a single operation, why do 500,000 operations (not including the loop overhead).
If the processor has an increment instruction, the compiler will usually translate the "add one" operation into an increment instruction.
Some processors may have an optimized increment instructions to help speed up things like loops. Other processors can combine an increment operation with a load or store instruction.
There is a possibility that a small loop containing only an increment instruction could be replaced by a multiply and add. The compiler is allowed to do so, if and only if the functionality is the same.
This kind of operation, generally produces negligible results. However, for large data sets and performance critical applications, this kind of operation may be necessary and the time gained would be significant.
Edit 1:
For adding values other than 1, the compiler would emit processor instructions to use the best addition operations.
The add operation is optimized in hardware as a different animal than incrementing. Arithmetic Logic Units (ALU) have been around for a long time. The basic addition operation is very optimized and a lot faster than incrementing in a loop.

How do I force the compiler not to skip my function calls?

Let's say I want to benchmark two competing implementations of some function double a(double b, double c). I already have a large array <double, 1000000> vals from which I can take input values, so my benchmarking would look roughly like this:
//start timer here
double r;
for (int i = 0; i < 1000000; i+=2) {
r = a(vals[i], vals[i+1]);
}
//stop timer here
Now, a clever compiler could realize that I can only ever use the result of the last iteration and simply kill the rest, leaving me with double r = a(vals[999998], vals[999999]). This of course defeats the purpose of benchmarking.
Is there a good way (bonus points if it works on multiple compilers) to prevent this kind of optimization while keeping all other optimizations in place?
(I have seen other threads about inserting empty asm blocks but I'm worried that might prevent inlining or reordering. I'm also not particularly fond of the idea of adding the results sum += r; during each iteration because that's extra work that should not be included in the resulting timings. For the purposes of this question, it would be great if we could focus on other alternative solutions, although for anyone interested in this there is a lively discussion in the comments where the consensus is that += is the most appropriate method in many cases. )
Put a in a separate compilation unit and do not use LTO (link-time optimizations). That way:
The loop is always identical (no difference due to optimizations based on a)
The overhead of the function call is always the same
To measure the pure overhead and to have a baseline to compare implementations, just benchmark an empty version of a
Note that the compiler can not assume that the call to a has no side-effect, so it can not optimize the loop away and replace it with just the last call.
A totally different approach could use RDTSC, which is a hardware register in the CPU core that measures the clock cycles. It's sometimes useful for micro-benchmarks, but it's not exactly trivial to understand the results correctly. For example, check out this and goggle/search SO for more information on RDTSCs.