Severe & Bizzare performance issue - c++

Update
Ok, I removed the 3 couts and replaced it with *buffer = 'a', and there was a big performance difference. Removing that line made the program 2x as fast. If you go on godbolt and compile it using msvc, that single line of code changes most of the program. (It adds a whole lot more complexity)
The following might seem extremely weird, but it's true on my computer:
Alright, so I was doing some benchmarking of some code, and I noticed extremely weird performance anomalies that were 100% consistent. I'm running windows-10 and visual-studio-2019. Basically, deleting a line of code that is never called completely changes the performance of the program.
Here is exactly what to do:
Create new VS-2019 Console C++ App project
Set the configuration to Release & x64
Paste the code below:
#include <iostream>
#include <chrono>
class Test {
public:
size_t length;
size_t doublingVal;
char* buffer;
Test() : length(0), doublingVal(2) {
buffer = static_cast<char*>(malloc(1));
}
~Test() {
std::cout << "called" << "\n";
std::cout << "called" << "\n";
std::cout << "called" << "\n"; // Remove this line and the time decreases DRASTICALLY (ie remove line 14)
}
void append() {
if (doublingVal == length) {
doublingVal <<= 1;
}
*buffer = 'a';
++length;
}
};
int main()
{
Test test;
auto start = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < static_cast<size_t>(1024) * 1024 * 1024 * 4; ++i) {
test.append();
}
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start).count() << "\n";
}
Run the program using CTRL+F5, not in debug. Now remember how long it takes to run. (a few seconds)
Then, in the destructor of Test, remove the third line which has the comment.
Run the program again, and you should see that the performance increases drastically. I tested this exact same code with 4 different projects all brand new, and 3 different computers.
The destructor is called at the very end, when the entire program is finished measuring time. The extra cout shouldn't affect anything.
Edit:
You can also see a similar thing go on if you remove the 3 cout's and replace it with a single *buffer = 'a'. Then CTRL+F5 once again, record the time, and then remove that line we just added. Then run it again and the time magically decreases by half.
WTF is going on, and how do you solve the weird performance difference?

Related

A code works slowly because of compiler optimizations

I have 3 separate global functions and I want to test it's speed. I'm using this code:
// case 1
{
chrono::duration<double, milli> totalTime;
for (uint32_t i{ 0 }; i < REPEATS; ++i)
{
auto start = chrono::steady_clock::now();
func1(); // "normal" c++ code
auto end = chrono::steady_clock::now();
auto diff = end - start;
cout << chrono::duration <double, milli>(diff).count() << " ms" << endl;
}
}
// case 2
{
chrono::duration<double, milli> totalTime;
for (uint32_t i{ 0 }; i < REPEATS; ++i)
{
auto start = chrono::steady_clock::now();
func2(); // multithreaded c++ code
auto end = chrono::steady_clock::now();
auto diff = end - start;
cout << chrono::duration <double, milli>(diff).count() << " ms" << endl;
}
}
// case 3
{
chrono::duration<double, milli> totalTime;
for (uint32_t i{ 0 }; i < REPEATS; ++i)
{
auto start = chrono::steady_clock::now();
func3(); // SIMD c++ code
auto end = chrono::steady_clock::now();
auto diff = end - start;
cout << chrono::duration <double, milli>(diff).count() << " ms" << endl;
}
}
This func1(), func2(), func3() are global functions that doesn't change the state of the program (I don't have any global variables).
The outputed result depends on the running cases. If I run case 1 and case 2 I have 100ms and 10ms respectively. If I run case 1 and case 3 I have 100ms and 130ms. If I run cases 1, 2, 3 I have 130ms, 10ms, 120 ms. The first case became slower on 30% and the third one became faster! If I run cases separatelly I have 100ms, 10ms, 130ms. I tried to turn optimisation off - the code became (surprise, surprise) much slower but at least the results are the same not depending on cases order. So I came to a conclusion that compiler do something special. Is it true?
I'm using Win7 and VS 2013.
A few things can happen:
Your test get's pre-empted by the kernel. Not much you can do to prevent that. Run your tests multiple times to make sure the results are consistent.
The compiler can inline your functions and optimize on the inlined code.
There is interaction between the functions. The simplest thing I would guess is memory allocation (i.e. func1() is optimized to request a memory block sufficient for the entire program, or one of the function brings some memory blocks into cache).
So suggestions:
First run each function before the benchmark to get rid of some of
the memory artifacts.
Run the benchmark a few times and see how
the values fluctuate to eliminate some of the OS artifacts.
Shuffle the order of the functions on each run, or take it as a
parameter, to make sure you eliminate ordering artifacts.
Don't do cout in your loop because that interacts with the OS and can mess up your cache or even get your process preempted. Write the results to some vector and output everything at the end.
There can be other things affecting your performance (i.e. Disk, networking, other processes, memory and cpu load), so take the values with a grain of salt.
Compiler optimizations vary. There are numerous things that compiler can do - one of optimizations(at least in GNU GCC) is aggresive loop unrolling - this might create faster code, but you must be aware that this can cause cache misses, effectively slowing down your code. That is, if we take just the compiler optimizations into consideration.
Now you have three different cases that if run separately give different output. This might be affected by alignment issue - if your code is properly aligned it will be faster, and if it isn't, additional padding might slow it down - I've seen similar thing happening in C#, but I can't find this thread now.
And last thing that could happen is you run too few tests to be sure - 10k tests is decent set, and you can start comparing speed output. One-time tests can be affected by OS, so keep that in mind.
Oh, and because Microsoft is brilliant at writing compilers, there are bugs in certain versions. I don't think the world of Microsoft's C++ compiler - there are many hacks and workarounds, it's not as up-to-date as other popular compilers - but that's simply my opinion. So another option is that compiler is simply malfunctioning. Also, see this and this beautiful typedef.

c++ Read File lines to vector<string>

I made this method to read from a file and put it into a vector of strings;
std::vector<std::string> read_file_lines1(const char* filepath){
std::vector<std::string> file;
std::ifstream input(filepath);
Timer timer;
float time = 0;
std::string line;
int i = 0;
while (getline(input, line)){
timer.reset();
file.push_back(line);
time += timer.elapsed();
if (i == 10000)
std::cout << "10000 done" << std::endl;
i = ((i + 1) % 10001);
}
std::cout << time << std::endl;;
return file;
}
But the performance was really bad in my opinion (200k lines in ~22 seconds)
with a small change making it a vector<string*> (using file.push_back(new std::string(line)) pushback calls went from ~16 seconds to ~1.2 seconds what was a huge improvement (still behind my goals) and it has a small disadvantage: memory usage; if I want to clear the memory used here I will have to remember to make a loop to clear each string*
Now it takes 6~seconds for the whole method, ~5 of them are mostly used in string in the "getline" method and I would really like to know how to optimize it or make an alternative.
PS: I am doing this do load a 3D model, using the same model in Java it takes ~0.8 seconds to read everything AND FILTER (putting "each line in the" vertex/texture... array and then putting them in the index order), so I'm really disappointed if I take that much time to read each line from a file in c++ (using debug mode in both java/c++, that probably makes it quite a bad benchmark but I'm still really disappointed);
Main reason why it is slow, that you need to reallocate memory and move all strings into new location each time when vector capacity is reached. Use std::deque instead of vector, deque doesn't reallocate memory, it adding new chunks. Or you could preallocate vector with reserve method, to avoid reallocations.
Also debug c++ code could be much slower than release, especially with a lot of template and/or inline code - you really need to measure release performance and you need to use timer just once for whole loop as I suspect that in release mode you will be spending a lot of time in timer code.
Another small optimization. instead of
if (i == 10000)
std::cout << "10000 done" << std::endl;
i = ((i + 1) % 10001);
use:
if (i == 10000)
{
std::cout << "10000 done" << std::endl;
i = 0;
}
++i;

How can I print a newline without flushing the buffer?

I was experimenting with c++ trying to figure out how I could print the numbers from 0 to n as fast as possible.
At first I just printed all the numbers with a loop:
for (int i = 0; i < n; i++)
{
std::cout << i << std::endl;
}
However, I think this flushes the buffer after every single number that it outputs, and surely that must take some time, so I tried to first print all the numbers to the buffer (or actually until it's full as it seems then seems to flush automatically) and then flush it all at once. However it seems that printing a \n after flushes the buffer like the std::endl so I omitted it:
for (int i = 0; i < n; i++)
{
std::cout << i << ' ';
}
std::cout << std::endl;
This seems to run about 10 times faster than the first example. However I want to know how to store all the values in the buffer and flush it all at once rather than letting it flush every time it becomes full so I have a few questions:
Is it possible to print a newline without flushing the buffer?
How can I change the buffer size so that I could store all the values inside it and flush it at the very end?
Is this method of outputting text dumb? If so, why, and what would be a better alternative to it?
EDIT: It seems that my results were biased by a laggy system (Terminal app of a smartphone)... With a faster system the execution times show no significant difference.
TL;DR: In general, using '\n' instead of std::endl is faster since std::endl
Explanation:
std::endl causes a flushing of the buffer, whereas '\n' does not.
However, you might or might not notice any speedup whatsoever depending upon the method of testing that you apply.
Consider the following test files:
endl.cpp:
#include <iostream>
int main() {
for ( int i = 0 ; i < 1000000 ; i++ ) {
std::cout << i << std::endl;
}
}
slashn.cpp:
#include <iostream>
int main() {
for ( int i = 0 ; i < 1000000 ; i++ ) {
std::cout << i << '\n';
}
}
Both of these are compiled using g++ on my linux system and undergo the following tests:
1. time ./a.out
For endl.cpp, it takes 19.415s.
For slashn.cpp, it takes 19.312s.
2. time ./a.out >/dev/null
For endl.cpp, it takes 0.397s
For slashn.cpp, it takes 0.153s
3. time ./a.out >temp
For endl.cpp, it takes 2.255s
For slashn.cpp, it takes 0.165s
Conclusion: '\n' is definitely faster (even practically), but the difference in speed can be dependant upon other factors. In the case of a terminal window, the limiting factor seems to depend upon how fast the terminal itself can display the text. As the text is shown on screen, and auto scrolling etc needs to happen, massive slowdowns occur in the execution. On the other hand, for normal files (like the temp example above), the rate at which the buffer is being flushed affects it a lot. In the case of some special files (like /dev/null above), since the data is just sinked into a black-hole, the flushing doesn't seem to have an effect.

Rarely executed and almost empty if statement drastically reduces performance in C++

Editor's clarification: When this was originally posted, there were two issues:
Test performance drops by a factor of three if seemingly inconsequential statement added
Time taken to complete the test appears to vary randomly
The second issue has been solved: the randomness only occurs when running under the debugger.
The remainder of this question should be understood as being about the first bullet point above, and in the context of running in VC++ 2010 Express's Release Mode with optimizations "Maximize Speed" and "favor fast code".
There are still some Comments in the comment section talking about the second point but they can now be disregarded.
I have a simulation where if I add a simple if statement into the while loop that runs the actual simulation, the performance drops about a factor of three (and I run a lot of calculations in the while loop, n-body gravity for the solar system besides other things) even though the if statement is almost never executed:
if (time - cb_last_orbital_update > 5000000)
{
cb_last_orbital_update = time;
}
with time and cb_last_orbital_update being both of type double and defined in the beginning of the main function, where this if statement is too. Usually there are computations I want to run there too, but it makes no difference if I delete them. The if statement as it is above has the same effect on the performance.
The variable time is the simulation time, it increases in 0.001 steps in the beginning so it takes a really long time until the if statement is executed for the first time (I also included printing a message to see if it is being executed, but it is not, or at least only when it's supposed to). Regardless, the performance drops by a factor of 3 even in the first minutes of the simulation when it hasn't been executed once yet. If I comment out the line
cb_last_orbital_update = time;
then it runs faster again, so it's not the check for
time - cb_last_orbital_update > 5000000
either, it's definitely the simple act of writing current simulation time into this variable.
Also, if I write the current time into another variable instead of cb_last_orbital_update, the performance does not drop. So this might be an issue with assigning a new value to a variable that is used to check if the "if" should be executed? These are all shots in the dark though.
Disclaimer: I am pretty new to programming, and sorry for all that text.
I am using Visual C++ 2010 Express, deactivating the stdafx.h precompiled header function didn't make a difference either.
EDIT: Basic structure of the program. Note that nowhere besides at the end of the while loop (time += time_interval;) is time changed. Also, cb_last_orbital_update has only 3 occurrences: Declaration / initialization, plus the two times in the if statement that is causing the problem.
int main(void)
{
...
double time = 0;
double time_interval = 0.001;
double cb_last_orbital_update = 0;
F_Rocket_Preset(time, time_interval, ...);
while(conditions)
{
Rocket[active].Stage[Rocket[active].r_stage].F_Update_Stage_Performance(time, time_interval, ...);
Rocket[active].F_Calculate_Aerodynamic_Variables(time);
Rocket[active].F_Calculate_Gravitational_Forces(cb_mu, cb_pos_d, time);
Rocket[active].F_Update_Rotation(time, time_interval, ...);
Rocket[active].F_Update_Position_Velocity(time_interval, time, ...);
Rocket[active].F_Calculate_Orbital_Elements(cb_mu);
F_Update_Celestial_Bodies(time, time_interval, ...);
if (time - cb_last_orbital_update > 5000000.0)
{
cb_last_orbital_update = time;
}
Rocket[active].F_Check_Apoapsis(time, time_interval);
Rocket[active].F_Status_Check(time, ...);
Rocket[active].F_Update_Mass (time_interval, time);
Rocket[active].F_Staging_Check (time, time_interval);
time += time_interval;
if (time > 3.1536E8)
{
std::cout << "\n\nBreak main loop! Sim Time: " << time << std::endl;
break;
}
}
...
}
EDIT 2:
Here is the difference in the assembly code. On the left is the fast code with the line
cb_last_orbital_update = time;
outcommented, on the right the slow code with the line.
EDIT 4:
So, i found a workaround that seems to work just fine so far:
int cb_orbit_update_counter = 1; // before while loop
if(time - cb_orbit_update_counter * 5E6 > 0)
{
cb_orbit_update_counter++;
}
EDIT 5:
While that workaround does work, it only works in combination with using __declspec(noinline). I just removed those from the function declarations again to see if that changes anything, and it does.
EDIT 6: Sorry this is getting confusing. I tracked down the culprit for the lower performance when removing __declspec(noinline) to this function, that is being executed inside the if:
__declspec(noinline) std::string F_Get_Body_Name(int r_body)
{
switch (r_body)
{
case 0:
{
return ("the Sun");
}
case 1:
{
return ("Mercury");
}
case 2:
{
return ("Venus");
}
case 3:
{
return ("Earth");
}
case 4:
{
return ("Mars");
}
case 5:
{
return ("Jupiter");
}
case 6:
{
return ("Saturn");
}
case 7:
{
return ("Uranus");
}
case 8:
{
return ("Neptune");
}
case 9:
{
return ("Pluto");
}
case 10:
{
return ("Ceres");
}
case 11:
{
return ("the Moon");
}
default:
{
return ("unnamed body");
}
}
}
The if also now does more than just increase the counter:
if(time - cb_orbit_update_counter * 1E7 > 0)
{
F_Update_Orbital_Elements_Of_Celestial_Bodies(args);
std::cout << F_Get_Body_Name(3) << " SMA: " << cb_sma[3] << "\tPos Earth: " << cb_pos_d[3][0] << " / " << cb_pos_d[3][1] << " / " << cb_pos_d[3][2] <<
"\tAlt: " << sqrt(pow(cb_pos_d[3][0] - cb_pos_d[0][0],2) + pow(cb_pos_d[3][1] - cb_pos_d[0][1],2) + pow(cb_pos_d[3][2] - cb_pos_d[0][2],2)) << std::endl;
std::cout << "Time: " << time << "\tcb_o_h[3]: " << cb_o_h[3] << std::endl;
cb_orbit_update_counter++;
}
I remove __declspec(noinline) from the function F_Get_Body_Name alone, the code gets slower. Similarly, if i remove the execution of this function or add __declspec(noinline) again, the code runs faster. All other functions still have __declspec(noinline).
EDIT 7:
So i changed the switch function to
const std::string cb_names[] = {"the Sun","Mercury","Venus","Earth","Mars","Jupiter","Saturn","Uranus","Neptune","Pluto","Ceres","the Moon","unnamed body"}; // global definition
const int cb_number = 12; // global definition
std::string F_Get_Body_Name(int r_body)
{
if (r_body >= 0 && r_body < cb_number)
{
return (cb_names[r_body]);
}
else
{
return (cb_names[cb_number]);
}
}
and also made another part of the code slimmer. The program now runs fast without any __declspec(noinline). As ElderBug suggested, an issue with the CPU instruction cache then / the code getting too big?
I'd put my money on Intel's branch predictor. http://en.wikipedia.org/wiki/Branch_predictor
The processor assumes (time - cb_last_orbital_update > 5000000) to be false most of the time and loads up the execution pipeline accordingly.
Once the condition (time - cb_last_orbital_update > 5000000) comes true. The misprediction delay is hitting you. You may loose 10 to 20 cycles.
if (time - cb_last_orbital_update > 5000000)
{
cb_last_orbital_update = time;
}
Something is happening that you don't expect.
One candidate is some uninitialised variables hanging around somewhere, which have different values depending on the exact code that you are running. For example, you might have uninitialised memory that is sometime a denormalised floating point number, and sometime it's not.
I think it should be clear that your code doesn't do what you expect it to do. So try debugging your code, compile with all warnings enabled, make sure you use the same compiler options (optimised vs. non-optimised can easily be a factor 10). Check that you get the same results.
Especially when you say "it runs faster again (this doesn't always work though, but i can't see a pattern). Also worked with changing 5000000 to 5E6 once. It only runs fast once though, recompiling causes the performance to drop again without changing anything. One time it ran slower only after recompiling twice." it looks quite likely that you are using different compiler options.
I will try another guess. This is hypothetical, and would be mostly due to the compiler.
My guess is that you use a lot of floating point calculations, and the introduction and use of double values in your main makes the compiler run out of XMM registers (the floating point SSE registers). This force the compiler to use memory instead of registers, and induce a lot of swapping between memory and registers, thus greatly reducing the performance. This would be happening mainly because of the computations functions inlining, because function calls are preserving registers.
The solution would be to add __declspec(noinline) to ALL your computation functions declarations.
I suggest using the Microsoft Profile Guided Optimizer -- if the compiler is making the wrong assumption for this particular branch it will help, and it will in all likelihood improve speed for the rest of the code as well.
Workaround, try 2:
The code is now looking like this:
int cb_orbit_update_counter = 1; // before while loop
if(time - cb_orbit_update_counter * 5E6 > 0)
{
cb_orbit_update_counter++;
}
So far it runs fast, plus the code is being executed when it should as far as i can tell. Again only a workaround, but if this proves to work all around then i'm satisfied.
After some more testing, seems good.
My guess is that this is because the variable cb_last_orbital_update is otherwise read-only, so when you assign to it inside the if, it destroys some optimizations that the compiler has for read-only variables (e.g. perhaps it's now stored in memory instead of a register).
Something to try (although this might still not work) is to make a third variable that is initialized via cb_last_orbital_update and time depending on whether the condition is true, and using that one instead. Presumably, the compiler would now treat that variable as a constant, but I'm not sure.

Why do std::string operations perform poorly?

I made a test to compare string operations in several languages for choosing a language for the server-side application. The results seemed normal until I finally tried C++, which surprised me a lot. So I wonder if I had missed any optimization and come here for help.
The test are mainly intensive string operations, including concatenate and searching. The test is performed on Ubuntu 11.10 amd64, with GCC's version 4.6.1. The machine is Dell Optiplex 960, with 4G RAM, and Quad-core CPU.
in Python (2.7.2):
def test():
x = ""
limit = 102 * 1024
while len(x) < limit:
x += "X"
if x.find("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0) > 0:
print("Oh my god, this is impossible!")
print("x's length is : %d" % len(x))
test()
which gives result:
x's length is : 104448
real 0m8.799s
user 0m8.769s
sys 0m0.008s
in Java (OpenJDK-7):
public class test {
public static void main(String[] args) {
int x = 0;
int limit = 102 * 1024;
String s="";
for (; s.length() < limit;) {
s += "X";
if (s.indexOf("ABCDEFGHIJKLMNOPQRSTUVWXYZ") > 0)
System.out.printf("Find!\n");
}
System.out.printf("x's length = %d\n", s.length());
}
}
which gives result:
x's length = 104448
real 0m50.436s
user 0m50.431s
sys 0m0.488s
in Javascript (Nodejs 0.6.3)
function test()
{
var x = "";
var limit = 102 * 1024;
while (x.length < limit) {
x += "X";
if (x.indexOf("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0) > 0)
console.log("OK");
}
console.log("x's length = " + x.length);
}();
which gives result:
x's length = 104448
real 0m3.115s
user 0m3.084s
sys 0m0.048s
in C++ (g++ -Ofast)
It's not surprising that Nodejs performas better than Python or Java. But I expected libstdc++ would give much better performance than Nodejs, whose result really suprised me.
#include <iostream>
#include <string>
using namespace std;
void test()
{
int x = 0;
int limit = 102 * 1024;
string s("");
for (; s.size() < limit;) {
s += "X";
if (s.find("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0) != string::npos)
cout << "Find!" << endl;
}
cout << "x's length = " << s.size() << endl;
}
int main()
{
test();
}
which gives result:
x length = 104448
real 0m5.905s
user 0m5.900s
sys 0m0.000s
Brief Summary
OK, now let's see the summary:
javascript on Nodejs(V8): 3.1s
Python on CPython 2.7.2 : 8.8s
C++ with libstdc++: 5.9s
Java on OpenJDK 7: 50.4s
Surprisingly! I tried "-O2, -O3" in C++ but noting helped. C++ seems about only 50% performance of javascript in V8, and even poor than CPython. Could anyone explain to me if I had missed some optimization in GCC or is this just the case? Thank you a lot.
It's not that std::string performs poorly (as much as I dislike C++), it's that string handling is so heavily optimized for those other languages.
Your comparisons of string performance are misleading, and presumptuous if they are intended to represent more than just that.
I know for a fact that Python string objects are completely implemented in C, and indeed on Python 2.7, numerous optimizations exist due to the lack of separation between unicode strings and bytes. If you ran this test on Python 3.x you will find it considerably slower.
Javascript has numerous heavily optimized implementations. It's to be expected that string handling is excellent here.
Your Java result may be due to improper string handling, or some other poor case. I expect that a Java expert could step in and fix this test with a few changes.
As for your C++ example, I'd expect performance to slightly exceed the Python version. It does the same operations, with less interpreter overhead. This is reflected in your results. Preceding the test with s.reserve(limit); would remove reallocation overhead.
I'll repeat that you're only testing a single facet of the languages' implementations. The results for this test do not reflect the overall language speed.
I've provided a C version to show how silly such pissing contests can be:
#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
void test()
{
int limit = 102 * 1024;
char s[limit];
size_t size = 0;
while (size < limit) {
s[size++] = 'X';
if (memmem(s, size, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", 26)) {
fprintf(stderr, "zomg\n");
return;
}
}
printf("x's length = %zu\n", size);
}
int main()
{
test();
return 0;
}
Timing:
matt#stanley:~/Desktop$ time ./smash
x's length = 104448
real 0m0.681s
user 0m0.680s
sys 0m0.000s
So I went and played a bit with this on ideone.org.
Here a slightly modified version of your original C++ program, but with the appending in the loop eliminated, so it only measures the call to std::string::find(). Note that I had to cut the number of iterations to ~40%, otherwise ideone.org would kill the process.
#include <iostream>
#include <string>
int main()
{
const std::string::size_type limit = 42 * 1024;
unsigned int found = 0;
//std::string s;
std::string s(limit, 'X');
for (std::string::size_type i = 0; i < limit; ++i) {
//s += 'X';
if (s.find("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0) != std::string::npos)
++found;
}
if(found > 0)
std::cout << "Found " << found << " times!\n";
std::cout << "x's length = " << s.size() << '\n';
return 0;
}
My results at ideone.org are time: 3.37s. (Of course, this is highly questionably, but indulge me for a moment and wait for the other result.)
Now we take this code and swap the commented lines, to test appending, rather than finding. Note that, this time, I had increased the number of iterations tenfold in trying to see any time result at all.
#include <iostream>
#include <string>
int main()
{
const std::string::size_type limit = 1020 * 1024;
unsigned int found = 0;
std::string s;
//std::string s(limit, 'X');
for (std::string::size_type i = 0; i < limit; ++i) {
s += 'X';
//if (s.find("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0) != std::string::npos)
// ++found;
}
if(found > 0)
std::cout << "Found " << found << " times!\n";
std::cout << "x's length = " << s.size() << '\n';
return 0;
}
My results at ideone.org, despite the tenfold increase in iterations, are time: 0s.
My conclusion: This benchmark is, in C++, highly dominated by the searching operation, the appending of the character in the loop has no influence on the result at all. Was that really your intention?
The idiomatic C++ solution would be:
#include <iostream>
#include <string>
#include <algorithm>
int main()
{
const int limit = 102 * 1024;
std::string s;
s.reserve(limit);
const std::string pattern("ABCDEFGHIJKLMNOPQRSTUVWXYZ");
for (int i = 0; i < limit; ++i) {
s += 'X';
if (std::search(s.begin(), s.end(), pattern.begin(), pattern.end()) != s.end())
std::cout << "Omg Wtf found!";
}
std::cout << "X's length = " << s.size();
return 0;
}
I could speed this up considerably by putting the string on the stack, and using memmem -- but there seems to be no need. Running on my machine, this is over 10x the speed of the python solution already..
[On my laptop]
time ./test
X's length = 104448
real 0m2.055s
user 0m2.049s
sys 0m0.001s
That is the most obvious one: please try to do s.reserve(limit); before main loop.
Documentation is here.
I should mention that direct usage of standard classes in C++ in the same way you are used to do it in Java or Python will often give you sub-par performance if you are unaware of what is done behind the desk. There is no magical performance in language itself, it just gives you right tools.
My first thought is that there isn't a problem.
C++ gives second-best performance, nearly ten times faster than Java. Maybe all but Java are running close to the best performance achievable for that functionality, and you should be looking at how to fix the Java issue (hint - StringBuilder).
In the C++ case, there are some things to try to improve performance a bit. In particular...
s += 'X'; rather than s += "X";
Declare string searchpattern ("ABCDEFGHIJKLMNOPQRSTUVWXYZ"); outside the loop, and pass this for the find calls. An std::string instance knows it's own length, whereas a C string requires a linear-time check to determine that, and this may (or may not) be relevant to std::string::find performance.
Try using std::stringstream, for a similar reason to why you should be using StringBuilder for Java, though most likely the repeated conversions back to string will create more problems.
Overall, the result isn't too surprising though. JavaScript, with a good JIT compiler, may be able to optimise a little better than C++ static compilation is allowed to in this case.
With enough work, you should always be able to optimise C++ better than JavaScript, but there will always be cases where that doesn't just naturally happen and where it may take a fair bit of knowledge and effort to achieve that.
What you are missing here is the inherent complexity of the find search.
You are executing the search 102 * 1024 (104 448) times. A naive search algorithm will, each time, try to match the pattern starting from the first character, then the second, etc...
Therefore, you have a string that is going from length 1 to N, and at each step you search the pattern against this string, which is a linear operation in C++. That is N * (N+1) / 2 = 5 454 744 576 comparisons. I am not as surprised as you are that this would take some time...
Let us verify the hypothesis by using the overload of find that searches for a single A:
Original: 6.94938e+06 ms
Char : 2.10709e+06 ms
About 3 times faster, so we are within the same order of magnitude. Therefore the use of a full string is not really interesting.
Conclusion ? Maybe that find could be optimized a bit. But the problem is not worth it.
Note: and to those who tout Boyer Moore, I am afraid that the needle is too small, so it won't help much. May cut an order of magnitude (26 characters), but no more.
For C++, try to use std::string for "ABCDEFGHIJKLMNOPQRSTUVWXYZ" - in my implementation string::find(const charT* s, size_type pos = 0) const calculates length of string argument.
I just tested the C++ example myself. If I remove the the call to std::sting::find, the program terminates in no time. Thus the allocations during string concatenation is no problem here.
If I add a variable sdt::string abc = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" and replace the occurence of "ABC...XYZ" in the call of std::string::find, the program needs almost the same time to finish as the original example. This again shows that allocation as well as computing the string's length does not add much to the runtime.
Therefore, it seems that the string search algorithm used by libstdc++ is not as fast for your example as the search algorithms of javascript or python. Maybe you want to try C++ again with your own string search algorithm which fits your purpose better.
C/C++ language are not easy and take years make fast programs.
with strncmp(3) version modified from c version:
#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
void test()
{
int limit = 102 * 1024;
char s[limit];
size_t size = 0;
while (size < limit) {
s[size++] = 'X';
if (!strncmp(s, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", 26)) {
fprintf(stderr, "zomg\n");
return;
}
}
printf("x's length = %zu\n", size);
}
int main()
{
test();
return 0;
}
Your test code is checking a pathological scenario of excessive string concatenation. (The string-search part of the test could have probably been omitted, I bet you it contributes almost nothing to the final results.) Excessive string concatenation is a pitfall that most languages warn very strongly against, and provide very well known alternatives for, (i.e. StringBuilder,) so what you are essentially testing here is how badly these languages fail under scenarios of perfectly expected failure. That's pointless.
An example of a similarly pointless test would be to compare the performance of various languages when throwing and catching an exception in a tight loop. All languages warn that exception throwing and catching is abysmally slow. They do not specify how slow, they just warn you not to expect anything. Therefore, to go ahead and test precisely that, would be pointless.
So, it would make a lot more sense to repeat your test substituting the mindless string concatenation part (s += "X") with whatever construct is offered by each one of these languages precisely for avoiding string concatenation. (Such as class StringBuilder.)
As mentioned by sbi, the test case is dominated by the search operation.
I was curious how fast the text allocation compares between C++ and Javascript.
System: Raspberry Pi 2, g++ 4.6.3, node v0.12.0, g++ -std=c++0x -O2 perf.cpp
C++ : 770ms
C++ without reserve: 1196ms
Javascript: 2310ms
C++
#include <iostream>
#include <string>
#include <chrono>
using namespace std;
using namespace std::chrono;
void test()
{
high_resolution_clock::time_point t1 = high_resolution_clock::now();
int x = 0;
int limit = 1024 * 1024 * 100;
string s("");
s.reserve(1024 * 1024 * 101);
for(int i=0; s.size()< limit; i++){
s += "SUPER NICE TEST TEXT";
}
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count();
cout << duration << endl;
}
int main()
{
test();
}
JavaScript
function test()
{
var time = process.hrtime();
var x = "";
var limit = 1024 * 1024 * 100;
for(var i=0; x.length < limit; i++){
x += "SUPER NICE TEST TEXT";
}
var diff = process.hrtime(time);
console.log('benchmark took %d ms', diff[0] * 1e3 + diff[1] / 1e6 );
}
test();
It seems that in nodejs there are better algorithms for substring search. You can implement it by yourself and try it out.