`time` benchmarking utility: What does `user`+`sys` = 0 actually mean? - c++

One comparison I felt that was missing from the Why is this C++ program so incredibly fast? discussion is Fortran. I translated Sven Hager's C++ benchmark:
#include <iostream>
#include <cstdlib>
#include <cstdint>
int main(int argc, char* argv[]) {
uint32_t s = 0;
uint32_t outer = atoi(argv[1]);
uint32_t inner = atoi(argv[2]);
for (uint32_t i = 0; i < outer; ++i) {
for (uint32_t j = 0; j < inner; ++j)
++s;
s -= inner;
}
std::cout << s << std::endl;
return 0;
}
to its Fortran equivalent:
PROGRAM Benchmark
IMPLICIT NONE
INTEGER :: i,j,s
INTEGER, PARAMETER :: outer=1000,inner=1000000
s = 0
DO i = 1, outer
DO j = 1, inner
s = s + 1
END DO
s = s - inner
END DO
PRINT *, s
END PROGRAM Benchmark
and compiled a fully optimized version with gfortran -g -std=f2008 -Wall -Wextra -O3 Benchmark.f08. I expected to acheive similar performance as Herr Hager:
real 0m0.003s
user 0m0.002s
sys 0m0.002s
What I got was a little puzzling:
real 0m0.003s
user 0m0.000s
sys 0m0.000s
Digging more deeply, I found this discussion on What do 'real', 'user' and 'sys' mean in the output of time(1)?. In it they say that user+sys gives the actual CPU time the process used. So, what does a user+sys of zero actually mean?

In this case, "zero" means "less than a millisecond" (or half a millisecond, depending on how it's rounded), since that's the resolution of the times given by time.
It's only useful for measuring programs that take considerably longer than a millisecond to run.
The reason that the Fortran version is much faster is probably because the loop bounds are hard-coded constants, so that the entire calculation can be done at compile time, leaving just PRINT 0 to do at runtime.

Related

infinite for loop with correct limits

can you please explain why this code is going in infinite loop? I am unable to find the error. It is working fine with small values of n and m.
#include<bits/stdc++.h>
using namespace std;
int main()
{
long long n=1000000, m=1000000;
long long k = 1;
for (long long i = 0; i < n; i++)
{
for (long long j = 0; j < m; j++)
{
k++;
}
}
cout << k;
return 0;
}
It's not infinite, but that k++ operation has to run for 1,000,000 * 1,000,000 = 1,000,000,000,000 times. It's not infinite, but it takes too long. That's exactly why it works well with small n and m values.
It is a typical target for optimization.
Build with -Ofast.
g++ t_duration.cpp -Ofast -std=c++11 -o a_fast
#time ./a_fast
1000000000001
real 0m0.002s
user 0m0.000s
sys 0m0.002s
it takes almost no time to return the output.
Build with -O1.
g++ t_duration.cpp -O1 -std=c++11 -o a_1
#./a_1
419774 ms
About 420 seconds to complete the calculation.
when I Run this code I got this as my output
1000000000001
reference : I had run this code in the code chef IDE(https://www.codechef.com/ide)
you can try in this IDE once, I guess there is some problem with your IDE
or might be some other issue
it took me less than 20 sec to run this(on clockšŸ˜)
But when I put the same code in CODE::BLOCKS its taking long time(like you said infinite loops running) and the reason is quite simple it should do 1000000000000 runs
however this brings me a questions what's difference between code chef IDE and code::blocks compiler
(Got a New question from your question šŸ˜šŸ¤”(Difference Between Code-Chef IDE and Code::Blocks))
Finally answer is try on with code chef IDE that's it , this code runs fast therešŸ¤£
Hope this Helps you šŸ˜ƒ

Adding a print statement speeds up code by an order of magnitude

I've hit a piece of extremely bizarre performance behavior in a piece of C/C++ code, as suggested in the title, which I have no idea how to explain.
Here's an as-close-as-I've-found-to-minimal working example [EDIT: see below for a shorter one]:
#include <stdio.h>
#include <stdlib.h>
#include <complex>
using namespace std;
const int pp = 29;
typedef complex<double> cdbl;
int main() {
cdbl ff[pp], gg[pp];
for(int ii = 0; ii < pp; ii++) {
ff[ii] = gg[ii] = 1.0;
}
for(int it = 0; it < 1000; it++) {
cdbl dual[pp];
for(int ii = 0; ii < pp; ii++) {
dual[ii] = 0.0;
}
for(int h1 = 0; h1 < pp; h1 ++) {
for(int h2 = 0; h2 < pp; h2 ++) {
cdbl avg_right = 0.0;
for(int xx = 0; xx < pp; xx ++) {
int c00 = xx, c01 = (xx + h1) % pp, c10 = (xx + h2) % pp,
c11 = (xx + h1 + h2) % pp;
avg_right += ff[c00] * conj(ff[c01]) * conj(ff[c10]) * gg[c11];
}
avg_right /= static_cast<cdbl>(pp);
for(int xx = 0; xx < pp; xx ++) {
int c01 = (xx + h1) % pp, c10 = (xx + h2) % pp,
c11 = (xx + h1 + h2) % pp;
dual[xx] += conj(ff[c01]) * conj(ff[c10]) * ff[c11] * conj(avg_right);
}
}
}
for(int ii = 0; ii < pp; ii++) {
dual[ii] = conj(dual[ii]) / static_cast<double>(pp*pp);
}
for(int ii = 0; ii < pp; ii++) {
gg[ii] = dual[ii];
}
#ifdef I_WANT_THIS_TO_RUN_REALLY_FAST
printf("%.15lf\n", gg[0].real());
#else // I_WANT_THIS_TO_RUN_REALLY_SLOWLY
#endif
}
printf("%.15lf\n", gg[0].real());
return 0;
}
Here are the results of running this on my system:
me#mine $ g++ -o test.elf test.cc -Wall -Wextra -O2
me#mine $ time ./test.elf > /dev/null
real 0m7.329s
user 0m7.328s
sys 0m0.000s
me#mine $ g++ -o test.elf test.cc -Wall -Wextra -O2 -DI_WANT_THIS_TO_RUN_REALLY_FAST
me#mine $ time ./test.elf > /dev/null
real 0m0.492s
user 0m0.490s
sys 0m0.001s
me#mine $ g++ --version
g++ (Gentoo 4.9.4 p1.0, pie-0.6.4) 4.9.4 [snip]
It's not terribly important what this code computes: it's just a tonne of complex arithmetic on arrays of length 29. It's been "simplified" from a much larger tonne of complex arithmetic that I care about.
So, the behavior seems to be, as claimed in the title: if I put this print statement back in, the code gets a lot faster.
I've played around a bit: e.g, printing a constant string doesn't give the speedup, but printing the clock time does. There's a pretty clear threshold: the code is either fast or slow.
I considered the possibility that some bizarre compiler optimization either does or doesn't kick in, maybe depending on whether the code does or doesn't have side effects. But, if so it's pretty subtle: when I looked at the disassembled binaries, they're seemingly identical except that one has an extra print statement in and they use different interchangeable registers. I may (must?) have missed something important.
I'm at a total loss to explain what an earth could be causing this. Worse, it does actually affect my life because I'm running related code a lot, and going round inserting extra print statements does not feel like a good solution.
Any plausible theories would be very welcome. Responses along the lines of "your computer's broken" are acceptable if you can explain how that might explain anything.
UPDATE: with apologies for the increasing length of the question, I've shrunk the example to
#include <stdio.h>
#include <stdlib.h>
#include <complex>
using namespace std;
const int pp = 29;
typedef complex<double> cdbl;
int main() {
cdbl ff[pp];
cdbl blah = 0.0;
for(int ii = 0; ii < pp; ii++) {
ff[ii] = 1.0;
}
for(int it = 0; it < 1000; it++) {
cdbl xx = 0.0;
for(int kk = 0; kk < 100; kk++) {
for(int ii = 0; ii < pp; ii++) {
for(int jj = 0; jj < pp; jj++) {
xx += conj(ff[ii]) * conj(ff[jj]) * ff[ii];
}
}
}
blah += xx;
printf("%.15lf\n", blah.real());
}
printf("%.15lf\n", blah.real());
return 0;
}
I could make it even smaller but already the machine code is manageable. If I change five bytes of the binary corresponding to the callq instruction for that first printf, to 0x90, the execution goes from fast to slow.
The compiled code is very heavy with function calls to __muldc3(). I feel it must be to do with how the Broadwell architecture does or doesn't handle these jumps well: both versions run the same number of instructions so it's a difference in instructions / cycle (about 0.16 vs about 2.8).
Also, compiling -static makes things fast again.
Further shameless update: I'm conscious I'm the only one who can play with this, so here are some more observations:
It seems like calling any library function ā€” including some foolish ones I made up that do nothing ā€” for the first time, puts the execution into slow state. A subsequent call to printf, fprintf or sprintf somehow clears the state and execution is fast again. So, importantly the first time __muldc3() is called we go into slow state, and the next {,f,s}printf resets everything.
Once a library function has been called once, and the state has been reset, that function becomes free and you can use it as much as you like without changing the state.
So, e.g.:
#include <stdio.h>
#include <stdlib.h>
#include <complex>
using namespace std;
int main() {
complex<double> foo = 0.0;
foo += foo * foo; // 1
char str[10];
sprintf(str, "%c\n", 'c');
//fflush(stdout); // 2
for(int it = 0; it < 100000000; it++) {
foo += foo * foo;
}
return (foo.real() > 10.0);
}
is fast, but commenting out line 1 or uncommenting line 2 makes it slow again.
It must be relevant that the first time a library call is run the "trampoline" in the PLT is initialized to point to the shared library. So, maybe somehow this dynamic loading code leaves the processor frontend in a bad place until it's "rescued".
For the record, I finally figured this out.
It turns out this is to do with AVXā€“SSE transition penalties. To quote this exposition from Intel:
When using IntelĀ® AVX instructions, it is important to know that mixing 256-bit IntelĀ® AVX instructions with legacy (non VEX-encoded) IntelĀ® SSE instructions may result in penalties that could impact performance. 256-bit IntelĀ® AVX instructions operate on the 256-bit YMM registers which are 256-bit extensions of the existing 128-bit XMM registers. 128-bit IntelĀ® AVX instructions operate on the lower 128 bits of the YMM registers and zero the upper 128 bits. However, legacy IntelĀ® SSE instructions operate on the XMM registers and have no knowledge of the upper 128 bits of the YMM registers. Because of this, the hardware saves the contents of the upper 128 bits of the YMM registers when transitioning from 256-bit IntelĀ® AVX to legacy IntelĀ® SSE, and then restores these values when transitioning back from IntelĀ® SSE to IntelĀ® AVX (256-bit or 128-bit). The save and restore operations both cause a penalty that amounts to several tens of clock cycles for each operation.
The compiled version of my main loops above includes legacy SSE instructions (movapd and friends, I think), whereas the implementation of __muldc3 in libgcc_s uses a lot of fancy AVX instructions (vmovapd, vmulsd etc.).
This is the ultimate cause of the slowdown.
Indeed, Intel performance diagnostics show that this AVX/SSE switching happens almost exactly once each way per call of `__muldc3' (in the last version of the code posted above):
$ perf stat -e cpu/event=0xc1,umask=0x08/ -e cpu/event=0xc1,umask=0x10/ ./slow.elf
Performance counter stats for './slow.elf':
100,000,064 cpu/event=0xc1,umask=0x08/
100,000,118 cpu/event=0xc1,umask=0x10/
(event codes taken from table 19.5 of another Intel manual).
That leaves the question of why the slowdown turns on when you call a library function for the first time, and turns off again when you call printf, sprintf or whatever. The clue is in the first document again:
When it is not possible to remove the transitions, it is often possible to avoid the penalty by explicitly zeroing the upper 128-bits of the YMM registers, in which case the hardware does not save these values.
I think the full story is therefore as follows. When you call a library function for the first time, the trampoline code in ld-linux-x86-64.so that sets up the PLT leaves the upper bits of the MMY registers in a non-zero state. When you call sprintf among other things it zeros out the upper bits of the MMY registers (whether by chance or design, I'm not sure).
Replacing the sprintf call with asm("vzeroupper") ā€” which instructs the processor explicitly to zero these high bits ā€” has the same effect.
The effect can be eliminated by adding -mavx or -march=native to the compile flags, which is how the rest of the system was built. Why this doesn't happen by default is just a mystery of my system I guess.
I'm not quite sure what we learn here, but there it is.

Array of array and filtering performance

What is the most efficient way to filter on an array of array?
Let's take this example:
[
[key,timestamp,value],
[2,12211212440,3.98],
[2,12211212897,3.78],
...,
[3,12211212440,3.28],
[3,12211212897,3.58],
...,
[4,12211212440,4.98],
... about 1 millions row :-)
]
So let's imagine we have more than 1 millions row. I want to filter by key and timestamp like key in array [3,7,9,...] and timestamp between time1 and time2.
Of course the first idea is to use database but for some reason I want to do it directly inside an application and not with a third party software.
Therefor I am wondering what would be the most efficient way to perform this filtering over a big amount of data. C++ list STL? Python Numpy? ...? But Numpy is base on C/C++, therefor I would imagine that the most efficient way is C/C++.
So could you give me some great idea?
Well, if you really want to absolutely optimize the performance here, the answer is pretty much always to hand-code in C/C++. But optimizing the run-time performance usually means a lot longer to write the code, so you need to figure out that tradeoff for yourself.
But you're right: numpy is indeed written in C, and most of its processing happens at that level. As a result, the algorithms it implements are typically very close to the speed you could get by coding them yourself in C (or faster, because people have been working on numpy for a while). So you probably shouldn't bother re-coding anything you can find in numpy.
Now the question is whether or not numpy has a built-in way of doing what you want. Here's one way to do it:
import numpy as np
keys = np.array([2, 4])
t1 = 12211212440
t2 = 12211212897
d = np.array([[key,timestamp,value]
for key in range(5)
for timestamp in range(12211172430, 12211212997+1)
for value in [3.98, 3.78, 3.28, 3.58, 4.98]])
filtered = d[(d[:, 1] >= t1) & (d[:, 1] <= t2)] # Select rows in time span
filtered = filtered[np.in1d(filtered[:, 0], keys)] # Select rows where key is in keys
It's not quite built-in, but two lines isn't bad. Note that this d I made up has a little over a million entries. On my laptop, this runs in around 4 milliseconds, which is around 4 nanoseconds per entry. That's almost as fast as I would expect to be able to do it in C/C++.
Of course, there are other ways to do it. Also, you might want to make a custom numpy dtype to handle the fact that you evidently have int, int, float, whereas numpy assumes float, float, float. If I use a numpy recarray to do this, I get about a 50% speedup. There's also pandas, which can handle that automatically, and has lots of selection functions. In particular, translating the above straight to pandas runs in a little under 4 milliseconds -- maybe pandas is more clever about something:
import pandas as pd
keys = [2, 4]
t1 = 12211212440
t2 = 12211212897
d_pandas = pd.DataFrame([(key,timestamp,value)
for key in range(5)
for timestamp in range(12211172430, 12211212997+1)
for value in [3.98, 3.78, 3.28, 3.58, 4.98]],
columns=['key', 'timestamp', 'value'])
filtered = d_pandas[(d_pandas.timestamp >= t1) & (d_pandas.timestamp <= t2)]
filtered = filtered[filtered.timestamp.isin(keys)]
Of course, this isn't how I would code it in C++, so you could do better in principle. In particular, I would expect to see a slowdown due to looping over the array two or three separate times; in C++ I would just do the loop once, check one condition, if that's true check the second condtion, if that's true keep that row. If you're already using python, you could also look into numba, which lets you use python, but compiles it on the fly for you, so it's about as fast as C/C++. Even a code as dumb as this runs very quickly:
import numba as nb
import numpy as np
keys = np.array([2, 4], dtype='i8')
t1 = 12211212440
t2 = 12211212897
d_recarray = np.rec.array([(key,timestamp,value)
for key in range(5)
for timestamp in range(12211172430, 12211212997+1)
for value in [3.98, 3.78, 3.28, 3.58, 4.98]],
dtype=[('key', 'i8'), ('timestamp', 'i8'), ('value', 'f8')])
#nb.njit
def select_elements_recarrray(d_in, keys, t1, t2):
d_out = np.empty_like(d_in)
k = 0
for i in range(len(d_in)):
if d_in[i].timestamp >= t1 and d_in[i].timestamp <= t2:
matched_key = False
for j in range(len(keys)):
if d_in[i].key == keys[j]:
matched_key = True
break
if matched_key:
d_out[k] = d_in[i]
k += 1
return d_out[:k]
filtered = select_elements_recarrray(d_recarray, keys, t1, t2)
The jit compilation takes a little time, though much less time than compiling a typical C code, and it also only has to happen once. And then the filtering runs in just over one millisecond on my laptop -- almost four times faster than the numpy code, and about one nanosecond per input array element. This is roughly as fast as I would expect to be able to do anything, even in C/C++. So I don't expect it would be worth your time optimizing this.
I suggest you try something like this because it's so short and fast (and done for you). Then, only if it's too slow, put in the work to do something better. I don't know what you're doing, but chances are this will not make up a large fraction of the time your code takes to run.
Quick'n'dirty filtering with C, single threaded, relying only on the operating system for buffering and the like:
#include <stdio.h>
#include <stdbool.h>
#include <string.h>
#include <assert.h>
#include <time.h>
double const fromTime = 3.45f;
double const toTime = 11.5f;
unsigned const someKeys[] = {21, 42, 63, 84};
bool matchingKey(unsigned const key) {
size_t i;
for (i = 0; i < sizeof(someKeys) / sizeof(someKeys[0]); ++i) {
if (someKeys[i] == key) {
return true;
}
}
return false;
}
bool matchingTime(double const time) {
return (time >= fromTime) && (time < toTime);
}
int main() {
char buffer[100];
time_t start_time = time(NULL);
clock_t start_clock = clock();
while (fgets(buffer, 100, stdin) != NULL) {
size_t const lineLength = strlen(buffer);
assert(lineLength > 0);
if (buffer[lineLength - 1] != '\n') {
fprintf(stderr, "Line too long:\n%s\nBye!\n", buffer);
return 1;
}
unsigned key;
double timestamp;
if (sscanf(buffer, "%u %lf", &key, &timestamp) != 2) {
fprintf(stderr, "Failed to parse line:\n%sBye!\n", buffer);
}
if (matchingTime(timestamp) && matchingKey(key)) {
printf("%s", buffer);
}
}
time_t end_time = time(NULL);
clock_t end_clock = clock();
fprintf(stderr, "time: %lf clock: %lf\n",
difftime(end_time, start_time),
(double) (end_clock - start_clock) / CLOCKS_PER_SEC);
if (!feof(stdin)) {
fprintf(stderr, "Something bad happend, bye!\n");
return 1;
}
return 0;
}
There's a lot one could optimize here, but let's look how it runs:
$ clang -Wall -Wextra -pedantic -O2 generate.c -o generate
$ clang -Wall -Wextra -pedantic -O2 filter.c -o filter
$ ./generate | time ./filter > /dev/null
time: 67.000000 clock: 66.535009
66.08user 0.45system 1:07.16elapsed 99%CPU (0avgtext+0avgdata 596maxresident)k
0inputs+0outputs (0major+178minor)pagefaults 0swaps
This is the result of filtering 100 million random rows. I ran this on a laptop with an Intel P6100, so nothing high end.
This is about 670 nanoseconds per row. Note that you cannot compare this to Mike's results since on the one hand this (of course) depends on the available performance from the system (mainly CPU) and my test includes time taken by the operating system (Linux x86-64) to forward the data via the pipes as well as parsing the data and printing it again.
For reference here's generate's code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
unsigned randomKey(void) {
return rand() % 128;
}
double randomTime(void) {
double a = rand(), b = rand();
return (a > b) ? (a / b) : (b / a);
}
int main() {
unsigned long const rows = 100000000UL;
srand(time(NULL));
unsigned long r;
for (r = 0; r < rows; ++r) {
printf("%u %lf\n", randomKey(), randomTime());
}
return 0;
}

Why do std::string operations perform poorly?

I made a test to compare string operations in several languages for choosing a language for the server-side application. The results seemed normal until I finally tried C++, which surprised me a lot. So I wonder if I had missed any optimization and come here for help.
The test are mainly intensive string operations, including concatenate and searching. The test is performed on Ubuntu 11.10 amd64, with GCC's version 4.6.1. The machine is Dell Optiplex 960, with 4G RAM, and Quad-core CPU.
in Python (2.7.2):
def test():
x = ""
limit = 102 * 1024
while len(x) < limit:
x += "X"
if x.find("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0) > 0:
print("Oh my god, this is impossible!")
print("x's length is : %d" % len(x))
test()
which gives result:
x's length is : 104448
real 0m8.799s
user 0m8.769s
sys 0m0.008s
in Java (OpenJDK-7):
public class test {
public static void main(String[] args) {
int x = 0;
int limit = 102 * 1024;
String s="";
for (; s.length() < limit;) {
s += "X";
if (s.indexOf("ABCDEFGHIJKLMNOPQRSTUVWXYZ") > 0)
System.out.printf("Find!\n");
}
System.out.printf("x's length = %d\n", s.length());
}
}
which gives result:
x's length = 104448
real 0m50.436s
user 0m50.431s
sys 0m0.488s
in Javascript (Nodejs 0.6.3)
function test()
{
var x = "";
var limit = 102 * 1024;
while (x.length < limit) {
x += "X";
if (x.indexOf("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0) > 0)
console.log("OK");
}
console.log("x's length = " + x.length);
}();
which gives result:
x's length = 104448
real 0m3.115s
user 0m3.084s
sys 0m0.048s
in C++ (g++ -Ofast)
It's not surprising that Nodejs performas better than Python or Java. But I expected libstdc++ would give much better performance than Nodejs, whose result really suprised me.
#include <iostream>
#include <string>
using namespace std;
void test()
{
int x = 0;
int limit = 102 * 1024;
string s("");
for (; s.size() < limit;) {
s += "X";
if (s.find("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0) != string::npos)
cout << "Find!" << endl;
}
cout << "x's length = " << s.size() << endl;
}
int main()
{
test();
}
which gives result:
x length = 104448
real 0m5.905s
user 0m5.900s
sys 0m0.000s
Brief Summary
OK, now let's see the summary:
javascript on Nodejs(V8): 3.1s
Python on CPython 2.7.2 : 8.8s
C++ with libstdc++: 5.9s
Java on OpenJDK 7: 50.4s
Surprisingly! I tried "-O2, -O3" in C++ but noting helped. C++ seems about only 50% performance of javascript in V8, and even poor than CPython. Could anyone explain to me if I had missed some optimization in GCC or is this just the case? Thank you a lot.
It's not that std::string performs poorly (as much as I dislike C++), it's that string handling is so heavily optimized for those other languages.
Your comparisons of string performance are misleading, and presumptuous if they are intended to represent more than just that.
I know for a fact that Python string objects are completely implemented in C, and indeed on Python 2.7, numerous optimizations exist due to the lack of separation between unicode strings and bytes. If you ran this test on Python 3.x you will find it considerably slower.
Javascript has numerous heavily optimized implementations. It's to be expected that string handling is excellent here.
Your Java result may be due to improper string handling, or some other poor case. I expect that a Java expert could step in and fix this test with a few changes.
As for your C++ example, I'd expect performance to slightly exceed the Python version. It does the same operations, with less interpreter overhead. This is reflected in your results. Preceding the test with s.reserve(limit); would remove reallocation overhead.
I'll repeat that you're only testing a single facet of the languages' implementations. The results for this test do not reflect the overall language speed.
I've provided a C version to show how silly such pissing contests can be:
#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
void test()
{
int limit = 102 * 1024;
char s[limit];
size_t size = 0;
while (size < limit) {
s[size++] = 'X';
if (memmem(s, size, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", 26)) {
fprintf(stderr, "zomg\n");
return;
}
}
printf("x's length = %zu\n", size);
}
int main()
{
test();
return 0;
}
Timing:
matt#stanley:~/Desktop$ time ./smash
x's length = 104448
real 0m0.681s
user 0m0.680s
sys 0m0.000s
So I went and played a bit with this on ideone.org.
Here a slightly modified version of your original C++ program, but with the appending in the loop eliminated, so it only measures the call to std::string::find(). Note that I had to cut the number of iterations to ~40%, otherwise ideone.org would kill the process.
#include <iostream>
#include <string>
int main()
{
const std::string::size_type limit = 42 * 1024;
unsigned int found = 0;
//std::string s;
std::string s(limit, 'X');
for (std::string::size_type i = 0; i < limit; ++i) {
//s += 'X';
if (s.find("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0) != std::string::npos)
++found;
}
if(found > 0)
std::cout << "Found " << found << " times!\n";
std::cout << "x's length = " << s.size() << '\n';
return 0;
}
My results at ideone.org are time: 3.37s. (Of course, this is highly questionably, but indulge me for a moment and wait for the other result.)
Now we take this code and swap the commented lines, to test appending, rather than finding. Note that, this time, I had increased the number of iterations tenfold in trying to see any time result at all.
#include <iostream>
#include <string>
int main()
{
const std::string::size_type limit = 1020 * 1024;
unsigned int found = 0;
std::string s;
//std::string s(limit, 'X');
for (std::string::size_type i = 0; i < limit; ++i) {
s += 'X';
//if (s.find("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0) != std::string::npos)
// ++found;
}
if(found > 0)
std::cout << "Found " << found << " times!\n";
std::cout << "x's length = " << s.size() << '\n';
return 0;
}
My results at ideone.org, despite the tenfold increase in iterations, are time: 0s.
My conclusion: This benchmark is, in C++, highly dominated by the searching operation, the appending of the character in the loop has no influence on the result at all. Was that really your intention?
The idiomatic C++ solution would be:
#include <iostream>
#include <string>
#include <algorithm>
int main()
{
const int limit = 102 * 1024;
std::string s;
s.reserve(limit);
const std::string pattern("ABCDEFGHIJKLMNOPQRSTUVWXYZ");
for (int i = 0; i < limit; ++i) {
s += 'X';
if (std::search(s.begin(), s.end(), pattern.begin(), pattern.end()) != s.end())
std::cout << "Omg Wtf found!";
}
std::cout << "X's length = " << s.size();
return 0;
}
I could speed this up considerably by putting the string on the stack, and using memmem -- but there seems to be no need. Running on my machine, this is over 10x the speed of the python solution already..
[On my laptop]
time ./test
X's length = 104448
real 0m2.055s
user 0m2.049s
sys 0m0.001s
That is the most obvious one: please try to do s.reserve(limit); before main loop.
Documentation is here.
I should mention that direct usage of standard classes in C++ in the same way you are used to do it in Java or Python will often give you sub-par performance if you are unaware of what is done behind the desk. There is no magical performance in language itself, it just gives you right tools.
My first thought is that there isn't a problem.
C++ gives second-best performance, nearly ten times faster than Java. Maybe all but Java are running close to the best performance achievable for that functionality, and you should be looking at how to fix the Java issue (hint - StringBuilder).
In the C++ case, there are some things to try to improve performance a bit. In particular...
s += 'X'; rather than s += "X";
Declare string searchpattern ("ABCDEFGHIJKLMNOPQRSTUVWXYZ"); outside the loop, and pass this for the find calls. An std::string instance knows it's own length, whereas a C string requires a linear-time check to determine that, and this may (or may not) be relevant to std::string::find performance.
Try using std::stringstream, for a similar reason to why you should be using StringBuilder for Java, though most likely the repeated conversions back to string will create more problems.
Overall, the result isn't too surprising though. JavaScript, with a good JIT compiler, may be able to optimise a little better than C++ static compilation is allowed to in this case.
With enough work, you should always be able to optimise C++ better than JavaScript, but there will always be cases where that doesn't just naturally happen and where it may take a fair bit of knowledge and effort to achieve that.
What you are missing here is the inherent complexity of the find search.
You are executing the search 102 * 1024 (104 448) times. A naive search algorithm will, each time, try to match the pattern starting from the first character, then the second, etc...
Therefore, you have a string that is going from length 1 to N, and at each step you search the pattern against this string, which is a linear operation in C++. That is N * (N+1) / 2 = 5 454 744 576 comparisons. I am not as surprised as you are that this would take some time...
Let us verify the hypothesis by using the overload of find that searches for a single A:
Original: 6.94938e+06 ms
Char : 2.10709e+06 ms
About 3 times faster, so we are within the same order of magnitude. Therefore the use of a full string is not really interesting.
Conclusion ? Maybe that find could be optimized a bit. But the problem is not worth it.
Note: and to those who tout Boyer Moore, I am afraid that the needle is too small, so it won't help much. May cut an order of magnitude (26 characters), but no more.
For C++, try to use std::string for "ABCDEFGHIJKLMNOPQRSTUVWXYZ" - in my implementation string::find(const charT* s, size_type pos = 0) const calculates length of string argument.
I just tested the C++ example myself. If I remove the the call to std::sting::find, the program terminates in no time. Thus the allocations during string concatenation is no problem here.
If I add a variable sdt::string abc = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" and replace the occurence of "ABC...XYZ" in the call of std::string::find, the program needs almost the same time to finish as the original example. This again shows that allocation as well as computing the string's length does not add much to the runtime.
Therefore, it seems that the string search algorithm used by libstdc++ is not as fast for your example as the search algorithms of javascript or python. Maybe you want to try C++ again with your own string search algorithm which fits your purpose better.
C/C++ language are not easy and take years make fast programs.
with strncmp(3) version modified from c version:
#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
void test()
{
int limit = 102 * 1024;
char s[limit];
size_t size = 0;
while (size < limit) {
s[size++] = 'X';
if (!strncmp(s, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", 26)) {
fprintf(stderr, "zomg\n");
return;
}
}
printf("x's length = %zu\n", size);
}
int main()
{
test();
return 0;
}
Your test code is checking a pathological scenario of excessive string concatenation. (The string-search part of the test could have probably been omitted, I bet you it contributes almost nothing to the final results.) Excessive string concatenation is a pitfall that most languages warn very strongly against, and provide very well known alternatives for, (i.e. StringBuilder,) so what you are essentially testing here is how badly these languages fail under scenarios of perfectly expected failure. That's pointless.
An example of a similarly pointless test would be to compare the performance of various languages when throwing and catching an exception in a tight loop. All languages warn that exception throwing and catching is abysmally slow. They do not specify how slow, they just warn you not to expect anything. Therefore, to go ahead and test precisely that, would be pointless.
So, it would make a lot more sense to repeat your test substituting the mindless string concatenation part (s += "X") with whatever construct is offered by each one of these languages precisely for avoiding string concatenation. (Such as class StringBuilder.)
As mentioned by sbi, the test case is dominated by the search operation.
I was curious how fast the text allocation compares between C++ and Javascript.
System: Raspberry Pi 2, g++ 4.6.3, node v0.12.0, g++ -std=c++0x -O2 perf.cpp
C++ : 770ms
C++ without reserve: 1196ms
Javascript: 2310ms
C++
#include <iostream>
#include <string>
#include <chrono>
using namespace std;
using namespace std::chrono;
void test()
{
high_resolution_clock::time_point t1 = high_resolution_clock::now();
int x = 0;
int limit = 1024 * 1024 * 100;
string s("");
s.reserve(1024 * 1024 * 101);
for(int i=0; s.size()< limit; i++){
s += "SUPER NICE TEST TEXT";
}
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count();
cout << duration << endl;
}
int main()
{
test();
}
JavaScript
function test()
{
var time = process.hrtime();
var x = "";
var limit = 1024 * 1024 * 100;
for(var i=0; x.length < limit; i++){
x += "SUPER NICE TEST TEXT";
}
var diff = process.hrtime(time);
console.log('benchmark took %d ms', diff[0] * 1e3 + diff[1] / 1e6 );
}
test();
It seems that in nodejs there are better algorithms for substring search. You can implement it by yourself and try it out.

C++ Exp vs. Log: Which is faster?

I have a C++ application in which I need to compare two values and decide which is greater. The only complication is that one number is represented in log-space, the other is not. For example:
double log_num_1 = log(1.23);
double num_2 = 1.24;
If I want to compare num_1 and num_2, I have to use either log() or exp(), and I'm wondering if one is easier to compute than the other (i.e. runs in less time, in general). You can assume I'm using the standard cmath library.
In other words, the following are semantically equivalent, so which is faster:
if(exp(log_num_1) > num_2)) cout << "num_1 is greater";
or
if(log_num_1 > log(num_2)) cout << "num_1 is greater";
AFAIK the algorithms, the complexity is the same, the difference should be only a (hopefully negligible) constant.
Due to this, I'd use the exp(a) > b, simply because it doesn't break on invalid input.
Do you really need to know? Is this going to occupy a large fraction of you running time? How do you know?
Worse, it may be platform dependent. Then what?
So sure, test it if you care, but spending much time agonizing over micro-optimization is usually a bad idea.
Edit: Modified the code to avoid exp() overflow. This caused the margin between the two functions to shrink considerably. Thanks, fredrikj.
Code:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main(int argc, char **argv)
{
if (argc != 3) {
return 0;
}
int type = atoi(argv[1]);
int count = atoi(argv[2]);
double (*func)(double) = type == 1 ? exp : log;
int i;
for (i = 0; i < count; i++) {
func(i%100);
}
return 0;
}
(Compile using:)
emil#lanfear /home/emil/dev $ gcc -o test_log test_log.c -lm
The results seems rather conclusive:
emil#lanfear /home/emil/dev $ time ./test_log 0 10000000
real 0m2.307s
user 0m2.040s
sys 0m0.000s
emil#lanfear /home/emil/dev $ time ./test_log 1 10000000
real 0m2.639s
user 0m2.632s
sys 0m0.004s
A bit surprisingly log seems to be the faster one.
Pure speculation:
Perhaps the underlying mathematical taylor series converges faster for log or something? It actually seems to me like the natural logarithm is easier to calculate than the exponential function:
ln(1+x) = x - x^2/2 + x^3/3 ...
e^x = 1 + x + x^2/2! + x^3/3! + x^4/4! ...
Not sure that the c library functions even does it like this, however. But it doesn't seem totally unlikely.
Since you're working with values << 1, note that x-1 > log(x) for x<1,
which means that x-1 < log(y) implies log(x) < log(y), which already
takes care of 1/e ~ 37% of the cases without having to use log or exp.
some quick tests in python (which uses c for math):
$ time python -c "from math import log, exp;[exp(100) for i in xrange(1000000)]"
real 0m0.590s
user 0m0.520s
sys 0m0.042s
$ time python -c "from math import log, exp;[log(100) for i in xrange(1000000)]"
real 0m0.685s
user 0m0.558s
sys 0m0.044s
would indicate that log is slightly slower
Edit: the C functions are being optimized out by compiler it seems, so the loop is what is taking up the time
Interestingly, in C they seem to be the same speed (possibly for reasons Mark mentioned in comment)
#include <math.h>
void runExp(int n) {
int i;
for (i=0; i<n; i++) {
exp(100);
}
}
void runLog(int n) {
int i;
for (i=0; i<n; i++) {
log(100);
}
}
int main(int argc, char **argv) {
if (argc <= 1) {
return 0;
}
if (argv[1][0] == 'e') {
runExp(1000000000);
} else if (argv[1][0] == 'l') {
runLog(1000000000);
}
return 0;
}
giving times:
$ time ./exp l
real 0m2.987s
user 0m2.940s
sys 0m0.015s
$ time ./exp e
real 0m2.988s
user 0m2.942s
sys 0m0.012s
It can depend on your libm, platform and processor. You're best off writing some code that calls exp/log a large number of times, and using time to call it a few times to see if there's a noticeable difference.
Both take basically the same time on my computer (Windows), so I'd use exp, since it's defined for all values (assuming you check for ERANGE). But if it's more natural to use log, you should use that instead of trying to optimise without good reason.
It would make sense that log be faster... Exp has to perform several multiplications to arrive at its answer whereas log only has to convert the mantissa and exponent to base-e from base-2.
Just be sure to boundary check (as many others have said) if you're using log.
if you are sure that this is hotspot -- compiler instrinsics are your friends. Although it`s platform-dependent (if you go for performance in places like this -- you cannot be platform-agnostic)
So. The question really is -- which one is asm instruction on your target architecture -- and latency+cycles. Without this it is pure speculation.