C++ Exp vs. Log: Which is faster? - c++

I have a C++ application in which I need to compare two values and decide which is greater. The only complication is that one number is represented in log-space, the other is not. For example:
double log_num_1 = log(1.23);
double num_2 = 1.24;
If I want to compare num_1 and num_2, I have to use either log() or exp(), and I'm wondering if one is easier to compute than the other (i.e. runs in less time, in general). You can assume I'm using the standard cmath library.
In other words, the following are semantically equivalent, so which is faster:
if(exp(log_num_1) > num_2)) cout << "num_1 is greater";
or
if(log_num_1 > log(num_2)) cout << "num_1 is greater";

AFAIK the algorithms, the complexity is the same, the difference should be only a (hopefully negligible) constant.
Due to this, I'd use the exp(a) > b, simply because it doesn't break on invalid input.

Do you really need to know? Is this going to occupy a large fraction of you running time? How do you know?
Worse, it may be platform dependent. Then what?
So sure, test it if you care, but spending much time agonizing over micro-optimization is usually a bad idea.

Edit: Modified the code to avoid exp() overflow. This caused the margin between the two functions to shrink considerably. Thanks, fredrikj.
Code:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main(int argc, char **argv)
{
if (argc != 3) {
return 0;
}
int type = atoi(argv[1]);
int count = atoi(argv[2]);
double (*func)(double) = type == 1 ? exp : log;
int i;
for (i = 0; i < count; i++) {
func(i%100);
}
return 0;
}
(Compile using:)
emil#lanfear /home/emil/dev $ gcc -o test_log test_log.c -lm
The results seems rather conclusive:
emil#lanfear /home/emil/dev $ time ./test_log 0 10000000
real 0m2.307s
user 0m2.040s
sys 0m0.000s
emil#lanfear /home/emil/dev $ time ./test_log 1 10000000
real 0m2.639s
user 0m2.632s
sys 0m0.004s
A bit surprisingly log seems to be the faster one.
Pure speculation:
Perhaps the underlying mathematical taylor series converges faster for log or something? It actually seems to me like the natural logarithm is easier to calculate than the exponential function:
ln(1+x) = x - x^2/2 + x^3/3 ...
e^x = 1 + x + x^2/2! + x^3/3! + x^4/4! ...
Not sure that the c library functions even does it like this, however. But it doesn't seem totally unlikely.

Since you're working with values << 1, note that x-1 > log(x) for x<1,
which means that x-1 < log(y) implies log(x) < log(y), which already
takes care of 1/e ~ 37% of the cases without having to use log or exp.

some quick tests in python (which uses c for math):
$ time python -c "from math import log, exp;[exp(100) for i in xrange(1000000)]"
real 0m0.590s
user 0m0.520s
sys 0m0.042s
$ time python -c "from math import log, exp;[log(100) for i in xrange(1000000)]"
real 0m0.685s
user 0m0.558s
sys 0m0.044s
would indicate that log is slightly slower
Edit: the C functions are being optimized out by compiler it seems, so the loop is what is taking up the time
Interestingly, in C they seem to be the same speed (possibly for reasons Mark mentioned in comment)
#include <math.h>
void runExp(int n) {
int i;
for (i=0; i<n; i++) {
exp(100);
}
}
void runLog(int n) {
int i;
for (i=0; i<n; i++) {
log(100);
}
}
int main(int argc, char **argv) {
if (argc <= 1) {
return 0;
}
if (argv[1][0] == 'e') {
runExp(1000000000);
} else if (argv[1][0] == 'l') {
runLog(1000000000);
}
return 0;
}
giving times:
$ time ./exp l
real 0m2.987s
user 0m2.940s
sys 0m0.015s
$ time ./exp e
real 0m2.988s
user 0m2.942s
sys 0m0.012s

It can depend on your libm, platform and processor. You're best off writing some code that calls exp/log a large number of times, and using time to call it a few times to see if there's a noticeable difference.
Both take basically the same time on my computer (Windows), so I'd use exp, since it's defined for all values (assuming you check for ERANGE). But if it's more natural to use log, you should use that instead of trying to optimise without good reason.

It would make sense that log be faster... Exp has to perform several multiplications to arrive at its answer whereas log only has to convert the mantissa and exponent to base-e from base-2.
Just be sure to boundary check (as many others have said) if you're using log.

if you are sure that this is hotspot -- compiler instrinsics are your friends. Although it`s platform-dependent (if you go for performance in places like this -- you cannot be platform-agnostic)
So. The question really is -- which one is asm instruction on your target architecture -- and latency+cycles. Without this it is pure speculation.

Related

Array of array and filtering performance

What is the most efficient way to filter on an array of array?
Let's take this example:
[
[key,timestamp,value],
[2,12211212440,3.98],
[2,12211212897,3.78],
...,
[3,12211212440,3.28],
[3,12211212897,3.58],
...,
[4,12211212440,4.98],
... about 1 millions row :-)
]
So let's imagine we have more than 1 millions row. I want to filter by key and timestamp like key in array [3,7,9,...] and timestamp between time1 and time2.
Of course the first idea is to use database but for some reason I want to do it directly inside an application and not with a third party software.
Therefor I am wondering what would be the most efficient way to perform this filtering over a big amount of data. C++ list STL? Python Numpy? ...? But Numpy is base on C/C++, therefor I would imagine that the most efficient way is C/C++.
So could you give me some great idea?
Well, if you really want to absolutely optimize the performance here, the answer is pretty much always to hand-code in C/C++. But optimizing the run-time performance usually means a lot longer to write the code, so you need to figure out that tradeoff for yourself.
But you're right: numpy is indeed written in C, and most of its processing happens at that level. As a result, the algorithms it implements are typically very close to the speed you could get by coding them yourself in C (or faster, because people have been working on numpy for a while). So you probably shouldn't bother re-coding anything you can find in numpy.
Now the question is whether or not numpy has a built-in way of doing what you want. Here's one way to do it:
import numpy as np
keys = np.array([2, 4])
t1 = 12211212440
t2 = 12211212897
d = np.array([[key,timestamp,value]
for key in range(5)
for timestamp in range(12211172430, 12211212997+1)
for value in [3.98, 3.78, 3.28, 3.58, 4.98]])
filtered = d[(d[:, 1] >= t1) & (d[:, 1] <= t2)] # Select rows in time span
filtered = filtered[np.in1d(filtered[:, 0], keys)] # Select rows where key is in keys
It's not quite built-in, but two lines isn't bad. Note that this d I made up has a little over a million entries. On my laptop, this runs in around 4 milliseconds, which is around 4 nanoseconds per entry. That's almost as fast as I would expect to be able to do it in C/C++.
Of course, there are other ways to do it. Also, you might want to make a custom numpy dtype to handle the fact that you evidently have int, int, float, whereas numpy assumes float, float, float. If I use a numpy recarray to do this, I get about a 50% speedup. There's also pandas, which can handle that automatically, and has lots of selection functions. In particular, translating the above straight to pandas runs in a little under 4 milliseconds -- maybe pandas is more clever about something:
import pandas as pd
keys = [2, 4]
t1 = 12211212440
t2 = 12211212897
d_pandas = pd.DataFrame([(key,timestamp,value)
for key in range(5)
for timestamp in range(12211172430, 12211212997+1)
for value in [3.98, 3.78, 3.28, 3.58, 4.98]],
columns=['key', 'timestamp', 'value'])
filtered = d_pandas[(d_pandas.timestamp >= t1) & (d_pandas.timestamp <= t2)]
filtered = filtered[filtered.timestamp.isin(keys)]
Of course, this isn't how I would code it in C++, so you could do better in principle. In particular, I would expect to see a slowdown due to looping over the array two or three separate times; in C++ I would just do the loop once, check one condition, if that's true check the second condtion, if that's true keep that row. If you're already using python, you could also look into numba, which lets you use python, but compiles it on the fly for you, so it's about as fast as C/C++. Even a code as dumb as this runs very quickly:
import numba as nb
import numpy as np
keys = np.array([2, 4], dtype='i8')
t1 = 12211212440
t2 = 12211212897
d_recarray = np.rec.array([(key,timestamp,value)
for key in range(5)
for timestamp in range(12211172430, 12211212997+1)
for value in [3.98, 3.78, 3.28, 3.58, 4.98]],
dtype=[('key', 'i8'), ('timestamp', 'i8'), ('value', 'f8')])
#nb.njit
def select_elements_recarrray(d_in, keys, t1, t2):
d_out = np.empty_like(d_in)
k = 0
for i in range(len(d_in)):
if d_in[i].timestamp >= t1 and d_in[i].timestamp <= t2:
matched_key = False
for j in range(len(keys)):
if d_in[i].key == keys[j]:
matched_key = True
break
if matched_key:
d_out[k] = d_in[i]
k += 1
return d_out[:k]
filtered = select_elements_recarrray(d_recarray, keys, t1, t2)
The jit compilation takes a little time, though much less time than compiling a typical C code, and it also only has to happen once. And then the filtering runs in just over one millisecond on my laptop -- almost four times faster than the numpy code, and about one nanosecond per input array element. This is roughly as fast as I would expect to be able to do anything, even in C/C++. So I don't expect it would be worth your time optimizing this.
I suggest you try something like this because it's so short and fast (and done for you). Then, only if it's too slow, put in the work to do something better. I don't know what you're doing, but chances are this will not make up a large fraction of the time your code takes to run.
Quick'n'dirty filtering with C, single threaded, relying only on the operating system for buffering and the like:
#include <stdio.h>
#include <stdbool.h>
#include <string.h>
#include <assert.h>
#include <time.h>
double const fromTime = 3.45f;
double const toTime = 11.5f;
unsigned const someKeys[] = {21, 42, 63, 84};
bool matchingKey(unsigned const key) {
size_t i;
for (i = 0; i < sizeof(someKeys) / sizeof(someKeys[0]); ++i) {
if (someKeys[i] == key) {
return true;
}
}
return false;
}
bool matchingTime(double const time) {
return (time >= fromTime) && (time < toTime);
}
int main() {
char buffer[100];
time_t start_time = time(NULL);
clock_t start_clock = clock();
while (fgets(buffer, 100, stdin) != NULL) {
size_t const lineLength = strlen(buffer);
assert(lineLength > 0);
if (buffer[lineLength - 1] != '\n') {
fprintf(stderr, "Line too long:\n%s\nBye!\n", buffer);
return 1;
}
unsigned key;
double timestamp;
if (sscanf(buffer, "%u %lf", &key, &timestamp) != 2) {
fprintf(stderr, "Failed to parse line:\n%sBye!\n", buffer);
}
if (matchingTime(timestamp) && matchingKey(key)) {
printf("%s", buffer);
}
}
time_t end_time = time(NULL);
clock_t end_clock = clock();
fprintf(stderr, "time: %lf clock: %lf\n",
difftime(end_time, start_time),
(double) (end_clock - start_clock) / CLOCKS_PER_SEC);
if (!feof(stdin)) {
fprintf(stderr, "Something bad happend, bye!\n");
return 1;
}
return 0;
}
There's a lot one could optimize here, but let's look how it runs:
$ clang -Wall -Wextra -pedantic -O2 generate.c -o generate
$ clang -Wall -Wextra -pedantic -O2 filter.c -o filter
$ ./generate | time ./filter > /dev/null
time: 67.000000 clock: 66.535009
66.08user 0.45system 1:07.16elapsed 99%CPU (0avgtext+0avgdata 596maxresident)k
0inputs+0outputs (0major+178minor)pagefaults 0swaps
This is the result of filtering 100 million random rows. I ran this on a laptop with an Intel P6100, so nothing high end.
This is about 670 nanoseconds per row. Note that you cannot compare this to Mike's results since on the one hand this (of course) depends on the available performance from the system (mainly CPU) and my test includes time taken by the operating system (Linux x86-64) to forward the data via the pipes as well as parsing the data and printing it again.
For reference here's generate's code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
unsigned randomKey(void) {
return rand() % 128;
}
double randomTime(void) {
double a = rand(), b = rand();
return (a > b) ? (a / b) : (b / a);
}
int main() {
unsigned long const rows = 100000000UL;
srand(time(NULL));
unsigned long r;
for (r = 0; r < rows; ++r) {
printf("%u %lf\n", randomKey(), randomTime());
}
return 0;
}

Would a pre-calculated variable faster than calculating it every time in a loop?

In a function that updates all particles I have the following code:
for (int i = 0; i < _maxParticles; i++)
{
// check if active
if (_particles[i].lifeTime > 0.0f)
{
_particles[i].lifeTime -= _decayRate * deltaTime;
}
}
This decreases the lifetime of the particle based on the time that passed.
It gets calculated every loop, so if I've 10000 particles, that wouldn't be very efficient because it doesn't need to(it doesn't get changed anyways).
So I came up with this:
float lifeMin = _decayRate * deltaTime;
for (int i = 0; i < _maxParticles; i++)
{
// check if active
if (_particles[i].lifeTime > 0.0f)
{
_particles[i].lifeTime -= lifeMin;
}
}
This calculates it once and sets it to a variable that gets called every loop, so the CPU doesn't have to calculate it every loop, which would theoretically increase performance.
Would it run faster than the old code? Or does the release compiler do optimizations like this?
I wrote a program that compares both methods:
#include <time.h>
#include <iostream>
const unsigned int MAX = 1000000000;
int main()
{
float deltaTime = 20;
float decayRate = 200;
float foo = 2041.234f;
unsigned int start = clock();
for (unsigned int i = 0; i < MAX; i++)
{
foo -= decayRate * deltaTime;
}
std::cout << "Method 1 took " << clock() - start << "ms\n";
start = clock();
float calced = decayRate * deltaTime;
for (unsigned int i = 0; i < MAX; i++)
{
foo -= calced;
}
std::cout << "Method 2 took " << clock() - start << "ms\n";
int n;
std::cin >> n;
return 0;
}
Result in debug mode:
Method 1 took 2470ms
Method 2 took 2410ms
Result in release mode:
Method 1 took 0ms
Method 2 took 0ms
But that doesn't work. I know it doesn't do exactly the same, but it gives an idea.
In debug mode, they take roughly the same time. Sometimes Method 1 is faster than Method 2(especially at fewer numbers), sometimes Method 2 is faster.
In release mode, it takes 0 ms. A little weird.
I tried measuring it in the game itself, but there aren't enough particles to get a clear result.
EDIT
I tried to disable optimizations, and let the variables be user inputs using std::cin.
Here are the results:
Method 1 took 2430ms
Method 2 took 2410ms
It will almost certainly make no difference what so ever, at least if
you compile with optimization (and of course, if you're concerned with
performance, you are compiling with optimization). The opimization in
question is called loop invariant code motion, and is universally
implemented (and has been for about 40 years).
On the other hand, it may make sense to use the separate variable
anyway, to make the code clearer. This depends on the application, but
in many cases, giving a name to the results of an expression can make
code clearer. (In other cases, of course, throwing in a lot of extra
variables can make it less clear. It's all depends on the application.)
In any case, for such things, write the code as clearly as possible
first, and then, if (and only if) there is a performance problem,
profile to see where it is, and fix that.
EDIT:
Just to be perfectly clear: I'm talking about this sort of code optimization in general. In the exact case you show, since you don't use foo, the compiler will probably remove it (and the loops) completely.
In theory, yes. But your loop is extremely simple and thus likeley to be heavily optimized.
Try the -O0 option to disable all compiler optimizations.
The release runtime might be caused by the compiler statically computing the result.
I am pretty confident that any decent compiler will replace your loops with the following code:
foo -= MAX * decayRate * deltaTime;
and
foo -= MAX * calced ;
You can make the MAX size depending on some kind of input (e.g. command line parameter) to avoid that.

C++ Branch prediction algorithm comparison?

There are 2 possible techniques shown below which do the same task.
I would like to know if there will be any performance difference between the two.
I think the first technique will suffer due to branch prediction as contents of A are random.
Technique 1:
#include <iostream>
#include <cstdlib>
#include <ctime>
#define SIZE 1000000
using namespace std;
class MyClass
{
private:
bool flag;
public:
void setFlag(bool f) {flag = f;}
};
int main()
{
MyClass obj;
int *A = new int[SIZE];
for(int i = 0; i < SIZE; i++)
A[i] = (unsigned int)rand();
time_t mytime1;
time_t mytime2;
time(&mytime1);
for(int test = 0; test < 5000; test++)
{
for(int i = 0; i < SIZE; i++)
{
if(A[i] > 100)
obj.setFlag(true);
else
obj.setFlag(false);
}
}
time(&mytime2);
cout << asctime(localtime(&mytime1)) << endl;
cout << asctime(localtime(&mytime2)) << endl;
}
Result:
Sat May 03 20:08:07 2014
Sat May 03 20:08:32 2014
i.e. Time taken = 25sec
Technique 2:
#include <iostream>
#include <cstdlib>
#include <ctime>
#define SIZE 1000000
using namespace std;
class MyClass
{
private:
bool flag;
public:
void setFlag(bool f) {flag = f;}
};
int main()
{
MyClass obj;
int *A = new int[SIZE];
for(int i = 0; i < SIZE; i++)
A[i] = (unsigned int)rand();
time_t mytime1;
time_t mytime2;
time(&mytime1);
for(int test = 0; test < 5000; test++)
{
for(int i = 0; i < SIZE; i++)
{
obj.setFlag(A[i] > 100);
}
}
time(&mytime2);
cout << asctime(localtime(&mytime1)) << endl;
cout << asctime(localtime(&mytime2)) << endl;
}
Result:
Sat May 03 20:08:42 2014
Sat May 03 20:09:10 2014
i.e. Time taken = 28sec
The compilation is done using MinGW 64 bt compiler with no flags.
From the results it looks like the opposite is happening.
EDIT:
After making the check for RAND_MAX / 2 instead of 100, I am getting the following results:
Technique 1: 70sec
Technique 2: 28sec
So it becomes clear now that Technique 2 is better than technique 1 and can be explained on the basis of branch prediction failure phenomenon.
With optimisations enabled the binaries are exactly the same, in GCC 4.8 at least: demo.
They're different with optimisations disabled, though: demo.
This very poor attempt at a measurement suggests that the second is actually slower, though both programs run in the same duration in real terms: demo
real 0m0.052s
user 0m0.036s
sys 0m0.012s
real 0m0.052s
user 0m0.044s
sys 0m0.004s
To find out how they really differ in performance with optimisations disabled, you can benchmark properly with more runs.
Frankly, though, since it's irrelevant for your production code I wouldn't even bother.
I agree with the fact that this isn't very interesting for practical code (especially when it dissapears with -O3), but for the sake of academic interest: In some conditions it may be better to rely on the branch predictor.
On one hand, in this particular case the branch is almost always going to be not-taken (as RAND_MAX >> 100), which is easy to predict both interms of branch resolution as well as the next IP address. Try converting the prediciton to a 50% chance and then benchmark this.
On the other hand, the second operation turns the stores done to the obj flag into being data-dependent with the loads from A[i]. These loads are going to be slow as your dataset is 1000000*sizeof(A) bytes at least (almost 4MB), meaning that it could be either in the L3 cache or the memory - either way that's quiet a few cycles per each new line (once every few accesses) - when the writes to the flag were independent, they could queue in parallel, now you have to stall them until you get the data. In Theory, the CPU should be able to "pipeline" this, since stores are performed much later than loads along the pipeline on most CPUs, but in practice you're limited by the size of the execution window, in most machines that would be ~100 I believe), so if the store of the current iteration is stalled, you won't be able to launch too far ahead the loads required for the future iterations.
In other words - you may be losing due to the fact that CPUs have a fairly decent branch prediction, but no (or hardly no) data prediction.

Why do std::string operations perform poorly?

I made a test to compare string operations in several languages for choosing a language for the server-side application. The results seemed normal until I finally tried C++, which surprised me a lot. So I wonder if I had missed any optimization and come here for help.
The test are mainly intensive string operations, including concatenate and searching. The test is performed on Ubuntu 11.10 amd64, with GCC's version 4.6.1. The machine is Dell Optiplex 960, with 4G RAM, and Quad-core CPU.
in Python (2.7.2):
def test():
x = ""
limit = 102 * 1024
while len(x) < limit:
x += "X"
if x.find("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0) > 0:
print("Oh my god, this is impossible!")
print("x's length is : %d" % len(x))
test()
which gives result:
x's length is : 104448
real 0m8.799s
user 0m8.769s
sys 0m0.008s
in Java (OpenJDK-7):
public class test {
public static void main(String[] args) {
int x = 0;
int limit = 102 * 1024;
String s="";
for (; s.length() < limit;) {
s += "X";
if (s.indexOf("ABCDEFGHIJKLMNOPQRSTUVWXYZ") > 0)
System.out.printf("Find!\n");
}
System.out.printf("x's length = %d\n", s.length());
}
}
which gives result:
x's length = 104448
real 0m50.436s
user 0m50.431s
sys 0m0.488s
in Javascript (Nodejs 0.6.3)
function test()
{
var x = "";
var limit = 102 * 1024;
while (x.length < limit) {
x += "X";
if (x.indexOf("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0) > 0)
console.log("OK");
}
console.log("x's length = " + x.length);
}();
which gives result:
x's length = 104448
real 0m3.115s
user 0m3.084s
sys 0m0.048s
in C++ (g++ -Ofast)
It's not surprising that Nodejs performas better than Python or Java. But I expected libstdc++ would give much better performance than Nodejs, whose result really suprised me.
#include <iostream>
#include <string>
using namespace std;
void test()
{
int x = 0;
int limit = 102 * 1024;
string s("");
for (; s.size() < limit;) {
s += "X";
if (s.find("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0) != string::npos)
cout << "Find!" << endl;
}
cout << "x's length = " << s.size() << endl;
}
int main()
{
test();
}
which gives result:
x length = 104448
real 0m5.905s
user 0m5.900s
sys 0m0.000s
Brief Summary
OK, now let's see the summary:
javascript on Nodejs(V8): 3.1s
Python on CPython 2.7.2 : 8.8s
C++ with libstdc++: 5.9s
Java on OpenJDK 7: 50.4s
Surprisingly! I tried "-O2, -O3" in C++ but noting helped. C++ seems about only 50% performance of javascript in V8, and even poor than CPython. Could anyone explain to me if I had missed some optimization in GCC or is this just the case? Thank you a lot.
It's not that std::string performs poorly (as much as I dislike C++), it's that string handling is so heavily optimized for those other languages.
Your comparisons of string performance are misleading, and presumptuous if they are intended to represent more than just that.
I know for a fact that Python string objects are completely implemented in C, and indeed on Python 2.7, numerous optimizations exist due to the lack of separation between unicode strings and bytes. If you ran this test on Python 3.x you will find it considerably slower.
Javascript has numerous heavily optimized implementations. It's to be expected that string handling is excellent here.
Your Java result may be due to improper string handling, or some other poor case. I expect that a Java expert could step in and fix this test with a few changes.
As for your C++ example, I'd expect performance to slightly exceed the Python version. It does the same operations, with less interpreter overhead. This is reflected in your results. Preceding the test with s.reserve(limit); would remove reallocation overhead.
I'll repeat that you're only testing a single facet of the languages' implementations. The results for this test do not reflect the overall language speed.
I've provided a C version to show how silly such pissing contests can be:
#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
void test()
{
int limit = 102 * 1024;
char s[limit];
size_t size = 0;
while (size < limit) {
s[size++] = 'X';
if (memmem(s, size, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", 26)) {
fprintf(stderr, "zomg\n");
return;
}
}
printf("x's length = %zu\n", size);
}
int main()
{
test();
return 0;
}
Timing:
matt#stanley:~/Desktop$ time ./smash
x's length = 104448
real 0m0.681s
user 0m0.680s
sys 0m0.000s
So I went and played a bit with this on ideone.org.
Here a slightly modified version of your original C++ program, but with the appending in the loop eliminated, so it only measures the call to std::string::find(). Note that I had to cut the number of iterations to ~40%, otherwise ideone.org would kill the process.
#include <iostream>
#include <string>
int main()
{
const std::string::size_type limit = 42 * 1024;
unsigned int found = 0;
//std::string s;
std::string s(limit, 'X');
for (std::string::size_type i = 0; i < limit; ++i) {
//s += 'X';
if (s.find("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0) != std::string::npos)
++found;
}
if(found > 0)
std::cout << "Found " << found << " times!\n";
std::cout << "x's length = " << s.size() << '\n';
return 0;
}
My results at ideone.org are time: 3.37s. (Of course, this is highly questionably, but indulge me for a moment and wait for the other result.)
Now we take this code and swap the commented lines, to test appending, rather than finding. Note that, this time, I had increased the number of iterations tenfold in trying to see any time result at all.
#include <iostream>
#include <string>
int main()
{
const std::string::size_type limit = 1020 * 1024;
unsigned int found = 0;
std::string s;
//std::string s(limit, 'X');
for (std::string::size_type i = 0; i < limit; ++i) {
s += 'X';
//if (s.find("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0) != std::string::npos)
// ++found;
}
if(found > 0)
std::cout << "Found " << found << " times!\n";
std::cout << "x's length = " << s.size() << '\n';
return 0;
}
My results at ideone.org, despite the tenfold increase in iterations, are time: 0s.
My conclusion: This benchmark is, in C++, highly dominated by the searching operation, the appending of the character in the loop has no influence on the result at all. Was that really your intention?
The idiomatic C++ solution would be:
#include <iostream>
#include <string>
#include <algorithm>
int main()
{
const int limit = 102 * 1024;
std::string s;
s.reserve(limit);
const std::string pattern("ABCDEFGHIJKLMNOPQRSTUVWXYZ");
for (int i = 0; i < limit; ++i) {
s += 'X';
if (std::search(s.begin(), s.end(), pattern.begin(), pattern.end()) != s.end())
std::cout << "Omg Wtf found!";
}
std::cout << "X's length = " << s.size();
return 0;
}
I could speed this up considerably by putting the string on the stack, and using memmem -- but there seems to be no need. Running on my machine, this is over 10x the speed of the python solution already..
[On my laptop]
time ./test
X's length = 104448
real 0m2.055s
user 0m2.049s
sys 0m0.001s
That is the most obvious one: please try to do s.reserve(limit); before main loop.
Documentation is here.
I should mention that direct usage of standard classes in C++ in the same way you are used to do it in Java or Python will often give you sub-par performance if you are unaware of what is done behind the desk. There is no magical performance in language itself, it just gives you right tools.
My first thought is that there isn't a problem.
C++ gives second-best performance, nearly ten times faster than Java. Maybe all but Java are running close to the best performance achievable for that functionality, and you should be looking at how to fix the Java issue (hint - StringBuilder).
In the C++ case, there are some things to try to improve performance a bit. In particular...
s += 'X'; rather than s += "X";
Declare string searchpattern ("ABCDEFGHIJKLMNOPQRSTUVWXYZ"); outside the loop, and pass this for the find calls. An std::string instance knows it's own length, whereas a C string requires a linear-time check to determine that, and this may (or may not) be relevant to std::string::find performance.
Try using std::stringstream, for a similar reason to why you should be using StringBuilder for Java, though most likely the repeated conversions back to string will create more problems.
Overall, the result isn't too surprising though. JavaScript, with a good JIT compiler, may be able to optimise a little better than C++ static compilation is allowed to in this case.
With enough work, you should always be able to optimise C++ better than JavaScript, but there will always be cases where that doesn't just naturally happen and where it may take a fair bit of knowledge and effort to achieve that.
What you are missing here is the inherent complexity of the find search.
You are executing the search 102 * 1024 (104 448) times. A naive search algorithm will, each time, try to match the pattern starting from the first character, then the second, etc...
Therefore, you have a string that is going from length 1 to N, and at each step you search the pattern against this string, which is a linear operation in C++. That is N * (N+1) / 2 = 5 454 744 576 comparisons. I am not as surprised as you are that this would take some time...
Let us verify the hypothesis by using the overload of find that searches for a single A:
Original: 6.94938e+06 ms
Char : 2.10709e+06 ms
About 3 times faster, so we are within the same order of magnitude. Therefore the use of a full string is not really interesting.
Conclusion ? Maybe that find could be optimized a bit. But the problem is not worth it.
Note: and to those who tout Boyer Moore, I am afraid that the needle is too small, so it won't help much. May cut an order of magnitude (26 characters), but no more.
For C++, try to use std::string for "ABCDEFGHIJKLMNOPQRSTUVWXYZ" - in my implementation string::find(const charT* s, size_type pos = 0) const calculates length of string argument.
I just tested the C++ example myself. If I remove the the call to std::sting::find, the program terminates in no time. Thus the allocations during string concatenation is no problem here.
If I add a variable sdt::string abc = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" and replace the occurence of "ABC...XYZ" in the call of std::string::find, the program needs almost the same time to finish as the original example. This again shows that allocation as well as computing the string's length does not add much to the runtime.
Therefore, it seems that the string search algorithm used by libstdc++ is not as fast for your example as the search algorithms of javascript or python. Maybe you want to try C++ again with your own string search algorithm which fits your purpose better.
C/C++ language are not easy and take years make fast programs.
with strncmp(3) version modified from c version:
#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
void test()
{
int limit = 102 * 1024;
char s[limit];
size_t size = 0;
while (size < limit) {
s[size++] = 'X';
if (!strncmp(s, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", 26)) {
fprintf(stderr, "zomg\n");
return;
}
}
printf("x's length = %zu\n", size);
}
int main()
{
test();
return 0;
}
Your test code is checking a pathological scenario of excessive string concatenation. (The string-search part of the test could have probably been omitted, I bet you it contributes almost nothing to the final results.) Excessive string concatenation is a pitfall that most languages warn very strongly against, and provide very well known alternatives for, (i.e. StringBuilder,) so what you are essentially testing here is how badly these languages fail under scenarios of perfectly expected failure. That's pointless.
An example of a similarly pointless test would be to compare the performance of various languages when throwing and catching an exception in a tight loop. All languages warn that exception throwing and catching is abysmally slow. They do not specify how slow, they just warn you not to expect anything. Therefore, to go ahead and test precisely that, would be pointless.
So, it would make a lot more sense to repeat your test substituting the mindless string concatenation part (s += "X") with whatever construct is offered by each one of these languages precisely for avoiding string concatenation. (Such as class StringBuilder.)
As mentioned by sbi, the test case is dominated by the search operation.
I was curious how fast the text allocation compares between C++ and Javascript.
System: Raspberry Pi 2, g++ 4.6.3, node v0.12.0, g++ -std=c++0x -O2 perf.cpp
C++ : 770ms
C++ without reserve: 1196ms
Javascript: 2310ms
C++
#include <iostream>
#include <string>
#include <chrono>
using namespace std;
using namespace std::chrono;
void test()
{
high_resolution_clock::time_point t1 = high_resolution_clock::now();
int x = 0;
int limit = 1024 * 1024 * 100;
string s("");
s.reserve(1024 * 1024 * 101);
for(int i=0; s.size()< limit; i++){
s += "SUPER NICE TEST TEXT";
}
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count();
cout << duration << endl;
}
int main()
{
test();
}
JavaScript
function test()
{
var time = process.hrtime();
var x = "";
var limit = 1024 * 1024 * 100;
for(var i=0; x.length < limit; i++){
x += "SUPER NICE TEST TEXT";
}
var diff = process.hrtime(time);
console.log('benchmark took %d ms', diff[0] * 1e3 + diff[1] / 1e6 );
}
test();
It seems that in nodejs there are better algorithms for substring search. You can implement it by yourself and try it out.

Benchmarking math.h square root and Quake square root

Okay so I was board and wondered how fast math.h square root was in comparison to the one with the magic number in it (made famous by Quake but made by SGI).
But this has ended up in a world of hurt for me.
I first tried this on the Mac where the math.h would win hands down every time then on Windows where the magic number always won, but I think this is all down to my own noobness.
Compiling on the Mac with "g++ -o sq_root sq_root_test.cpp" when the program ran it takes about 15 seconds to complete. But compiling in VS2005 on release takes a split second. (in fact I had to compile in debug just to get it to show some numbers)
My poor man's benchmarking? is this really stupid? cos I get 0.01 for math.h and 0 for the Magic number. (it cant be that fast can it?)
I don't know if this matters but the Mac is Intel and the PC is AMD. Is the Mac using hardware for math.h sqroot?
I got the fast square root algorithm from http://en.wikipedia.org/wiki/Fast_inverse_square_root
//sq_root_test.cpp
#include <iostream>
#include <math.h>
#include <ctime>
float invSqrt(float x)
{
union {
float f;
int i;
} tmp;
tmp.f = x;
tmp.i = 0x5f3759df - (tmp.i >> 1);
float y = tmp.f;
return y * (1.5f - 0.5f * x * y * y);
}
int main() {
std::clock_t start;// = std::clock();
std::clock_t end;
float rootMe;
int iterations = 999999999;
// ---
rootMe = 2.0f;
start = std::clock();
std::cout << "Math.h SqRoot: ";
for (int m = 0; m < iterations; m++) {
(float)(1.0/sqrt(rootMe));
rootMe++;
}
end = std::clock();
std::cout << (difftime(end, start)) << std::endl;
// ---
std::cout << "Quake SqRoot: ";
rootMe = 2.0f;
start = std::clock();
for (int q = 0; q < iterations; q++) {
invSqrt(rootMe);
rootMe++;
}
end = std::clock();
std::cout << (difftime(end, start)) << std::endl;
}
There are several problems with your benchmarks. First, your benchmark includes a potentially expensive cast from int to float. If you want to know what a square root costs, you should benchmark square roots, not datatype conversions.
Second, your entire benchmark can be (and is) optimized out by the compiler because it has no observable side effects. You don't use the returned value (or store it in a volatile memory location), so the compiler sees that it can skip the whole thing.
A clue here is that you had to disable optimizations. That means your benchmarking code is broken. Never ever disable optimizations when benchmarking. You want to know which version runs fastest, so you should test it under the conditions it'd actually be used under. If you were to use square roots in performance-sensitive code, you'd enable optimizations, so how it behaves without optimizations is completely irrelevant.
Also, you're not benchmarking the cost of computing a square root, but of the inverse square root.
If you want to know which way of computing the square root is fastest, you have to move the 1.0/... division down to the Quake version. (And since division is a pretty expensive operation, this might make a big difference in your results)
Finally, it might be worth pointing out that Carmacks little trick was designed to be fast on 12 year old computers. Once you fix your benchmark, you'll probably find that it's no longer an optimization, because today's CPU's are much faster at computing "real" square roots.