numpy.dot 100 times slower than native C++11

numpy.dot 100 times slower than native C++11 - c++

I have a Matlab-background, and when I bought a laptop a year ago, I carefully selected one which has a lot of compute power, the machine has 4 threads and it offers me 8 threads at 2.4GHz. The machine proved itself to be very powerful, and using simple parfor-loops I could utilize all the processor threads, with which I got a speedup near 8 for many problems and experiments.
This nice Sunday I was experimenting with numpy, people often tell me that the core business of numpy is implemented efficiently using libblas, and possibly even using multiple cores and libraries like OpenMP (with OpenMP you can create parfor-like loops using c-style pragma's).
This is the general approach for many numerical and machine learning algorithms, you express them using expensive high-level operations like matrix multiplications, but in an expensive, high-level language like Matlab and python for comfort. Moreover, c(++) allows us to bypass the GIL.
So the cool part is that linear algebra-stuff should process really fast in python, whenever you use numpy. You just have the overhead of some function-calling, but then if the calculation behind it is large, that's negligible.
So, without even touching the topic that not everything can be expressed in linear algebra or other numpy operations, I gave it a spin:
t = time.time(); numpy.dot(range(100000000), range(100000000)); print(time.time() - t)
40.37656021118164
So I, these 40 seconds I saw ONE of the 8 threads on my machine working for 100%, and the others were near 0%. I didn't like this, but even with one thread working I'd expect this to run in approximately 0.something seconds. The dot-product does 100M +'es and *'es, so we have 2400M / 100M = 24 clock ticks per second for one +, one * and whatever overhead.
Nevertheless, the algorithm needs 40* 24 =approx= 1000 ticks (!!!!!) for the +, * and overhead. Let's do this in C++:
#include<iostream>
int main() {
unsigned long long result = 0;
for(unsigned long long i=0; i < 100000000; i++)
result += i * i;
std::cout << result << '\n';
}
BLITZ:
herbert#machine:~$ g++ -std=c++11 dot100M.cc
herbert#machine:~$ time ./a.out
662921401752298880
real 0m0.254s
user 0m0.254s
sys 0m0.000s
0.254 seconds, almost 100 times faster than numpy.dot.
I thought, maybe the python3 range-generator is the slow part, so I handicapped my c++11 implementation by storing all 100M numbers in a std::vector first (using iterative push_back's), and than iterating over it. This was a lot slower, it took a little below 4 seconds, which still is 10 times faster.
I installed my numpy using 'pip3 install numpy' on ubuntu, and it started compiling for some time, using both gcc and gfortran, moreover I saw mentions of blas-header files passing through the compiler output.
For what reason is numpy.dot so extremely slow?

So your comparison is unfair. In your python example, you first generate two range objects, convert them to numpy-arrays and then doing the scalar product. The calculation takes the least part. Here are the numbers for my computer:
>>> t=time.time();x=numpy.arange(100000000);numpy.dot(x,x);print time.time()-t
1.28280997276
And without the generation of the array:
>>> t=time.time();numpy.dot(x,x);print time.time()-t
0.124325990677
For completion, the C-version takes roughly the same time:
real 0m0.108s
user 0m0.100s
sys 0m0.007s

range generates a list based on your given parameters, where as your for loop in C merely increments a number.
I agree that it seems fairly costly computationally wise to spend so much time on generating one list-- then again, it is a big list, and you're requesting two of them ;-)
EDIT: As mentioned in comments range generates lists, not arrays.
Try substituting your range method with an incrementing while loop or similar and see if you get more tolerable results.

Related

How can one verify the proper operation of a sieve close to 2^64?

Small primes - up to about 1,000,000,000,000 - are readily available from various sources. The Prime Pages (utm.edu) have lists for the first 50 million primes, primos.mat.br goes up to 10^12, and programs like the one available at primesieve.org go even higher.
However, when it comes to numbers close to 2^64 there's only the ten primes mentioned on the page Primes just less than a power of two at primes.utm.edu and that seems to be it.
The primality test found there refuses to work on numbers that don't fit a double, others - elsewhere - fail to refuse and just print trash. The primesieve.org program refuses to work with numbers that aren't at least some 40 billion below 2^64, which doesn't exactly inspire confidence in the quality of their coding. The same result everywhere: nada, zilch, niente.
The cogs and gears of sieves start creaking around the 2^62 mark, and close to 2^64 there's hardly a cog that doesn't creak loudly threatening to break apart. Hence the need for testing the implementation is greatest where verification is most difficult, because of the scarcity/absence of reliable reference data. The primesieve.org program seems to be the only one that works at least up to 2^63 or thereabouts, but I don't trust it too much because of the above-mentioned issue.
So how then can one verify the proper operation of a sieve close to 2^64? Are there reliable lists somewhere for a million (or ten million or a hundred million) primes just below and above powers of two like 2^64, 2^63 and so on? Or are there reliable (trustworthy, verified, banged-on a lot) programs that yield such sequences or that can verify primes or lists of primes?
Once a sieve has been verified it can be used to produce handy lists with sums/checksums for loads of interesting ranges, but absent such lists the situation seems difficult...
P.S.: I determined the upper limit for the primesieve.org turbo siever to be UINT64_MAX - 10 * UINT32_MAX, or 0xFFFFFFF600000009. That means only the 10 * UINT32_MAX highest primes don't have any reference data at all so far...

Instead of looking for a pre-computed list, you could compare the output of your sieve to a different sieve. A good sieve, written by Tomás Oliveira e Silva, is available at http://sweet.ua.pt/tos/software/prime_sieve.html.
Another way to test your code is by testing the primality of all numbers your sieve reports as prime (or conversely, testing the non-primality of all numbers your sieve does not report as prime). A good way to do that is the Baillie-Wagstaff test. You can find a good-quality implementation by Thomas R. Nicely at http://www.trnicely.net/misc/bpsw.html.
You might also be interested in Jan Feitsma's tables of pseudoprimes at http://www.janfeitsma.nl/math/psp2/index, which are complete to 264.

First, thanks for sharing your program and working on correctness. I think it's important to do testing, and sieving near the size boundary was something I spent time working on for my code.
"The same result everywhere: nada, zilch, niente." You're not looking hard enough. There are plenty of tools that do this. It's too bad primesieve doesn't go all the way to 2^64-1, but that doesn't mean nothing else does.
"So how then can one verify the proper operation of a sieve close to 2^64?" One thing I did it is make an edge-case test that runs through all combinations of start/end points near 2^64-1, verifying a number of methods all generate the same results. This relies on having a list of these primes to start, but there are many ways to get these. Not only does this test the sieve at this range, but tests the start/end conditions to make sure there are no issues there.
Some ways to generate a million primes below 2^64:
time perl -Mntheory=:all -E 'forprimes { say } ~0-44347170,~0' | md5sum
Takes ~2s to generate 1M primes. We can force use of different code (Perl or GMP), use primality tests, etc. Lots of ways to do this, including just looping and calling is_provable_prime($n), for example. There are also other Perl modules including Math::Primality though they are much slower.
echo 'forprime(i=2^64-44347170,2^64-1,print(i))' | time gp -f -q | md5sum
Takes ~13s to generate 1M primes. As with the Perl module, there are lots of alternate ways including looping calling isprime which is a deterministic routine (assuming a non-ancient version of Pari/GP).
#include <stdio.h>
#include <gmp.h>
int main(void) {
mpz_t n;
mpz_init_set_str(n,"18446744073665204445",10);
mpz_nextprime(n, n);
while (mpz_sizeinbase(n,2) < 65) {
/* If you don't trust mpz_nextprime, one could add this:
* if (!mpz_probab_prime_p(n, 100))
* { fprintf(stderr, "Bad nextprime!\n"); return -1; }
*/
gmp_printf("%Zd\n",n);
mpz_nextprime(n, n);
}
mpz_clear(n);
return 0;
}
Takes about 30s and get the same results. This one is more dubious as I don't trust its 25 preset-random base MR test as much as BPSW or one of the proof methods, but it doesn't matter in this case as we see the results match. Adding the extra 100 tests is very expensive in time, but would make it extremely unlikely to have false results (I suspect we have overlapping bases so this is also wasteful).
from sympy import nextprime
n = 2**64-44347170;
n = nextprime(n)
while n < 2**64:
print n
n = nextprime(n)
Using Python's SymPy. Unfortunately primerange uses crazy memory when given 2^64-1 so that's not possible to use. Doing the simple nextprime method isn't ideal -- it takes about 5 minutes, but generates the same results (the current SymPy isprime uses 46 prime bases, which is many more than needed for deterministic results under 2^64).
There are other tools, e.g. FLINT, GAP, etc.
I realize that since you're on Windows, the world is wonky and lots of things don't work right. I have tested Perl's ntheory on Windows and with both Cygwin and Strawberry Perl from command prompt I get the same results. The GMP code ought to work the same, assuming GMP works correctly.
Edit add: If your results don't match one of the comparison methods, then one of the two (or both) is wrong. It may be the comparison code that is wrong! It helps everyone if you find and report errors. It's unlikely but possible they are both wrong in the same way, which is why I like to compare with as many other sources as possible. To me that is more robust than picking one "golden" code to compare against. Especially if you're using an oddball platform that may not have been thoroughly tested.
For BPSW, there are a few implementations around:
Pari. AES Lucas, in the Pari source code so not sure how portable it is.
TR Nicely. Strong Lucas, standalone code.
David Cleaver. Standard, Strong or Extra Strong Lucas. Standalone library, very clear, very easy to use.
My non-GMP code, including asm Montgomery math for x86_64. Quite a bit faster than bigint codes of course.
My GMP code. Standard, Strong, AES, or Extra strong Lucas. Faster than the other bigint codes. Also has other Frobenius and other compositeness tests. Can be made standalone.
I have a version using LibTomMath that I hope to get into one of the Perl6 VMs. Only interesting if you want to use LTM.
All verified vs. the Feitsma data. I'm sure there are more implementations around as well. FLINT has a variation that is quite fast, but it isn't really BPSW (but it's been verified for numbers under 2^64).

In general, one must use less naive techniques than trial division, or be very patient.
(gp/PARI documentation)
For 64-bit integers, trial division takes millions of times as long as even a simple sieve, let alone thoroughbreds like Kim Walisch's program (primesieve.org) which is orders of magnitude faster.
The reference sieve I want to verify (there's a standalone .cpp # pastebin) finds about a million primes per second when sieving close to 2^64, whereas the trial division code I lifted out of the gmp implementation takes 20 seconds to find even one. Restricting trial division to presieved primes (stored as deltas with one byte per prime for fast iteration) speeds it up by an order of magnitude, but it still outputs less than one prime per second on my laptop.
Hence, trial division can deliver only homœopathic amounts of reference data, even if I use all cores I can lay hands on including Kindle, phone and toaster.
More sophisticated tests like Miller-Rabin or the Baillie-PSW linked by user448810 are several orders of magnitude faster than trial division. For numbers up to 2^64 the Baillie-PSW has been verified to be deterministic (no strong pseudo primes below that threshold). The Miller-Rabin may or may not be deterministic up to 2^64 if the first 12 primes are used as base, or the 7-base set found by Jim Sinclar (meaning the 'net offers statements to that effect but apparently no evidence).
With Baillie-PSW verified - and faster to boot - it seems like a good choice. Unfortunately it is also several orders of magnitude more complicated than a sieve, making it even more important to find trustworthy implementations that are ready to compile without lots of twiddling or - ideally - available as binaries.
Thomas Nicely's Baillie-PSW page has source code that uses the gmp, and gp/PARI can use either gmp or its own code. The latter is also available as a binary, which is very fortunate since building gmp code on an exotic, off-beat platform like MinGW under Windows is a non-trivial undertaking, even if MPIR is used instead of gmp.
That gets us some bulk data but still nowhere near enough for verifying the sieve, since it is orders of magnitude too slow even for covering the blank area left by the cap of primesieve.org (10 * 2^32 numbers).
This is where Will Ness's bigint idea comes in. The operation of the sieve can be verified up to 1,000,000,000,000 using reference data from multiple, independent sources. Switching index variables from 32-bit to 64-bit eliminates most of the boundary cases that could cause the code to mess up in higher regions, leaving only a very few places where even uint64_t gets close to its limits. With those places thoroughly inspected and generously covered by test cases derived from the Baillie-PSW undertaking we can have reasonably high confidence that the sieve code is good. Add copious verification against primesieve.org in the range from 10^12 up to its cap, and it should be sufficient to regard the sieve implementation as trustworthy.
With the sieve up and running, it's easy to cover arbitray ranges with bulk data. Or with digests, as a canned/compressed means of verification that can serve needs of any size and shape. It's what I use in the demo .cpp I mentioned earlier, although my real code uses a mixture between an optimised digest implementation for general work and a special raw memory checksum of 128 bits for quick self-checks of factor sieve bitmaps.
SUMMARY
up to 1,000,000,000,000 verification against primos.mat.br or similar
up to 2^64 - 10 * 2^32 verification against primesieve.org
rest up to 2^64-1: verification of strategically chosen segments using Baillie-PSW (e.g. gp/PARI)

Correct QueryPerformanceCounter Function implementation / Time changes everytime

I have to create a sorting algorithm function that returns number of comparisons, number of copies and number of MICROSECONDS it uses to finish its sorting.
I have seen that to use microseconds i have to use the function QueryPerformance counter as it's accurate (Ps i know it isn't portable between OS)
So i've done that :
void Exchange_sort(int vect[], int dim, int &countconf, int &countcopy, double &time)
{
LARGE_INTEGER a, b, oh, freq;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&a);
QueryPerformanceCounter(&b);
oh.QuadPart = b.QuadPart - a.QuadPart; //Saves in oh the overhead time (?) accuracy
QueryPerformanceCounter(&a);
int i=0,j=0; // The sorting algorithm starts
for (i=0 ; i<dim-1 ; i++)
{ for(j=i+1 ; j<dim; j++ )
{
countconf++; // +1 Comparisons
if (vect[i]>vect[j])
{
scambio ( vect[i],vect[j] ); // It is a function that swaps 2 integers
countcopy=countcopy+3; // +3 copies
}
}
}
QueryPerformanceCounter(&b); // Ends timer
time = ( ( (double)(b.QuadPart - a.QuadPart - oh.QuadPart) /freq.QuadPart )
*1000000 ) ;
}
The *1000000 is actually to give microseconds...
I think like this it should work but everytime i call the function giving it the same dimension of the array, it returns a different time... How can i solve that?
Thank you very much, and sorry for my bad coding

Firstly, the performance counter frequency might not be that great. It's usually several hundred thousand or more, which gives a microsecond or tens of microseconds resolution, but you should be aware that it can be even worse.
Secondly, if your array size is small, your sort might finish in nanoseconds or microseconds, and you would not be able to measure that accurately with QueryPerformanceCounter.
Thirdly, when your benchmark process is running, Windows might take the CPU away from it for a (relatively) long time, milliseconds or maybe even hundreds of milliseconds. This will lead to highly irregular and seemingly erratic timings.
I have two suggestions that you might pursue independently of each other:
I suggest you investigate using the RDTSC instruction (using inline assembly or compiler intrinsics or even an existing library.) Which will most likely give you better resolution with far less overhead. But I have to warn you that it has its own bag of problems.
For this type of benchmark, you have to run your sort routine with the exact same input many times (tens or hundreds) and then take the smallest time measurement. The reason that you should adopt this strategy is that there are a few phenomena that will interfere with your timing and make it longer, but there is nothing that can make your sort go faster than it would on paper. Therefore, you need to run the test many many times and hope to all your gods that the fastest time you've measured is the actual running time with no interference or noise.
UPDATE: Reading through the comments on the question, it seems that you are trying to time a very short-running piece of code with a timer that doesn't have enough resolution. Either increase your input size, or use RDTSC.

The short answer for your question is that it is not possible to measure exactly the same time for all calls of the same function.
The fact that you are receiving different times is expected because your operating system is not a perfect Real-Time System, but a general purpose OS with multiple processes running at the same time and competing to be scheduled by the kernel to get its own CPU cycles.
And also, consider that, each time you execute your program or function, some of its instructions might be located at the RAM, and some might be available at the CPU L1 or L2 cache memory, and it will probably change from one execution to another. So, there are lots of variables to consider when evaluating the elapsed time for function calls using high level of precision.

Make g++ produce a program that can use multiple cores?

I have a c++ program with multiple For loops; each one runs about 5 million iterations. Is there any command I can use with g++ to make the resulting .exe will use multiple cores; i.e. make the first For loop run on the first core and the second For loop run on the second core at the same time? I've tried -O3 and -O3 -ftree-vectorize, but in both cases, my cpu usage still only hovers at around 25%.
EDIT:
Here is my code, in case in helps. I'm basically just making a program to test the speed capabilities of my computer.
#include <iostream>
using namespace std;
#include <math.h>
int main()
{
float *bob = new float[50102133];
float *jim = new float[50102133];
float *joe = new float[50102133];
int i,j,k,l;
//cout << "Starting test...";
for (i=0;i<50102133;i++)
bob[i] = sin(i);
for (j=0;j<50102133;j++)
bob[j] = sin(j*j);
for (k=0;k<50102133;k++)
bob[k] = sin(sqrt(k));
for (l=0;l<50102133;l++)
bob[l] = cos(l*l);
cout << "finished test.";
cout << "the 100120 element is," << bob[1001200];
return 0;
}

The most obviously choice would be to use OpenMP. Assuming your loop is one that's really easy to execute multiple iterations in parallel, you might be able to just add:
#pragma openmp parallel for
...immediately before the loop, and get it to execute in parallel. You'll also have to add -fopenmp when you compile.
Depending on the content of the loop, that may give anywhere from a nearly-linear speedup to slowing the code down somewhat. In the latter cases (slowdown or minimal speedup) there may be other things you can do with OpenMP to help speed it up, but without knowing at least a little about the code itself, it's hard to guess what to do or what improvement you may be able to expect at maximum.
The other advice you're getting ("Use threads") may be suitable. OpenMP is basically an automated way of putting threads to use for specific types of parallel code. For a situation such as you describe (executing multiple iterations of a loop in parallel) OpenMP is generally preferred--it's quite a bit simpler to implement, and may well give better performance unless you know multithreading quite well and/or expend a great deal of effort on parallelizing the code.
Edit:
The code you gave in the question probably won't benefit from multiple threads. The problem is that it does very little computation on each data item before writing the result out to memory. Even a single core can probably do the computation fast enough that the overall speed will be limited by the bandwidth to memory.
To stand a decent chance of getting some real benefit from multiple threads, you probably want to write some code that does more computation and less just reading and writing memory. For example, if we collapse your computations together, and do all of them on a single item, then sum the results:
double total = 0;
for (int i = 0; i < size; i++)
total += sin(i) + sin(i*i) + sin(sqrt(i)) + cos(i*i);
By adding a pragma:
#pragma omp parallel for reduction(+:total)
...just before the for loop, we stand a good chance of seeing a substantial improvement in execution speed. Without OpenMP, I get a time like this:
Real 16.0399
User 15.9589
Sys 0.0156001
...but with the #pragma and OpenMP enabled when I compile, I get a time like this:
Real 8.96051
User 17.5033
Sys 0.0468003
So, on my (dual core) processor, time has dropped from 16 to 9 seconds--not quite twice as fast, but pretty close. Of course, a lot of the improvement you get will depend on exactly how many cores you have available. For example, on my other computer (with an Intel i7 CPU), I get a rather larger improvement because it has more cores.
Without OpenMP:
Real 15.339
User 15.3281
Sys 0.015625
...and with OpenMP:
Real 3.09105
User 23.7813
Sys 0.171875
For completeness, here's the final code I used:
#include <math.h>
#include <iostream>
static const int size = 1024 * 1024 * 128;
int main(){
double total = 0;
#pragma omp parallel for reduction(+:total)
for (int i = 0; i < size; i++)
total += sin(i) + sin(i*i) + sin(sqrt(i)) + cos(i*i);
std::cout << total << "\n";
}

The compiler has no way to tell if your code inside the loop can be safely executed on multiple cores. If you want to use all your cores, use threads.

Use Threads or Processes, you may want to look to OpenMp

C++11 got support for threading but c++ compilers won't/can't do any threading on their own.

As others have pointed out, you can manually use threads to achieve this. You might look at libraries such as libdispatch (aka. GCD) or Intel's TBB to help you do this with the least pain.
The -ftree-vectorize option you mention is for targeting SIMD vector processor units on CPUs such as ARM's NEON or Intel's SSE. The code produced is not thread-parallel, but rather operation parallel using a single thread.
The code example posted above is highly amenable to parallelism on SIMD systems as the body of each loop very obviously has no dependancies on the previous iteration, and the operations in the loop are linear.
On some ARM Cortex A series systems at least, you may need to accept slightly reduced accuracy to get the full benefits.

Algorithm: taking out every 4th item of an array

I have two huge arrays (int source[1000], dest[1000] in the code below, but having millions of elements in reality). The source array contains a series of ints of which I want to copy 3 out of every 4.
For example, if the source array is:
int source[1000] = {1,2,3,4,5,6,7,8....};
int dest[1000];
Here is my code:
for (int count_small = 0, count_large = 0; count_large < 1000; count_small += 3, count_large +=4)
{
dest[count_small] = source[count_large];
dest[count_small+1] = source[count_large+1];
dest[count_small+2] = source[count_large+2];
}
In the end, dest console output would be:
1 2 3 5 6 7 9 10 11...
But this algorithm is so slow! Is there an algorithm or an open source function that I can use / include?
Thank you :)
Edit: The actual length of my array would be about 1 million (640*480*3)
Edit 2: Processing this for loop takes about 0.98 seconds to 2.28 seconds, while the other code only take 0.08 seconds to 0.14 seconds, so the device uses at least 90 % cpu time only for the loop

Well, the asymptotic complexity there is as good as it's going to get. You might be able to achieve slightly better performance by loading in the values as four 4-way SIMD integers, shuffling them into three 4-way SIMD integers, and writing them back out, but even that's not likely to be hugely faster.
With that said, though, the time to process 1000 elements (Edit: or one million elements) is going to be utterly trivial. If you think this is the bottleneck in your program, you are incorrect.

Before you do much more, try profiling your application and determine if this is the best place to spend your time. Then, if this is a hot spot, determine how fast is it, and how fast you need it to be/might achieve? Then test the alternatives; the overhead of threading or OpenMP might even slow it down (especially, as you now have noted, if you are on a single core processor - in which case it won't help at all). For single threading, I would look to memcpy as per Sean's answer.
#Sneftel has also reference other options below involving SIMD integers.
One option would be to try parallel processing the loop, and see if that helps. You could try using the OpenMP standard (see Wikipedia link here), but you would have to try it for your specific situation and see if it helped. I used this recently on an AI implementation and it helped us a lot.
#pragma omp parallel for
for (...)
{
... do work
}
Other than that, you are limited to the compiler's own optimisations.
You could also look at the recent threading support in C11, though you might be better off using pre-implemented framework tools like parallel_for (available in the new Windows Concurrency Runtime through the PPL in Visual Studio, if that's what you're using) than rolling your own.
parallel_for(0, max_iterations,
[...] (int i)
{
... do stuff
}
);
Inside the for loop, you still have other options. You could try a for loop that iterates and skips every for, instead of doing 3 copies per iteration (just skip when (i+1) % 4 == 0), or doing block memcopy operations for groups of 3 integers as per Seans answer. You might achieve slightly different compiler optimisations for some of these, but it is unlikely (memcpy is probably as fast as you'll get).
for (int i = 0, int j = 0; i < 1000; i++)
{
if ((i+1) % 4 != 0)
{
dest[j] = source[i];
j++;
}
}
You should then develop a test rig so you can quickly performance test and decide on the best one for you. Above all, decide how much time is worth spending on this before optimising elsewhere.

You could try memcpy instead of the individual assignments:
memcpy(&dest[count_small], &source[count_large], sizeof(int) * 3);

Is your array size only a 1000? If so, how is it slow? It should be done in no time!
As long as you are creating a new array and for a single threaded application, this is the only away AFAIK.
However, if the datasets are huge, you could try a multi threaded application.
Also you could explore having a bigger data type holding the value, such that the array size decreases... That is if this is viable to your real life application.

If you have Nvidia card you can consider using CUDA. If thats not the case you can try other parallel programming methods/environments as well.

Best way to test code speed in C++ without profiler, or does it not make sense to try?

On SO, there are quite a few questions about performance profiling, but I don't seem to find the whole picture. There are quite a few issues involved and most Q & A ignore all but a few at a time, or don't justify their proposals.
What Im wondering about. If I have two functions that do the same thing, and Im curious about the difference in speed, does it make sense to test this without external tools, with timers, or will this compiled in testing affect the results to much?
I ask this because if it is sensible, as a C++ programmer, I want to know how it should best be done, as they are much simpler than using external tools. If it makes sense, lets proceed with all the possible pitfalls:
Consider this example. The following code shows 2 ways of doing the same thing:
#include <algorithm>
#include <ctime>
#include <iostream>
typedef unsigned char byte;
inline
void
swapBytes( void* in, size_t n )
{
for( size_t lo=0, hi=n-1; hi>lo; ++lo, --hi )
in[lo] ^= in[hi]
, in[hi] ^= in[lo]
, in[lo] ^= in[hi] ;
}
int
main()
{
byte arr[9] = { 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h' };
const int iterations = 100000000;
clock_t begin = clock();
for( int i=iterations; i!=0; --i )
swapBytes( arr, 8 );
clock_t middle = clock();
for( int i=iterations; i!=0; --i )
std::reverse( arr, arr+8 );
clock_t end = clock();
double secSwap = (double) ( middle-begin ) / CLOCKS_PER_SEC;
double secReve = (double) ( end-middle ) / CLOCKS_PER_SEC;
std::cout << "swapBytes, for: " << iterations << " times takes: " << middle-begin
<< " clock ticks, which is: " << secSwap << "sec." << std::endl;
std::cout << "std::reverse, for: " << iterations << " times takes: " << end-middle
<< " clock ticks, which is: " << secReve << "sec." << std::endl;
std::cin.get();
return 0;
}
// Output:
// Release:
// swapBytes, for: 100000000 times takes: 3000 clock ticks, which is: 3sec.
// std::reverse, for: 100000000 times takes: 1437 clock ticks, which is: 1.437sec.
// Debug:
// swapBytes, for: 10000000 times takes: 1781 clock ticks, which is: 1.781sec.
// std::reverse, for: 10000000 times takes: 12781 clock ticks, which is: 12.781sec.
The issues:
Which timers to use and how get the cpu time actually consumed by the code under question?
What are the effects of compiler optimization (since these functions just swap bytes back and forth, the most efficient thing is obviously to do nothing at all)?
Considering the results presented here, do you think they are accurate (I can assure you that multiple runs give very similar results)? If yes, can you explain how std::reverse gets to be so fast, considering the simplicity of the custom function. I don't have the source code from the vc++ version that I used for this test, but here is the implementation from GNU. It boils down to the function iter_swap, which is completely incomprehensible for me. Would this also be expected to run twice as fast as that custom function, and if so, why?
Contemplations:
It seems two high precision timers are being proposed: clock() and QueryPerformanceCounter (on windows). Obviously we would like to measure the cpu time of our code and not the real time, but as far as I understand, these functions don't give that functionality, so other processes on the system would interfere with measurements. This page on the gnu c library seems to contradict that, but when I put a breakpoint in vc++, the debugged process gets a lot of clock ticks even though it was suspended (I have not tested under gnu). Am I missing alternative counters for this, or do we need at least special libraries or classes for this? If not, is clock good enough in this example or would there be a reason to use the QueryPerformanceCounter?
What can we know for certain without debugging, dissassembling and profiling tools? Is anything actually happening? Is the function call being inlined or not? When checking in the debugger, the bytes do actually get swapped, but I'd rather know from theory why, than from testing.
Thanks for any directions.
update
Thanks to a hint from tojas the swapBytes function now runs as fast as the std::reverse. I had failed to realize that the temporary copy in case of a byte must be only a register, and thus is very fast. Elegance can blind you.
inline
void
swapBytes( byte* in, size_t n )
{
byte t;
for( int i=0; i<7-i; ++i )
{
t = in[i];
in[i] = in[7-i];
in[7-i] = t;
}
}
Thanks to a tip from ChrisW I have found that on windows you can get the actual cpu time consumed by a (read:your) process trough Windows Management Instrumentation. This definitely looks more interesting than the high precision counter.

Obviously we would like to measure the cpu time of our code and not the real time, but as far as I understand, these functions don't give that functionality, so other processes on the system would interfere with measurements.
I do two things, to ensure that wall-clock time and CPU time are approximately the same thing:
Test for a significant length of time, i.e. several seconds (e.g. by testing a loop of however many thousands of iterations)
Test when the machine is more or less relatively idle except for whatever I'm testing.
Alternatively if you want to measure only/more exactly the CPU time per thread, that's available as a performance counter (see e.g. perfmon.exe).
What can we know for certain without debugging, dissassembling and profiling tools?
Nearly nothing (except that I/O tends to be relatively slow).

To answer you main question, it "reverse" algorithm just swaps elements from the array and not operating on the elements of the array.

Use QueryPerformanceCounter on Windows if you need a high-resolution timing. The counter accuracy depends on the CPU but it can go up to per clock pulse. However, profiling in real world operations is always a better idea.

Is it safe to say you're asking two questions?
Which one is faster, and by how much?
And why is it faster?
For the first, you don't need high precision timers. All you need to do is run them "long enough" and measure with low precision timers. (I'm old-fashioned, my wristwatch has a stop-watch function, and it is entirely good enough.)
For the second, surely you can run the code under a debugger and single-step it at the instruction level. Since the basic operations are so simple, you will be able to easily see roughly how many instructions are required for the basic cycle.
Think simple. Performance is not a hard subject. Usually, people are trying to find problems, for which this is a simple approach.

(This answer is specific to Windows XP and the 32-bit VC++ compiler.)
The easiest thing for timing little bits of code is the time-stamp counter of the CPU. This is a 64-bit value, a count of the number of CPU cycles run so far, which is about as fine a resolution as you're going to get. The actual numbers you get aren't especially useful as they stand, but if you average out several runs of various competing approaches then you can compare them that way. The results are a bit noisy, but still valid for comparison purposes.
To read the time-stamp counter, use code like the following:
LARGE_INTEGER tsc;
__asm {
cpuid
rdtsc
mov tsc.LowPart,eax
mov tsc.HighPart,edx
}
(The cpuid instruction is there to ensure that there aren't any incomplete instructions waiting to complete.)
There are four things worth noting about this approach.
Firstly, because of the inline assembly language, it won't work as-is on MS's x64 compiler. (You'll have to create a .ASM file with a function in it. An exercise for the reader; I don't know the details.)
Secondly, to avoid problems with cycle counters not being in sync across different cores/threads/what have you, you may find it necessary to set your process's affinity so that it only runs on one specific execution unit. (Then again... you may not.)
Thirdly, you'll definitely want to check the generated assembly language to ensure that the compiler is generating roughly the code you expect. Watch out for bits of code being removed, functions being inlined, that sort of thing.
Finally, the results are rather noisy. The cycle counters count cycles spent on everything, including waiting for caches, time spent on running other processes, time spent in the OS itself, etc. Unfortunately, it's not possible (under Windows, at least) to time just your process. So, I suggest running the code under test a lot of times (several tens of thousands) and working out the average. This isn't very cunning, but it seems to have produced useful results for me at any rate.

I would suppose that anyone competent enough to answer all your questions is gong to be far too busy to answer all your questions. In practice it is probably more effective to ask a single, well-defined questions. That way you may hope to get well-defined answers which you can collect and be on your way to wisdom.
So, anyway, perhaps I can answer your question about which clock to use on Windows.
clock() is not considered a high precision clock. If you look at the value of CLOCKS_PER_SEC you will see it has a resolution of 1 millisecond. This is only adequate if you are timing very long routines, or a loop with 10000's of iterations. As you point out, if you try and repeat a simple method 10000's of times in order to get a time that can be measured with clock() the compiler is liable to step in and optimize the whole thing away.
So, really, the only clock to use is QueryPerformanceCounter()

Is there something you have against profilers? They help a ton. Since you are on WinXP, you should really give a trial of vtune a try. Try a call graph sampling test and look at self time and total time of the functions being called. There's no better way to tune your program so that it's the fastest possible without being an assembly genius (and a truly exceptional one).
Some people just seem to be allergic to profilers. I used to be one of those and thought I knew best about where my hotspots were. I was often correct about obvious algorithmic inefficiencies, but practically always incorrect about more micro-optimization cases. Just rewriting a function without changing any of the logic (ex: reordering things, putting exceptional case code in a separate, non-inlined function, etc) can make functions a dozen times faster and even the best disassembly experts usually can't predict that without the profiler.
As for relying on simplistic timing tests alone, they are extremely problematic. That current test is not so bad but it's a very common mistake to write timing tests in ways in which the optimizer will optimize out dead code and end up testing the time it takes to do essentially a nop or even nothing at all. You should have some knowledge to interpret the disassembly to make sure the compiler isn't doing this.
Also timing tests like this have a tendency to bias the results significantly since a lot of them just involve running your code over and over in the same loop, which tends to simply test the effect of your code when all the memory in the cache with all the branch prediction working perfectly for it. It's often just showing you best case scenarios without showing you the average, real-world case.
Depending on real world timing tests is a little bit better; something closer to what your application will be doing at a high level. It won't give you specifics about what is taking what amount of time, but that's precisely what the profiler is meant to do.

Wha? How to measure speed without a profiler? The very act of measuring speed is profiling! The question amounts to, "how can I write my own profiler?" And the answer is clearly, "don't".
Besides, you should be using std::swap in the first place, which complete invalidates this whole pointless pursuit.
-1 for pointlessness.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js