std::vector faster than plain array? - c++

I have just tried to benchmark std::sort on both std::vector<std::pair<float, unsigned int>> (filled with push_back operation) and plain std::pair<float, unsigned int>> * array (allocated using new and then filled one by one). The compare function just compared the float parts of the pairs.
Surprisingly, when used on 16M values, on std::vector it took only about 1940 ms but on the array it was about 2190 ms. Can anybody explain how can be the vector faster? Is it due to cacheing, or is just array version of std::sort poorly implemented?
gcc (GCC) 4.4.5 20110214 (Red Hat 4.4.5-6)
Intel(R) Core(TM) i7 CPU 870 # 2.93GHz - cache size 8192 KB (the computer has two quad-core CPUs but I assume the sort is only single-threaded)
EDIT: Now you can call me dumbass, but when I tried to reproduce the code I used for the measurements (I have already deleted the original one) I cannot reproduce the results - now the array version takes about 1915 +- 5ms (measured on 32 runs). I can only swear I have run the test on 10 measurements three times (manually) with similar results, but this is not a rigorous proof.
There was probably some bug in the original code, background process seems unprobable because I have alternated measurements of vector and array versions and the vector results hold and no user was logged in.
Please, consider this question as closed. Thanks for your efforts.

std::vector<std::pair<float, unsigned int>> (filled with push_back operation)
This stores all the data continguously, so memory locality is very good
std::pair<float, unsigned int>> * array (allocated using new and then filled one by one)
This scatters the data all over memory.
You've set up a very unfair comparison between vector and a simple array. The extra indirection involved in the array case is going to hurt, the lack of locality is going to kill cache performance. I'm surprised you don't see a bigger win in favor of contiguous storage.

They will use the same version of sort. It's quite likely random CPU effects, like caching or thread context switches.

Did you use -O3 to compile your code?
If not, do it. All other benchmark results are meaningless, especially for template code.
Did you run the test many times?
This is done to prevent things like interrupts and or caching to have to much of an impact on your result.
Don't use floatint point comparison or arithmetic for benchmarks. The results depend heavily on the compiler, platform, compiler options etc.
How was your testdata created?
The time required by most sorting algorithm changes depending on the sorting of the input data.
Which method for measuring time did you use? Clock cycles? A timer?
Anyway, writing benchmarks that provide reliable results is not as easy as it might seem at first. Don't use a benchmark to determine what the proper code for your problem is.

Related

concurrency::parallel_sort overhead and performance hit (rule of thumb)?

Recently I stumbled across a very large performance improvement -- I'm talking about a 4x improvement -- with a one line code change. I just changed a std::sort call to concurrency_parallel sort
// Get a contiguous vector copy of the pixels from the image.
std::vector<float> vals = image.copyPixels();
// New, fast way. Takes 7 seconds on a test image.
concurrency::parallel_buffered_sort(vals.begin(), vals.end());
// Old, slow way -- takes 30 seconds on a test image
// std::sort(vals.begin(), vals.end());
This was for a large image and dropped my processing time 30 seconds to 7 seconds. However some cases will involve small images. I don't know if I can or should just do this blindly.
I would like to make some judicious use of parallel_sort, parallel_for and the like but I'm wondering about what threshold needs to be crossed (in terms of number of elements to be sorted/iterated through) before it becomes a help and not a hindrance.
I will eventually go through some lengthy performance testing but at the moment I don't have a lot of time do that. I would like to get this working better "most" of the time and not hurting performance any of the time (or at least only rarely).
Can someone out there with some experience in this area can give me a reasonable rule-of-thumb that will help me in "most" cases? Does one exist?
The requirement of RandomIterator and presence of overloads with a const size_t _Chunk_size = 2048 parameter, which control the threshold of serialisation, imply that the library authors are aware of this concern. Thus probably just using concurrency::parallel_* as drop-in replacements for std::* will do fine.
Here is how I think about it, windows thread scheduling time quanta is ~20-60 ms on workstation and 120ms on the server so anything that can be done in this much time doesn't need concurrency.
So, I am guessing up to 1k-10k you are good with std::sort the latency in launching multiple threads would be an overkill, but 10k onwards there is a distinct advantage in using parallel sort or p-buffered sort (if you can afford it) and parallel radix sort probably would be great for very very large values.
Caveats apply. :o)
I don’t know about that concurrency namespace, but any sane implementation of a parallel algorithm will adapt appropriately to the size of the input. You shouldn’t have to worry about the details of the underlying thread implementation. Just do it.

Counting the length of the Collatz sequence - custom iterator produces slowdown?

I've been solving the UVA problem #100 - "The 3n + 1 problem". This is their "sample" problem, with a very forgiving time limit (limit of 3 sec, their sample solution with no caching at all runs in 0.738 sec, my best solution so far run in 0.016 sec), so I thought however I'd experiment with the code, I should always fit in the limit. Well, I was mistaken.
The problem specification is simple: each line of input has two numbers i and j, and the output should print these numbers followed by the maximum length of the Collatz sequence whose beginning is in between i and j inclusive.
I've prepared four solutions. They are quite similar and all produce good answers.
The first solution caches up to 0x100000 sequence lengths in a vector. Length of 0 means the length for the sequence that starts at this particular number has not yet been calculated. It runs fast enough - 0.03 sec.
The second solution is quite similar, only it caches every singe length in a sparse array implemented with an unordered_map. It runs fairly slower than the previous solution, but still fits well in the limit: 0.28 sec.
As an exercise, I also wrote the third solution, which based on the second one. The goal was to use the max_element function, which only accepts iterators. I couldn't use unordered_map::iterator, for incrementing such iterator is, AFAIK, linear in map size; therefore, I wrote a custom iterator, operating on an abstract "container" that "holds" the sequence length of every possible number (but in fact calculates them and caches them only as needed). In it's core, it's actually the same unordered_map solution - only there's an extra layer of iterators added on the top. The solution did not fit in the 3 sec. limit.
And now comes the thing I can't understand. While obviously the third solution is overcomplicated on purpose, I could hardly believe that an extra layer of iterators could produce such a slowdown. To check this out, I added a same iterator layer to the vector solution. This is my fourth solution. Judging from what this iterator idea did to my unordered_map solution, I expected a considerable slowdown here as well; but oddly enough, this did not happen at all. This solution runs almost as fast as the plain vector one, in 0.035 sec.
How is this possible? What exactly is responsible for the slowdown in the third solution? How is it possible that overcomplicating two similar solutions in an exactly the same way slows one of them down very much, and hardly harms the other at all? Why adding the iterator layer to the unordered_map solution made it not fitting in time, and doing the same to the vector solution hardly slowed it down at all?
EDIT:
I found out that the problem seems to be most visible if the input contains many repetitive lines. I tested all four solutions on my machine against the input of 1 1000000, repeated 200 times. The solution with a plain vector processed all of them in 1.531 sec. The solution with the vector and the additional iterator layer took 3.087 sec. The solution with the plain unordered map took 33.821 sec. And the solution with the unordered map and the additional iterator layer managed took more than half an hour - I halted it after 31 minutes 0.482 sec! I tested it on Linux mint 17.2 64 bit, g++ version Ubuntu 4.8.4-2ubuntu1~14.04 with flags -std=c++11 -O2, processor Celeron 2955U #1.4 GHz x 2
It appears to be a problem in GCC 4.8. It doesn't occur in 4.9 up. For some reason, subsequent outer loops (with the populated unordered_map cache) run progressively slower, not faster. I'm not sure why, since the unordered_map isn't getting bigger.
If you follow that link and switch GCC 4.8 to 4.9, then you'll see the expected behavior where subsequent runs over the same numeric range add little time because they leverage the cache.
Altogether, the philosophy of being "conservative" with compiler updates has been outdated for a long time. Compilers today are rigorously tested and you should use the latest (or at least some recent) release for everyday development.
For an online judge to subject you to long-fixed bugs is just cruel.

SIMD Implementation of std::nth_element

I have an algorithm that runs on my dual-core, 3 GHz Intel processor in on average 250ms, and I am trying to optimize it. Currently, I have an std::nth_element call that is invoked around 6,000 times on std::vectors of between 150 and 300 elements, taking on average 50ms. I've spent some time optimizing the comparator I use, which currently looks up two doubles from a vector and does a simple < comparison. The comparator takes a negligible fraction of the time to run std::nth_element. The comparator's copy-constructor is also simple.
Since this call is currently taking 20% of the time for my algorithm, and since the time is mostly spent in the code for nth_element that I did not write (i.e. not the comparator), I'm wondering if anyone knows of a way of optimizing nth_element using SIMD or any other approach? I've seen some questions on parallelizing std::nth_element using OpenCL and multiple threads, but since the vectors are pretty short, I'm not sure how much benefit I would get from that approach, though I'm open to being told I'm wrong.
If there is an SSE approach, I can use any SSE instruction up to (the current, I think) SSE4.2.
Thanks!
Two thoughts:
Multithreading probably won't speed up processing for any single vector, but might help you as the number of vectors grows large.
Sorting is too powerful a tool for your problem: you're computing the entire order of the vector, but you don't care about anything but the top few. You know for each vector how many elements make up the top 5%, so instead of sorting the whole thing you should make one pass through the array and find the k largest. You can do this is O(n) time with k extra space, so it's probably a win overall.

OpenMP and cores/threads

My CPU is a Core i3 330M with 2 cores and 4 threads. When I execute the command cat /proc/cpuinfo in my terminal, it is like I have 4 CPUS. When I use the OpenMP function get_omp_num_procs() I also get 4.
Now I have a standard C++ vector class, I mean a fixed-size double array class that does not use expression templates. I have carefully parallelized all the methods of my class and I get the "expected" speedup.
The question is: can I guess the expected speedup in such a simple case? For instance, if I add two vectors without parallelized for-loops I get some time (using the shell time command). Now if I use OpenMP, should I get a time divided by 2 or 4, according to the number of cores/threads? I emphasize that I am only asking for this particular simple problem, where there is no interdependence in the data and everything is linear (vector addition).
Here is some code:
Vector Vector::operator+(const Vector& rhs) const
{
assert(m_size == rhs.m_size);
Vector result(m_size);
#pragma omp parallel for schedule(static)
for (unsigned int i = 0; i < m_size; i++)
result.m_data[i] = m_data[i]+rhs.m_data[i];
return result;
}
I have already read this post: OpenMP thread mapping to physical cores.
I hope that somebody will tell me more about how OpenMP get the work done in this simple case. I should say that I am a beginner in parallel computing.
Thanks!
EDIT : Now that some code has been added.
In that particular example, there is very little computation and lots of memory access. So the performance will depend heavily on:
The size of the vector.
How you are timing it. (do you have an outer-loop for timing purposes)
Whether the data is already in cache.
For larger vector sizes, you will likely find that the performance is limited by your memory bandwidth. In which case, parallelism is not going to help much. For smaller sizes, the overhead of threading will dominate. If you're getting the "expected" speedup, you're probably somewhere in between where the result is optimal.
I refuse to give hard numbers because in general, "guessing" performance, especially in multi-threaded applications is a lost cause unless you have prior testing knowledge or intimate knowledge of both the program and the system that it's running on.
Just as a simple example taken from my answer here: How to get 100% CPU usage from a C program
On a Core i7 920 # 3.5 GHz (4 cores, 8 threads):
If I run with 4 threads, the result is:
This machine calculated all 78498 prime numbers under 1000000 in 39.3498 seconds
If I run with 4 threads and explicitly (using Task Manager) pin the threads on 4 distinct physical cores, the result is:
This machine calculated all 78498 prime numbers under 1000000 in 30.4429 seconds
So this shows how unpredictable it is for even a very simple and embarrassingly parallel application. Applications involving heavy memory usage and synchronization get a lot uglier...
To add to Mysticals answer. Your problem is purely memory bandwidth bounded. Have a look at the STREAM benchmark. Run it on your computer in single and multi-threaded cases, and look at the Triad results - this is your case (well, almost, since your output vector is at the same time one of your input vectors). Calculate how much data you move around and You will know exactly what performance to expect.
Does multi-threading work for this problem? Yes. It is rare that a single CPU core can saturate the entire memory bandwidth of the system. Modern computers balance the available memory bandwidth with the number of cores available. From my experience you will need around half of the cores to saturate the memory bandwidth with a simple memcopy operation. It might take a few more if you do some calculations on the way.
Note that on NUMA systems you will need to bind the threads to cpu cores and use local memory allocation to get optimal results. This is because on such systems every CPU has its own local memory, to which the access is the fastest. You can still access the entire system memory like on usual SMPs, but this incurs communication cost - CPUs have to explicitly exchange data. Binding threads to CPUs and using local allocation is extremely important. Failing to do this kills the scalability. Check libnuma if you want to do this on Linux.

How best to quickly populate a vector?

I have some simulation code I'm working on and I've just gotten rid of all the low hanging fruit so far as optimisation is concerned. The code now spends half its time pushing back vectors. (The size of the final vectors is known and I reserve appropriately)
Essentially I'm rearranging one vector into a permutation of another, or populating the vector with random elements.
Is there any faster means of pushing back into a vector? Or pushing back/copying multiple elements?
std::vector<unsigned int, std::allocator<unsigned int> >::push_back(unsigned int const&)
Thanks in advance.
EDIT: Extra info; I'm running a release build with -O3, also: the original vector needs to be preserved.
You can have a look at
c++0x (which enables a lot of optimizations in this area in the concept of move semantics)
EASTL (which boasts superior performance, mainly through the use of custom allocators (_you can get it up and running in about an hour, and the only visible change will be std::vector --> eastl::vector and some extra link objects).
you can drop in google perftools tcmalloc (although since apparently you already optimize by pre-allocating, this shouldn't really matter).
I'd personally not expect much gain if really the vector handling is the bottleneck. I'd really look at parallelizing with (in order of preference):
GNU openmp (CPPFLAGS+=-D_GLIBCXX_PARALLEL -fopenmp)
just openmp and 'manual' #pragma parallel for
Intel TBB (most appropriate if using Intel compiler)
I must be forgetting stuff. O yeah, look here: http://www.agner.org/optimize/
Edit: I always forget the simplest things: Use memcpy/memmove for bulk appending POD elements to pre-allocated vectors.
If you're pre-reserving space then your vector is as fast as an array. You cannot mathematically make it faster; stop worrying and move on to something else!
You may be experiencing slow-down if you're running a "debug build" i.e. where your standard library implementation has optimisations turned off, and debug tracking info turned on.
push_back on int is extremely efficient. So I would look elsewhere for optimization opportunities.
Nemo's first rule of micro-optimization: Math is fast; memory is slow. Creating a huge vector is very cache-unfriendly.
For example, instead of creating a permutation of the original vector, can you just compute which element you need as you need it and then access that element directly from the original vector?
Similarly, do you really need a vector of random integers? Why not just generate a random number when it is needed? (If you have to remember it for later, then go ahead and push it onto the vector then... But not before.)
push_back on int is about as fast as it is going to get. I would bet you could barely notice the difference even if you got rid of the reserve (because the re-allocation does not happen often and is going to use a very fast bulk copy already). So you need to take a broader view to improve performance.
If you have multiple vectors, you may be able to improve speed by allocating them continuously using a custom allocator. Improving memory locality may well improve the running time of the algorithm.
If you are using Debug version of STL there is a debug overhead (esp. in iterators) in all STL calls.
I would advice to replace STL vector with regular array. If you are using trivially-copyable types you could easily copy multiple elements using memcpy call.