How to auto-vectorize range-based for loops? - c++

A similar question was posted on SO for g++ that was rather vague, so I thought I'd post a specific example for VC++12 / VS2013 to which we can hopefully get an answer.
cross-link:
g++ , range based for and vectorization
MSDN gives the following as an example of a loop that can be vectorized:
for (int i=0; i<1000; ++i)
{
A[i] = A[i] + 1;
}
(http://msdn.microsoft.com/en-us/library/vstudio/jj658585.aspx)
Here is my version of a range-based analogue to the above, a c-style monstrosity, and a similar loop using std::for_each. I compiled with the /Qvec-report:2 flag and added the compiler messages as comments:
#include <vector>
#include <algorithm>
int main()
{
std::vector<int> vec(1000, 1);
// simple range-based for loop
{
for (int& elem : vec)
{
elem = elem + 1;
}
} // info C5002 : loop not vectorized due to reason '1304'
// c-style iteration
{
int * begin = vec.data();
int * end = begin + vec.size();
for (int* it = begin; it != end; ++it)
{
*it = *it + 1;
}
} // info C5001: loop vectorized
// for_each iteration
{
std::for_each(vec.begin(), vec.end(), [](int& elem)
{
elem = elem + 1;
});
} // (no compiler message provided)
return 0;
}
Only the c-style loop gets vectorized. Reason 1304 is as follows as per the MSDN docs:
1304: Loop includes assignments that are of different sizes.
It gives the following as an example of code that would trigger a 1304 message:
void code_1304(int *A, short *B)
{
// Code 1304 is emitted when the compiler detects
// different sized statements in the loop body.
// In this case, there is an 32-bit statement and a
// 16-bit statement.
// In cases like this consider splitting the loop into loops to
// maximize vector register utilization.
for (int i=0; i<1000; ++i)
{
A[i] = A[i] + 1;
B[i] = B[i] + 1;
}
}
I'm no expert but I can't see the relationship. Is this just buggy reporting? I've noticed that none of my range-based loops are getting vectorized in my actual program. What gives?
(In case this is buggy behavior I'm running VS2013 Professional Version 12.0.21005.1 REL)
EDIT: Bug report posted: https://connect.microsoft.com/VisualStudio/feedback/details/807826/range-based-for-loops-are-not-vectorized

Posted bug report here:
https://connect.microsoft.com/VisualStudio/feedback/details/807826/range-based-for-loops-are-not-vectorized
Response:
Hi, thanks for the report.
Vectorizing range-based-for-loop-y code is something we are actively
making better. We'll address vectorizing this, plus enabling
auto-vectorization for other C++ language & library features in future
releases of the compiler.
The emission of reason code 1304 (on x64) and reason code 1301 (on
x86) are artifacts of compiler internals. The details of that, for
this particular code, is not important.
Thanks for the report! I am closing this MSConnect item. Feel free to
respond if you need anything else.
Eric Brumer Microsoft Visual C++ Team

Related

Why in C++ overwritingis is slower than writing?

USELESS QUESTION - ASKED TO BE DELETED
I have to run a piece of code that manages a video stream from camera.
I am trying to boost it, and I realized a weird C++ behaviour. (I have to admit I am realizing I do not know C++)
The first piece of code run faster than the seconds, why? It might be possible that the stack is almost full?
Faster version
double* temp = new double[N];
for(int i = 0; i < N; i++){
temp[i] = operation(x[i],y[i]);
res = res + (temp[i]*temp[i])*coeff[i];
}
Slower version1
double temp;
for(int i = 0; i < N; i++){
temp = operation(x[i],y[i]);
res = res + (temp*temp)*coeff[i];
}
Slower version2
for(int i = 0; i < N; i++){
double temp = operation(x[i],y[i]);
res = res + (temp*temp)*coeff[i];
}
EDIT
I realized the compiler was optimizing the product between elemnts of coeff and temp. I beg your pardon for the unuseful question. I will delete this post.
This has obviously nothing to do with "writing vs overwriting".
Assuming your results are indeed correct, I can guess that your "faster" version can be vectorized (i.e. pipelined) by the compiler more efficiently.
The difference in that in this version you allocate a storage space for temp, whereas each iteration uses its own member of the array, hence all the iterations can be executed independently.
Your "slow 1" version creates a (sort of) false dependence on a single temp variable. A primitive compiler might "buy" it, producing a non-pipelined code.
Your "slow 2" version seems to be ok actually, loop iterations are independent.
Why is this still slower?
I can guess that this is due to the use of the same CPU registers. That is, arithmetic on double is usually implemented via FPU stack registers, this is the interference between loop iterations.

Where is the loop-carried dependency here?

Does anyone see anything obvious about the loop code below that I'm not seeing as to why this cannot be auto-vectorized by VS2012's C++ compiler?
All the compiler gives me is info C5002: loop not vectorized due to reason '1200' when I use the /Qvec-report:2 command-line switch.
Reason 1200 is documented in MSDN as:
Loop contains loop-carried data dependences that prevent
vectorization. Different iterations of the loop interfere with each
other such that vectorizing the loop would produce wrong answers, and
the auto-vectorizer cannot prove to itself that there are no such data
dependences.
I know (or I'm pretty sure that) there aren't any loop-carried data dependencies but I'm not sure what's preventing the compiler from realizing this.
These source and dest pointers do not ever overlap nor alias the same memory and I'm trying to provide the compiler with that hint via __restrict.
pitch is always a positive integer value, something like 4096, depending on the screen resolution, since this is a 8bpp->32bpp rendering/conversion function, operating column-by-column.
byte * __restrict source;
DWORD * __restrict dest;
int pitch;
for (int i = 0; i < count; ++i) {
dest[(i*2*pitch)+0] = (source[(i*8)+0]);
dest[(i*2*pitch)+1] = (source[(i*8)+1]);
dest[(i*2*pitch)+2] = (source[(i*8)+2]);
dest[(i*2*pitch)+3] = (source[(i*8)+3]);
dest[((i*2+1)*pitch)+0] = (source[(i*8)+4]);
dest[((i*2+1)*pitch)+1] = (source[(i*8)+5]);
dest[((i*2+1)*pitch)+2] = (source[(i*8)+6]);
dest[((i*2+1)*pitch)+3] = (source[(i*8)+7]);
}
The parens around each source[] are remnants of a function call which I have elided here because the loop still won't be auto-vectorized without the function call, in its most simplest form.
EDIT:
I've simplified the loop to its most trivial form that I can:
for (int i = 0; i < 200; ++i) {
dest[(i*2*4096)+0] = (source[(i*8)+0]);
}
This still produces the same 1200 reason code.
EDIT (2):
This minimal test case with local allocations and identical pointer types still fails to auto-vectorize. I'm simply baffled at this point.
const byte * __restrict source;
byte * __restrict dest;
source = (const byte * __restrict ) new byte[1600];
dest = (byte * __restrict ) new byte[1600];
for (int i = 0; i < 200; ++i) {
dest[(i*2*4096)+0] = (source[(i*8)+0]);
}
Let's just say there's more than just a couple of things preventing this loop from vectorizing...
Consider this:
int main(){
byte *source = new byte[1000];
DWORD *dest = new DWORD[1000];
for (int i = 0; i < 200; ++i) {
dest[(i*2*4096)+0] = (source[(i*8)+0]);
}
for (int i = 0; i < 200; ++i) {
dest[i*2*4096] = source[i*8];
}
for (int i = 0; i < 200; ++i) {
dest[i*8192] = source[i*8];
}
for (int i = 0; i < 200; ++i) {
dest[i] = source[i];
}
}
Compiler Output:
main.cpp(10) : info C5002: loop not vectorized due to reason '1200'
main.cpp(13) : info C5002: loop not vectorized due to reason '1200'
main.cpp(16) : info C5002: loop not vectorized due to reason '1203'
main.cpp(19) : info C5002: loop not vectorized due to reason '1101'
Let's break this down:
The first two loops are the same. So they give the original reason 1200 which is the loop-carried dependency.
The 3rd loop is the same as the 2nd loop. Yet the compiler gives a different reason 1203:
Loop body includes non-contiguous accesses into an array
Okay... Why a different reason? I dunno. But this time, the reason is correct.
The 4th loop gives 1101:
Loop contains a non-vectorizable conversion operation (may be implicit)
So VC++ doesn't isn't smart enough to issue the SSE4.1 pmovzxbd instruction.
That's a pretty niche case, I wouldn't have expected any modern compiler to be able to do this. And if it could, you'd need to specify SSE4.1.
So the only thing that's out-of-the ordinary is why the initial loop reports a loop-carried dependency.Well, that's a tough call... I would go so far to say that the compiler just isn't issuing the correct reason. (When it really should be non-contiguous access.)
Getting back to the point, I wouldn't expect MSVC or any compiler to be able to vectorize your original loop. Your original loop has accesses grouped in chunks of 4 - which makes it contiguous enough to vectorize. But it's a long-shot to expect the compiler to be able to recognize that.
So if it matters, I suggest manually vectorizing this loop. The intrinsic that you will need is _mm_cvtepu8_epi32().
Your original loop:
for (int i = 0; i < count; ++i) {
dest[(i*2*pitch)+0] = (source[(i*8)+0]);
dest[(i*2*pitch)+1] = (source[(i*8)+1]);
dest[(i*2*pitch)+2] = (source[(i*8)+2]);
dest[(i*2*pitch)+3] = (source[(i*8)+3]);
dest[((i*2+1)*pitch)+0] = (source[(i*8)+4]);
dest[((i*2+1)*pitch)+1] = (source[(i*8)+5]);
dest[((i*2+1)*pitch)+2] = (source[(i*8)+6]);
dest[((i*2+1)*pitch)+3] = (source[(i*8)+7]);
}
vectorizes as follows:
for (int i = 0; i < count; ++i) {
__m128i s0 = _mm_loadl_epi64((__m128i*)(source + i*8));
__m128i s1 = _mm_unpackhi_epi64(s0,s0);
*(__m128i*)(dest + (i*2 + 0)*pitch) = _mm_cvtepu8_epi32(s0);
*(__m128i*)(dest + (i*2 + 1)*pitch) = _mm_cvtepu8_epi32(s1);
}
Disclaimer: This is untested and ignores alignment.
From the MSDN documentation, a case in which an error 1203 would be reported
void code_1203(int *A)
{
// Code 1203 is emitted when non-vectorizable memory references
// are present in the loop body. Vectorization of some non-contiguous
// memory access is supported - for example, the gather/scatter pattern.
for (int i=0; i<1000; ++i)
{
A[i] += A[0] + 1; // constant memory access not vectorized
A[i] += A[i*2+2] + 2; // non-contiguous memory access not vectorized
}
}
It really could be the computations at the indexes that mess with the auto-vectorizer. Funny the error code shown is not 1203, though.
MSDN Parallelizer and Vectorizer Messages

C++ 2011 : std::thread : simple example to parallelize a loop?

C++ 2011 includes very cool new features, but I can't find a lot of example to parallelize a for-loop.
So my very naive question is : how do you parallelize a simple for loop (like using "omp parallel for") with std::thread ? (I search for an example).
Thank you very much.
std::thread is not necessarily meant to parallize loops. It is meant to be the lowlevel abstraction to build constructs like a parallel_for algorithm. If you want to parallize your loops, you should either wirte a parallel_for algorithm yourself or use existing libraires which offer task based parallism.
The following example shows how you could parallize a simple loop but on the other side also shows the disadvantages, like the missing load-balancing and the complexity for a simple loop.
typedef std::vector<int> container;
typedef container::iterator iter;
container v(100, 1);
auto worker = [] (iter begin, iter end) {
for(auto it = begin; it != end; ++it) {
*it *= 2;
}
};
// serial
worker(std::begin(v), std::end(v));
std::cout << std::accumulate(std::begin(v), std::end(v), 0) << std::endl; // 200
// parallel
std::vector<std::thread> threads(8);
const int grainsize = v.size() / 8;
auto work_iter = std::begin(v);
for(auto it = std::begin(threads); it != std::end(threads) - 1; ++it) {
*it = std::thread(worker, work_iter, work_iter + grainsize);
work_iter += grainsize;
}
threads.back() = std::thread(worker, work_iter, std::end(v));
for(auto&& i : threads) {
i.join();
}
std::cout << std::accumulate(std::begin(v), std::end(v), 0) << std::endl; // 400
Using a library which offers a parallel_for template, it can be simplified to
parallel_for(std::begin(v), std::end(v), worker);
Well obviously it depends on what your loop does, how you choose to parallellize, and how you manage the threads lifetime.
I'm reading the book from the std C++11 threading library (that is also one of the boost.thread maintainer and wrote Just Thread ) and I can see that "it depends".
Now to give you an idea of basics using the new standard threading, I would recommend to read the book as it gives plenty of examples.
Also, take a look at http://www.justsoftwaresolutions.co.uk/threading/ and https://stackoverflow.com/questions/415994/boost-thread-tutorials
Can't provide a C++11 specific answer since we're still mostly using pthreads. But, as a language-agnostic answer, you parallelise something by setting it up to run in a separate function (the thread function).
In other words, you have a function like:
def processArraySegment (threadData):
arrayAddr = threadData->arrayAddr
startIdx = threadData->startIdx
endIdx = threadData->endIdx
for i = startIdx to endIdx:
doSomethingWith (arrayAddr[i])
exitThread()
and, in your main code, you can process the array in two chunks:
int xyzzy[100]
threadData->arrayAddr = xyzzy
threadData->startIdx = 0
threadData->endIdx = 49
threadData->done = false
tid1 = startThread (processArraySegment, threadData)
// caveat coder: see below.
threadData->arrayAddr = xyzzy
threadData->startIdx = 50
threadData->endIdx = 99
threadData->done = false
tid2 = startThread (processArraySegment, threadData)
waitForThreadExit (tid1)
waitForThreadExit (tid2)
(keeping in mind the caveat that you should ensure thread 1 has loaded the data into its local storage before the main thread starts modifying it for thread 2, possibly with a mutex or by using an array of structures, one per thread).
In other words, it's rarely a simple matter of just modifying a for loop so that it runs in parallel, though that would be nice, something like:
for {threads=10} ({i} = 0; {i} < ARR_SZ; {i}++)
array[{i}] = array[{i}] + 1;
Instead, it requires a bit of rearranging your code to take advantage of threads.
And, of course, you have to ensure that it makes sense for the data to be processed in parallel. If you're setting each array element to the previous one plus 1, no amount of parallel processing will help, simply because you have to wait for the previous element to be modified first.
This particular example above simply uses an argument passed to the thread function to specify which part of the array it should process. The thread function itself contains the loop to do the work.
Using this class you can do it as:
Range based loop (read and write)
pforeach(auto &val, container) {
val = sin(val);
};
Index based for-loop
auto new_container = container;
pfor(size_t i, 0, container.size()) {
new_container[i] = sin(container[i]);
};
Define macro using std::thread and lambda expression:
#ifndef PARALLEL_FOR
#define PARALLEL_FOR(INT_LOOP_BEGIN_INCLUSIVE, INT_LOOP_END_EXCLUSIVE,I,O) \ \
{ \
int LOOP_LIMIT=INT_LOOP_END_EXCLUSIVE-INT_LOOP_BEGIN_INCLUSIVE; \
std::thread threads[LOOP_LIMIT]; auto fParallelLoop=[&](int I){ O; }; \
for(int i=0; i<LOOP_LIMIT; i++) \
{ \
threads[i]=std::thread(fParallelLoop,i+INT_LOOP_BEGIN_INCLUSIVE); \
} \
for(int i=0; i<LOOP_LIMIT; i++) \
{ \
threads[i].join(); \
} \
} \
#endif
usage:
int aaa=0; // std::atomic<int> aaa;
PARALLEL_FOR(0,90,i,
{
aaa+=i;
});
its ugly but it works (I mean, the multi-threading part, not the non-atomic incrementing).
AFAIK the simplest way to parallelize a loop, if you are sure that there are no concurrent access possible, is by using OpenMP.
It is supported by all major compilers except LLVM (as of August 2013).
Example :
for(int i = 0; i < n; ++i)
{
tab[i] *= 2;
tab2[i] /= 2;
tab3[i] += tab[i] - tab2[i];
}
This would be parallelized very easily like this :
#pragma omp parallel for
for(int i = 0; i < n; ++i)
{
tab[i] *= 2;
tab2[i] /= 2;
tab3[i] += tab[i] - tab2[i];
}
However, be aware that this is only efficient with a big number of values.
If you use g++, another very C++11-ish way of doing would be using a lambda and a for_each, and use gnu parallel extensions (which can use OpenMP behind the scene) :
__gnu_parallel::for_each(std::begin(tab), std::end(tab), [&] ()
{
stuff_of_your_loop();
});
However, for_each is mainly thought for arrays, vectors, etc...
But you can "cheat" it if you only want to iterate through a range by creating a Range class with begin and end method which will mostly increment an int.
Note that for simple loops that do mathematical stuff, the algorithms in #include <numeric> and #include <algorithm> can all be parallelized with G++.

Executable runs faster on Wine than Windows -- why?

Solution: Apparently the culprit was the use of floor(), the performance of which turns out to be OS-dependent in glibc.
This is a followup question to an earlier one: Same program faster on Linux than Windows -- why?
I have a small C++ program, that, when compiled with nuwen gcc 4.6.1, runs much faster on Wine than Windows XP (on the same computer). The question: why does this happen?
The timings are ~15.8 and 25.9 seconds, for Wine and Windows respectively. Note that I'm talking about the same executable, not only the same C++ program.
The source code is at the end of the post. The compiled executable is here (if you trust me enough).
This particular program does nothing useful, it is just a minimal example boiled down from a larger program I have. Please see this other question for some more precise benchmarking of the original program (important!!) and the most common possibilities ruled out (such as other programs hogging the CPU on Windows, process startup penalty, difference in system calls such as memory allocation). Also note that while here I used rand() for simplicity, in the original I used my own RNG which I know does no heap-allocation.
The reason I opened a new question on the topic is that now I can post an actual simplified code example for reproducing the phenomenon.
The code:
#include <cstdlib>
#include <cmath>
int irand(int top) {
return int(std::floor((std::rand() / (RAND_MAX + 1.0)) * top));
}
template<typename T>
class Vector {
T *vec;
const int sz;
public:
Vector(int n) : sz(n) {
vec = new T[sz];
}
~Vector() {
delete [] vec;
}
int size() const { return sz; }
const T & operator [] (int i) const { return vec[i]; }
T & operator [] (int i) { return vec[i]; }
};
int main() {
const int tmax = 20000; // increase this to make it run longer
const int m = 10000;
Vector<int> vec(150);
for (int i=0; i < vec.size(); ++i)
vec[i] = 0;
// main loop
for (int t=0; t < tmax; ++t)
for (int j=0; j < m; ++j) {
int s = irand(100) + 1;
vec[s] += 1;
}
return 0;
}
UPDATE
It seems that if I replace irand() above with something deterministic such as
int irand(int top) {
static int c = 0;
return (c++) % top;
}
then the timing difference disappears. I'd like to note though that in my original program I used a different RNG, not the system rand(). I'm digging into the source of that now.
UPDATE 2
Now I replaced the irand() function with an equivalent of what I had in the original program. It is a bit lengthy (the algorithm is from Numerical Recipes), but the point was to show that no system libraries are being called explictly (except possibly through floor()). Yet the timing difference is still there!
Perhaps floor() could be to blame? Or the compiler generates calls to something else?
class ran1 {
static const int table_len = 32;
static const int int_max = (1u << 31) - 1;
int idum;
int next;
int *shuffle_table;
void propagate() {
const int int_quo = 1277731;
int k = idum/int_quo;
idum = 16807*(idum - k*int_quo) - 2836*k;
if (idum < 0)
idum += int_max;
}
public:
ran1() {
shuffle_table = new int[table_len];
seedrand(54321);
}
~ran1() {
delete [] shuffle_table;
}
void seedrand(int seed) {
idum = seed;
for (int i = table_len-1; i >= 0; i--) {
propagate();
shuffle_table[i] = idum;
}
next = idum;
}
double frand() {
int i = next/(1 + (int_max-1)/table_len);
next = shuffle_table[i];
propagate();
shuffle_table[i] = idum;
return next/(int_max + 1.0);
}
} rng;
int irand(int top) {
return int(std::floor(rng.frand() * top));
}
edit: It turned out that the culprit was floor() and not rand() as I suspected - see
the update at the top of the OP's question.
The run time of your program is dominated by the calls to rand().
I therefore think that rand() is the culprit. I suspect that the underlying function is provided by the WINE/Windows runtime, and the two implementations have different performance characteristics.
The easiest way to test this hypothesis would be to simply call rand() in a loop, and time the same executable in both environments.
edit I've had a look at the WINE source code, and here is its implementation of rand():
/*********************************************************************
* rand (MSVCRT.#)
*/
int CDECL MSVCRT_rand(void)
{
thread_data_t *data = msvcrt_get_thread_data();
/* this is the algorithm used by MSVC, according to
* http://en.wikipedia.org/wiki/List_of_pseudorandom_number_generators */
data->random_seed = data->random_seed * 214013 + 2531011;
return (data->random_seed >> 16) & MSVCRT_RAND_MAX;
}
I don't have access to Microsoft's source code to compare, but it wouldn't surprise me if the difference in performance was in the getting of thread-local data rather than in the RNG itself.
Wikipedia says:
Wine is a compatibility layer not an emulator. It duplicates functions
of a Windows computer by providing alternative implementations of the
DLLs that Windows programs call,[citation needed] and a process to
substitute for the Windows NT kernel. This method of duplication
differs from other methods that might also be considered emulation,
where Windows programs run in a virtual machine.[2] Wine is
predominantly written using black-box testing reverse-engineering, to
avoid copyright issues.
This implies that the developers of wine could replace an api call with anything at all to as long as the end result was the same as you would get with a native windows call. And I suppose they weren't constrained by needing to make it compatible with the rest of Windows.
From what I can tell, the C standard libraries used WILL be different in the two different scenarios. This affects the rand() call as well as floor().
From the mingw site... MinGW compilers provide access to the functionality of the Microsoft C runtime and some language-specific runtimes. Running under XP, this will use the Microsoft libraries. Seems straightforward.
However, the model under wine is much more complex. According to this diagram, the operating system's libc comes into play. This could be the difference between the two.
While Wine is basically Windows, you're still comparing apples to oranges. As well, not only is it apples/oranges, the underlying vehicles hauling those apples and oranges around are completely different.
In short, your question could trivially be rephrased as "this code runs faster on Mac OSX than it does on Windows" and get the same answer.

C OpenMP parallel quickSort

Once again I'm stuck when using openMP in C++. This time I'm trying to implement a parallel quicksort.
Code:
#include <iostream>
#include <vector>
#include <stack>
#include <utility>
#include <omp.h>
#include <stdio.h>
#define SWITCH_LIMIT 1000
using namespace std;
template <typename T>
void insertionSort(std::vector<T> &v, int q, int r)
{
int key, i;
for(int j = q + 1; j <= r; ++j)
{
key = v[j];
i = j - 1;
while( i >= q && v[i] > key )
{
v[i+1] = v[i];
--i;
}
v[i+1] = key;
}
}
stack<pair<int,int> > s;
template <typename T>
void qs(vector<T> &v, int q, int r)
{
T pivot;
int i = q - 1, j = r;
//switch to insertion sort for small data
if(r - q < SWITCH_LIMIT)
{
insertionSort(v, q, r);
return;
}
pivot = v[r];
while(true)
{
while(v[++i] < pivot);
while(v[--j] > pivot);
if(i >= j) break;
std::swap(v[i], v[j]);
}
std::swap(v[i], v[r]);
#pragma omp critical
{
s.push(make_pair(q, i - 1));
s.push(make_pair(i + 1, r));
}
}
int main()
{
int n, x;
int numThreads = 4, numBusyThreads = 0;
bool *idle = new bool[numThreads];
for(int i = 0; i < numThreads; ++i)
idle[i] = true;
pair<int, int> p;
vector<int> v;
cin >> n;
for(int i = 0; i < n; ++i)
{
cin >> x;
v.push_back(x);
}
cout << v.size() << endl;
s.push(make_pair(0, v.size()));
#pragma omp parallel shared(s, v, idle, numThreads, numBusyThreads, p)
{
bool done = false;
while(!done)
{
int id = omp_get_thread_num();
#pragma omp critical
{
if(s.empty() == false && numBusyThreads < numThreads)
{
++numBusyThreads;
//the current thread is not idle anymore
//it will get the interval [q, r] from stack
//and run qs on it
idle[id] = false;
p = s.top();
s.pop();
}
if(numBusyThreads == 0)
{
done = true;
}
}
if(idle[id] == false)
{
qs(v, p.first, p.second);
idle[id] = true;
#pragma omp critical
--numBusyThreads;
}
}
}
return 0;
}
Algorithm:
To use openMP for a recursive function I used a stack to keep track of the next intervals on which the qs function should run. I manually add the 1st interval [0, size] and then let the threads get to work when a new interval is added in the stack.
The problem:
The program ends too early, not sorting the array after creating the 1st set of intervals ([q, i - 1], [i+1, r] if you look on the code. My guess is that the threads which get the work, considers the local variables of the quicksort function(qs in the code) shared by default, so they mess them up and add no interval in the stack.
How I compile:
g++ -o qs qs.cc -Wall -fopenmp
How I run:
./qs < in_100000 > out_100000
where in_100000 is a file containing 100000 on the 1st line followed by 100k intergers on the next line separated by spaces.
I am using gcc 4.5.2 on linux
Thank you for your help,
Dan
I didn't actually run your code, but I see an immediate mistake on p, which should be private not shared. The parallel invocation of qs: qs(v, p.first, p.second); will have races on p, resulting in unpredictable behavior. The local variables at qs should be okay because all threads have their own stack. However, the overall approach is good. You're on the right track.
Here are my general comments for the implementation of parallel quicksort. Quicksort itself is embarrassingly parallel, which means no synchronization is needed. The recursive calls of qs on a partitioned array is embarrassingly parallel.
However, the parallelism is exposed in a recursive form. If you simply use the nested parallelism in OpenMP, you will end up having thousand threads in a second. No speedup will be gained. So, mostly you need to turn the recursive algorithm into an interative one. Then, you need to implement a sort of work-queue. This is your approach. And, it's not easy.
For your approach, there is a good benchmark: OmpSCR. You can download at http://sourceforge.net/projects/ompscr/
In the benchmark, there are several versions of OpenMP-based quicksort. Most of them are similar to yours. However, to increase parallelism, one must minimize the contention on a global queue (in your code, it's s). So, there could be a couple of optimizations such as having local queues. Although the algorithm itself is purely parallel, the implementation may require synchronization artifacts. And, most of all, it's very hard to gain speedups.
However, you still directly use recursive parallelism in OpenMP in two ways: (1) Throttling the total number of the threads, and (2) Using OpenMP 3.0's task.
Here is pseudo code for the first approach (This is only based on OmpSCR's benchmark):
void qsort_omp_recursive(int* begin, int* end)
{
if (begin != end) {
// Partition ...
// Throttling
if (...) {
qsort_omp_recursive(begin, middle);
qsort_omp_recursive(++middle, ++end);
} else {
#pragma omp parallel sections nowait
{
#pragma omp section
qsort_omp_recursive(begin, middle);
#pragma omp section
qsort_omp_recursive(++middle, ++end);
}
}
}
}
In order to run this code, you need to call omp_set_nested(1) and omp_set_num_threads(2). The code is really simple. We simply spawn two threads on the division of the work. However, we insert a simple throttling logic to prevent excessive threads. Note that my experimentation showed decent speedups for this approach.
Finally, you may use OpenMP 3.0's task, where a task is a logically concurrent work. In the above all OpenMP's approaches, each parallel construct spawns two physical threads. You may say there is a hard 1-to-1 mapping between a task to a work thread. However, task separates logical tasks and workers.
Because OpenMP 3.0 is not popular yet, I will use Cilk Plus, which is great to express this kind of nested and recursive parallelisms. In Cilk Plus, the parallelization is extremely easy:
void qsort(int* begin, int* end)
{
if (begin != end) {
--end;
int* middle = std::partition(begin, end,
std::bind2nd(std::less<int>(), *end));
std::swap(*end, *middle);
cilk_spawn qsort(begin, middle);
qsort(++middle, ++end);
// cilk_sync; Only necessay at the final stage.
}
}
I copied this code from Cilk Plus' example code. You will see a single keyword cilk_spawn is everything to parallelize quicksort. I'm skipping the explanations of Cilk Plus and spawn keyword. However, it's easy to understand: the two recursive calls are declared as logically concurrent tasks. Whenever the recursion takes place, the logical tasks are created. But, the Cilk Plus runtime (which implements an efficient work-stealing scheduler) will handle all kinds of dirty job. It optimally queues the parallel tasks and maps to the work threads.
Note that OpenMP 3.0's task is essentially similar to the Cilk Plus's approach. My experimentation shows that pretty nice speedups were feasible. I got a 3~4x speedup on a 8-core machine. And, the speedup was scale. Cilk Plus' absolute speedups are greater than those of OpenMP 3.0's.
The approach of Cilk Plus (and OpenMP 3.0) and your approach are essentially the same: the separation of parallel task and workload assignment. However, it's very difficult to implement efficiently. For example, you must reduce the contention and use lock-free data structures.