False sharing and stack variables - c++

I have small but frequently used function objects. Each thread gets its own copy. Everything is allocated statically. Copies don't share any global or static data. Do I need to protect this objects from false sharing?
Thank you.
EDIT: Here is a toy program which uses Boost.Threads. Can false sharing occur for the field data?
#include <boost/thread/thread.hpp>
struct Work {
void operator()() {
++data;
}
int data;
};
int main() {
boost::thread_group threads;
for (int i = 0; i < 10; ++i)
threads.create_thread(Work());
threads.join_all();
}

False sharing between threads is when 2 or more threads use the same cache line.
E.g. :
struct Work {
Work( int& d) : data( d ) {}
void operator()() {
++data;
}
int& data;
};
int main() {
int false_sharing[10] = { 0 };
boost::thread_group threads;
for (int i = 0; i < 10; ++i)
threads.create_thread(Work(false_sharing[i]));
threads.join_all();
int no_false_sharing[10 * CACHELINE_SIZE_INTS] = { 0 };
for (int i = 0; i < 10; ++i)
threads.create_thread(Work(no_false_sharing[i * CACHELINE_SIZE_INTS]));
threads.join_all();
}
The threads in the first block do suffer from false sharing. The threads in the second block do not (thanks to CACHELINE_SIZE).
Data on the stack is always 'far' away from other threads. (E.g. under windows, at least a couple of pages).
With your definition of a function object, false sharing can appear, because the instances of Work get created on the heap and this heap space is used inside the thread.
This may lead to several Work instances to be adjacent and so may incur sharing of cache lines.
But ... your sample does not make sense, because data is never touched outside and so false sharing is induced needlessly.
The easiest way, to prevent problems like this, is to copy your 'shared' data locally on tho the stack, and then work on the stack copy. When your work is finished copy it back to the output var.
E.g:
struct Work {
Work( int& d) : data( d ) {}
void operator()()
{
int tmp = data;
for( int i = 0; i < lengthy_op; ++i )
++tmp;
data = tmp;
}
int& data;
};
This prevents all problems with sharing.

I did a fair bit of research and it seems there is no silver bullet solution to false sharing. Here is what I come up with (thanks to Christopher):
1) Pad your data from both sides with unused or less frequently used stuff.
2) Copy your data into stack and copy it back after all hard work is done.
3) Use cache aligned memory allocation.

I' don't feel entirely safe with the details, but here's my take:
(1) Your simplified example is broken since boost create_thread expects a reference, you pass a temporary.
(2) if you'd use vector<Work> with one item fro each thread, or othrwise have them in memory sequentially, false sharing will occur.

Related

Reset thread-local variables in OpenMP

I need a consistent way of resetting all thread-local variables my program creates. The problem lies in that the thread-local data is created in places different from where they are used.
My program outline is the following:
struct data_t { /* ... */ };
// 1. Function that fetches the "global" thread-local data
data_t& GetData()
{
static data_t *d = NULL;
#pragma omp threadprivate(d); // !!!
if (!d) { d = new data_t(); }
return *d;
}
// 2 example function that uses the data
void user(int *elements, int num, int *output)
{
#pragma omp parallel for shared(elements, output) if (num > 1000)
for (int i = 0; i < num; ++i)
{
// computation is a heavy calculation, on memoized data
computation(GetData());
}
}
Now, my problem is I need a function that resets data, i.e. every thread-local object created must be accounted for.
For now, my solution, is to use a parallel region, that hopefully uses equal or more threads than the "parallel for" so every object is "iterated" through:
void ClearThreadLocalData()
{
#pragma omp parallel
{
// assuming data_t has a "clear()" method
GetData().clear();
}
}
Is there a more idiomatic / safe way to implement ClearThreadLocalData() ?
You can create and use a global version number for your data. Increment it every time you need to clear the existing caches. Then modify GetData to check the version number if there is an existing data object, discarding the existing one and creating a new one if it is out of date. (The version number for the allocated data_t object can be stored within data_t if you can modify the class, or in a second thread local variable if not.) You'd end up with something like
static int dataVersion;
data_t& GetData()
{
static data_t *d = NULL;
#pragma omp threadprivate(d); // !!!
if (d && d->myDataVersion != dataVersion) {
delete d;
d = nullptr;
}
if (!d) {
d = new data_t();
d->myDataVersion = dataVersion;
}
return *d;
}
This doesn't depend on the existence of a Clear method in data_t, but if you have one replace the delete-and-reset with a call to Clear. I'm using d = nullptr to avoid duplicating the call to new data_t().
The global dataVersion could be a static member of data_t if you want to avoid the global variable, and it can be atomic if necessary although GetData would need changes to handle that.
When it comes time to reset the data, just change the global version number:
++dataVersion;

C++ Dereferencing with Parenthesis (with iterators)

My question is pretty simple. I have a vector of values (threads here, irrelevant) and I want to iterate through them. However there are two version of the code which looks same to me but only the second one works. I want to know why.
Version 1 (Does not compile)
int main(){
int someValue = 5;
vector<std::thread *> threadVector;
threadVector.resize(20);
for (int i = 0; i < 20; i++) {
threadVector[i] = new std::thread(foo, std::ref(someValue));
}
for (std::vector<std::thread *>::iterator it = threadVector.begin(); it != threadVector.end(); ++it) {
*it->join(); // *********Notice this Line*********
}
system("pause"); // I know I shouldn't be using this
}
Version 2 (Does work)
int main(){
int someValue = 5;
vector<std::thread *> threadVector;
threadVector.resize(20);
for (int i = 0; i < 20; i++) {
threadVector[i] = new std::thread(foo, std::ref(someValue));
}
for (std::vector<std::thread *>::iterator it = threadVector.begin(); it != threadVector.end(); ++it) {
(*it)->join(); // *********Notice this Line*********
}
system("pause"); // I know I shouldn't be using this
}
This is an issue with order of operations.
*it->join();
is parsed as:
*(it->join());
Taking it as a challenge, I just dabbed my feet in C++11 for the first time. I have found out that you can achieve the same without any dynamic allocation of std::thread objects:
#include <iostream>
#include <thread>
#include <vector>
void function()
{
std::cout << "thread function\n";
}
int main()
{
std::vector<std::thread> ths;
ths.push_back(std::move(std::thread(&function)));
ths.push_back(std::move(std::thread(&function)));
ths.push_back(std::move(std::thread(&function)));
while (!ths.empty()) {
std::thread th = std::move(ths.back());
ths.pop_back();
th.join();
}
}
This works because std::thread has a constructor and assignment operator taking an rvalue reference, making it movable. Further, all containers have gained support for storing movable objects and they move instead of copy them on reallocations. Read some online articles about this new C++11 feature, it's too wide to explain here and I also don't know it well enough.
About the concern you raised that threads have a memory cost, I don't think that your approach is an optimization. Rather the dynamic allocation itself has an overhead, both in memory and performance. For small objects, the memory overhead of one or two pointers plus possibly some padding is enormous. I wouldn't be surprised if std::thread objects had the size of a single pointer only, giving you an overhead of more than 100%.
Note that this only concerns the std::thread object. The memory required for the actual thread, in particular its stack, are a different issue. However, std::thread objects and the actual threads don't have a 1:1 relation and dynamic allocation of the std::thread object doesn't change anything there either.
If you're still afraid that the reallocations are too expensive, you could reserve a suitable size up front to avoid reallocations. However, if that really is an issue, then you are creating and terminating threads way too much, and that will by far dwarf the overhead of shifting a few, small objects around. Consider using a threadpool.

std::vector push_back fails when used in a parallel for loop

I have a code that is as follow (simplified code):
for( int i = 0; i < input.rows; i++ )
{
if(IsGoodMatch(input[I])
{
Newvalues newValues;
newValues.x1=input.x1;
newValues.x2=input.x1*2;
output.push_back( newValues);
}
}
This code works well, but if I want to make it parallel using omp parallel for, I am getting error on output.push_back and it seems that during vector resize, the memory corrupted.
What is the problem and how can I fix it?
How can I make sure only one thread inserting a new item into vector at any time?
The simple answer is that std::vector::push_back is not thread-safe.
In order to safely do this in parallel you need to synchronize in order to ensure that push_back isn't called from multiple threads at the same time.
Synchronization in C++11 can easily be achieved by using an std::mutex.
std::vector's push_back can not guarantee a correct behavior when being called in a concurrent manner like you are doing now (there is no thread-safety).
However since the elements don't depend on each other, it would be very reasonable to resize the vector and modify elements inside the loop separately:
output.resize(input.rows);
int k = 0;
#pragma omp parallel for shared(k, input)
for( int i = 0; i < input.rows; i++ )
{
if(IsGoodMatch(input[I])
{
Newvalues newValues;
...
// ! prevent other threads to modify k !
output[k] = newValues;
k++;
// ! allow other threads to modify k again !
}
}
output.resize(k);
since the direct access using operator[] doesn't depend on other members of std::vector which might cause inconsistencies between the threads. However this solution might still need an explicit synchronization (i.e. using a synchronization mechanism such as mutex) that will ensure that a correct value of k will be used.
"How can I make sure only one thread inserting a new item into vector at any time?"
You don't need to. Threads will be modifying different elements (that reside in different parts of memory). You just need to make sure that the element each thread tries to modify is the correct one.
Use concurrent vector
#include <concurrent_vector.h>
Concurrency::concurrent_vector<int> in c++11.
It is thread safe version of vector.
Put a #pragma omp critical before the push_back.
I solved a similar problem by deriving the standard std::vector class just to implement an atomic_push_back method, suitable to work in the OpenMP paradigm.
Here is my "OpenMP-safe" vector implementation:
template <typename T>
class omp_vector : public std::vector<T>
{
private:
omp_lock_t lock;
public:
omp_vector()
{
omp_init_lock(&lock);
}
void atomic_push_back(T const &p)
{
omp_set_lock(&lock);
std::vector<T>::push_back(p);
omp_unset_lock(&lock);
}
};
of course you have to include omp.h. Then your code could be just as follows:
opm_vector<...> output;
#pragma omp parallel for shared(input,output)
for( int i = 0; i < input.rows; i++ )
{
if(IsGoodMatch(input[I])
{
Newvalues newValues;
newValues.x1=input.x1;
newValues.x2=input.x1*2;
output.atomic_push_back( newValues);
}
}
If you still need the output vector somewhere else in a non-parallel section of the code, you could just use the normal push_back method.
You can try to use a mutex to fix the problem.
Usually I prefer to achieve such thing myself;
static int mutex=1;
int signal(int &x)
{
x+=1;
return 0;
}
int wait(int &x)
{
x-=1;
while(x<0);
return 0;
}
for( int i = 0; i < input.rows; i++ )
{
if(IsGoodMatch(input[I])
{
Newvalues newValues;
newValues.x1=input.x1;
newValues.x2=input.x1*2;
wait(mutex);
output.push_back( newValues);
signal(mutex);
}
}
Hope this could help.

Multithreaded matrix multiplication in C++

I've been having trouble with this parallel matrix multiplication code, I keep getting an error when trying to access a data member in my structure.
This is my main function:
struct arg_struct
{
int* arg1;
int* arg2;
int arg3;
int* arg4;
};
int main()
{
pthread_t allthreads[4];
int A [N*N];
int B [N*N];
int C [N*N];
randomMatrix(A);
randomMatrix(B);
printMatrix(A);
printMatrix(B);
struct arg_struct *args = (arg_struct*)malloc(sizeof(struct arg_struct));
args.arg1 = A;
args.arg2 = B;
int x;
for (int i = 0; i < 4; i++)
{
args.arg3 = i;
args.arg4 = C;
x = pthread_create(&allthreads[i], NULL, &matrixMultiplication, (void*)args);
if(x!=0)
exit(1);
}
return 0;
}
and the matrixMultiplication method used from another C file:
void *matrixMultiplication(void* arguments)
{
struct arg_struct* args = (struct arg_struct*) arguments;
int block = args.arg3;
int* A = args.arg1;
int* B = args.arg2;
int* C = args->arg4;
free(args);
int startln = getStartLineFromBlock(block);
int startcol = getStartColumnFromBlock(block);
for (int i = startln; i < startln+(N/2); i++)
{
for (int j = startcol; j < startcol+(N/2); j++)
{
setMatrixValue(C,0,i,j);
for(int k = 0; k < N; k++)
{
C[i*N+j] += (getMatrixValue(A,i,k) * getMatrixValue(B,k,j));
usleep(1);
}
}
}
}
Another error I am getting is when creating the thread: "invalid conversion from ‘void ()(int, int*, int, int*)’ to ‘void* ()(void)’ [-fpermissive]
"
Can anyone please tell me what I'm doing wrong?
First you mix C and C++ very badly, either use plain C or use C++, in C++ you can simply use new and delete.
But the reason of your error is you allocate arg_struct in one place and free it in 4 threads. You should allocate one arg_struct for each thread
Big Boss is right in the sense that he has identified the problem, but to add to/augment the reply he made.
Option 1:
Just create an arg_struct in the loop and set the members, then pass it through:
for(...)
{
struct arg_struct *args = (arg_struct*)malloc(sizeof(struct arg_struct));
args->arg1 = A;
args->arg2 = B; //set up args as now...
...
x = pthread_create(&allthreads[i], NULL, &matrixMultiplication, (void*)args);
....
}
keep the free call in the thread, but now you could then use the passed struct directly rather than creating locals in your thread.
Option 2:
It looks like you want to copy the params from the struct internally to the thread anyway so you don't need to dynamically allocate.
Just create an arg_struct and set the members, then pass it through:
arg_struct args;
//set up args as now...
for(...)
{
...
x = pthread_create(&allthreads[i], NULL, &matrixMultiplication, (void*)&args);
}
Then remove the free call.
However as James pointed out you would need to synchronize in the thread/parent on the structure to make sure that it wasn't changed. That would mean a Mutex or some other mechanism. So probably the move of the allocation to the for loop is easier to begin with.
Part 2:
I'm working on windows (so I can't experiment currently), but pthread_create param 3 is referring to the thread function matrixMultiplication which is defined as void* matrixMultiplication( void* ); - it looks correct to me (signature wise) from the man pages online, void* fn (void* )
I think I'll have to defer to someone else on your second error. Made this post a comunnity wiki entry so answer can be put into this if desired.
It's not clear to me what you are trying to do. You start some threads,
then you return from main (exiting the process) before getting any
results from them.
In this case, I'ld probably not use any dynamic allocation, directly.
(I would use std::vector for the matrices, which would use dynamic
allocation internally.) There's no reason to dynamically allocate the
arg_struct, since it can safely be copied. Of course, you'll have to
wait until each thread has successfully extracted its data before
looping to construct the next thread. This would normally be done using
a conditional: the new thread would unblock the conditional once it has
extracted the arguments from the arg_struct (or even better, you could
use boost::thread, which does this part for you). Alternatively, you
could use an array of arg_struct, but there is absolutely no reason to
allocate them dynamically. (If for some reason you cannot use
std::vector for A, B and C, you will want to allocate these
dynamically, in order to avoid any risk of stack overflow. But
std::vector is a much better solution.)
Finally, of course, you must wait for all of the threads to finish
before leaving main. Otherwise, the threads will continue working on
data that doesn't exist any more. In this case, you should
pthread_join all of the threads before exiting main. Presumably,
too, you want to do something with the results of the multiplication,
but in any case, exiting main before all of the threads have finished
accessing the matrices will cause undefined behavior.

Boost, Shared Memory and Vectors

I need to share a stack of strings between processes (possibly more complex objects in the future). I've decided to use boost::interprocess but I can't get it to work. I'm sure it's because I'm not understanding something. I followed their example, but I would really appreciate it if someone with experience with using that library can have a look at my code and tell me what's wrong. The problem is it seems to work but after a few iterations I get all kinds of exceptions both on the reader process and sometimes on the writer process. Here's a simplified version of my implementation:
using namespace boost::interprocess;
class SharedMemoryWrapper
{
public:
SharedMemoryWrapper(const std::string & name, bool server) :
m_name(name),
m_server(server)
{
if (server)
{
named_mutex::remove("named_mutex");
shared_memory_object::remove(m_name.c_str());
m_segment = new managed_shared_memory (create_only,name.c_str(),65536);
m_stackAllocator = new StringStackAllocator(m_segment->get_segment_manager());
m_stack = m_segment->construct<StringStack>("MyStack")(*m_stackAllocator);
}
else
{
m_segment = new managed_shared_memory(open_only ,name.c_str());
m_stack = m_segment->find<StringStack>("MyStack").first;
}
m_mutex = new named_mutex(open_or_create, "named_mutex");
}
~SharedMemoryWrapper()
{
if (m_server)
{
named_mutex::remove("named_mutex");
m_segment->destroy<StringStack>("MyStack");
delete m_stackAllocator;
shared_memory_object::remove(m_name.c_str());
}
delete m_mutex;
delete m_segment;
}
void push(const std::string & in)
{
scoped_lock<named_mutex> lock(*m_mutex);
boost::interprocess::string inStr(in.c_str());
m_stack->push_back(inStr);
}
std::string pop()
{
scoped_lock<named_mutex> lock(*m_mutex);
std::string result = "";
if (m_stack->size() > 0)
{
result = std::string(m_stack->begin()->c_str());
m_stack->erase(m_stack->begin());
}
return result;
}
private:
typedef boost::interprocess::allocator<boost::interprocess::string, boost::interprocess::managed_shared_memory::segment_manager> StringStackAllocator;
typedef boost::interprocess::vector<boost::interprocess::string, StringStackAllocator> StringStack;
bool m_server;
std::string m_name;
boost::interprocess::managed_shared_memory * m_segment;
StringStackAllocator * m_stackAllocator;
StringStack * m_stack;
boost::interprocess::named_mutex * m_mutex;
};
EDIT Edited to use named_mutex. Original code was using interprocess_mutex which is incorrect, but that wasn't the problem.
EDIT2 I should also note that things work up to a point. The writer process can push several small strings (or one very large string) before the reader breaks. The reader breaks in a way that the line m_stack->begin() does not refer to a valid string. It's garbage. And then further execution throws an exception.
EDIT3 I have modified the class to use boost::interprocess::string rather than std::string. Still the reader fails with invalid memory address. Here is the reader/writer
//reader process
SharedMemoryWrapper mem("MyMemory", true);
std::string myString;
int x = 5;
do
{
myString = mem.pop();
if (myString != "")
{
std::cout << myString << std::endl;
}
} while (1); //while (myString != "");
//writer
SharedMemoryWrapper mem("MyMemory", false);
for (int i = 0; i < 1000000000; i++)
{
std::stringstream ss;
ss << i; //causes failure after few thousand iterations
//ss << "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" << i; //causes immediate failure
mem.push(ss.str());
}
return 0;
There are several things that leaped out at me about your implementation. One was the use of a pointer to the named mutex object, whereas the documentation of most boost libraries tends to bend over backwards to not use a pointer. This leads me to ask for a reference to the program snippet you worked from in building your own test case, as I have had similar misadventures and sometimes the only way out was to go back to the exemplar and work forward one step at a time until I come across the breaking change.
The other thing that seems questionable is your allocation of a 65k block for shared memory, and then in your test code, looping to 1000000000, pushing a string onto your stack each iteration.
With a modern PC able to execute 1000 instructions per microsecond and more, and operating systems like Windows still doling out execution quanta in 15 millisecond. chunks, it won't take long to overflow that stack. That would be my first guess as to why things are haywire.
P.S.
I just returned from fixing my name to something resembling my actual identity. Then the irony hit that my answer to your question has been staring us both in the face from the upper left hand corner of the browser page! (That is, of course, presuming I was correct, which is so often not the case in this biz.)
Well maybe shared memory is not the right design for your problem to begin with. However we would not know, because we don't know what you try to achieve in the first place.