Multiple threads writing into a single buffer concurrently - c++

I have an output cointainer similar to this:
struct cont {
std::mutex m;
size_t offset;
char* data;
cont(size_t sizeB) {
data = new char[sizeB];
}
void write(char* data, size_t sizeB) {
m.lock();
size_t off = offset;
offset += sizeB;
m.unlock();
std::memcpy(this->data + off, data, sizeB);
}
};
The idea is that I have many threads, each working on a dynamically sized workload and outputting data in no specific order into that container. The threads are triggered by server access and there is no telling how many are in concurrently or how much they will contribute.
The reason I'm questioning this is because as you can see, the main workload is outside the mutex lock since in theory, only the distribution of the available buffer needs to be synchronized and the threads shouldn't collide after that bit.
Its been working fine so far but from previous experience, threading problems can manifest themselves way down the road so is this considered a thread safe practice?

Seems OK. If you want to optimize, you could make the offset atomic, to avoid the mutex altogether. So, just declare
std::atomic<size_t> offset;
and the mutex can be removed.

As it stands, I'm afraid this is incomplete: your solution correctly allocates space between multiple threads, but you also need a solution for threads to "commit" their writes. Imagine that one writer thread is indefinitely delayed in the midst of a memcpy (or even prior to commencing its memcpy). How does any other thread ever find out about this so that you can eventually use this buffer at all?

This seems perfectly safe. You're probably worried about trampling on "leftover" bytes concurrently when offset changes by a number which is not a multiple of 4 or 8 bytes. I wanted to alleviate your concerns by quoting the Standard, but the entry for memcpy points to the C Library Reference, which is scant on details. Nevertheless, the function treat the buffers as arrays of unsigned char, so it cannot reliably assume it can also optimize copying the tail when it's unaligned or an incomplete word, as that could constitute an out-of-bound access.

Related

Could this publish / check-for-update class for a single writer + reader use memory_order_relaxed or acquire/release for efficiency?

Introduction
I have a small class which make use of std::atomic for a lock free operation. Since this class is being called massively, it's affecting the performance and I'm having trouble.
Class description
The class similar to a LIFO, but once the pop() function is called, it only return the last written element of its ring-buffer (only if there are new elements since last pop()).
A single thread is calling push(), and another single thread is calling pop().
Source I've read
Since this is using too much time of my computer time, I decided to study a bit further the std::atomic class and its memory_order. I've read a lot of memory_order post avaliable in StackOverflow and other sources and books, but I'm not able to get a clear idea about the different modes. Specially, I'm struggling between acquire and release modes: I fail too see why they are different to memory_order_seq_cst.
What I think each memory order do using my words, from my own research
memory_order_relaxed: In the same thread, the atomic operations are instant, but other threads may fail to see the lastest values instantly, they will need some time until they are updated. The code can be re-ordered freely by the compiler or OS.
memory_order_acquire / release: Used by atomic::load. It prevents the lines of code there are before this from being reordered (the compiler/OS may reorder after this line all it want), and reads the lastest value that was stored on this atomic using memory_order_release or memory_order_seq_cst in this thread or another thread. memory_order_release also prevents that code after it may be reordered. So, in an acquire/release, all the code between both can be shuffled by the OS. I'm not sure if that's between same thread, or different threads.
memory_order_seq_cst: Easiest to use because it's like the natural writting we are used with variables, instantly refreshing the values of other threads load functions.
The LockFreeEx class
template<typename T>
class LockFreeEx
{
public:
void push(const T& element)
{
const int wPos = m_position.load(std::memory_order_seq_cst);
const int nextPos = getNextPos(wPos);
m_buffer[nextPos] = element;
m_position.store(nextPos, std::memory_order_seq_cst);
}
const bool pop(T& returnedElement)
{
const int wPos = m_position.exchange(-1, std::memory_order_seq_cst);
if (wPos != -1)
{
returnedElement = m_buffer[wPos];
return true;
}
else
{
return false;
}
}
private:
static constexpr int maxElements = 8;
static constexpr int getNextPos(int pos) noexcept {return (++pos == maxElements)? 0 : pos;}
std::array<T, maxElements> m_buffer;
std::atomic<int> m_position {-1};
};
How I expect it could be improved
So, my first idea was using memory_order_relaxed in all atomic operations, since the pop() thread is in a loop looking for avaliable updates in pop function each 10-15 ms, then it's allowed to fail in the firsts pop() functions to realize later that there is a new update. It's only a bunch of milliseconds.
Another option would be using release/acquire - but I'm not sure about them. Using release in all store() and acquire in all load() functions.
Unfortunately, all the memory_order I described seems to work, and I'm not sure when will they fail, if they are supposed to fail.
Final
Please, could you tell me if you see some problem using relaxed memory order here? Or should I use release/acquire (maybe a further explanation on these could help me)? why?
I think that relaxed is the best for this class, in all its store() or load(). But I'm not sure!
Thanks for reading.
EDIT: EXTRA EXPLANATION:
Since I see everyone is asking for the 'char', I've changed it to int, problem solved! But it doesn't it the one I want to solve.
The class, as I stated before, is something likely to a LIFO but where only matters the last element pushed, if there is any.
I have a big struct T (copiable and asignable), that I must share between two threads in a lock-free way. So, the only way I know to do it is using a circular buffer that writes the last known value for T, and a atomic which know the index of the last value written. When there isn't any, the index would be -1.
Notice that my push thread must know when there is a "new T" avaliable, that's why pop() returns a bool.
Thanks again to everyone trying to assist me with memory orders! :)
AFTER READING SOLUTIONS:
template<typename T>
class LockFreeEx
{
public:
LockFreeEx() {}
LockFreeEx(const T& initValue): m_data(initValue) {}
// WRITE THREAD - CAN BE SLOW, WILL BE CALLED EACH 500-800ms
void publish(const T& element)
{
// I used acquire instead relaxed to makesure wPos is always the lastest w_writePos value, and nextPos calculates the right one
const int wPos = m_writePos.load(std::memory_order_acquire);
const int nextPos = (wPos + 1) % bufferMaxSize;
m_buffer[nextPos] = element;
m_writePos.store(nextPos, std::memory_order_release);
}
// READ THREAD - NEED TO BE VERY FAST - CALLED ONCE AT THE BEGGINING OF THE LOOP each 2ms
inline void update()
{
// should I change to relaxed? It doesn't matter I don't get the new value or the old one, since I will call this function again very soon, and again, and again...
const int writeIndex = m_writePos.load(std::memory_order_acquire);
// Updating only in case there is something new... T may be a heavy struct
if (m_readPos != writeIndex)
{
m_readPos = writeIndex;
m_data = m_buffer[m_readPos];
}
}
// NEED TO BE LIGHTNING FAST, CALLED MULTIPLE TIMES IN THE READ THREAD
inline const T& get() const noexcept {return m_data;}
private:
// Buffer
static constexpr int bufferMaxSize = 4;
std::array<T, bufferMaxSize> m_buffer;
std::atomic<int> m_writePos {0};
int m_readPos = 0;
// Data
T m_data;
};
Memory order is not about when you see some particular change to an atomic object but rather about what this change can guarantee about the surrounding code. Relaxed atomics guarantee nothing except the change to the atomic object itself: the change will be atomic. But you can't use relaxed atomics in any synchronization context.
And you have some code which requires synchronization. You want to pop something that was pushed and not trying to pop what has not been pushed yet. So if you use a relaxed operation then there is no guarantee that your pop will see this push code:
m_buffer[nextPos] = element;
m_position.store(nextPos, std::memory_relaxed);
as it is written. It just as well can see it this way:
m_position.store(nextPos, std::memory_relaxed);
m_buffer[nextPos] = element;
So you might try to get an element from the buffer which is not there yet. Hence, you have to use some synchronization and at least use acquire/release memory order.
And to your actual code. I think the order can be as follows:
const char wPos = m_position.load(std::memory_order_relaxed);
...
m_position.store(nextPos, std::memory_order_release);
...
const char wPos = m_position.exchange(-1, memory_order_acquire);
Your writer only needs release, not seq-cst, but relaxed is too weak. You can't publish a value for m_position until after the non-atomic assignment to the corresponding m_buffer[] entry. You need release ordering to make sure the m_position store is visible to other threads only after all earlier memory operations. (Including the non-atomic assignment). https://preshing.com/20120913/acquire-and-release-semantics/
This has to "synchronize-with" an acquire or seq_cst load in the reader. Or at least mo_consume in the reader.
In theory you also need wpos = m_position to be at least acquire (or consume in the reader), not relaxed, because C++11's memory model is weak enough for things like value-prediction which can let the compiler speculatively use a value for wPos before the load actually takes a value from coherent cache.
(In practice on real CPUs, a crazy compiler could do this with test/branch to introduce a control dependency, allowing branch prediction + speculative execution to break the data dependency for a likely value of wPos.)
But with normal compilers don't do that. On CPUs other than DEC Alpha, the data dependency in the source code of wPos = m_position and then using m_buffer[wPos] will create a data dependency in the asm, like mo_consume is supposed to take advantage of. Real ISAs other than Alpha guarantee dependency-ordering for dependent loads. (And even on Alpha, using a relaxed atomic exchange might be enough to close the tiny window that exists on the few real Alpha CPUs that allow this reordering.)
When compiling for x86, there's no downside at all to using mo_acquire; it doesn't cost any extra barriers. There can be on other ISAs, like 32-bit ARM where acquire costs a barrier, so "cheating" with a relaxed load could be a win that's still safe in practice. Current compilers always strengthen mo_consume to mo_acquire so we unfortunately can't take advantage of it.
You already have a real-word race condition even using seq_cst.
initial state: m_position = 0
reader "claims" slot 0 by exchanging in a m_position = -1 and reads part of m_buffer[0];
reader sleeps for some reason (e.g. timer interrupt deschedules it), or simply races with a writer.
writer reads wPos = m_position as -1, and calculates nextPos = 0.
It overwrites the partially-read m_buffer[0]
reader wakes up and finishes reading, getting a torn T &element. Data race UB in the C++ abstract machine, and tearing in practice.
Adding a 2nd check of m_position after the read (like a SeqLock) can't detect this in every case because the writer doesn't update m_position until after writing the buffer element.
Even though you your real use-case has long gaps between reads and writes, this defect can bite you with just one read and write happening at almost the same time.
I for sure know that the read side cannot wait for nothing and cannot be stopped (it's audio) and it's poped each 5-10ms, and the write side is the user input, which is more slower, a faster one could do a push once each 500ms.
A millisecond is ages on a modern CPU. Inter-thread latency is often something like 60 ns, so fractions of a microsecond, e.g. from a quad-core Intel x86. As long as you don't sleep on a mutex, it's not a problem to spin-retry once or twice before giving up.
Code review:
The class similar to a LIFO, but once the pop() function is called, it only return the last written element of its ring-buffer (only if there are new elements since last pop()).
This isn't a real queue or stack: push and pop aren't great names. "publish" and "read" or "get" might be better and make it more obvious what this is for.
I'd include comments in the code to describe the fact that this is safe for a single writer, multiple readers. (The non-atomic increment of m_position in push makes it clearly unsafe for multiple writers.)
Even so, it's kinda weird even with 1 writer + 1 reader running at the same time. If a read starts while a write is in progress, it will get the "old" value instead of spin-waiting for a fraction of a microsecond to get the new value. Then next time it reads there will already be a new value waiting; the one it just missed seeing last time. So e.g. m_position can update in this order: 2, -1, 3.
That might or might not be desirable, depending on whether "stale" data has any value, and on acceptability of the reader blocking if the writer sleeps mid-write. Or even without the writer sleeping, of spin-waiting.
The standard pattern for rarely written smallish data with multiple read-only readers is a SeqLock. e.g. for publishing a 128-bit current timestamp on a CPU that can't atomically read or write a 128-bit value. See Implementing 64 bit atomic counter with 32 bit atomics
Possible design changes
To make this safe, we could let the writer run free, always wrapping around its circular buffer, and have the reader keep track of the last element it looked at.
If there's only one reader, this should be a simple non-atomic variable. If it's an instance variable, at least put it on the other side of m_buffer[] from the write-position.
// Possible failure mode: writer wraps around between reads, leaving same m_position
// single-reader
const bool read(T &elem)
{
// FIXME: big hack to get this in a separate cache line from the instance vars
// maybe instead use alignas(64) int m_lastread as a class member, and/or on the other side of m_buffer from m_position.
static int lastread = -1;
int wPos = m_position.load(std::memory_order_acquire); // or cheat with relaxed to get asm that's like "consume"
if (lastread == wPos)
return false;
elem = m_buffer[wPos];
lastread = wPos;
return true;
}
You want lastread in a separate cache line from the stuff the writer writes. Otherwise the reader's updates of readPos will be slower because of false-sharing with the writer's writes and vice versa.
This lets the reader(s) be truly read-only wrt. the cache lines written by the writer. It will still take MESI traffic to request read access to lines that are in Modified state after the writer writes them, though. But the writer can still read m_position with no cache miss, so it can get its stores into the store buffer right away. It only has to wait for an RFO to get exclusive ownership of the cache line(s) before it can commit the element and the updated m_position from its store buffer to L1d cache.
TODO: let m_position increment without manual wrapping, so we have a write sequence number that takes a very long time to wrap around, avoiding false-negative early out from lastread == wPos.
Use wPos & (maxElements-1) as the index. And static_assert(maxElements & (maxElements-1) == 0, "maxElements must be a power of 2");
Then the only danger is undetected tearing in a tiny time-window if the writer has wrapped all the way around and is writing the element being read. For frequent reads and infrequent writes, and a buffer that's not too small, this should never happen. Checking the m_position again after a read (like a SeqLock, similar to below) narrows the race window to only writes that are still in progress.
If there are multiple readers, another good option might be a claimed flag in each m_buffer entry. So you'd define
template<typename T>
class WaitFreePublish
{
private:
struct {
alignas(32) T elem; // at most 2 elements per cache line
std::atomic<int8_t> claimed; // writers sets this to 0, readers try to CAS it to 1
// could be bool if we don't end up needing 3 states for anything.
// set to "1" in the constructor? or invert and call it "unclaimed"
} m_buffer[maxElements];
std::atomic<int> m_position {-1};
}
If T has padding at the end, it's a shame we can't take advantage of that for the claimed flag :/
This avoids the possible failure mode of comparing positions: if the writer wraps around between reads, the worst we get is tearing. And we could detect such tearing by having the writer clear the claimed flag first, before writing the rest of the element.
With no other threads writing m_position, we can definitely use a relaxed load without worry. We could even cache the write-position somewhere else, but the reader hopefully isn't invalidating the cache-line containing m_position very often. And apparently in your use-case, writer performance/latency probably isn't a big deal.
So the writer + reader could look like this, with SeqLock-style tearing detection using the known update-order for claimed flag, element, and m_position.
/// claimed flag per array element supports concurrent readers
// thread-safety: single-writer only
// update claimed flag first, then element, then m_position.
void publish(const T& elem)
{
const int wPos = m_position.load(std::memory_order_relaxed);
const int nextPos = getNextPos(wPos);
m_buffer[nextPos].claimed.store(0, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release); // make sure that `0` is visible *before* the non-atomic element modification
m_buffer[nextPos].elem = elem;
m_position.store(nextPos, std::memory_order_release);
}
// thread-safety: multiple readers are ok. First one to claim an entry gets it
// check claimed flag before/after to detect overwrite, like a SeqLock
const bool read(T &elem)
{
int rPos = m_position.load(std::memory_order_acquire);
int8_t claimed = m_buffer[rPos].claimed.load(std::memory_order_relaxed);
if (claimed != 0)
return false; // read-only early-out
claimed = 0;
if (!m_buffer[rPos].claimed.compare_exchange_strong(
claimed, 1, std::memory_order_acquire, std::memory_order_relaxed))
return false; // strong CAS failed: another thread claimed it
elem = m_buffer[rPos].elem;
// final check that the writer didn't step on this buffer during read, like a SeqLock
std::atomic_thread_fence(std::memory_order_acquire); // LoadLoad barrier
// We expect it to still be claimed=1 like we set with CAS
// Otherwise we raced with a writer and elem may be torn.
// optionally retry once or twice in this case because we know there's a new value waiting to be read.
return m_buffer[rPos].claimed.load(std::memory_order_relaxed) == 1;
// Note that elem can be updated even if we return false, if there was tearing. Use a temporary if that's not ok.
}
Using claimed = m_buffer[rPos].exchange(1) and checking for claimed==0 would be another option, vs. CAS-strong. Maybe slightly more efficient on x86. On LL/SC machines I guess CAS might be able to bail out without doing a write at all if it finds a mismatch with expected, in which case the read-only check is pointless.
I used .claimed.compare_exchange_strong(claimed, 1) with success ordering = acquire to make sure that read of claimed happens-before reading .elem.
The "failure" memory ordering can be relaxed: If we see it already claimed by another thread, we give up and don't look at any shared data.
The memory-ordering of the store part of compare_exchange_strong can be relaxed, so we just need mo_acquire, not acq_rel. Readers don't do any other stores to the shared data, and I don't think the ordering of the store matters wrt. to the loads. CAS is an atomic RMW. Only one thread's CAS can succeed on a given buffer element because they're all trying to set it from 0 to 1. That's how atomic RMWs work, regardless of being relaxed or seq_cst or anything in between.
It doesn't need to be seq_cst: we don't need to flush the store buffer or whatever to make sure the store is visible before this thread reads .elem. Just being an atomic RMW is enough to stop multiple threads from actually thinking they succeed. Release would just make sure it can't move earlier, ahead of the relaxed read-only check. That wouldn't be a correctness problem. Hopefully no x86 compilers would do that at compile time. (At runtime on x86, RMW atomic operations are always seq_cst.)
I think being an RMW makes it impossible for it to "step on" a write from a writer (after wrapping around). But this might be real-CPU implementation detail, not ISO C++. In the global modification order for any given .claimed, I think the RMW stays together, and the "acquire" ordering does keep it ahead of the read of the .elem. A release store that wasn't part of a RMW would be a potential problem though: a writer could wrap around and put claimed=0 in a new entry, then the reader's store could eventually commit and set it to 1, when actually no reader has ever read that element.
If we're very sure the reader doesn't need to detect writer wrap-around of the circular buffer, leave out the std::atomic_thread_fence in the writer and reader. (The claimed and the non-atomic element store will still be ordered by the release-store to m_position). The reader can be simplified to leave out the 2nd check and always return true if it gets past the CAS.
Notice that m_buffer[nextPos].claimed.store(0, std::memory_order_release); would not be sufficient to stop later non-atomic stores from appearing before it: release-stores are a one-way barrier, unlike release fences. A release-fence is like a 2-way StoreStore barrier. (Free on x86, cheap on other ISAs.)
This SeqLock-style tearing detection doesn't technically avoid UB in the C++ abstract machine, unfortunately. There's no good / safe way to express this pattern in ISO C++, and it's known to be safe in asm on real hardware. Nothing actually uses the torn value (assuming read()'s caller ignores its elem value if it returns false).
Making elem a std::atomic<T> would be defeat the entire purpose: that would use a spinlock to get atomicity so it might as well use it directly.
Using volatile T elem would break buffer[i].elem = elem because unlike C, C++ doesn't allow copying a volatile struct to/from a regular struct. (volatile struct = struct not possible, why?). This is highly annoying for a SeqLock type of pattern where you'd like the compiler to emit efficient code to copy the whole object representation, optionally using SIMD vectors. You won't get that if you write a constructor or assignment operator that takes a volatile &T argument and does individual members. So clearly volatile is the wrong tool, and that only leaves compiler memory barriers to make sure the non-atomic object is fully read or fully written before the barrier. std::atomic_thread_fence is I think actually safe for that, like asm("" ::: "memory") in GNU C. It works in practice on current compilers.

array pointers that don't change size. Do I need locks?

I am using threads to increase the speed of my program.
As a result I now have a 8 bitset<UINT64_MAX> bitsets. I plan on creating 8 separate threads, each of which are responsible for setting and checking the bitset they own, which is defined by an index passed to each thread.
Given that they are accessing and modifying the same bitset array, do I need to use mutexes?
Here is an example of my code:
#define NUM_CORES 8
class MyBitsetClass {
public:
bitset<UINT64_MAX> bitsets[NUM_CORES];
thread threads[NUM_CORES];
void init() {
for (uint8_t i = 0; i < NUM_CORES; i++) {
threads[i] = thread(&MyBitsetClass::thread_handler, this, i);
}
... do other stuff
}
void thread_handler(uint8_t i){
// 2 threads are never passed the same i value so they are always
// modifying their 'own' bitset. do I need a mutex?
bitsets[i].set(some_index);
}
}
do I need to use mutexes?
No, because the array is pre-allocated before the threads are created and does not change size, and each thread is independently accessing a different element of the array, so there is no overlap or sharing of any data that needs to be protected from concurrent access across thread boundaries.
Given that they are accessing and modifying the same bitset array, do I need to use mutexes?
No; As long as each thread uses separate element of the array, no synchronisation is needed.
However, the access to that array may be effectively serialised if the bitsets are small, due to "false sharing" caused by accessing the same cache line from multiple threads. This won't be a problem if the threads only spend a small amount of time accessing the array for example only writing at the very end of an expensive calculation.
bitset<UINT64_MAX> isn't small though. 8 of those bitsets are 16 Exa Bytes in total. I hope you got a good deal when sourcing the hardware :)

Safe to use int in multithreaded single writer multi-reader code

I'm writing parallel code that has a single writer and multiple readers. The writer will fill in an array from beginning to end, and the readers will access elements of the array in order. Pseudocode is something like the following:
std::vector<Stuff> vec(knownSize);
int producerIndex = 0;
std::atomic<int> consumerIndex = 0;
Producer thread:
for(a while){
vec[producerIndex] = someStuff();
++producerIndex;
}
Consumer thread:
while(!finished){
int myIndex = consumerIndex++;
while(myIndex >= producerIndex){ spin(); }
use(vec[myIndex]);
}
Do I need any sort of synchronization around the producerIndex? It seems like the worst thing that could happen is that I would read an old value while it's being updated so I might spin an extra time. Am I missing anything? Can I be sure that each assignment to myIndex will be unique?
As the comments have pointed out, this code has a data race. Instead of speculating about whether the code has a chance of doing what you want, just fix it: change the type of producerIndex and consumerIndex from int to std::atomic<int> and let the compiler implementor and standard library implementor worry about how to make that work right on your target platform.
It's likely that the array will be stored in the cache so all the threads will have their own copy of it. Whenever your producer puts a new value in the array this will set the dirty bit on the store address, so every other thread that uses the value will retrieve it from the RAM to its own copy in the cache.
That means you will get a lot of cache misses but no race conditions. :)

How to get multithreads working properly using pthreads and not boost in class using C++

I have a project that I have summarized here with some pseudo code to illustrate the problem.
I do not have a compiler issue and my code compiles well whether it be using boost or pthreads. Remember this is pseudo code designed to illustrate the problem and not directly compilable.
The problem I am having is that for a multithreaded function the memory usage and processing time is always greater than if the same function is acheived using serial programming e.g for/while loop.
Here is a simplified version of the problem I am facing:
class aproject(){
public:
typedef struct
{
char** somedata;
double output,fitness;
}entity;
entity **entity_array;
int whichthread,numthreads;
pthread_mutex_t mutexdata;
aproject(){
numthreads = 100;
*entity_array=new entity[numthreads];
for(int i;i<numthreads;i++){
entity_array[i]->somedata[i] = new char[100];
}
/*.....more memory allocations for entity_array.......*/
this->initdata();
this->eval_thread();
}
void initdata(){
/**put zeros and ones in entity_array**/
}
float somefunc(char *somedata){
float output=countzero(); //someother function not listed
return output;
}
void* thread_function()
{
pthread_mutex_lock (&mutexdata);
int currentthread = this->whichthread;
this->whichthread+=1;
pthread_mutex_unlock (&mutexdata);
entity *ent = this->entity_array[currentthread];
double A=0,B=0,C=0,D=0,E=0,F=0;
int i,j,k,l;
A = somefunc(ent->somedata[0]);
B = somefunc(ent->somedata[1]);
t4 = anotherfunc(A,B);
ent->output = t4;
ent->fitness = sqrt(pow(t4,2));
}
static void* staticthreadproc(void* p){
return reinterpret_cast<ga*>(p)->thread_function();
}
void eval_thread(){
//use multithreading to evaluate individuals in parallel
int i,j,k;
nthreads = this->numthreads;
pthread_t threads[nthreads];
//create threads
pthread_mutex_init(&this->mutexdata,NULL);
this->whichthread=0;
for(i=0;i<nthreads;i++){
pthread_create(&threads[i],NULL,&ga::staticthreadproc,this);
//printf("creating thread, %d\n",i);
}
//join threads
for(i=0;i<nthreads;i++){
pthread_join(threads[i],NULL);
}
}
};
I am using pthreads here because it works better than boost on machines with less memory.
Each thread is started in eval_thread and terminated there aswell. I am using a mutex to ensure every thread starts with the correct index for the entity_array, as each thread only applys its work to its respective entity_array indexed by the variable this->whichthread. This variable is the only thing that needs to be locked by the mutex as it is updated for every thread and must not be changed by other threads. You can happily ignore everything else apart from the thread_function, eval_threads, and the staticthreadproc as they are the only relevent functions assume that all the other functions apart from init to be both processor and memory intensive.
So my question is why is it that using multithreading in this way is IT more costly in memory and speed than the traditional method of not using threads at all?
I MUST REITERATE THE CODE IS PSEUDO CODE AND THE PROBLEM ISNT WHETHER IT WILL COMPILE
Thanks, I would appreciate any suggestions you might have for pthreads and/or boost solutions.
Each thread requires it's own call-stack, which consumes memory. Every local variable of your function (and all other functions on the call-stack) counts to that memory.
When creating a new thread, space for its call-stack is reserved. I don't know what the default-value is for pthreads, but you might want to look into that. If you know you require less stack-space than is reserved by default, you might be able to reduce memory-consumption significantly by explicitly specifying the desired stack-size when spawning the thread.
As for the performance-part - it could depend on several issues. Generally, you should not expect a performance boost from parallelizing number-crunching operations onto more threads than you have cores (don't know if that is the case here). This might end up being slower, due to the additional overhead of context-switches, increased amount of cache-misses, etc. There are ways to profile this, depending on your platform (for instance, the Visual Studio profiler can count cache-misses, and there are tools for Linux as well).
Creating a thread is quite an expensive operation. If each thread only does a very small amount of work, then your program may be dominated by the time taken to create them all. Also, a large number of active threads can increase the work needed to schedule them, degrading system performance. And, as another answer mentions, each thread requires its own stack memory, so memory usage will grow with the number of threads.
Another issue can be cache invalidation; when one thread writes its results in to its entity structure, it may invalidate nearby cached memory and force other threads to fetch data from higher-level caches or main memory.
You may have better results if you use a smaller number of threads, each processing a larger subset of the data. For a CPU-bound task like this, one thread per CPU is probably best - that means that you can keep all CPUs busy, and there's no need to waste time scheduling multiple threads on each. Make sure that the entities each thread works on are located together in the array, so it does not invalidate those cached by other threads.

Any issues with large numbers of critical sections?

I have a large array of structures, like this:
typedef struct
{
int a;
int b;
int c;
etc...
}
data_type;
data_type data[100000];
I have a bunch of separate threads, each of which will want to make alterations to elements within data[]. I need to make sure that no to threads attempt to access the same data element at the same time. To be precise: one thread performing data[475].a = 3; and another thread performing data[475].b = 7; at the same time is not allowed, but one thread performing data[475].a = 3; while another thread performs data[476].a = 7; is allowed. The program is highly speed critical. My plan is to make a separate critical section for each data element like so:
typedef struct
{
CRITICAL_SECTION critsec;
int a;
int b;
int c;
etc...
}
data_type;
In one way I guess it should all work and I should have no real questions, but not having had much experience in multithreaded programming I am just feeling a little uneasy about having so many critical sections. I'm wondering if the sheer number of them could be creating some sort of inefficiency. I'm also wondering if perhaps some other multithreading technique could be faster? Should I just relax and go ahead with plan A?
With this many objects, most of their critical sections will be unlocked, and there will be almost no contention. As you already know (other comment), critical sections don't require a kernel-mode transition if they're unowned. That makes critical sections efficient for this situation.
The only other consideration would be whether you would want the critical sections inside your objects or in another array. Locality of reference is a good reason to put the critical sections inside the object. When you've entered the critical section, an entire cacheline (e.g. 16 or 32 bytes) will be in memory. With a bit of padding, you can make sure each object starts on a cacheline. As a result, the object will be (partially) in cache once its critical section is entered.
Your plan is worth trying, but I think you will find that Windows is unhappy creating that many Critical Sections. Each CS contains some kernel handle(s) and you are using up precious kernel space. I think, depending on your version of Windows, you will run out of handle memory and InitializeCriticalSection() or some other function will start to fail.
What you might want to do is have a pool of CSs available for use, and store a pointer to the 'in use' CS inside your struct. But then this gets tricky quite quickly and you will need to use Atomic operations to set/clear the CS pointer (to atomically flag the array entry as 'in use'). Might also need some reference counting, etc...
Gets complicated.
So try your way first, and see what happens. We had a similar situation once, and we had to go with a pool, but maybe things have changed since then.
Depending on the data member types in your data_type structure (and also depending on the operations you want to perform on those members), you might be able to forgo using a separate synchronization object, using the Interlocked functions instead.
In your sample code, all the data members are integers, and all the operations are assignments (and presumably reads), so you could use InterlockedExchange() to set the values atomically and InterlockedCompareExchange() to read the values atomically.
If you need to use non-integer data member types, or if you need to perform more complex operations, or if you need to coordinate atomic access to more than one operation at a time (e.g., read data[1].a and then write data[1].b), then you will have to use a synchronization object, such as a CRITICAL_SECTION.
If you must use a synchronization object, I recommend that you consider partitioning your data set into subsets and use a single synchronization object per subset. For example, you might consider using one CRITICAL_SECTION for each span of 1000 elements in the data array.
You could also consider MUTEX.
This is nice method.
Each client could reserve the resource by itself with mutex (mutual-exclusion).
This is more common, some libraries also support this with threads.
Read about boost::thread and it's mutexes
With Your approach:
data_type data[100000];
I'd be afraid of stack overflow, unless You're allocating it at the heap.
EDIT:
Boost::MUTEX
uses win32 Critical Sections
As others have pointed out, yes there is an issue and it is called too fine-grained locking.. it's resource wasteful and even though the chances are small you will start creating a lot of backing primitives and data when the things do get an occasional, call it longer than usual or whatever, contention. Plus you are wasting resources as it is not really a trivial data structure as for example in VM impls..
If I recall correctly you will have a higher chance of a SEH exception from that point onwards on Win32 or just higher memory usage. Partitioning and pooling them is probably the way to go but it is a more complex implementation. Paritioning on something else (re:action) and expecting some short-lived contention is another way to deal with it.
In any case, it is a problem of resource management with what you have right now.