C - faster locking of integer when using PThreads - c++

I have a counter that's used by multiple threads to write to a specific element in an array. Here's what I have so far...
int count = 0;
pthread_mutex_t count_mutex;
void *Foo()
{
// something = random value from I/O redirection
pthread_mutex_lock(&count_mutex);
count = count + 1;
currentCount = count;
pthread_mutex_unlock(&count_mutex);
// do quick assignment operation. array[currentCount] = something
}
main()
{
// create n pthreads with the task Foo
}
The problem is that it is ungodly slow. I'm accepting a file of integers as I/O redirection and writing them into an array. It seems like each thread spends a lot of time waiting for the lock to be removed. Is there a faster way to increment the counter?
Note: I need to keep the numbers in order which is why I have to use a counter vs giving each thread a specific chunk of the array to write to.

You need to use interlocking. Check out the Interlocked* function on windows, or apple's OSAtomic* functions, or maybe libatomic on linux.
If you have a compiler that supports C++11 well you may even be able to use std::atomic.

Well, one option is to batch up the changes locally somewhere before applying the batch to your protected resource.
For example, have each thread gather ten pieces of information (or less if it runs out before it's gathered ten) then modify Foo to take a length and array - that way, you amortise the cost of the locking, making it much more efficient.
I'd also be very wary of doing:
// do quick assignment operation. array[currentCount] = something
outside the protected area - that's a recipe for disaster since another thread may change currentCount from underneath you. That's not a problem if it's a local variable since each thread will have its own copy but it's not clear from the code what scope that variable has.

Related

Is this request frequency limiter thread safe?

In order to prevent excessive server pressure, I implemented a request frequency limiter using a sliding window algorithm, which can determine whether the current request is allowed to pass according to the parameters. In order to achieve the thread safety of the algorithm, I used the atomic type to control the number of sliding steps of the window, and used unique_lock to achieve the correct sum of the total number of requests in the current window.
But I'm not sure whether my implementation is thread-safe, and if it is safe, whether it will affect service performance. Is there a better way to achieve it?
class SlideWindowLimiter
{
public:
bool TryAcquire();
void SlideWindow(int64_t window_number);
private:
int32_t limit_; // maximum number of window requests
int32_t split_num_; // subwindow number
int32_t window_size_; // the big window
int32_t sub_window_size_; // size of subwindow = window_size_ / split_number
int16_t index_{0}; //the index of window vector
std::mutex mtx_;
std::vector<int32_t> sub_windows_; // window vector
std::atomic<int64_t> start_time_{0}; //start time of limiter
}
bool SlideWindowLimiter::TryAcquire() {
int64_t cur_time = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::system_clock::now().time_since_epoch()).count();
auto time_stamp = start_time_.load();
int64_t window_num = std::max(cur_time - window_size_ - start_time_, int64_t(0)) / sub_window_size_;
std::unique_lock<std::mutex> guard(mtx_, std::defer_lock);
if (window_num > 0 && start_time_.compare_exchange_strong(time_stamp, start_time_.load() + window_num*sub_window_size_)) {
guard.lock();
SlideWindow(window_num);
guard.unlock();
}
monitor_->TotalRequestQps();
{
guard.lock();
int32_t total_req = 0;
std::cout<<" "<<std::endl;
for(auto &p : sub_windows_) {
std::cout<<p<<" "<<std::this_thread::get_id()<<std::endl;
total_req += p;
}
if(total_req >= limit_) {
monitor_->RejectedRequestQps();
return false;
} else {
monitor_->PassedRequestQps();
sub_windows_[index_] += 1;
return true;
}
guard.unlock();
}
}
void SlideWindowLimiter::SlideWindow(int64_t window_num) {
int64_t slide_num = std::min(window_num, int64_t(split_num_));
for(int i = 0; i < slide_num; i++){
index_ += 1;
index_ = index_ % split_num_;
sub_windows_[index_] = 0;
}
}
First of all, thread-safe is a relative property. Two sequences of operations are thread-safe relative to each other. A single bit of code cannot be thread-safe by itself.
I'll instead answer "am I handling threading in such a way that reasonable thread-safety guarantees could be made with other reasonable code".
The answer is "No".
I found one concrete problem; your use of atomic and compare_exchange_strong isn't in a loop, and you access start_time_ atomically at multiple spots without the proper care. If start_time_ changes in the period with the 3 spots you read and write from it, you return false, and fail to call SlideWindow, then... proceed as if you had.
I can't think of why that would be a reasonable response to contention, so that is a "No, this code isn't written to behave reasonably under multiple threads using it".
There is a lot of bad smell in your code. You are mixing concurrency code with a whole pile of state, which means it isn't clear what mutexes are guarding what data.
You have a pointer in your code that is never defined. Maybe it is supposed to be a global variable?
You are writing to cout using multiple << on one line. That is a bad plan in a multithreaded environment; even if your cout is concurrency-hardened, you get scrambled output. Build a buffer string and do one <<.
You are passing data between functions via the back door. index_ for example. One function sets a member variable, another reads it. Is there any possibility it gets edited by another thread? Hard to audit, but seems reasonably likely; you set it on one .lock(), then .unlock(), then read it as if it was in a sensible state in a later lock(). What more, you use it to access a vector; if the vector or index changed in unplanned ways, that could crash or lead to memory corruption.
...
I would be shocked if this code didn't have a pile of race conditions, crashes and the like in production. I see no sign of any attempt to prove that this code is concurrency safe, or simplify it to the point where it is easy to sketch such a proof.
In actual real practice, any code that you haven't proven is concurrency safe is going to be unsafe to use concurrently. So complex concurrency code is almost guaranteed to be unsafe to use concurrently.
...
Start with a really, really simple model. If you have a mutex and some data, make that mutex and the data into a single struct, so you know exactly what that mutex is guarding.
If you are messing with an atomic, don't use it in the middle of other code mixed up with other variables. Put it in its own class. Give that class a name, representing some concrete semantics, ideally ones that you have found elsewhere. Describe what it is supposed to do, and what the methods guarantee before and after. Then use that.
Elsewhere, avoid any kind of global state. This includes class member variables used to pass state around. Pass your data explicitly from one function to another. Avoid pointers to anything mutable.
If your data is all value types in automatic storage and pointers to immutable (never changing in the lifetime of your threads) data, that data can't be directly involved in a race condition.
The remaining data is bundled up and firewalled in a small a spot as possible, and you can look at how you interact with it and determine if you are messing up.
...
Multithreaded programming is hard, especially in an environment with mutable data. If you aren't working to make it possible to prove your code is correct, you are going to produce code that isn't correct, and you won't know it.
Well, based off my experience, I know it; all code that isn't obviously trying to act in such a way that it is easy to show it is correct is simply incorrect. If the code is old and has piles of patches over a decade+, the incorrectness is probably unlikely and harder to find, but it is probably still incorrect. If it is new code, it is probably easier to find the incorrectness.

Thread Safe Integer Array?

I have a situation where I have a legacy multi-threaded application I'm trying to move to a linux platform and convert into C++.
I have a fixed size array of integers:
int R[5000];
And I perform a lot of operations like:
R[5] = (R[10] + R[20]) / 50;
R[5]++;
I have one Foreground task that mostly reads the values....but on occasion can update one. And then I have a background worker that is updating the values constantly.
I need to make this structure thread safe.
I would rather only update the value if the value has actually changed. The worker is constantly collecting data and doing calculation and storing the data whether it changes or not.
So should I create a custom class MyInt which has the structure and then include an array of mutexes to lock for updating/reading each value and then overload the [], =, ++, +=, -=, etc? Or should I try to implement anatomic integer array?
Any suggestions as to what that would look like? I'd like to try and keep the above notation for doing the updates...but I get that it might not be possible.
Thanks,
WB
The first thing to do is make the program work reliably, and the easiest way to do that is to have a Mutex that is used to control access to the entire array. That is, whenever either thread needs to read or write to anything in the array, it should do:
the_mutex.lock();
// do all the array-reads, calculations, and array-writes it needs to do
the_mutex.unlock();
... then test your program and see if it still runs fast enough for your needs. If so, you're done; that's all you need to do.
If you find that the program isn't fast enough due to contention on the mutex, you can start trying optimizations to make things faster. For example, if you know that your threads' operations will only need to work on local segments of the array at one time, you could create multiple mutexes, and assign different subsets of the array to each mutex (e.g. mutex #1 is used to serialize access to the first 100 array items, mutex #2 for the second 100 array items, etc). That will greatly decrease the chances of one thread having to wait for the other thread to release a mutex before it can continue.
If things still aren't fast enough for you, you could then look in to having two different arrays, one for each thread, and occasionally copying from one array to the other. That way each thread could safely access its own private array without any serialization needed. The copying operation would need to be handled carefully, probably using some sort of inter-thread message-passing protocol.

Safe to use int in multithreaded single writer multi-reader code

I'm writing parallel code that has a single writer and multiple readers. The writer will fill in an array from beginning to end, and the readers will access elements of the array in order. Pseudocode is something like the following:
std::vector<Stuff> vec(knownSize);
int producerIndex = 0;
std::atomic<int> consumerIndex = 0;
Producer thread:
for(a while){
vec[producerIndex] = someStuff();
++producerIndex;
}
Consumer thread:
while(!finished){
int myIndex = consumerIndex++;
while(myIndex >= producerIndex){ spin(); }
use(vec[myIndex]);
}
Do I need any sort of synchronization around the producerIndex? It seems like the worst thing that could happen is that I would read an old value while it's being updated so I might spin an extra time. Am I missing anything? Can I be sure that each assignment to myIndex will be unique?
As the comments have pointed out, this code has a data race. Instead of speculating about whether the code has a chance of doing what you want, just fix it: change the type of producerIndex and consumerIndex from int to std::atomic<int> and let the compiler implementor and standard library implementor worry about how to make that work right on your target platform.
It's likely that the array will be stored in the cache so all the threads will have their own copy of it. Whenever your producer puts a new value in the array this will set the dirty bit on the store address, so every other thread that uses the value will retrieve it from the RAM to its own copy in the cache.
That means you will get a lot of cache misses but no race conditions. :)

How to get multithreads working properly using pthreads and not boost in class using C++

I have a project that I have summarized here with some pseudo code to illustrate the problem.
I do not have a compiler issue and my code compiles well whether it be using boost or pthreads. Remember this is pseudo code designed to illustrate the problem and not directly compilable.
The problem I am having is that for a multithreaded function the memory usage and processing time is always greater than if the same function is acheived using serial programming e.g for/while loop.
Here is a simplified version of the problem I am facing:
class aproject(){
public:
typedef struct
{
char** somedata;
double output,fitness;
}entity;
entity **entity_array;
int whichthread,numthreads;
pthread_mutex_t mutexdata;
aproject(){
numthreads = 100;
*entity_array=new entity[numthreads];
for(int i;i<numthreads;i++){
entity_array[i]->somedata[i] = new char[100];
}
/*.....more memory allocations for entity_array.......*/
this->initdata();
this->eval_thread();
}
void initdata(){
/**put zeros and ones in entity_array**/
}
float somefunc(char *somedata){
float output=countzero(); //someother function not listed
return output;
}
void* thread_function()
{
pthread_mutex_lock (&mutexdata);
int currentthread = this->whichthread;
this->whichthread+=1;
pthread_mutex_unlock (&mutexdata);
entity *ent = this->entity_array[currentthread];
double A=0,B=0,C=0,D=0,E=0,F=0;
int i,j,k,l;
A = somefunc(ent->somedata[0]);
B = somefunc(ent->somedata[1]);
t4 = anotherfunc(A,B);
ent->output = t4;
ent->fitness = sqrt(pow(t4,2));
}
static void* staticthreadproc(void* p){
return reinterpret_cast<ga*>(p)->thread_function();
}
void eval_thread(){
//use multithreading to evaluate individuals in parallel
int i,j,k;
nthreads = this->numthreads;
pthread_t threads[nthreads];
//create threads
pthread_mutex_init(&this->mutexdata,NULL);
this->whichthread=0;
for(i=0;i<nthreads;i++){
pthread_create(&threads[i],NULL,&ga::staticthreadproc,this);
//printf("creating thread, %d\n",i);
}
//join threads
for(i=0;i<nthreads;i++){
pthread_join(threads[i],NULL);
}
}
};
I am using pthreads here because it works better than boost on machines with less memory.
Each thread is started in eval_thread and terminated there aswell. I am using a mutex to ensure every thread starts with the correct index for the entity_array, as each thread only applys its work to its respective entity_array indexed by the variable this->whichthread. This variable is the only thing that needs to be locked by the mutex as it is updated for every thread and must not be changed by other threads. You can happily ignore everything else apart from the thread_function, eval_threads, and the staticthreadproc as they are the only relevent functions assume that all the other functions apart from init to be both processor and memory intensive.
So my question is why is it that using multithreading in this way is IT more costly in memory and speed than the traditional method of not using threads at all?
I MUST REITERATE THE CODE IS PSEUDO CODE AND THE PROBLEM ISNT WHETHER IT WILL COMPILE
Thanks, I would appreciate any suggestions you might have for pthreads and/or boost solutions.
Each thread requires it's own call-stack, which consumes memory. Every local variable of your function (and all other functions on the call-stack) counts to that memory.
When creating a new thread, space for its call-stack is reserved. I don't know what the default-value is for pthreads, but you might want to look into that. If you know you require less stack-space than is reserved by default, you might be able to reduce memory-consumption significantly by explicitly specifying the desired stack-size when spawning the thread.
As for the performance-part - it could depend on several issues. Generally, you should not expect a performance boost from parallelizing number-crunching operations onto more threads than you have cores (don't know if that is the case here). This might end up being slower, due to the additional overhead of context-switches, increased amount of cache-misses, etc. There are ways to profile this, depending on your platform (for instance, the Visual Studio profiler can count cache-misses, and there are tools for Linux as well).
Creating a thread is quite an expensive operation. If each thread only does a very small amount of work, then your program may be dominated by the time taken to create them all. Also, a large number of active threads can increase the work needed to schedule them, degrading system performance. And, as another answer mentions, each thread requires its own stack memory, so memory usage will grow with the number of threads.
Another issue can be cache invalidation; when one thread writes its results in to its entity structure, it may invalidate nearby cached memory and force other threads to fetch data from higher-level caches or main memory.
You may have better results if you use a smaller number of threads, each processing a larger subset of the data. For a CPU-bound task like this, one thread per CPU is probably best - that means that you can keep all CPUs busy, and there's no need to waste time scheduling multiple threads on each. Make sure that the entities each thread works on are located together in the array, so it does not invalidate those cached by other threads.

how do i properly design a worker thread? (avoid for example Sleep(1))

i am still a beginner at multi-threading, so bear with me please:
i am currently writing an application that does some FVM calculation on a grid. it's a time-explicit model, so at every timestep i need to calculate new values for the whole grid. my idea was to distribute this calculation to 4 worker-threads, which then deal with the cells of the grid (first thread calculating 0, 4, 8... second thread 1, 5, 9... and so forth).
i create those 4 threads at program start.
they look something like this:
void __fastcall TCalculationThread::Execute()
{
bool alive = true;
THREAD_SIGNAL ts;
while (alive)
{
Sleep(1);
if (TryEnterCriticalSection(&TMS))
{
ts = thread_signal;
LeaveCriticalSection(&TMS);
alive = !ts.kill;
if (ts.go && !ts.done.at(this->index))
{
double delta_t = ts.dt;
for (unsigned int i=this->index; i < cells.size(); i+= this->steps)
{
calculate_one_cell();
}
EnterCriticalSection(&TMS);
thread_signal.done.at(this->index)=true;
LeaveCriticalSection(&TMS);
}
}
}
they use a global struct, to communicate with the main thread (main thread sets ts.go to true when the workers need to start.
now i am sure this is not the way to do it! not only does it feel wrong, it also doesn't perform very well...
i read for example here that a semaphore or an event would work better. the answer to this guy's question talks about a lockless queue.
i am not very familiar with these concepts would like some pointers how to continue.
could you line out any of the ways to do this better?
thank you for your time. (and sorry for the formatting)
i am using borland c++ builder and its thread-object (TThread).
The definitely more effective algorithm would be to calculate yields for 0,1,2,3 on one thread, 4,5,6,7 on another, etc. Interleaving memory accesses like that is very bad, even if the variables are completely independent- you'll get false sharing problems. This is the equivalent of the CPU locking every write.
Calling Sleep(1) in a calculation thread can't be a good solution to any problem. You want your threads to be doing useful work rather than blocking for no good reason.
I think your basic problem can be expressed as a serial algorithm of this basic form:
for (int i=0; i<N; i++)
cells[i]->Calculate();
You are in the happy position that calls to Calculate() are independent of each other—what you have here is a parallel for. This means that you can implement this without a mutex.
There are a variety of ways to achieve this. OpenMP would be one; a threadpool class another. If you are going to roll your own thread based solution then use InterlockedIncrement() on a shared variable to iterate through the array.
You may hit some false sharing problems, as #DeadMG suggests, but quite possibly not. If you do have false sharing then yet another approach is to stride across larger sub-arrays. Essentially the increment (i.e. stride) passed to InterlockedIncrement() would be greater than one.
The bottom line is that the way to make the code faster is to remove both the the critical section (and hence the contention on it) and the Sleep(1).