I was wondering Could Strict-Alternative Method work on multi-core CPU?
for this purpose, I've written some code
int counter = 0;
int turn;
void producer()
{
while(turn == 1);
for (int i = 0; i < 10000000; ++i)
{
counter++;
}
turn = 1;
}
void consumer()
{
while(turn == 0);
// Slip
turn = 0;
for (int i = 0; i < 10000000; ++i) {
counter--;
}
// it should be here (turn = 0;)
}
int main()
{
int i = 0;
while (counter == 0)
{
i++;
thread t1, t2;
t2 = thread(consumer);
t1 = thread(producer);
t1.join();
t2.join();
cout << i << " " << counter << endl;
}
cout << i << endl;
return 0;
}
first of all, I've tried to run this code without this slip thinking that if that code runs on multi-core CPU it should be a moment when turn variable is changed before it should be due to memory reordering happened by CPU or compiler making the 2 threads to be inside the CS and hence causing data inconsistency, but after running thousands of iterations this does not happen!
ok, I thought that I must be running on a single core! so I have made this little slip (you can say I have made memory-reordering deliberately)
if they are running on a single core it should come a time when turn variable is assigned to 0 in consumer and the thread gets preempted allowing another thread to enter CS before it finishes and hence causes a data inconsistency, but after about 67000 iterations this does not happen either!
Related
After several searches, I cannot find a lock-free vector implementation.
There is a document that speaks about it but nothing concrete (in any case I have not found it). http://pirkelbauer.com/papers/opodis06.pdf
There are currently 2 threads dealing with arrays, there may be more in a while.
One thread that updates different vectors and another thread that accesses the vector to do calculations, etc. Each thread accesses the different array a large number of times per second.
I implemented a lock with mutex on the different vectors but when the reading or writing thread takes too long to unlock, all further updates are delayed.
I then thought of copying the array all the time to go faster, but copying thousands of times per second an array of thousands of elements doesn't seem great to me.
So I thought to use 1 mutex per value in each table to lock only the value I am working on.
A lock-free could be better but I can not find a solution and I wonder if the performances would be really better.
EDIT:
I have a thread that receives data and ranges in vectors.
When I instantiate the structure, I use a fixed size.
I have to do 2 different things for the updates:
-Update vector elements. (1d vector which simulates a 2d vector)
-Add a line at the end of the vector and remove the first line. The array always remains sorted. Adding elements is much much rarer than updating
The thread that is read-only walks the array and performs calculations.
To limit the time spent on the array and do as little calculation as possible, I use arrays that store the result of my calculations. Despite this, I often have to scan the table enough to do new calculations or just update them. (the application is in real-time so the calculations to be made vary according to the requests)
When a new element is added to the vector, the reading thread must directly use it to update the calculations.
When I say calculation, it is not necessarily only arithmetic, it is more a treatment to be done.
There is no perfect implementation to run concurrency, each task has it's own good enogh. My goto method to find a decent implementation is to only alow what is needed and then check if i would need somthing more in the future.
You described a quite simple scenario, one thread one accion to a shared vector, then the vector needs to tell if the acction is alowed soo std::atomic_flag is good enogh.
This example shuld give you an idea on how it works and what to expent. Mainly i just attached a flag to eatch array and checkt it before to see if is safe to do somthing and some people like to add a guard to the flag, just in case.
#include <iostream>
#include <thread>
#include <atomic>
#include <chrono>
const int vector_size = 1024;
struct Element {
void some_yield(){
std::this_thread::yield();
};
void some_wait(){
std::this_thread::sleep_for(
std::chrono::microseconds(1)
);
};
};
Element ** data;
std::atomic_flag * vector_safe;
bool alive = true;
uint32_t c_down_time = 0;
uint32_t p_down_time = 0;
uint32_t c_intinerations = 0;
uint32_t p_intinerations = 0;
std::chrono::high_resolution_clock::time_point c_time_point;
std::chrono::high_resolution_clock::time_point p_time_point;
int simple_consumer_work(){
Element a_read;
uint16_t i, e;
while (alive){
// Loops thru the vectors
for (i=0; i < vector_size; i++){
// locks the thread untin the vector
// at index i is free to read
while (!vector_safe[i].test_and_set()){}
// Do the watherver
for (e=0; e < vector_size; e++){
a_read = data[i][e];
}
// And signal that this vector is done
vector_safe[i].clear();
}
}
return 0;
};
int simple_producer_work(){
uint16_t i;
while (alive){
for (i=0; i < vector_size; i++){
while (!vector_safe[i].test_and_set()){}
data[i][i].some_wait();
vector_safe[i].clear();
}
p_intinerations++;
}
return 0;
};
int consumer_work(){
Element a_read;
uint16_t i, e;
bool waiting;
while (alive){
for (i=0; i < vector_size; i++){
waiting = false;
c_time_point = std::chrono::high_resolution_clock::now();
while (!vector_safe[i].test_and_set(std::memory_order_acquire)){
waiting = true;
}
if (waiting){
c_down_time += (uint32_t)std::chrono::duration_cast<std::chrono::nanoseconds>
(std::chrono::high_resolution_clock::now() - c_time_point).count();
}
for (e=0; e < vector_size; e++){
a_read = data[i][e];
}
vector_safe[i].clear(std::memory_order_release);
}
c_intinerations++;
}
return 0;
};
int producer_work(){
bool waiting;
uint16_t i;
while (alive){
for (i=0; i < vector_size; i++){
waiting = false;
p_time_point = std::chrono::high_resolution_clock::now();
while (!vector_safe[i].test_and_set(std::memory_order_acquire)){
waiting = true;
}
if (waiting){
p_down_time += (uint32_t)std::chrono::duration_cast<std::chrono::nanoseconds>
(std::chrono::high_resolution_clock::now() - p_time_point).count();
}
data[i][i].some_wait();
vector_safe[i].clear(std::memory_order_release);
}
p_intinerations++;
}
return 0;
};
void print_time(uint32_t down_time){
if ( down_time <= 1000) {
std::cout << down_time << " [nanosecods] \n";
} else if (down_time <= 1000000) {
std::cout << down_time / 1000 << " [microseconds] \n";
} else if (down_time <= 1000000000) {
std::cout << down_time / 1000000 << " [miliseconds] \n";
} else {
std::cout << down_time / 1000000000 << " [seconds] \n";
}
};
int main(){
std::uint16_t i;
std::thread consumer;
std::thread producer;
vector_safe = new std::atomic_flag [vector_size] {ATOMIC_FLAG_INIT};
data = new Element * [vector_size];
for(i=0; i < vector_size; i++){
data[i] = new Element;
}
consumer = std::thread(consumer_work);
producer = std::thread(producer_work);
std::this_thread::sleep_for(
std::chrono::seconds(10)
);
alive = false;
producer.join();
consumer.join();
std::cout << " Consumer loops > " << c_intinerations << std::endl;
std::cout << " Consumer time lost > "; print_time(c_down_time);
std::cout << " Producer loops > " << p_intinerations << std::endl;
std::cout << " Producer time lost > "; print_time(p_down_time);
for(i=0; i < vector_size; i++){
delete data[i];
}
delete [] vector_safe;
delete [] data;
return 0;
}
And dont forget that the compiler can and will change portions of the code, spagueti code is realy realy buggy in multithreading.
I am currently working on a multithreaded code in c++ in which each thread is set to execute a piece of code unless a global flag becomes TRUE. The threads will not share any data, with the exception of the global flag. I am wondering if the best practice would be to add a mutex for the flag.
To keep this question simple, I will resort to the following trivial example: assume that you are trying to find whether a large vector contains a given integer, and you want to split the search between several threads. There is a global variable whose initial value is FALSE. If one of the threads finds the integer, it should set the flag to TRUE and all the threads should stop afterwards (it is ok if some threads execute a few extra operations until they realize the flag has changed).
The following code does that (I apologize for the race condition over cout, but I want to keep the code short). It creates a large vector filled with 500, replaces one of the 500 with a 100, and implements the search procedure.
The code compiles and seems to work, but I am wondering if something can go wrong with the read/write of the flag at some point. I am wondering if I should add a mutex for the flag variable. I am hesitant because (1) for the most part the threads will only read the value of the flag, (2) the flag will only change once (it will never change back to false), (3) the value of the flag does not change the data within the threads execution (it does nothing besides besides stoping them) (4) it is Ok if the threads continue for a few iterations.
Should I perhaps only add a mutex lock for the writing part in the functor (flag=true)?
vector <int> a;
bool flag = false;
int size = 100000000
class Fctor {
int s;
int t;
public:
Fctor(int s, int t) : s(s), t(t) {}
\\ finds if there is a 100 in the vector between positions s and t
void operator()() {
int i = 0;
for (i = s; i < t; i++) {
if (flag == true) break;
if (a[i] == 100) {
flag = true;
break;
}
}
cout << s << " " << t << " " << flag << " " << i << "\n";
}
};
int main() {
a = vector<int> (8 * size, 500); \\creates a vector filled with 500
a[2*size+1] = 100; // This position will have a 100
chrono::high_resolution_clock::time_point begin_time =
chrono::high_resolution_clock::now();
int cores = 4;
vector<std::thread> threads;
int num = 8/cores;
int s = 0;
int t = num * size;
Fctor fctor(s, t);
std::thread th(fctor);
threads.push_back(std::move(th));
for (int i = 1; i < cores; i++) {
int s1 = i * num * size+1;
int t1 = (i+1) * num * size;
Fctor fctor1(s1, t1);
std::thread th1(fctor1);
threads.push_back(std::move(th1));
}
for (std::thread& th : threads) {
th.join();
}
chrono::high_resolution_clock::time_point end_time =
chrono::high_resolution_clock::now();
chrono::duration<double> time_span =
chrono::duration_cast<chrono::duration<double> >(end_time -
begin_time);
cout << "Found in: "<< time_span.count() << " seconds. \n";
return 0;
}
Setting the flag as atomic does the job
I came across a unexpected results in test for pthread read-write lock.
the following is my code.
#include <iostream>
#include <thread>
#include <pthread.h>
//locks declaration
pthread_rwlock_t rwlock;
//shared resource
int numbers[20];
int size = 0;
void readFrom()
{
int rc;
rc = pthread_rwlock_rdlock(&rwlock);
for(int index = 0; index < size; index++) {
std::cout << numbers[index] << " ";
}
std::cout << std::endl;
rc = pthread_rwlock_unlock(&rwlock);
}
void writeTo(int index, int val)
{
int rc;
rc = pthread_rwlock_wrlock(&rwlock);
numbers[index] = val;
size++;
rc = pthread_rwlock_unlock(&rwlock);
}
int main(int argc, char **argv)
{
int rc=0;
std::cout << std::endl;
std::thread threads[25];
rc = pthread_rwlock_init(&rwlock, NULL);
for(int i=0; i<20; ++i) {
threads[i] = std::thread(writeTo, i, i);
if(i % 5 == 0) {
threads[20 + (i / 5)] = std::thread(readFrom);
}
}
for(int i=0; i<24; ++i) {
threads[i].join();
}
std::cout << "size is " << size << std::endl;
threads[24] = std::thread(readFrom);
threads[24].join();
std::cout << std::endl;
rc = pthread_rwlock_destroy(&rwlock);
return 0;
}
after several runs, I occasionally find there are something unexpected. Here is an example:
0 1 2 3 0
It is a output from a reader thread. Basically, it says the size of numbers is 5 for now. in that case, I expect the result should be 0 1 2 3 4.
By the way, I have tried to implement addition mutually exclusive lock, which stoke unexpected behaviour.
I am interesting in the solution as well as the root cause. Could anyone help me out?
any help is appreciated in advance.
A reader/writer lock just prevents two writers from running at the same time or a writer from running at the same time as a reader. Neither of those things are required to get the output "0 1 2 3 0". So there is no reason you should consider that unexpected.
In fact, if you have four cores, "0 1 2 3 0" is certainly an output that I'd expect at least some of the time. The threads run in the order they were started until all four cores are in use, then new threads have to wait until existing threads finish their timeslices. That seems perfectly reasonable to me.
If you can detail the thought process that made you think "0 1 2 3 0" is unexpected, we can point out the specific flaw in it.
By the way, for an application like this, you should just use a regular lock. Using a reader/writer lock only makes sense when reader operations significantly outnumber write operations or readers need to hold the lock for relatively long times.
I wanted to learn to use C++ 11 std::threads with VS2012 and I wrote a very simple C++ console program with two threads which just increment a counter. I also want to test the performance difference when two threads are used. Test program is given below:
#include <iostream>
#include <thread>
#include <conio.h>
#include <atomic>
std::atomic<long long> sum(0);
//long long sum;
using namespace std;
const int RANGE = 100000000;
void test_without_threds()
{
sum = 0;
for(unsigned int j = 0; j < 2; j++)
for(unsigned int k = 0; k < RANGE; k++)
sum ++ ;
}
void call_from_thread(int tid)
{
for(unsigned int k = 0; k < RANGE; k++)
sum ++ ;
}
void test_with_2_threds()
{
std::thread t[2];
sum = 0;
//Launch a group of threads
for (int i = 0; i < 2; ++i) {
t[i] = std::thread(call_from_thread, i);
}
//Join the threads with the main thread
for (int i = 0; i < 2; ++i) {
t[i].join();
}
}
int _tmain(int argc, _TCHAR* argv[])
{
chrono::time_point<chrono::system_clock> start, end;
cout << "-----------------------------------------\n";
cout << "test without threds()\n";
start = chrono::system_clock::now();
test_without_threds();
end = chrono::system_clock::now();
chrono::duration<double> elapsed_seconds = end-start;
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
cout << "-----------------------------------------\n";
cout << "test with 2_threds\n";
start = chrono::system_clock::now();
test_with_2_threds();
end = chrono::system_clock::now();
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
_getch();
return 0;
}
Now, when I use for the counter just the long long variable (which is commented) I get value which is different from the correct - 100000000 instead of 200000000. I am not sure why is that and I suppose that the two threads are changing the counter at the same time, but I am not sure how it happens really because ++ is just a very simple instruction. It seems that the threads are caching the sum variable at beginning. Performance is 110 ms with two threads vs 200 ms for one thread.
So the correct way according to documentation is to use std::atomic. However now the performance is much worse for both cases as about 3300 ms without threads and 15820 ms with threads. What is the correct way to use std::atomic in this case?
I am not sure why is that and I suppose that the two threads are changing the counter at the same time, but I am not sure how it happens really because ++ is just a very simple instruction.
Each thread is pulling the value of sum into a register, incrementing the register, and finally writing it back to memory at the end of the loop.
So the correct way according to documentation is to use std::atomic. However now the performance is much worse for both cases as about 3300 ms without threads and 15820 ms with threads. What is the correct way to use std::atomic in this case?
You're paying for the synchronization std::atomic provides. It won't be nearly as fast as using an un-synchronized integer, though you can get a small improvement to performance by refining the memory order of the add:
sum.fetch_add(1, std::memory_order_relaxed);
In this particular case, you're compiling for x86 and operating on a 64-bit integer. This means that the compiler has to generate code to update the value in two 32-bit operations; if you change the target platform to x64, the compiler will generate code to do the increment in a single 64-bit operation.
As a general rule, the solution to problems like this is to reduce the number of writes to shared data.
Your code has a couple of problems. First of all, all the "inputs" involved are compile-time constants, so a good compiler can pre-compute the value for the single-threaded code, so (regardless of the value you give for range) it shows as running in 0 ms.
Second, you're sharing a single variable (sum) between all the threads, forcing all of their accesses to be synchronized at that point. Without synchronization, that gives undefined behavior. As you've already found, synchronizing the access to that variable is quite expensive, so you usually want to avoid it if at all reasonable.
One way to do that is to use a separate subtotal for each thread, so they can all do their additions in parallel, without synchronizing, the adding together the individual results at the end.
Another point is to ensure against false sharing. False sharing arises when two (or more) threads are writing to data that really is separate, but has been allocated in the same cache line. In this case, access to the memory can be serialized even though (as already noted) you don't have any data actually shared between the threads.
Based on those factors, I've rewritten your code slightly to create a separate sum variable for each thread. Those variables are of a class type that gives (fairly) direct access to the data, but does stop the optimizer from seeing that it can do the whole computation at compile-time, so we end up comparing one thread to 4 (which reminds me: I did increase the number of threads from 2 to 4, since I'm using a quad-core machine). I moved that number into a const variable though, so it should be easy to test with different numbers of threads.
#include <iostream>
#include <thread>
#include <conio.h>
#include <atomic>
#include <numeric>
const int num_threads = 4;
struct val {
long long sum;
int pad[2];
val &operator=(long long i) { sum = i; return *this; }
operator long long &() { return sum; }
operator long long() const { return sum; }
};
val sum[num_threads];
using namespace std;
const int RANGE = 100000000;
void test_without_threds()
{
sum[0] = 0LL;
for(unsigned int j = 0; j < num_threads; j++)
for(unsigned int k = 0; k < RANGE; k++)
sum[0] ++ ;
}
void call_from_thread(int tid)
{
for(unsigned int k = 0; k < RANGE; k++)
sum[tid] ++ ;
}
void test_with_threads()
{
std::thread t[num_threads];
std::fill_n(sum, num_threads, 0);
//Launch a group of threads
for (int i = 0; i < num_threads; ++i) {
t[i] = std::thread(call_from_thread, i);
}
//Join the threads with the main thread
for (int i = 0; i < num_threads; ++i) {
t[i].join();
}
long long total = std::accumulate(std::begin(sum), std::end(sum), 0LL);
}
int main()
{
chrono::time_point<chrono::system_clock> start, end;
cout << "-----------------------------------------\n";
cout << "test without threds()\n";
start = chrono::system_clock::now();
test_without_threds();
end = chrono::system_clock::now();
chrono::duration<double> elapsed_seconds = end-start;
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
cout << "-----------------------------------------\n";
cout << "test with threads\n";
start = chrono::system_clock::now();
test_with_threads();
end = chrono::system_clock::now();
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
_getch();
return 0;
}
When I run this, my results are closer to what I'd guess you hoped for:
-----------------------------------------
test without threds()
finished calculation for 78ms.
sum: 000000013FCBC370
-----------------------------------------
test with threads
finished calculation for 15ms.
sum: 000000013FCBC370
... the sums are identical, but N threads increases speed by a factor of approximately N (up to the number of cores available).
Try to use prefix increment, which will give performance improvement.
Test on my machine, std::memory_order_relaxed does not give any advantage.
This code works only when any of the lines under /* debug messages */ are uncommented. Or if the list being mapped to is less than 30 elements.
func_map is a linear implementation of a Lisp-style mapping and can be assumed to work.
Use of it would be as follows func_map(FUNC_PTR foo, std::vector* list, locs* start_and_end)
FUNC_PTR is a pointer to a function that returns void and takes in an int pointer
For example: &foo in which foo is defined as:
void foo (int* num){ (*num) = (*num) * (*num);}
locs is a struct with two members int_start and int_end; I use it to tell func_map which elements it should iterate over.
void par_map(FUNC_PTR func_transform, std::vector<int>* vector_array) //function for mapping a function to a list alla lisp
{
int array_size = (*vector_array).size(); //retain the number of elements in our vector
int num_threads = std::thread::hardware_concurrency(); //figure out number of cores
int array_sub = array_size/num_threads; //number that we use to figure out how many elements should be assigned per thread
std::vector<std::thread> threads; //the vector that we will initialize threads in
std::vector<locs> vector_locs; // the vector that we will store the start and end position for each thread
for(int i = 0; i < num_threads && i < array_size; i++)
{
locs cur_loc; //the locs struct that we will create using the power of LOGIC
if(array_sub == 0) //the LOGIC
{
cur_loc.int_start = i; //if the number of elements in the array is less than the number of cores just assign one core to each element
}
else
{
cur_loc.int_start = (i * array_sub); //otherwise figure out the starting point given the number of cores
}
if(i == (num_threads - 1))
{
cur_loc.int_end = array_size; //make sure all elements will be iterated over
}
else if(array_sub == 0)
{
cur_loc.int_end = (i + 1); //ditto
}
else
{
cur_loc.int_end = ((i+1) * array_sub); //otherwise use the number of threads to determine our ending point
}
vector_locs.push_back(cur_loc); //store the created locs struct so it doesnt get changed during reference
threads.push_back(std::thread(func_map,
func_transform,
vector_array,
(&vector_locs[i]))); //create a thread
/*debug messages*/ // <--- whenever any of these are uncommented the code works
//cout << "i = " << i << endl;
//cout << "int_start == " << cur_loc.int_start << endl;
//cout << "int_end == " << cur_loc.int_end << endl << endl;
//cout << "Thread " << i << " initialized" << endl;
}
for(int i = 0; i < num_threads && i < array_size; i++)
{
(threads[i]).join(); //make sure all the threads are done
}
}
I think that the issue might be in how vector_locs[i] is used and how threads are resolved. But the use of a vector to maintain the state of the locs instance referenced by thread should prevent that from being an issue; I'm really stumped.
You're giving the thread function a pointer, &vector_locs[i], that may become invalidated as you push_back more items into the vector.
Since you know beforehand how many items vector_locs will contain - min(num_threads, array_size) - you can reserve that space in advance to prevent reallocation.
As to why it doesn't crash if you uncomment the output, I would guess that the output is so slow that the thread you just started will finish before the output is done, so the next iteration can't affect it.
I think you should make this loop inner to the main one:
...
for(int i = 0; i < num_threads && i < array_size; i++)
{
(threads[i]).join(); //make sure all the threads are done
}
}