Implementation of a lock free vector

Implementation of a lock free vector - c++

After several searches, I cannot find a lock-free vector implementation.
There is a document that speaks about it but nothing concrete (in any case I have not found it). http://pirkelbauer.com/papers/opodis06.pdf
There are currently 2 threads dealing with arrays, there may be more in a while.
One thread that updates different vectors and another thread that accesses the vector to do calculations, etc. Each thread accesses the different array a large number of times per second.
I implemented a lock with mutex on the different vectors but when the reading or writing thread takes too long to unlock, all further updates are delayed.
I then thought of copying the array all the time to go faster, but copying thousands of times per second an array of thousands of elements doesn't seem great to me.
So I thought to use 1 mutex per value in each table to lock only the value I am working on.
A lock-free could be better but I can not find a solution and I wonder if the performances would be really better.
EDIT:
I have a thread that receives data and ranges in vectors.
When I instantiate the structure, I use a fixed size.
I have to do 2 different things for the updates:
-Update vector elements. (1d vector which simulates a 2d vector)
-Add a line at the end of the vector and remove the first line. The array always remains sorted. Adding elements is much much rarer than updating
The thread that is read-only walks the array and performs calculations.
To limit the time spent on the array and do as little calculation as possible, I use arrays that store the result of my calculations. Despite this, I often have to scan the table enough to do new calculations or just update them. (the application is in real-time so the calculations to be made vary according to the requests)
When a new element is added to the vector, the reading thread must directly use it to update the calculations.
When I say calculation, it is not necessarily only arithmetic, it is more a treatment to be done.

There is no perfect implementation to run concurrency, each task has it's own good enogh. My goto method to find a decent implementation is to only alow what is needed and then check if i would need somthing more in the future.
You described a quite simple scenario, one thread one accion to a shared vector, then the vector needs to tell if the acction is alowed soo std::atomic_flag is good enogh.
This example shuld give you an idea on how it works and what to expent. Mainly i just attached a flag to eatch array and checkt it before to see if is safe to do somthing and some people like to add a guard to the flag, just in case.
#include <iostream>
#include <thread>
#include <atomic>
#include <chrono>
const int vector_size = 1024;
struct Element {
void some_yield(){
std::this_thread::yield();
};
void some_wait(){
std::this_thread::sleep_for(
std::chrono::microseconds(1)
);
};
};
Element ** data;
std::atomic_flag * vector_safe;
bool alive = true;
uint32_t c_down_time = 0;
uint32_t p_down_time = 0;
uint32_t c_intinerations = 0;
uint32_t p_intinerations = 0;
std::chrono::high_resolution_clock::time_point c_time_point;
std::chrono::high_resolution_clock::time_point p_time_point;
int simple_consumer_work(){
Element a_read;
uint16_t i, e;
while (alive){
// Loops thru the vectors
for (i=0; i < vector_size; i++){
// locks the thread untin the vector
// at index i is free to read
while (!vector_safe[i].test_and_set()){}
// Do the watherver
for (e=0; e < vector_size; e++){
a_read = data[i][e];
}
// And signal that this vector is done
vector_safe[i].clear();
}
}
return 0;
};
int simple_producer_work(){
uint16_t i;
while (alive){
for (i=0; i < vector_size; i++){
while (!vector_safe[i].test_and_set()){}
data[i][i].some_wait();
vector_safe[i].clear();
}
p_intinerations++;
}
return 0;
};
int consumer_work(){
Element a_read;
uint16_t i, e;
bool waiting;
while (alive){
for (i=0; i < vector_size; i++){
waiting = false;
c_time_point = std::chrono::high_resolution_clock::now();
while (!vector_safe[i].test_and_set(std::memory_order_acquire)){
waiting = true;
}
if (waiting){
c_down_time += (uint32_t)std::chrono::duration_cast<std::chrono::nanoseconds>
(std::chrono::high_resolution_clock::now() - c_time_point).count();
}
for (e=0; e < vector_size; e++){
a_read = data[i][e];
}
vector_safe[i].clear(std::memory_order_release);
}
c_intinerations++;
}
return 0;
};
int producer_work(){
bool waiting;
uint16_t i;
while (alive){
for (i=0; i < vector_size; i++){
waiting = false;
p_time_point = std::chrono::high_resolution_clock::now();
while (!vector_safe[i].test_and_set(std::memory_order_acquire)){
waiting = true;
}
if (waiting){
p_down_time += (uint32_t)std::chrono::duration_cast<std::chrono::nanoseconds>
(std::chrono::high_resolution_clock::now() - p_time_point).count();
}
data[i][i].some_wait();
vector_safe[i].clear(std::memory_order_release);
}
p_intinerations++;
}
return 0;
};
void print_time(uint32_t down_time){
if ( down_time <= 1000) {
std::cout << down_time << " [nanosecods] \n";
} else if (down_time <= 1000000) {
std::cout << down_time / 1000 << " [microseconds] \n";
} else if (down_time <= 1000000000) {
std::cout << down_time / 1000000 << " [miliseconds] \n";
} else {
std::cout << down_time / 1000000000 << " [seconds] \n";
}
};
int main(){
std::uint16_t i;
std::thread consumer;
std::thread producer;
vector_safe = new std::atomic_flag [vector_size] {ATOMIC_FLAG_INIT};
data = new Element * [vector_size];
for(i=0; i < vector_size; i++){
data[i] = new Element;
}
consumer = std::thread(consumer_work);
producer = std::thread(producer_work);
std::this_thread::sleep_for(
std::chrono::seconds(10)
);
alive = false;
producer.join();
consumer.join();
std::cout << " Consumer loops > " << c_intinerations << std::endl;
std::cout << " Consumer time lost > "; print_time(c_down_time);
std::cout << " Producer loops > " << p_intinerations << std::endl;
std::cout << " Producer time lost > "; print_time(p_down_time);
for(i=0; i < vector_size; i++){
delete data[i];
}
delete [] vector_safe;
delete [] data;
return 0;
}
And dont forget that the compiler can and will change portions of the code, spagueti code is realy realy buggy in multithreading.

Related

Strict-Alternation Synchronization

I was wondering Could Strict-Alternative Method work on multi-core CPU?
for this purpose, I've written some code
int counter = 0;
int turn;
void producer()
{
while(turn == 1);
for (int i = 0; i < 10000000; ++i)
{
counter++;
}
turn = 1;
}
void consumer()
{
while(turn == 0);
// Slip
turn = 0;
for (int i = 0; i < 10000000; ++i) {
counter--;
}
// it should be here (turn = 0;)
}
int main()
{
int i = 0;
while (counter == 0)
{
i++;
thread t1, t2;
t2 = thread(consumer);
t1 = thread(producer);
t1.join();
t2.join();
cout << i << " " << counter << endl;
}
cout << i << endl;
return 0;
}
first of all, I've tried to run this code without this slip thinking that if that code runs on multi-core CPU it should be a moment when turn variable is changed before it should be due to memory reordering happened by CPU or compiler making the 2 threads to be inside the CS and hence causing data inconsistency, but after running thousands of iterations this does not happen!
ok, I thought that I must be running on a single core! so I have made this little slip (you can say I have made memory-reordering deliberately)
if they are running on a single core it should come a time when turn variable is assigned to 0 in consumer and the thread gets preempted allowing another thread to enter CS before it finishes and hence causes a data inconsistency, but after about 67000 iterations this does not happen either!

Multithread program in C++: Using mutex on a flag variable

I am currently working on a multithreaded code in c++ in which each thread is set to execute a piece of code unless a global flag becomes TRUE. The threads will not share any data, with the exception of the global flag. I am wondering if the best practice would be to add a mutex for the flag.
To keep this question simple, I will resort to the following trivial example: assume that you are trying to find whether a large vector contains a given integer, and you want to split the search between several threads. There is a global variable whose initial value is FALSE. If one of the threads finds the integer, it should set the flag to TRUE and all the threads should stop afterwards (it is ok if some threads execute a few extra operations until they realize the flag has changed).
The following code does that (I apologize for the race condition over cout, but I want to keep the code short). It creates a large vector filled with 500, replaces one of the 500 with a 100, and implements the search procedure.
The code compiles and seems to work, but I am wondering if something can go wrong with the read/write of the flag at some point. I am wondering if I should add a mutex for the flag variable. I am hesitant because (1) for the most part the threads will only read the value of the flag, (2) the flag will only change once (it will never change back to false), (3) the value of the flag does not change the data within the threads execution (it does nothing besides besides stoping them) (4) it is Ok if the threads continue for a few iterations.
Should I perhaps only add a mutex lock for the writing part in the functor (flag=true)?
vector <int> a;
bool flag = false;
int size = 100000000
class Fctor {
int s;
int t;
public:
Fctor(int s, int t) : s(s), t(t) {}
\\ finds if there is a 100 in the vector between positions s and t
void operator()() {
int i = 0;
for (i = s; i < t; i++) {
if (flag == true) break;
if (a[i] == 100) {
flag = true;
break;
}
}
cout << s << " " << t << " " << flag << " " << i << "\n";
}
};
int main() {
a = vector<int> (8 * size, 500); \\creates a vector filled with 500
a[2*size+1] = 100; // This position will have a 100
chrono::high_resolution_clock::time_point begin_time =
chrono::high_resolution_clock::now();
int cores = 4;
vector<std::thread> threads;
int num = 8/cores;
int s = 0;
int t = num * size;
Fctor fctor(s, t);
std::thread th(fctor);
threads.push_back(std::move(th));
for (int i = 1; i < cores; i++) {
int s1 = i * num * size+1;
int t1 = (i+1) * num * size;
Fctor fctor1(s1, t1);
std::thread th1(fctor1);
threads.push_back(std::move(th1));
}
for (std::thread& th : threads) {
th.join();
}
chrono::high_resolution_clock::time_point end_time =
chrono::high_resolution_clock::now();
chrono::duration<double> time_span =
chrono::duration_cast<chrono::duration<double> >(end_time -
begin_time);
cout << "Found in: "<< time_span.count() << " seconds. \n";
return 0;
}

Setting the flag as atomic does the job

How can I raise a matrix to a power with multiple threads?

I am trying to raise a matrix to a power with multiple threads, but I am not very good with threads. Also I enter the number of threads from keyboard and that number is in range [1, matrix height], then I do the following:
unsigned period = ceil((double)A.getHeight() / threadNum);
unsigned prev = 0, next = period;
for (unsigned i(0); i < threadNum; ++i) {
threads.emplace_back(&power<long long>, std::ref(result), std::ref(A), std::ref(B), prev, next, p);
if (next + period > A.getHeight()) {
prev = next;
next = A.getHeight();
}
else {
prev = next;
next += period;
}
}
It was easy for me to multiply one matrix by another with multiple threads, but here the problem is that once 1 step is done, for example I need to raise A to the power of 3, A^2 would be that one step, after that step I have to wait for all the threads to finish up, before moving on to doing A^2*A. How can I make my threads wait for that? I'm using std::thread's.
After the first reply was posted I realized that I forgot to mention that I want to create those threads only once, and not recreate them for each multiplication step.

I would suggest using condition_variable.
Algorithm would be something like this:
Split the matrix in N parts for N threads.
Each thread calculates the necessary resulting sub matrix for a single multiplication.
Then it increments an atomic threads_finished counter using fetch_add and waits on a shared condition variable.
Last thread that finishes (fetch_add()+1 == thread count), notifies all threads, that they can now continue processing.
Profit.
Edit:
Here is and example how to stop threads:
#include <iostream>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <vector>
#include <algorithm>
#include <atomic>
void sync_threads(std::condition_variable & cv, std::mutex & mut, std::vector<int> & threads, const int idx) {
std::unique_lock<std::mutex> lock(mut);
threads[idx] = 1;
if(std::find(threads.begin(),threads.end(),0) == threads.end()) {
for(auto & i: threads)
i = 0;
cv.notify_all();
} else {
while(threads[idx])
cv.wait(lock);
}
}
int main(){
std::vector<std::thread> threads;
std::mutex mut;
std::condition_variable cv;
int max_threads = 10;
std::vector<int> thread_wait(max_threads,0);
for(int i = 0; i < max_threads; i++) {
threads.emplace_back([&,i](){
std::cout << "Thread "+ std::to_string(i)+" started\n";
sync_threads(cv,mut,thread_wait,i);
std::cout << "Continuing thread " + std::to_string(i) + "\n";
sync_threads(cv,mut,thread_wait,i);
std::cout << "Continuing thread for second time " + std::to_string(i) + "\n";
});
}
for(auto & i: threads)
i.join();
}
The interesting part is here:
void sync_threads(std::condition_variable & cv, std::mutex & mut, std::vector<int> & threads, const int idx) {
std::unique_lock<std::mutex> lock(mut); // Lock because we want to modify cv
threads[idx] = 1; // Set my idx to 1, so we know we are sleeping
if(std::find(threads.begin(),threads.end(),0) == threads.end()) {
// I'm the last thread, wake up everyone
for(auto & i: threads)
i = 0;
cv.notify_all();
} else { //I'm not the last thread - sleep until all are finished
while(threads[idx]) // In loop so, if we wake up unexpectedly, we go back to sleep. (Thanks for pointing that out Yakk)
cv.wait(lock);
}
}

Here is a mass_thread_pool:
// launches n threads all doing task F with an index:
template<class F>
struct mass_thread_pool {
F f;
std::vector< std::thread > threads;
std::condition_variable cv;
std::mutex m;
size_t task_id = 0;
size_t finished_count = 0;
std::unique_ptr<std::promise<void>> task_done;
std::atomic<bool> finished;
void task( F f, size_t n, size_t cur_task ) {
//std::cout << "Thread " << n << " launched" << std::endl;
do {
f(n);
std::unique_lock<std::mutex> lock(m);
if (finished)
break;
++finished_count;
if (finished_count == threads.size())
{
//std::cout << "task set finished" << std::endl;
task_done->set_value();
finished_count = 0;
}
cv.wait(lock,[&]{if (finished) return true; if (cur_task == task_id) return false; cur_task=task_id; return true;});
} while(!finished);
//std::cout << finished << std::endl;
//std::cout << "Thread " << n << " finished" << std::endl;
}
mass_thread_pool() = delete;
mass_thread_pool(F fin):f(fin),finished(false) {}
mass_thread_pool(mass_thread_pool&&)=delete; // address is party of identity
std::future<void> kick( size_t n ) {
//std::cout << "kicking " << n << " threads off. Prior count is " << threads.size() << std::endl;
std::future<void> r;
{
std::unique_lock<std::mutex> lock(m);
++task_id;
task_done.reset( new std::promise<void>() );
finished_count = 0;
r = task_done->get_future();
while (threads.size() < n) {
size_t i = threads.size();
threads.emplace_back( &mass_thread_pool::task, this, f, i, task_id );
}
//std::cout << "count is now " << threads.size() << std::endl;
}
cv.notify_all();
return r;
}
~mass_thread_pool() {
//std::cout << "destroying thread pool" << std::endl;
finished = true;
cv.notify_all();
for (auto&& t:threads) {
//std::cout << "joining thread" << std::endl;
t.join();
}
//std::cout << "destroyed thread pool" << std::endl;
}
};
you construct it with a task, and then you kick(77) to launch 77 copies of that task (each with a different index).
kick returns a std::future<void>. You must wait on this future for all of the tasks to be finished.
Then you can either destroy the thread pool, or call kick(77) again to relaunch the task.
The idea is that the function object you pass to mass_thread_pool has access to both your input and output data (say, the matrices you want to multiply, or pointers to them). Each kick causes it to call your function once for each index. You are in charge of turning indexes into offsets of whatever.
Live example where I use it to add 1 to an entry in another vector. Between iterations, we swap vectors. This does 2000 iterations, and launches 10 threads, and calls the lambda 20000 times.
Note the auto&& pool = make_pool( lambda ) bit. Use of auto&& is required -- as the thread pool has pointers into itself, I disabled both move and copy construct on a mass thread pool. If you really need to pass it around, create a unique pointer to the thread pool.
I ran into some issues with std::promise resetting, so I wrapped it in a unique_ptr. That may not be required.
Trace statements I used to debug it are commented out.
Calling kick with a different n may or may not work. Definitely calling it with a smaller n will not work the way you expect (it will ignore the n in that case).
No processing is done until you call kick. kick is short for "kick off".
...
In the case of your problem, what I'd do is make a multipier object that owns a mass_thread_pool.
The multiplier has a pointer to 3 matrices (a, b and out). Each of the n subtasks generate some subsection of out.
You pass 2 matrices to the multiplier, it sets a pointer to out to a local matrix and a and b to the passed in matrices, does a kick, then a wait, then returns the local matrix.
For powers, you use the above multiplier to build a power-of-two tower, while multiply-accumulating based off the bits of the exponent into your result (again using the above multiplier).
A fancier version of the above could allow queuing up of multiplications and std::future<Matrix>s (and multiplications of future matrixes).

I would start with a simple decomposition:
matrix multiplication gets multithreaded
matrix exponent calls the multiplication several times.
Something like that:
Mat multithreaded_multiply(Mat const& left, Mat const& right) {...}
Mat power(Mat const& M, int n)
{
// Handle degenerate cases here (n = 0, 1)
// Regular loop
Mat intermediate = M;
for (int i = 2; i <= n; ++i)
{
intermediate = multithreaded_multiply(M, intermediate);
}
}
For waiting for std::thread, you have the method join().

Not a programming but math answer: for every square matrix there is a set of so called "eigenvalues" and "eigenvectors", so that M * E_i = lambda_i * E_i. M is the matrix, E_i is the eigenvector, lambda_i is the eigenvalue, which is just a complex number. So M^n * E_i = lambda_i^n *E_i. So you need only the nth power of a complex number instead of a matrix. The eigenvectors are orthogonal, i.e. any vector V = sum_i a_i * E_i. So M^n * V = sum_i a_i lambda^n E_i.
Depending on your problem this might speed up things significantly.

C++ 11 std thread sumation with atomic very slow

I wanted to learn to use C++ 11 std::threads with VS2012 and I wrote a very simple C++ console program with two threads which just increment a counter. I also want to test the performance difference when two threads are used. Test program is given below:
#include <iostream>
#include <thread>
#include <conio.h>
#include <atomic>
std::atomic<long long> sum(0);
//long long sum;
using namespace std;
const int RANGE = 100000000;
void test_without_threds()
{
sum = 0;
for(unsigned int j = 0; j < 2; j++)
for(unsigned int k = 0; k < RANGE; k++)
sum ++ ;
}
void call_from_thread(int tid)
{
for(unsigned int k = 0; k < RANGE; k++)
sum ++ ;
}
void test_with_2_threds()
{
std::thread t[2];
sum = 0;
//Launch a group of threads
for (int i = 0; i < 2; ++i) {
t[i] = std::thread(call_from_thread, i);
}
//Join the threads with the main thread
for (int i = 0; i < 2; ++i) {
t[i].join();
}
}
int _tmain(int argc, _TCHAR* argv[])
{
chrono::time_point<chrono::system_clock> start, end;
cout << "-----------------------------------------\n";
cout << "test without threds()\n";
start = chrono::system_clock::now();
test_without_threds();
end = chrono::system_clock::now();
chrono::duration<double> elapsed_seconds = end-start;
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
cout << "-----------------------------------------\n";
cout << "test with 2_threds\n";
start = chrono::system_clock::now();
test_with_2_threds();
end = chrono::system_clock::now();
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
_getch();
return 0;
}
Now, when I use for the counter just the long long variable (which is commented) I get value which is different from the correct - 100000000 instead of 200000000. I am not sure why is that and I suppose that the two threads are changing the counter at the same time, but I am not sure how it happens really because ++ is just a very simple instruction. It seems that the threads are caching the sum variable at beginning. Performance is 110 ms with two threads vs 200 ms for one thread.
So the correct way according to documentation is to use std::atomic. However now the performance is much worse for both cases as about 3300 ms without threads and 15820 ms with threads. What is the correct way to use std::atomic in this case?

I am not sure why is that and I suppose that the two threads are changing the counter at the same time, but I am not sure how it happens really because ++ is just a very simple instruction.
Each thread is pulling the value of sum into a register, incrementing the register, and finally writing it back to memory at the end of the loop.
So the correct way according to documentation is to use std::atomic. However now the performance is much worse for both cases as about 3300 ms without threads and 15820 ms with threads. What is the correct way to use std::atomic in this case?
You're paying for the synchronization std::atomic provides. It won't be nearly as fast as using an un-synchronized integer, though you can get a small improvement to performance by refining the memory order of the add:
sum.fetch_add(1, std::memory_order_relaxed);
In this particular case, you're compiling for x86 and operating on a 64-bit integer. This means that the compiler has to generate code to update the value in two 32-bit operations; if you change the target platform to x64, the compiler will generate code to do the increment in a single 64-bit operation.
As a general rule, the solution to problems like this is to reduce the number of writes to shared data.

Your code has a couple of problems. First of all, all the "inputs" involved are compile-time constants, so a good compiler can pre-compute the value for the single-threaded code, so (regardless of the value you give for range) it shows as running in 0 ms.
Second, you're sharing a single variable (sum) between all the threads, forcing all of their accesses to be synchronized at that point. Without synchronization, that gives undefined behavior. As you've already found, synchronizing the access to that variable is quite expensive, so you usually want to avoid it if at all reasonable.
One way to do that is to use a separate subtotal for each thread, so they can all do their additions in parallel, without synchronizing, the adding together the individual results at the end.
Another point is to ensure against false sharing. False sharing arises when two (or more) threads are writing to data that really is separate, but has been allocated in the same cache line. In this case, access to the memory can be serialized even though (as already noted) you don't have any data actually shared between the threads.
Based on those factors, I've rewritten your code slightly to create a separate sum variable for each thread. Those variables are of a class type that gives (fairly) direct access to the data, but does stop the optimizer from seeing that it can do the whole computation at compile-time, so we end up comparing one thread to 4 (which reminds me: I did increase the number of threads from 2 to 4, since I'm using a quad-core machine). I moved that number into a const variable though, so it should be easy to test with different numbers of threads.
#include <iostream>
#include <thread>
#include <conio.h>
#include <atomic>
#include <numeric>
const int num_threads = 4;
struct val {
long long sum;
int pad[2];
val &operator=(long long i) { sum = i; return *this; }
operator long long &() { return sum; }
operator long long() const { return sum; }
};
val sum[num_threads];
using namespace std;
const int RANGE = 100000000;
void test_without_threds()
{
sum[0] = 0LL;
for(unsigned int j = 0; j < num_threads; j++)
for(unsigned int k = 0; k < RANGE; k++)
sum[0] ++ ;
}
void call_from_thread(int tid)
{
for(unsigned int k = 0; k < RANGE; k++)
sum[tid] ++ ;
}
void test_with_threads()
{
std::thread t[num_threads];
std::fill_n(sum, num_threads, 0);
//Launch a group of threads
for (int i = 0; i < num_threads; ++i) {
t[i] = std::thread(call_from_thread, i);
}
//Join the threads with the main thread
for (int i = 0; i < num_threads; ++i) {
t[i].join();
}
long long total = std::accumulate(std::begin(sum), std::end(sum), 0LL);
}
int main()
{
chrono::time_point<chrono::system_clock> start, end;
cout << "-----------------------------------------\n";
cout << "test without threds()\n";
start = chrono::system_clock::now();
test_without_threds();
end = chrono::system_clock::now();
chrono::duration<double> elapsed_seconds = end-start;
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
cout << "-----------------------------------------\n";
cout << "test with threads\n";
start = chrono::system_clock::now();
test_with_threads();
end = chrono::system_clock::now();
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
_getch();
return 0;
}
When I run this, my results are closer to what I'd guess you hoped for:
-----------------------------------------
test without threds()
finished calculation for 78ms.
sum: 000000013FCBC370
-----------------------------------------
test with threads
finished calculation for 15ms.
sum: 000000013FCBC370
... the sums are identical, but N threads increases speed by a factor of approximately N (up to the number of cores available).

Try to use prefix increment, which will give performance improvement.
Test on my machine, std::memory_order_relaxed does not give any advantage.

Parallel implemention of Lisp-style mapping of a function to a list in C++ fails without cout after use of thread

This code works only when any of the lines under /* debug messages */ are uncommented. Or if the list being mapped to is less than 30 elements.
func_map is a linear implementation of a Lisp-style mapping and can be assumed to work.
Use of it would be as follows func_map(FUNC_PTR foo, std::vector* list, locs* start_and_end)
FUNC_PTR is a pointer to a function that returns void and takes in an int pointer
For example: &foo in which foo is defined as:
void foo (int* num){ (*num) = (*num) * (*num);}
locs is a struct with two members int_start and int_end; I use it to tell func_map which elements it should iterate over.
void par_map(FUNC_PTR func_transform, std::vector<int>* vector_array) //function for mapping a function to a list alla lisp
{
int array_size = (*vector_array).size(); //retain the number of elements in our vector
int num_threads = std::thread::hardware_concurrency(); //figure out number of cores
int array_sub = array_size/num_threads; //number that we use to figure out how many elements should be assigned per thread
std::vector<std::thread> threads; //the vector that we will initialize threads in
std::vector<locs> vector_locs; // the vector that we will store the start and end position for each thread
for(int i = 0; i < num_threads && i < array_size; i++)
{
locs cur_loc; //the locs struct that we will create using the power of LOGIC
if(array_sub == 0) //the LOGIC
{
cur_loc.int_start = i; //if the number of elements in the array is less than the number of cores just assign one core to each element
}
else
{
cur_loc.int_start = (i * array_sub); //otherwise figure out the starting point given the number of cores
}
if(i == (num_threads - 1))
{
cur_loc.int_end = array_size; //make sure all elements will be iterated over
}
else if(array_sub == 0)
{
cur_loc.int_end = (i + 1); //ditto
}
else
{
cur_loc.int_end = ((i+1) * array_sub); //otherwise use the number of threads to determine our ending point
}
vector_locs.push_back(cur_loc); //store the created locs struct so it doesnt get changed during reference
threads.push_back(std::thread(func_map,
func_transform,
vector_array,
(&vector_locs[i]))); //create a thread
/*debug messages*/ // <--- whenever any of these are uncommented the code works
//cout << "i = " << i << endl;
//cout << "int_start == " << cur_loc.int_start << endl;
//cout << "int_end == " << cur_loc.int_end << endl << endl;
//cout << "Thread " << i << " initialized" << endl;
}
for(int i = 0; i < num_threads && i < array_size; i++)
{
(threads[i]).join(); //make sure all the threads are done
}
}
I think that the issue might be in how vector_locs[i] is used and how threads are resolved. But the use of a vector to maintain the state of the locs instance referenced by thread should prevent that from being an issue; I'm really stumped.

You're giving the thread function a pointer, &vector_locs[i], that may become invalidated as you push_back more items into the vector.
Since you know beforehand how many items vector_locs will contain - min(num_threads, array_size) - you can reserve that space in advance to prevent reallocation.
As to why it doesn't crash if you uncomment the output, I would guess that the output is so slow that the thread you just started will finish before the output is done, so the next iteration can't affect it.

I think you should make this loop inner to the main one:
...
for(int i = 0; i < num_threads && i < array_size; i++)
{
(threads[i]).join(); //make sure all the threads are done
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js