When creating a thread how do I run the threads main method? - c++

The main method of my thread is:
void thrMain(const std::vector<long>& list, std::vector<int>& result,
const int startInd, const int endInd) {
for (int i = startInd; (i < endInd); i++) {
result[i] = countFactors(list[i]);
}
}
I create a list of threads each time using another method:
std::vector<int> getFactorCount(const std::vector<long>& numList, const int thrCount) {
// First allocate the return vector
const int listSize = numList.size();
const int count = (listSize / thrCount) + 1;
std::vector<std::thread> thrList; // List of threads
const std::vector<long> interFac(thrCount); // Intermediate factors
// Store factorial counts
std::vector<int> factCounts(numList.size());
for (int start = 0, thr = 0; (thr < thrCount); thr++, start += count) {
int end = std::max(listSize, (start + count));
thrList.push_back(std::thread(thrMain, std::ref(numList),
std::ref(interFac[thr]), start, end));
}
for (auto& t : thrList) {
t.join();
}
// Return the result back
return factCounts;
}
The main problem I am having is that the std::ref(interFac[thr]) is making my #include <thread> file not work. I have tried taking away the pass by reference and that does not help the problem.

I don't know what interFac is for, but it looks like this:
thrList.push_back(std::thread(thrMain, std::ref(numList),
std::ref(interFac[thr]), start, end));
should be this:
thrList.push_back(std::thread(thrMain, std::ref(numList),
std::ref(factCounts), start, end));
Then it compiles.

Related

Vector processing issues in multi threading

I'm implement about the data process in multi thread.
I want to process data in class DataProcess and merge the data in class DataStorage.
My problem is when the data is add to the vector sometimes occurs the exception error.
In my opinions, there have a different address class
Is it a problem to create a new data handling class and process each data?
Here is my code.
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <thread>
#include <vector>
#include <mutex>
using namespace::std;
static std::mutex m;
class DataStorage
{
private :
std::vector<long long> vecData;
public:
DataStorage()
{
}
~DataStorage()
{
}
void SetDataVectorSize(int size)
{
vecData.clear();
vecData.resize(size);
}
void DataInsertLoop(void* Data, int start, int end)
{
m.lock();
std::vector<long long> const * _v1 = static_cast<std::vector<long long> const *>(Data);
long long num = 0;
for (int idx = start; idx < _v1->size(); ++idx)
{
vecData[idx] = _v1->at(idx);
}
m.unlock();
}
};
class DataProcess
{
private:
int m_index;
long long m_startIndex;
long long m_endIndex;
int m_coreNum;
long long num;
DataStorage* m_mainStorage;
std::vector<long long> m_vecData;
public :
DataProcess(int pindex, long long startindex, long long endindex)
: m_index(pindex), m_startIndex(startindex), m_endIndex(endindex),
m_coreNum(0),m_mainStorage(NULL), num(0)
{
m_vecData.clear();
}
~DataProcess()
{
}
void SetMainAdrr(DataStorage* const mainstorage)
{
m_mainStorage = mainstorage;
}
void SetCoreInCPU(int num)
{
m_coreNum = num;
}
void DataRun()
{
for (long long idx = m_startIndex; idx < m_endIndex; ++idx)
{
num += rand();
m_vecData.push_back(num); //<- exception error position
}
m_mainStorage->DataInsertLoop(&m_vecData, m_startIndex, m_endIndex);
}
};
int main()
{
//auto beginTime = std::chrono::high_resolution_clock::now();
clock_t beginTime, endTime;
DataStorage* main = new DataStorage();
beginTime = clock();
long long totalcount = 200000000;
long long halfdata = totalcount / 2;
std::thread t1,t2;
for (int t = 0; t < 2; ++t)
{
DataProcess* clsDP = new DataProcess(1, 0, halfdata);
clsDP->SetCoreInCPU(2);
clsDP->SetMainAdrr(main);
if (t == 0)
{
t1 = std::thread([&]() {clsDP->DataRun(); });
}
else
{
t2 = std::thread([&]() {clsDP->DataRun(); });
}
}
t1.join(); t2.join();
endTime = clock();
double resultTime = (double)(endTime - beginTime);
std::cout << "Multi Thread " << resultTime / 1000 << " sec" << std::endl;
printf("--------------------\n");
int value = getchar();
}
Interestingly, if none of your threads accesses portions of vecData accessed by another thread, DataInsertLoop::DataInsertLoop should not need to be synchonized at all. That should make processsing much faster. That is, after all bugs are fixed... This also means, you should not need a mutex at all.
There are other issues with your code... The most easily spotted is a memory leak.
In main:
DataStorage* main = new DataStorage(); // you call new, but never call delete...
// that's a memory leak. Avoid caling
// new() directly.
//
// Also: 'main' is kind of a reserved
// name, don't use it except for the
// program entry point.
// How about this, instead ?
DataStorage dataSrc; // DataSrc has a very small footprint (a few pointers).
// ...
std::thread t1,t2; // why not use an array ?
// as in:
std::vector<std::tread> thrds;
// ...
// You forgot to set the size of your data set before starting, by calling:
dataSrc.SetDataVectorSize(200000000);
for (int t = 0; t < 2; ++t)
{
// ...
// Calling new again, and not delete... Use a smart pointer type
DataProcess* clsDP = new DataProcess(1, 0, halfdata);
// Also, fix the start and en indices (NOTE: code below works for t < 2, but
// probably not for t < 3)
auto clsDP = std::make_unique<DataProcess>(t, t * halfdata, (t + 1) * halfdata);
// You need to keep a reference to these pointers
// Either by storing them in an array, or by passing them to
// the threads. As in, for example:
thrds.emplace_back([dp = std::move(clsDP)]() {clsDP->DataRun(); });
}
//...
std::for_each(thrds.begin(), thrds.end(), [](auto& t) { t.join(); });
//...
More...
You create a mutex on your very first line of executable code. That's good... somewhat...
static std::mutex m; // a one letter name is a terrible choice for a variable with
// file scope.
Apart form the name, it's not in the right scope... If you want to use a mutex to protect DataStorage::vecData, this mutex should be declared in the same scope as DataStorage::vecData.
One last thing. Have you considered using iterators (aka pointers) as arguments to DataProcess::DataProcess() ? This would simplify the code quite a bit, and it would very likely run faster.

Cyclic splitting of execution into several threads (1-N-1-N-1...)

Consider this case:
for (...)
{
const size_t count = ...
for (size_t i = 0; i < count; ++i)
{
calculate(i); // thread-safe function
}
}
What is the most elegant solution to maximize performance using C++17 and/or boost?
Cyclic "create + join" threads makes no sense because of huge overhead (which in my case exactly equals possible gain).
So I have to create N threads only once and keep them synchronized with the main one (using: mutex, shared_mutex, condition_variable, atomic, etc.). It appeared to be quite difficult task for such common and clear situation (in order to make everything really safe and fast). Sticking with it during days I have a feeling of "inventing a bicycle"...
Update 1: calculate(x) and calculate(y) can (and should) run in
parallel
Update 2: std::atomic::fetch_add (or smth.) is more preferable
than queue (or smth.)
Update 3: extreme computations (i.e. millions of "outer" calls and hundreds of "inner")
Update 4: calculate() changes internal object's data without returning a value
Intermediate solution
For some reason "async + wait" is much faster then "create + join" threads. So these two examples make 100% speed increase:
Example 1
for (...)
{
const size_t count = ...
future<void> execution[cpu_cores];
for (size_t x = 0; x < cpu_cores; ++x)
{
execution[x] = async(launch::async, ref(*this), x, count);
}
for (size_t x = 0; x < cpu_cores; ++x)
{
execution[x].wait();
}
}
void operator()(const size_t x, const size_t count)
{
for (size_t i = x; i < count; i += cpu_cores)
{
calculate(i);
}
}
Example 2
for (...)
{
index = 0;
const size_t count = ...
future<void> execution[cpu_cores];
for (size_t x = 0; x < cpu_cores; ++x)
{
execution[x] = async(launch::async, ref(*this), count);
}
for (size_t x = 0; x < cpu_cores; ++x)
{
execution[x].wait();
}
}
atomic<size_t> index;
void operator()(const size_t count)
{
for (size_t i = index.fetch_add(1); i < count; i = index.fetch_add(1))
{
calculate(i);
}
}
Is it possible to make it even faster by creating threads only once and then synchronize them with a small overhead?
Final solution
Additional +20% of speed increase in comparison to std::async!
for (size_t i = 0; i < _countof(index); ++i) { index[i] = i; }
for_each_n(par_unseq, index, count, [&](const size_t i) { calculate(i); });
Is it possible to avoid redundant array "index"?
Yes:
for_each_n(par_unseq, counting_iterator<size_t>(0), count,
[&](const size_t i)
{
calculate(i);
});
In the past, you'd use OpenMP, GNU Parallel, Intel TBB.¹
If you have c++17², I'd suggest using execution policies with standard algorithms.
It's really better than you can expect to do things yourself, although it
requires some fore-thought to choose your types to be amenable to standard algorithms
still helps if you know what will happen under the hood
Here's a simple example without further ado:
Live On Compiler Explorer
#include <thread>
#include <algorithm>
#include <random>
#include <execution>
#include <iostream>
using namespace std::chrono_literals;
static size_t s_random_seed = std::random_device{}();
static auto generate_param() {
static std::mt19937 prng {s_random_seed};
static std::uniform_int_distribution<> dist;
return dist(prng);
}
struct Task {
Task(int p = generate_param()) : param(p), output(0) {}
int param;
int output;
struct ByParam { bool operator()(Task const& a, Task const& b) const { return a.param < b.param; } };
struct ByOutput { bool operator()(Task const& a, Task const& b) const { return a.output < b.output; } };
};
static void calculate(Task& task) {
//std::this_thread::sleep_for(1us);
task.output = task.param ^ 0xf0f0f0f0;
}
int main(int argc, char** argv) {
if (argc>1) {
s_random_seed = std::stoull(argv[1]);
}
std::vector<Task> jobs;
auto now = std::chrono::high_resolution_clock::now;
auto start = now();
std::generate_n(
std::execution::par_unseq,
back_inserter(jobs),
1ull << 28, // reduce for small RAM!
generate_param);
auto laptime = [&](auto caption) {
std::cout << caption << " in " << (now() - start)/1.0s << "s" << std::endl;
start = now();
};
laptime("generate randum input");
std::sort(
std::execution::par_unseq,
begin(jobs), end(jobs),
Task::ByParam{});
laptime("sort by param");
std::for_each(
std::execution::par_unseq,
begin(jobs), end(jobs),
calculate);
laptime("calculate");
std::sort(
std::execution::par_unseq,
begin(jobs), end(jobs),
Task::ByOutput{});
laptime("sort by output");
auto const checksum = std::transform_reduce(
std::execution::par_unseq,
begin(jobs), end(jobs),
0, std::bit_xor<>{},
std::mem_fn(&Task::output)
);
laptime("reduce");
std::cout << "Checksum: " << checksum << "\n";
}
When run with the seed 42, prints:
generate randum input in 10.8819s
sort by param in 8.29467s
calculate in 0.22513s
sort by output in 5.64708s
reduce in 0.108768s
Checksum: 683872090
CPU utilization is 100% on all cores except for the first (random-generation) step.
¹ (I think I have answers demoing all of these on this site).
² See Are C++17 Parallel Algorithms implemented already?

c++ threading, duplicate/missing threads

I'm trying to write a program that concurrently add and removes items from a "storehouse". I have a "Monitor" class that handles the "storehouse" operations:
class Monitor
{
private:
mutex m;
condition_variable cv;
vector<Storage> S;
int counter = 0;
bool busy = false;;
public:
void add(Computer c, int index) {
unique_lock <mutex> lock(m);
if (busy)
cout << "Thread " << index << ": waiting for !busy " << endl;
cv.wait(lock, [&] { return !busy; });
busy = true;
cout << "Thread " << index << ": Request: add " << c.CPUFrequency << endl;
for (int i = 0; i < counter; i++) {
if (S[i].f == c.CPUFrequency) {
S[i].n++;
busy = false; cv.notify_one();
return;
}
}
Storage s;
s.f = c.CPUFrequency;
s.n = 1;
// put the new item in a sorted position
S.push_back(s);
counter++;
busy = false; cv.notify_one();
}
}
The threads are created like this:
void doThreadStuff(vector<Computer> P, vector <Storage> R, Monitor &S)
{
int Pcount = P.size();
vector<thread> myThreads;
myThreads.reserve(Pcount);
for (atomic<size_t> i = 0; i < Pcount; i++)
{
int index = i;
Computer c = P[index];
myThreads.emplace_back([&] { S.add(c, index); });
}
for (size_t i = 0; i < Pcount; i++)
{
myThreads[i].join();
}
// printing results
}
Running the program produced the following results:
I'm familiar with race conditions, but this doesn't look like one to me. My bet would be on something reference related, because in the results we can see that for every "missing thread" (threads 1, 3, 10, 25) I get "duplicate threads" (threads 2, 9, 24, 28).
I have tried to create local variables in functions and loops but it changed nothing.
I have heard about threads sharing memory regions, but my previous work should have produced similar results, so I don't think that's the case here, but feel free to prove me wrong.
I'm using Visual Studio 2017
Here you catch local variables by reference in a loop, they will be destroyed in every turn, causing undefined behavior:
for (atomic<size_t> i = 0; i < Pcount; i++)
{
int index = i;
Computer c = P[index];
myThreads.emplace_back([&] { S.add(c, index); });
}
You should catch index and c by value:
myThreads.emplace_back([&S, index, c] { S.add(c, index); });
Another approach would be to pass S, i and c as arguments instead of capturing them by defining the following non-capturing lambda, th_func:
auto th_func = [](Monitor &S, int index, Computer c){ S.add(c, index); };
This way you have to explicitly wrap the arguments that must be passed by reference to the thread's callable object with std::reference_wrapper by means of the function template std::ref(). In your case, only S:
for (atomic<size_t> i = 0; i < Pcount; i++) {
int index = i;
Computer c = P[index];
myThreads.emplace_back(th_func, std::ref(S), index, c);
}
Failing to wrap with std::reference_wrapper the arguments that must be passed by reference will result in a compile-time error. That is, the following won't compile:
myThreads.emplace_back(th_func, S, index, c); // <-- it should be std::ref(S)
See also this question.

Segmentation fault during multithreaded quicksort in c++

#include <iostream>
#include <algorithm>
#include <future>
#include <iterator>
using namespace std;
void qsort(int *beg, int *end)
{
if (end - beg <= 1)
return;
int lhs = *beg;
int *mid = partition(beg + 1, end,
[&](int arg)
{
return arg < lhs;
}
);
swap(*beg, *(mid - 1));
qsort(beg, mid);
qsort(mid, end);
}
std::future<void> qsortMulti(int *beg, int *end) // SEG FAULT
{
if (end - beg <= 1)
return future<void>();
int lhs = *beg;
int *mid = partition(beg + 1, end,
[&](int arg)
{
return arg < lhs;
}
);
swap(*beg, *(mid - 1));
//spawn new thread for one side of the recursion
auto future = async(launch::async, qsortMulti, beg, mid);
//other side of the recursion is done in the current thread
qsortMulti(mid, end);
future.wait();
inplace_merge(beg, mid, end);
}
void printArray(int *arr, size_t sz)
{
for (size_t i = 0; i != sz; i++)
cout << arr[i] << ' ';
cout << endl;
}
int main()
{
int ia[] = {5,3,6,8,4,6,2,5,2,9,7,8,4,2,6,8};
int ia2[] = {5,3,6,8,4,6,2,5,2,9,7,8,4,2,6,8};
size_t iaSize = 16;
size_t ia2Size = 16;
qsort(ia, ia + iaSize);
printArray(ia, iaSize);
qsortMulti(ia2, ia2 + ia2Size);
printArray(ia2, ia2Size);
}
From the above piece of code it is clear I am simply trying to implement the same qsort function, but with multiple threads. The other questions and answers on stack overflow regarding related issue have led me to this version of the code, which leaves me with a very simple problem and related question:
What is causing the multithreaded section to cause segmentation faults?
To be clear: I do not require anyone to build a solution for me, I'd much rather have an indication or directions as to where to find the source of the segmentation fault, as I don't see it. Thanks in advance!
In order to make std::async return an object of type std::future<T>, the function you pass to it merely has to return T. Example:
int compute() { return 42; }
std::future<int> result = std::async(&compute);
In your case that means that qsortMulti is supposed to have the signature
void qsortMulti(int* beg, int* end);
and nothing has to be returned from it. In the code you provided, qsortMulti returns std::future<void> itself, which leads to std::async returning an object of type std::future<std::future<void>>, which is probably not what you intended. Furthermore, your function is only returning something in the case where the range is empty (in the if at the top). In all other code paths (e.g. reaching the end of the function) you are not returning anything at all, which leads to the caller accessing an uninitialized object, what may be the reason for the seg fault.

C++: Issues with Circular Buffer

I'm having some trouble writing a circular buffer in C++. Here is my code base at the moment:
circ_buf.h:
#ifndef __CIRC_BUF_H__
#define __CIRC_BUF_H__
#define MAX_DATA (25) // Arbitrary size limit
// The Circular Buffer itself
struct circ_buf {
int s; // Index of oldest reading
int e; // Index of most recent reading
int data[MAX_DATA]; // The data
};
/*** Function Declarations ***/
void empty(circ_buf*);
bool is_empty(circ_buf*);
bool is_full(circ_buf*);
void read(circ_buf*, int);
int overwrite(circ_buf*);
#endif // __CIRC_BUF_H__
circ_buf.cpp:
#include "circ_buf.h"
/*** Function Definitions ***/
// Empty the buffer
void empty(circ_buf* cb) {
cb->s = 0; cb->e = 0;
}
// Is the buffer empty?
bool is_empty(circ_buf* cb) {
// By common convention, if the start index is equal to the end
// index, our buffer is considered empty.
return cb->s == cb->e;
}
// Is the buffer full?
bool is_full(circ_buf* cb) {
// By common convention, if the start index is one greater than
// the end index, our buffer is considered full.
// REMEMBER: we still need to account for wrapping around!
return cb->s == ((cb->e + 1) % MAX_DATA);
}
// Read data into the buffer
void read(circ_buf* cb, int k) {
int i = cb->e;
cb->data[i] = k;
cb->e = (i + 1) % MAX_DATA;
}
// Overwrite data in the buffer
int overwrite(circ_buf* cb) {
int i = cb->s;
int k = cb->data[i];
cb->s = (i + 1) % MAX_DATA;
}
circ_buf_test.cpp:
#include <iostream>
#include <fstream>
#include <string>
#include <cstdlib>
#include "circ_buf.h"
int main(int argc, char** argv) {
// Our data source
std::string file = "million_numbers.txt";
std::fstream in(file, std::ios_base::in);
// The buffer
circ_buf buffer = { .s = 0, .e = 0, .data = {} };
for (int i = 0; i < MAX_DATA; ++i) {
int k = 0; in >> k; // Get next int from in
read(&buffer, k);
}
for (int i = 0; i < MAX_DATA; ++i)
std::cout << overwrite(&buffer) << std::endl;
}
The main issue I'm having is getting the buffer to write integers to its array. When I compile and run the main program (circ_buf_test), it just prints the same number 25 times, instead of what I expect it to print (the numbers 1 through 25 - "million_numbers.txt" is literally just the numbers 1 through 1000000). The number is 2292656, in case this may be important.
Does anyone have an idea about what might be going wrong here?
Your function overwrite(circ_buf* cb) returns nothing (there are no return in it's body). So the code for printing of values can print anything (see "undefined behavior"):
for (int i = 0; i < MAX_DATA; ++i)
std::cout << overwrite(&buffer) << std::endl;
I expect you can find the reason of this "main issue" in the compilation log (see lines started with "Warning"). You can fix it this way:
int overwrite(circ_buf* cb) {
int i = cb->s;
int k = cb->data[i];
cb->s = (i + 1) % MAX_DATA;
return k;
}