I am trying to create a proof of concept for inter-thread communication by meanings of shared state: the main thread creates worker threads giving each a separate vector by reference, lets each do its work and fill its vector with results, and finally collects the results.
However, weird things are happening for which I can't find an explanation other than some race between the initialization of the vectors and the launch of the worker threads. Here is the code.
#include <iostream>
#include <vector>
#include <thread>
class Case {
public:
int val;
Case(int i):val(i) {}
};
void
run_thread (std::vector<Case*> &case_list, int idx)
{
std::cout << "size in thread " << idx <<": " << case_list.size() << '\n';
for (int i=0; i<10; i++) {
case_list.push_back(new Case(i));
}
}
int
main(int argc, char **argv)
{
int nthrd = 3;
std::vector<std::thread> threads;
std::vector<std::vector<Case*>> case_lists;
for (int i=0; i<nthrd; i++) {
case_lists.push_back(std::vector<Case*>());
std::cout << "size of " << i << " in main:" << case_lists[i].size() << '\n';
threads.push_back( std::thread( run_thread, std::ref(case_lists[i]), i) );
}
std::cout << "All threads lauched.\n";
for (int i=0; i<nthrd; i++) {
threads[i].join();
for (const auto cp:case_lists[i]) {
std::cout << cp->val << '\n';
}
}
return 0;
}
Tested on repl.it (gcc 4.6.3), the program gives the following result:
size of 0 in main:0
size of 1 in main:0
size of 2 in main:0
All threads lauched.
size in thread 0: 18446744073705569740
size in thread 2: 0
size in thread 1: 0
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
exit status -1
On my computer, besides something like the above, I also get:
Segmentation fault (core dumped)
It appears thread 0 is getting a vector that hasn't been initialized, although the vector appears properly initialized in main.
To isolate the problem, I have tried going single threaded by changing the line:
threads.push_back( std::thread( run_thread, std::ref(case_lists[i]), i) );
to
run_thread(case_lists[i], i);
and commenting out:
threads[i].join();
Now the program runs as expected, with the "threads" running one after another before the main collects the results.
My question is: what is wrong with the multi-threaded version above?
References (and iterators) for a vector are invalidated any time the capacity of the vector changes. The exact rules for overallocation vary by implementation, but odds are, you've got at least one capacity change between the first push_back and the last, and all the references made before that final capacity increase are garbage the moment it occurs, invoking undefined behavior.
Either reserve your total vector size up front (so push_backs don't cause capacity increases), initialize the whole vector to the final size up front (so no resizes occur at all), or have one loop populate completely, then launch the threads (so all resizes occur before you extract any references). The simplest fix here would be to initialize it to the final size, changing:
std::vector<std::vector<Case*>> case_lists;
for (int i=0; i<nthrd; i++) {
case_lists.push_back(std::vector<Case*>());
std::cout << "size of " << i << " in main:" << case_lists[i].size() << '\n';
threads.push_back( std::thread( run_thread, std::ref(case_lists[i]), i) );
}
to:
std::vector<std::vector<Case*>> case_lists(nthrd); // Default initialize nthrd elements up front
for (int i=0; i<nthrd; i++) {
// No push_back needed
std::cout << "size of " << i << " in main:" << case_lists[i].size() << '\n';
threads.push_back( std::thread( run_thread, std::ref(case_lists[i]), i) );
}
You might be thinking that vectors would overallocate fairly aggressively, but at least on many popular compilers, this is not the case; both gcc and clang follow a strict doubling pattern, so the first three insertions reallocate every time (capacity goes from 1, to 2, to 4); the reference to the first element is invalidated by the insertion of the second, and the reference to the second is invalidated by the insertion of the third.
Related
I'm messing around with multithreading in c++ and here is my code:
#include <iostream>
#include <vector>
#include <string>
#include <thread>
void read(int i);
bool isThreadEnabled;
std::thread threads[100];
int main()
{
isThreadEnabled = true; // I change this to compare the threaded vs non threaded method
if (isThreadEnabled)
{
for (int i = 0;i < 100;i++) //this for loop is what I'm confused about
{
threads[i] = std::thread(read,i);
}
for (int i = 0; i < 100; i++)
{
threads[i].join();
}
}
else
{
for (int i = 0; i < 100; i++)
{
read(i);
}
}
}
void read(int i)
{
int w = 0;
while (true) // wasting cpu cycles to actually see the difference between the threaded and non threaded
{
++w;
if (w == 100000000) break;
}
std::cout << i << std::endl;
}
in the for loop that uses threads the console prints values in a random order ex(5,40,26...) which is expected and totally fine since threads don't run in the same order as they were initiated...
but what confuses me is that the values printed are sometimes more than the maximum value that int i can reach (which is 100), values like 8000,2032,274... are also printed to the console even though i will never reach that number, I don't understand why ?
This line:
std::cout << i << std::endl;
is actually equivalent to
std::cout << i;
std::cout << std::endl;
And thus while thread safe (meaning there's no undefined behaviour), the order of execution is undefined. Given two threads the following execution is possible:
T20: std::cout << 20
T32: std::cout << 32
T20: std::cout << std::endl
T32: std::cout << std::endl
which results in 2032 in console (glued numbers) and an empty line.
The simplest (not necessarily the best) fix for that is to wrap this line with a shared mutex:
{
std::lock_guard lg { mutex };
std::cout << i << std::endl;
}
(the brackets for a separate scope are not needed if the std::cout << i << std::endl; is the last line in the function)
I am trying to run a hello-world DPC++ sample of oneAPI which adds two 1-D Arrays on both CPU and GPU, and verifies the results. Code is shown below:
/*
DataParallel Addition of two Vectors
*/
#include <CL/sycl.hpp>
#include <array>
#include <iostream>
using namespace sycl;
constexpr size_t array_size = 100000;
typedef std::array<int, array_size> IntArray;
// Initialize array with the same value as its index
void InitializeArray(IntArray& a) { for (size_t i = 0; i < a.size(); i++) a[i] = i; }
/*
Create an asynchronous Exception Handler for sycl
*/
static auto exception_handler = [](cl::sycl::exception_list eList) {
for (std::exception_ptr const& e : eList) {
try {
std::rethrow_exception(e);
}
catch (std::exception const& e) {
std::cout << "Failure" << std::endl;
std::terminate();
}
}
};
void VectorAddParallel(queue &q, const IntArray& x, const IntArray& y, IntArray& parallel_sum) {
range<1> num_items{ x.size() };
buffer x_buf(x);
buffer y_buf(y);
buffer sum_buf(parallel_sum.data(), num_items);
/*
Submit a command group to the queue by a lambda
which contains data access permissions and device computation
*/
q.submit([&](handler& h) {
auto xa = x_buf.get_access<access::mode::read>(h);
auto ya = y_buf.get_access<access::mode::read>(h);
auto sa = sum_buf.get_access<access::mode::write>(h);
std::cout << "Adding on GPU (Parallel)\n";
h.parallel_for(num_items, [=](id<1> i) { sa[i] = xa[i] + ya[i]; });
std::cout << "Done on GPU (Parallel)\n";
});
/*
queue runs the kernel asynchronously. Once beyond the scope,
buffers' data is copied back to the host.
*/
}
int main() {
default_selector d_selector;
IntArray a, b, sequential, parallel;
InitializeArray(a);
InitializeArray(b);
try {
// Queue needs: Device and Exception handler
queue q(d_selector, exception_handler);
std::cout << "Accelerator: "
<< q.get_device().get_info<info::device::name>() << "\n";
std::cout << "Vector size: " << a.size() << "\n";
VectorAddParallel(q, a, b, parallel);
}
catch (std::exception const& e) {
std::cout << "Exception while creating Queue. Terminating...\n";
std::terminate();
}
/*
Do the sequential, which is supposed to be slow
*/
std::cout << "Adding on CPU (Scalar)\n";
for (size_t i = 0; i < sequential.size(); i++) {
sequential[i] = a[i] + b[i];
}
std::cout << "Done on CPU (Scalar)\n";
/*
Verify results, the old-school way
*/
for (size_t i = 0; i < parallel.size(); i++) {
if (parallel[i] != sequential[i]) {
std::cout << "Fail: " << parallel[i] << " != " << sequential[i] << std::endl;
std::cout << "Failed. Results do not match.\n";
return -1;
}
}
std::cout << "Success!\n";
return 0;
}
With a relatively small array_size, (I tested 100-50k elements) the computation works out to be fine.
Sample output:
Accelerator: Intel(R) Gen9
Vector size: 50000
Adding on GPU (Parallel)
Done on GPU (Parallel)
Adding on CPU (Scalar)
Done on CPU (Scalar)
Success!
It can be noted that it takes barely a second to finish the computation on both CPU and GPU.
But when I increase the array_size, to say, 100000, I get this seemingly clueless error:
C:\Users\myuser\source\repos\dpcpp-iotas\x64\Debug\dpcpp-iotas.exe (process 24472) exited with code -1073741571.
Although I am not sure at what precise value the error starts occurring, but I seem to be sure it happens after around 70000. I seem to have no idea why this is happening, any insights on what can be wrong?
Turns out, this is due to Stack size reinforcement by VS. Contiguous array with too many elements resulted in a stack overflow.
As mentioned by #user4581301, the error code -107374171 in hex, gives C00000FD, which is a signed representation of 'stack exhaustion/overflow' in Visual Studio.
Ways to fix this:
Increase the /STACK reserve to something higher than 1MB (this is the default) in the Project Properties > Linker > System > Stack Reserve/Commit values.
Use a binary editor (editbin.exe and dumpbin.exe) to edit /STACK:reserve.
Use std::vector instead, which allows dynamic allocation (suggested by #Retired Ninja).
I couldn't find an option to change /STACK in oneAPI, the normal way in Linker properties, shown here.
I decided to go with dynamic allocation.
Related: https://stackoverflow.com/a/26311584/9230398
When I program big applications I always do a
ulimit -s unlimited
to explain to the shell that I am grown up and I really want some space on my stack.
Here this is the bash syntax but you can obviously adapt to some other shells.
I guess there might be an equivalent for non-UNIX OS?
I know that the std::map class is thread unsafe in read and write in two threads. But is it OK to insert in multiple threads?
void writeMap()
{
for (int i = 0; i < 1000; i++)
{
long long random_variable = (std::rand()) % 1000;
std::cout << "Thread ID -> " << std::this_thread::get_id() << " with looping index " << i << std::endl;
k1map.insert(std::make_pair(i, new p(i)));
}
}
int main()
{
std::srand((int)std::time(0));
for (int i = 0; i < 1000; ++i)
{
long long random_variable = (std::rand()) % 1000;
std::thread t(writeMap);
std::cout << "Thread created " << t.get_id() << std::endl;
t.detach();
}
return 0;
}
Like such code is running normal no matter how many times I try.
program is complex,to some extent,like magic(LOL).
The code run results are different on various IDE.
Before, I used VS2013, it's always right.
But on vs19 and linux,the result of the same code is wrong.
Maybe on vs2013,the implement of MAP has special way.
No, std::map::insert is not thread-safe.
Most standard library types are thread safe only if you are using separate object instances in separate threads. Take a look at the thread safety part of container's docs.
As #NutCracker has mentioned, std::map::insert is not thread-safe.
But, if the posted code works fine, I think the reason is that the map fills very fast by one thread and as a result, other threads are not modifying the map anymore.
Can iterating over unsorted data structure like array, tree with multiple thread make it faster?
For example I have big array with unsorted data.
int array[1000];
I'm searching array[i] == 8
Can running:
Thread 1:
for(auto i = 0; i < 500; i++)
{
if(array[i] == 8)
std::cout << "found" << std::endl;
}
Thread 2:
for(auto i = 500; i < 1000; i++)
{
if(array[i] == 8)
std::cout << "found" << std::endl;
}
be faster than normal iteration?
#update
I've written simple test witch describe problem better:
For searching int* array = new int[100000000];
and repeating it 1000 times
I got the result:
a
Number of threads = 2
End of multithread iteration
End of normal iteration
Time with 2 threads 73581
Time with 1 thread 154070
Bool values:0
0
0
Process returned 0 (0x0) execution time : 256.216 s
Press any key to continue.
What's more when program was running with 2 threads cpu usage of the process was around ~90% and when iterating with 1 thread it was never more than 50%.
So Smeeheey and erip are right that it can make iteration faster.
Of course it can be more tricky for not such trivial problems.
And as I've learned from this test is that compiler can optimize main thread (when i was not showing boolean storing results of search loop in main thread was ignored) but it will not do that for other threads.
This is code I have used:
#include<cstdlib>
#include<thread>
#include<ctime>
#include<iostream>
#define SIZE_OF_ARRAY 100000000
#define REPEAT 1000
inline bool threadSearch(int* array){
for(auto i = 0; i < SIZE_OF_ARRAY/2; i++)
if(array[i] == 101) // there is no array[i]==101
return true;
return false;
}
int main(){
int i;
std::cin >> i; // stops program enabling to set real time priority of the process
clock_t with_multi_thread;
clock_t normal;
srand(time(NULL));
std::cout << "Number of threads = "
<< std::thread::hardware_concurrency() << std::endl;
int* array = new int[SIZE_OF_ARRAY];
bool true_if_found_t1 =false;
bool true_if_found_t2 =false;
bool true_if_found_normal =false;
for(auto i = 0; i < SIZE_OF_ARRAY; i++)
array[i] = rand()%100;
with_multi_thread=clock();
for(auto j=0; j<REPEAT; j++){
std::thread t([&](){
if(threadSearch(array))
true_if_found_t1=true;
});
std::thread u([&](){
if(threadSearch(array+SIZE_OF_ARRAY/2))
true_if_found_t2=true;
});
if(t.joinable())
t.join();
if(u.joinable())
u.join();
}
with_multi_thread=(clock()-with_multi_thread);
std::cout << "End of multithread iteration" << std::endl;
for(auto i = 0; i < SIZE_OF_ARRAY; i++)
array[i] = rand()%100;
normal=clock();
for(auto j=0; j<REPEAT; j++)
for(auto i = 0; i < SIZE_OF_ARRAY; i++)
if(array[i] == 101) // there is no array[i]==101
true_if_found_normal=true;
normal=(clock()-normal);
std::cout << "End of normal iteration" << std::endl;
std::cout << "Time with 2 threads " << with_multi_thread<<std::endl;
std::cout << "Time with 1 thread " << normal<<std::endl;
std::cout << "Bool values:" << true_if_found_t1<<std::endl
<< true_if_found_t2<<std::endl
<<true_if_found_normal<<std::endl;// showing bool values to prevent compiler from optimization
return 0;
}
The answer is yes, it can make it faster - but not necessarily. In your case, when you're iterating over pretty small arrays, it is likely that the overhead of launching a new thread will be much higher than the benefit gained. If you array was much bigger then this would be reduced as a proportion of the overall runtime and eventually become worth it. Note you will only get speed up if your system has more than 1 physical core available to it.
Additionally, you should note that whilst that the code that reads the array in your case is perfectly thread-safe, writing to std::cout is not (you will get very strange looking output if your try this). Instead perhaps your thread should do something like return an integer type indicating the number of instances found.
My code:
#include <iostream>
#include <thread>
void function_1()
{
std::cout << "Thread t1 started!\n";
for (int j=0; j>-100; j--) {
std::cout << "t1 says: " << j << "\n";
}
}
int main()
{
std::thread t1(function_1); // t1 starts running
for (int i=0; i<100; i++) {
std::cout << "from main: " << i << "\n";
}
t1.join(); // main thread waits for t1 to finish
return 0;
}
I create a thread that prints numbers in decreasing order while main prints in increasing order.
Sample output here. Why is my code printing garbage ?
Both threads are outputting at the same time, thereby scrambling your output.
You need some kind of thread synchronization mechanism on the printing part.
See this answer for an example using a std::mutex combined with std::lock_guard for cout.
It's not "garbage" — it's the output you asked for! It's just jumbled up, because you have used a grand total of zero synchronisation mechanisms to prevent individual std::cout << ... << std::endl lines (which are not atomic) from being interrupted by similar lines (which are still not atomic) in the other thread.
Traditionally we'd lock a mutex around each of those lines.