Correct place to use cudaSetDeviceFlags?

Correct place to use cudaSetDeviceFlags? - c++

Win10 x64, CUDA 8.0, VS2015, 6-core CPU (12 logical cores), 2 GTX580 GPUs.
In general, I'm working on a multithreaded application that launches 2 threads that are associated with 2 GPUs available, these threads are stored in a thread pool.
Each thread does the following initialization procedure upon it's launch (i.e. this is done only ones during the runtime of each thread):
::cudaSetDevice(0 or 1, as we have only two GPUs);
::cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
::cudaSetDeviceFlags(cudaDeviceMapHost | cudaDeviceScheduleBlockingSync);
Then, from other worker threads (12 more threads that do not touch GPUs at all), I begin feeding these 2 GPU-associated worker threads with data, it works perfectly as long as the number of GPU threads being laucnhed is equal to the number of physical GPUs available.
Now I want to launch 4 GPU threads (i.e 2 threads per GPU) and make each one work via separate CUDA stream. I know the requirements that are essential for proper CUDA streams usage, I meet all of them. What I'm failing on is the initialization procedure mentioned above.
As soon as this procedure is attempted to be executed twice from different GPU threads but for the same GPU, the ::cudaSetDeviceFlags(...) starts failing with "cannot set while device is active in this process" error message.
I have looked into the manual and seems like I get the reason why this happens, what I can't understand is how to use ::cudaSetDeviceFlags(...) for my setup properly.
I can comment this ::cudaSetDeviceFlags(...) line and the propgram will work fine even for 8 thread per GPU, but I need the cudaDeviceMapHost flag to be set in order to use streams, pinned memory won't be available otherwise.
EDIT Extra info to consider #1:
If to call ::cudaSetDeviceFlags before ::cudaSetDevice then no error
occurs.
Each GPU thread allocates a chunk of pinned memory via
::VirtualAlloc ->::cudaHostRegister approach upon thread launch
(works just fine no matter how many GPU threads launched) and
deallocates it upon thread termination (via ::cudaHostUnregister ->
::VirtualFree). ::cudaHostUnregister fails with "pointer does not
correspond to a registered memory region" for half the threads if the number of threads per GPU is greater than 1.

Well, highly sophisticated method of trythis-trythat-seewhathappens-tryagain practice finally did the trick, as always.
Here is the excerpt from the documentation on ::cudaSetDeviceFlags():
Records flags as the flags to use when initializing the current
device. If no device has been made current to the calling thread, then
flags will be applied to the initialization of any device initialized
by the calling host thread, unless that device has had its
initialization flags set explicitly by this or any host thread.
Consequently, in the GPU worker thread it is necessary to call ::cudaSetDeviceFlags() before ::cudaSetDevice().
I have implemented somthing like this in the GPU thread initialization code in order to make sure that device flags being set before the device set are actually applied properly:
bse__throw_CUDAHOST_FAILED(::cudaSetDeviceFlags(nFlagsOfDesire));
bse__throw_CUDAHOST_FAILED(::cudaSetDevice(nDevice));
unsigned int nDeviceFlagsActual = 0;
bse__throw_CUDAHOST_FAILED(::cudaGetDeviceFlags(&nDeviceFlagsActual));
bse__throw_IF(nFlagsOfDesire != nDeviceFlagsActual);
Also, the comment of talonmies showed the way to resolve the ::cudaHostUnregister errors.

Related

TBB thread pool unexpectedly increasing

We have a piece of code that utilises TBB to spawn tasks to perform some processing this is done using the following TBB code to initialise the TBB thread pool:
tbb::task_scheduler_init(8);
Then for each task we want to spawn we use the following code (where MainTask is derived from the tbb::task class):
task = new (tbb::task::allocate_root()) MainTask(theAction, theOutputData);
tbb::task::enqueue(*task);
When we run our code we start off with a thread pool that is the same as the number of cores (in our case 8 threads) as expected but as the program executes and spawns new TBB tasks, as described above, the number of threads at some random points suddenly increase. After 40 minutes of program execution the thread count increases from 8 to 15 between.
Why is this happening? Shouldn’t TBB keep the number of worker threads fixed to equal the number of cores?

As I said in another answer to you: Don't worry :-)
TBB does great job preventing actual over-subscription - only 8 threads will be active in your program at the same time. Though for various reasons, it needs more threads than hardware resources sometimes. One example is tbb::task_arena with no master slots reserved and another recent addition is tbb::global_control class which allows to change the number of active threads in the pool dynamically. Unfortunately, the way how TBB implements it leaves some space for the data race. It happens when some threads are on its way back to thread-pool to get some sleep while a new work arrives and requests all the 8 threads to start processing immediately; but that these threads in the intermediate state are not accounted in the thread-pool yet and new threads created instead.
TBB reduced the window for this data race as much as possible but to close it completely, a synchronization needed on the hot path which will affect general performance. Thus the decision was made to allow the data race and get less obstacles on the hot path.
But again, don't worry, there is no resource leak because TBB has hard limit for the maximum number of threads it can create this way. Depending on platform, this number varies somewhere from 2x to 4x (though it's internal implementation specifics which keep changing).
Though, I'm surprised that it goes that far with 15 threads created and I understand your concerns. TBB team will appreciate if you share a reproducer with them. You can contribute the reproducer through either TBB Forum or OSS site.

Creating a cuda stream on each host thread (multi-threaded CPU)

I have a multi-threaded CPU and I would like each thread of the CPU to be able to launch a seperate CUDA stream. The seperate CPU threads will be doing different things at different times so there is a chance that they won't overlap but if they do launch a CUDA kernel at the same time I would like it to continue to run concurrently.
I'm pretty sure this is possible because in the CUDA Toolkit documentation section 3.2.5.5. It says "A stream is a sequence of commands (possibly issued by different host threads)..."
So if I want to implement this I would do something like
void main(int CPU_ThreadID) {
cudaStream_t *stream;
cudaStreamCreate(&stream);
int *d_a;
int *a;
cudaMalloc((void**)&d_a, 100*sizeof(int));
cudaMallocHost((void**)&a, 100*8*sizeof(int));
cudaMemcpyAsync(d_a, a[100*CPU_ThreadID], 100*size(int), cudaMemcpyHostToDevice, stream);
sum<<<100,32,0,stream>>>(d_a);
cudaStreamDestroy(stream);
}
That is just a simple example. If I know there are only 8 CPU Threads then I know at most 8 streams will be created. Is this the proper way to do this? Will this run concurrently if two or more different host threads reach this code around the same time? Thanks for any help!
Edit:
I corrected some of the syntax issues in the code block and put in the cudaMemcpyAsync as sgar91 suggested.

It really looks to me like you are proposing a multi-process application, not multithreaded. You don't mention which threading architecture you have in mind, nor even an OS, but the threading architectures I know of don't posit a thread routine called "main", and you haven't shown any preamble to the thread code.
A multi-process environment will generally create one device context per process, which will inhibit fine-grained concurrency.
Even if that's just an oversight, I would point out that a multi-threaded application should establish a GPU context on the desired device before threads are spawned.
Each thread can then issue a cudaSetDevice(0); or similar call, which should cause each thread to pick up the established context on the indicated device.
Once that is in place, you should be able to issue commands to the desired streams from whichever threads you like.
You may wish to refer to the cudaOpenMP sample code. Although it omits the streams concepts, it demonstrates a multi-threaded app with the potential for multiple threads to issue commands to the same device (and could be extended to the same stream)
Whether or not kernels happen to run concurrently or not after the above issues have been addressed is a separate issue. Concurrent kernel execution has a number of requirements, and the kernels themselves must have compatible resource requirements (blocks, shared memory, registers, etc.), which generally implies "small" kernels.

Fork and core dump with threads

Similar points to the one in this question have been raised before here and here, and I'm aware of the Google coredump library (which I've appraised and found lacking, though I might try and work on that if I understand the problem better).
I want to obtain a core dump of a running Linux process without interrupting the process. The natural approach is to say:
if (!fork()) { abort(); }
Since the forked process gets a fixed snapshot copy of the original process's memory, I should get a complete core dump, and since the copy uses copy-on-write, it should generally be cheap. However, a critical shortcoming of this approach is that fork() only forks the current thread, and all other threads of the original process won't exist in the forked copy.
My question is whether it is possible to somehow obtain the relevant data of the other, original threads. I'm not entirely sure how to approach this problem, but here are a couple of sub-questions I've come up with:
Is the memory that contains all of the threads' stacks still available and accessible in the forked process?
Is it possible to (quicky) enumerate all the running threads in the original process and store the addresses of the bases of their stacks? As I understand it, the base of a thread stack on Linux contains a pointer to the kernel's thread bookkeeping data, so...
with the stored thread base addresses, could you read out the relevant data for each of the original threads in the forked process?
If that is possible, perhaps it would only be a matter of appending the data of the other threads to the core dump. However, if that data is lost at the point of the fork already, then there doesn't seem to be any hope for this approach.

Are you familiar with process checkpoint-restart? In particular, CRIU? It seems to me it might provide an easy option for you.
I want to obtain a core dump of a running Linux process without interrupting the process [and] to somehow obtain the relevant data of the other, original threads.
Forget about not interrupting the process. If you think about it, a core dump has to interrupt the process for the duration of the dump; your true goal must therefore be to minimize the duration of this interruption. Your original idea of using fork() does interrupt the process, it just does so for a very short time.
Is the memory that contains all of the threads' stacks still available and accessible in the forked process?
No. The fork() only retains the thread that does the actual call, and the stacks for the rest of the threads are lost.
Here is the procedure I'd use, assuming CRIU is unsuitable:
Have a parent process that generates a core dump of the child process whenever the child is stopped. (Note that more than one consecutive stop event may be generated; only the first one until the next continue event should be acted on.)
You can detect the stop/continue events using waitpid(child,,WUNTRACED|WCONTINUED).
Optional: Use sched_setaffinity() to restrict the process to a single CPU, and sched_setscheduler() (and perhaps sched_setparam()) to drop the process priority to IDLE.
You can do this from the parent process, which only needs the CAP_SYS_NICE capability (which you can give it using setcap 'cap_sys_nice=pe' parent-binary to the parent binary, if you have filesystem capabilities enabled like most current Linux distributions do), in both the effective and permitted sets.
The intent is to minimize the progress of the other threads between the moment a thread decides it wants a snapshot/dump, and the moment when all threads have been stopped. I have not tested how long it takes for the changes to take effect -- certainly they only happen at the end of their current timeslices at the very earliest. So, this step should probably be done a bit beforehand.
Personally, I don't bother. On my four-core machine, the following SIGSTOP alone yields similar latencies between threads as a mutex or a semaphore does, so I don't see any need to strive for even better synchronization.
When a thread in the child process decides it wants to take a snapshot of itself, it sends a SIGSTOP to itself (via kill(getpid(), SIGSTOP)). This stops all threads in the process.
The parent process will receive the notification that the child was stopped. It will first examines /proc/PID/task/ to obtain the TIDs for each thread of the child process (and perhaps /proc/PID/task/TID/ pseudofiles for other information), then attaches to each TID using ptrace(PTRACE_ATTACH, TID). Obviously, ptrace(PTRACE_GETREGS, TID, ...) will obtain the per-thread register states, which can be used in conjunction with /proc/PID/task/TID/smaps and /proc/PID/task/TID/mem to obtain the per-thread stack trace, and whatever other information you're interested in. (For example, you could create a debugger-compatible core file for each thread.)
When the parent process is done grabbing the dump, it lets the child process continue. I believe you need to send a separate SIGCONT signal to let the entire child process continue, instead of just relying on ptrace(PTRACE_CONT, TID), but I haven't checked this; do verify this, please.
I do believe that the above will yield a minimal delay in wall clock time between the threads in the process stopping. Quick tests on AMD Athlon II X4 640 on Xubuntu and a 3.8.0-29-generic kernel indicates tight loops incrementing a volatile variable in the other threads only advance the counters by a few thousand, depending on the number of threads (there's too much noise in the few tests I made to say anything more specific).
Limiting the process to a single CPU, and even to IDLE priority, will drastically reduce that delay even further. CAP_SYS_NICE capability allows the parent to not only reduce the priority of the child process, but also lift the priority back to original levels; filesystem capabilities mean the parent process does not even have to be setuid, as CAP_SYS_NICE alone suffices. (I think it'd be safe enough -- with some good checks in the parent program -- to be installed in e.g. university computers, where students are quite active in finding interesting ways to exploit the installed programs.)
It is possible to create a kernel patch (or module) that provides a boosted kill(getpid(), SIGSTOP) that also tries to kick off the other threads from running CPUs, and thus try to make the delay between the threads stopping even smaller. Personally, I wouldn't bother. Even without the CPU/priority manipulation I get sufficient synchronization (small enough delays between the times the threads are stopped).
Do you need some example code to illustrate my ideas above?

When you fork you get a full copy of the running processes memory. This includes all thread's stacks (after all you could have valid pointers into them). But only the calling thread continues to execute in the child.
You can easily test this. Make a multithreaded program and run:
pid_t parent_pid = getpid();
if (!fork()) {
kill(parent_pid, SIGSTOP);
char buffer[0x1000];
pid_t child_pid = getpid();
sprintf(buffer, "diff /proc/%d/maps /proc/%d/maps", parent_pid, child_pid);
system(buffer);
kill(parent_pid, SIGTERM);
return 0;
} else for (;;);
So all your memory is there and when you create a core dump it will contain all the other threads stacks (provided your maximum core file size permits it). The only pieces that will be missing are their register sets. If you need those then you will have to ptrace your parent to obtain them.
You should keep in mind though that core dumps are not designed to contain runtime information of more then one thread - the one that caused the core dump.
To answer some of your other questions:
You can enumerate threads by going through /proc/[pid]/tasks, but you can not identify their stack bases until you ptrace them.
Yes, you have full access to the other threads stacks snapshots (see above) from the forked process. It is not trivial to determine them, but they do get put into a core dump provided the core file size permits it. Your best bet is to save them in some globally accessible structure if you can upon creation.

If you intend to get the core file at non-specific location, and just get core image of the process running without killing, then you can use gcore.
If you intend to get the core file at specific location (condition) and still continue running the process - a crude approach is to execute gcore programmatically from that location.
A more classical, clean approach would be to check the API which gcore uses and embedded it in your application - but would be too much of an effort compared to the need most of the time.
HTH!

If your goal is to snapshot the entire process in order to understand the exact state of all threads at a specific point then I can't see any way to do this that doesn't require some kind of interrupt service routine. You must halt all processors and record off the current state of each thread.
I don't know of any system that provides this kind of full process core dump. The rough outlines of the process would be:
issue an interrupt across all CPUs (both logical and physical cores).
busy wait for all cores to synchronize (this shouldn't take long).
clone the desired process's memory space: duplicate the page tables and mark all pages as copy on write.
have each processor check whether its current thread is in the target process. If so record the current stack pointer for that thread.
for every other thread examine the thread data block for the current stack pointer and record it.
create a kernel thread to save off the copied memory spaces and the thread stack pointers
resume all cores.
This should capture the entire process state, including a snapshot of any processes that were running at the moment the inter-processor interrupt was issued. Because all threads are interrupted (either through standard scheduler suspension process, or via our custom interrupt process) all register states will be on a stack somewhere in the process memory. You then only need to know where the top of each thread stack is. Using the copy on write mechanism to clone the page tables allows transparent save-off while the original process is allowed to resume.
This is a pretty heavyweight option, since it's main functionality requires suspending all processors for a significant amount of time (synchronize, clone, walk all threads). However this should allow you to exactly capture the status of all threads, as well as determine which threads were running (and on which CPUs) when the checkpoint was reached. I would assume some of the framework for doing this process exists (in CRIU for instance). Of course resuming the process will result in a storm of page allocations as the copy on write mechanism protects the check-pointed system state.

How to reduce CUDA synchronize latency / delay

This question is related to using cuda streams to run many kernels
In CUDA there are many synchronization commands
cudaStreamSynchronize,
CudaDeviceSynchronize,
cudaThreadSynchronize,
and also cudaStreamQuery to check if streams are empty.
I noticed when using the profiler that these synchronize commands introduce a large delay to the program. I was wondering if anyone knows any means to reduce this latency apart from of course using as few synchronisation commands as possible.
Also is there any figures to judge the most effecient synchronisation method. that is consider 3 streams used in an application and two of them need to complete for me to launch a forth streams should i use 2 cudaStreamSyncs or just one cudaDeviceSync what will incur less loss ?

The main difference between synchronize methods is "polling" and "blocking."
"Polling" is the default mechanism for the driver to wait for the GPU - it waits for a 32-bit memory location to attain a certain value written by the GPU. It may return the wait more quickly after the wait is resolved, but while waiting, it burns a CPU core looking at that memory location.
"Blocking" can be requested by calling cudaSetDeviceFlags() with cudaDeviceScheduleBlockingSync, or calling cudaEventCreate() with cudaEventBlockingSync. Blocking waits cause the driver to insert a command into the DMA command buffer that signals an interrupt when all preceding commands in the buffer have been executed. The driver can then map the interrupt to a Windows event or a Linux file handle, enabling the synchronization commands to wait without constantly burning CPU, as do the default polling methods.
The queries are basically a manual check of that 32-bit memory location used for polling waits; so in most situations, they are very cheap. But if ECC is enabled, the query will dive into kernel mode to check if there are any ECC errors; and on Windows, any pending commands will be flushed to the driver (which requires a kernel thunk).

How much overhead is there when creating a thread?

I just reviewed some really terrible code - code that sends messages on a serial port by creating a new thread to package and assemble the message in a new thread for every single message sent. Yes, for every message a pthread is created, bits are properly set up, then the thread terminates. I haven't a clue why anyone would do such a thing, but it raises the question - how much overhead is there when actually creating a thread?

To resurrect this old thread, I just did some simple test code:
#include <thread>
int main(int argc, char** argv)
{
for (volatile int i = 0; i < 500000; i++)
std::thread([](){}).detach();
return 0;
}
I compiled it with g++ test.cpp -std=c++11 -lpthread -O3 -o test. I then ran it three times in a row on an old (kernel 2.6.18) heavily loaded (doing a database rebuild) slow laptop (Intel core i5-2540M). Results from three consecutive runs: 5.647s, 5.515s, and 5.561s. So we're looking at a tad over 10 microseconds per thread on this machine, probably much less on yours.
That's not much overhead at all, given that serial ports max out at around 1 bit per 10 microseconds. Now, of course there's various additional thread losses one can get involving passed/captured arguments (although function calls themselves can impose some), cache slowdowns between cores (if multiple threads on different cores are battling over the same memory at the same time), etc. But in general I highly doubt the use case you presented will adversely impact performance at all (and could provide benefits, depending), despite having you already preemptively labeled the concept "really terrible code" without even knowing how much time it takes to launch a thread.
Whether it's a good idea or not depends a lot on the details of your situation. What else is the calling thread responsible for? What precisely is involved in preparing and writing out the packets? How frequently are they written out (with what sort of distribution? uniform, clustered, etc...?) and what's their structure like? How many cores does the system have? Etc. Depending on the details, the optimal solution could be anywhere from "no threads at all" to "shared thread pool" to "thread for each packet".
Note that thread pools aren't magic and can in some cases be a slowdown versus unique threads, since one of the biggest slowdowns with threads is synchronizing cached memory used by multiple threads at the same time, and thread pools by their very nature of having to look for and process updates from a different thread have to do this. So either your primary thread or child processing thread can get stuck having to wait if the processor isn't sure whether the other process has altered a section of memory. By contrast, in an ideal situation, a unique processing thread for a given task only has to share memory with its calling task once (when it's launched) and then they never interfere with each other again.

I have always been told that thread creation is cheap, especially when compared to the alternative of creating a process. If the program you are talking about does not have a lot of operations that need to run concurrently then threading might not be necessary, and judging by what you wrote this might well be the case. Some literature to back me up:
http://www.personal.kent.edu/~rmuhamma/OpSystems/Myos/threads.htm
Threads are cheap in the sense that
They only need a stack and storage for registers therefore, threads are cheap to create.
Threads use very little resources of an operating system in
which they are working. That is,
threads do not need new address space,
global data, program code or operating
system resources.
Context switching are fast when working with threads. The reason is
that we only have to save and/or
restore PC, SP and registers.
More of the same here.
In Operating System Concepts 8th Edition (page 155) the authors write about the benefits of threading:
Allocating memory and resources for process creation is costly. Because
threads share the resource of the
process to which they belong, it is
more economical to create and
context-switch threads. Empirically
gauging the difference in overhead can
be difficult, but in general it is
much more time consuming to create and
manage processes than threads. In
Solaris, for example, creating a
process is about thirty times slower
than is creating a thread, and context
switching is about five times slower.

There is some overhead in thread creation, but comparing it with usually slow baud rates of the serial port (19200 bits/sec being the most common), it just doesn't matter.

...sends Messages on a serial port ... for every message a pthread is created, bits are properly set up, then the thread terminates. ...how much overhead is there when actually creating a thread?
This is highly system specific. For example, last time I used VMS threading was nightmarishly slow (been years, but from memory one thread could create something like 10 more per second (and if you kept that up for a few seconds without threads exiting you'd core)), whereas on Linux you can probably create thousands. If you want to know exactly, benchmark it on your system. But, it's not much use just knowing that without knowing more about the messages: whether they average 5 bytes or 100k, whether they're sent contiguously or the line idles in between, and what the latency requirements for the app are are all as relevant to the appropriateness of the code's thread use as any absolute measurement of thread creation overhead. And performance may not have needed to be the dominant design consideration.

You definitely do not want to do this. Create a single thread or a pool of threads and just signal when messages are available. Upon receiving the signal, the thread can perform any necessary message processing.
In terms of overhead, thread creation/destruction, especially on Windows, is fairly expensive. Somewhere on the order of tens of microseconds, to be specific. It should, for the most part, only be done at the start/end of an app, with the possible exception of dynamically resized thread pools.

I used the above "terrible" design in a VOIP app I made. It worked very well ... absolutely no latency or missed/dropped packets for locally connected computers. Each time a data packet arrived in, a thread was created and handed that data to process it to the output devices. Of course the packets were large so it caused no bottleneck. Meanwhile the main thread could loop back to wait and receive another incoming packet.
I have tried other designs where the threads I need are created in advance but this creates it's own problems. First you need to design your code properly for threads to retrieve the incoming packets and process them in a deterministic fashion. If you use multiple (pre-allocated) threads it's possible that the packets may be processed 'out of order'. If you use a single (pre-allocated) thread to loop and pick up the incoming packets, there is a chance that thread might encounter a problem and terminate leaving no threads to process any data.
Creating a thread to process each incoming data packet works very cleanly, especially on multi-core systems and where incoming packets are large. Also to answer your question more directly, the alternative to thread creation is to create a run-time process that manages the pre-allocated threads. Being able to synchronize data hand-off and processing as well as detecting errors may add just as much, if not more overhead as just simply creating a new thread. It all depends on your design and requirements.

Thread creation and computing in a thread is pretty expensive.
All data strucutres need to be set up, the thread registered with the kernel and a thread switch must occur so that the new thread actually gets executed (in an unspecified and unpredictable time).
Executing thread.start does not mean that the thread main function is called immediately.
As the article (mentioned by typoking) points out creation of a thread is cheap only compared to the creation of a process. Overall, it is pretty expensive.
I would never use a thread
for a short computation
a computation where I need the result in my flow of code (that
means, I am starting the thread and
wait for it to return the result of
it's computation
In your example, it would make sense (as has already been pointed out) to create a thread that handles all of the serial communication and is eternal.
hth
Mario

For comparison , take a look of OSX: Link
Kernel data structures : Approximately 1 KB Stack space: 512 KB
(secondary threads) : 8 MB (OS X main thread) , 1 MB (iOS main
thread)
Creation time: Approximately 90 microseconds
The posix thread creation also should be around this (not a far away figure) I guess.

On any sane implementation, the cost of thread creation should be proportional to the number of system calls it involves, and on the same order of magnitude as familiar system calls like open and read. Some casual measurements on my system showed pthread_create taking about twice as much time as open("/dev/null", O_RDWR), which is very expensive relative to pure computation but very cheap relative to any IO or other operations which would involve switching between user and kernel space.

It is indeed very system dependent, I tested #Nafnlaus code:
#include <thread>
int main(int argc, char** argv)
{
for (volatile int i = 0; i < 500000; i++)
std::thread([](){}).detach();
return 0;
}
On my Desktop Ryzen 5 2600:
windows 10, compiled with MSVC 2019 release adding std::chrono calls around it to time it. Idle (only Firefox with 217 tabs):
It took around 20 seconds (20.274, 19.910, 20.608) (also ~20 seconds with Firefox closed)
Ubuntu 18.04 compiled with:
g++ main.cpp -std=c++11 -lpthread -O3 -o thread
timed with:
time ./thread
It took around 5 seconds (5.595, 5.230, 5.297)
The same code on my raspberry pi 3B compiled with:
g++ main.cpp -std=c++11 -lpthread -O3 -o thread
timed with:
time ./thread
took around 15 seconds (16.225, 14.689, 16.235)

Interesting.
I tested with my FreeBSD PCs and got the following results:
FreeBSD 12-STABLE, Core i3-8100T, 8GB RAM: 9.523sec<br/>
FreeBSD 12.1-RELEASE, Core i5-6600K, 16GB: 8.045sec
You need to do
sysctl kern.threads.max_threads_per_proc=500100
though.
Core i3-8100T is pretty slow but the results are not very different. Rather the CPU clocks seem to be more relevant: i3-8100T 3.1GHz vs i5-6600k 3.50GHz.

As others have mentioned, this seems to be very OS dependent. On my Core i5-8350U running Win10, it took 118 seconds which indicates an overhead of around 237 uS per thread (I suspect that the virus scanner and all the other rubbish IT installed is really slowing it down too). Dual core Xeon E5-2667 v4 running Windows Server 2016 took 41.4 seconds (82 uS per thread), but it's also running a lot of IT garbage in the background including the virus scanner. I think a better approach is to implement a queue with a thread that continuously processes whatever is in the queue to avoid the overhead of creating and destroying the thread everytime.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js