What exactly is an std::thread? [closed] - c++

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
Is it a real hardware thread?
I have a program which reads data from 30 COM devices every second and so far I only have access to 7. It works great when I implemented multithreading, one thread for each device and it doesn't block off my GUI while it waits to read data (it takes 30ms). I'm wondering though what will happen if I exceed the amount of threads I have on my CPU? If this isn't possible how would I approach this?

std::thread represents a thread, managed by the operating system. It has its stack, registers, instruction pointer etc. However, it is still managed by the OS. The operating system handles all the scheduling, assigning the thread to the hardware core and then preempting it if necessary to do another work on that core.
In a regular program you can't really lock core to do your work without any OS intervention. Otherwise, it could have negative impact on the stability of the system.
If you launch more threads than there are cores on your CPU, and they all run all the time, the OS will start swapping them in and out, effectively keeping them all running. However, this swapping is not for free, and you can slow everything down if you have too many of them.
However, if your threads are halted for whatever reason -- for example, you threads stop on a mutex, wait on a condition variable, or simply goes to sleep (e.g. via std::this_thread::sleep_for), then it no longer consume the hardware resources during that wait. In that scenario it is perfectly fine to have much more threads than there are cores on your CPU.

std::thread is not a "hardware thread". std::thread is a class in the C++ standard library. Instance of the std::thread class is an object that acts as a RAII container for a "native thread" i.e. a thread of execution provided by the API of the operating system.
When you create a std::thread (assuming you don't use the default constructor), the constructor will use the operating system API to create a native thread, and calls the passed function.
I'm wondering though what will happen if I exceed the amount of threads I have on my CPU?
The operating system has a subsystem called "process scheduler" which allocates the time of the hardware CPU cores (or logical core in case of hyper threading, which I assume is what you mean by "hardware thread") time for each of the threads running on the system. The number of (logical) cores the CPUs of the system has affects how many threads can be executed in parallel, but doesn't limit how many threads the operating system can manage.
As such, nothing in particular will happen or will stop happening. If your system has more threads ready to run than number of (logical) CPU cores, then the operating system will not be able to give CPU time to all of the threads in parallel.
Note that creating native threads has a performance penalty, and having more threads waiting to run (excluding those waiting for disk or network) than the number of cores to execute them will reduce the performance of the system.

I'm wondering though what will happen if I exceed the amount of threads I have on my CPU?
You might experience a lot of (unneeded) task switching, each taking 1us to 22ms depending on what exactly ran before.
Depending on your OS or setup you can get the serial ports to do a lot of work for you. Also the amount of actual work on the individual receiving COM port matters.
OS sends a message that there are some data you can read from a given buffer.
OS wakes your thread as there is something to read on the port
Interrupt, wakes the thread and returns
In all cases a single thread might be able to handle all 30 COM's as most of the time is waiting for the very slow serial ports to send the data.
Most serial ports are buffered and only need to be emptied after several chars have been received. Some serial cards have DMA so you don't even need to empty it yourself.

Related

Futex throughput on Linux

I have an async API which wraps some IO library. The library uses C style callbacks, the API is C++, so natural choice (IMHO) was to use std::future/std::promise to build this API. Something like std::future<void> Read(uint64_t addr, byte* buff, uint64_t buffSize). However, when I was testing the implementation I saw that the bottleneck is the future/promise, more precisely, the futex used to implement promise/future. Since the futex, AFAIK, is user space and the fastest mechanism I know to sync two threads, I just switched to use raw futexes, which somewhat improved the situation, but not something drastic. The performance floating somewhere around 200k futex WAKEs per second. Then I stumbled upon this article - Futex Scaling for Multi-core Systems which quite matches the effect I observe with futexes. My questions is, since the futex too slow for me, what is the fastest mechanism on Linux I can use to wake the waiting side. I dont need anything more sophisticated than binary semaphore, just to signal IO operation completion. Since IO operations are very fast (tens of microseconds) switching to kernel mode not an option. Busy wait not an option too, since CPU time is precious in my case.
Bottom line, user space, simple synchronization primitive, shared between two threads only, only one thread sets the completion, only one thread waits for completion.
EDIT001:
What if... Previously I said, no spinning in busy wait. But futex already spins in busy wait, right? But the implementation covers more general case, which requests global hash table, to hold the futexes, queues for all subscribers etc. Is it a good idea to mimic same behavior on some simple entity (like int), no locks, no atomics, no global datastructures and busy wait on it like futex already does?
In my experience, the bottleneck is due to linux's poor support for IPC. This probably isn't a multicore scaling issue, unless you have a large number of threads.
When one thread wakes another (by futex or any other mechanism), the system tries to run the 'wakee' thread immediately. But the waker thread is still running and using a core, so the system will usually put the wakee thread on a different core. If that core was previously idle, then the system will have to wake the core up from a power-down state, which takes some time. Any data shared between the threads must now be transferred between the cores.
Then, the waker thread will usually wait for a response from the wakee (it sounds like this is what you are doing). So it immediately goes to sleep, and puts its core to idle.
Then a similar thing happens again when the response comes. The continuous CPU wakes and migrations cause the slowdown. You may well discover that if you launch many instances of your process simultaneously, so that all your cores are busy, you see increased performance as the CPUs no longer have to wake up, and the threads may stop migrating between cores. You can get a similar performance increase if you pin the two threads to one core - it will do more than 1 million 'pings'/sec in this case.
So isn't there a way of saying 'put this thread to sleep and then wake that one'? Then the OS could run the wakee on the same core as the waiter? Well, Google proposed a solution to this with a FUTEX_SWAP api that does exactly this, but has yet to be accepted into the linux kernel. The focus now seems to be on user-space thread control via User Managed Concurrency Groups which will hopefully be able to do something similar. However at the time of writing this is yet to be merged into the kernel.
Without these changes to the kernel, as far as I can tell there is no way around this problem. 'You are on the fastest route'! UNIX sockets, TCP loopback, pipes all suffer from the same issue. Futexes have the lowest overhead, which is why they go faster than the others. (with TCP you get about 100k pings per sec, about half the speed of a futex impl). Fixing this issue in a general way would benefit a lot of applications/deployments - anything that uses connections to localhost could benefit.
(I did try a DIY approach where the waker thread pins the wakee thread to the same core that the waker is on, but if you don't want to to pin the waker, then every time you post the futex you need to pin the wakee to the current thread, and the system call to do this has too much overhead)

Kernel threads in posix

As far as I understand, the kernel has kernelthreads for each core in a computer and threads from the userspace are scheduled onto these kernel threads (The OS decides which thread from an application gets connected to which kernelthread). Lets say I want to create an application that uses X number of cores on a computer with X cores. If I use regular pthreads, I think it would be possible that the OS decides to have all the threads I created to be scheduled onto a single core. How can I ensure that each each thread is one-on-one with the kernelthreads?
You should basically trust the kernel you are using (in particular, because there could be another heavy process running; the kernel scheduler will choose tasks to be run during a quantum of time).
Perhaps you are interested in CPU affinity, with non-portable functions like pthread_attr_setaffinity_np
You're understanding is a bit off. 'kernelthreads' on Linux are basically kernel tasks that are scheduled alongside other processes and threads. When the kernel's scheduler runs, the scheduling algorithm decides which process/thread, out of the pool of runnable threads, will be scheduled to run next on a given CPU core. As #Basile Starynkevitch mentioned, you can tell the kernel to pin individual threads from your application to a particular core, which means the operating system's scheduler will only consider running it on that core, along with other threads that are not pinned to a particular core.
In general with multithreading, you don't want your number of threads to be equal to your number of cores, unless you're doing exclusively CPU-bound processing, you want number of threads > number of cores. When waiting for network or disk IO (i.e. when you're waiting in an accept(2), recv(2), or read(2)) you're thread is not considered runnable. If N threads > N cores, the operating system may be able to schedule a different thread of yours to do work while waiting for that IO.
What you mention is one possible model to implement threading. But such a hierarchical model may not be followed at all by a given POSIX thread implementation. Since somebody already mentioned linux, it dosn't have it, all threads are equal from the point of view of the scheduler, there. They compete for the same resources if you don't specify something extra.
Last time I have seen such a hierarchical model was on a machine with an IRIX OS, long time ago.
So in summary, there is no general rule under POSIX for that, you'd have to look up the documentation of your particular OS or ask a more specific question about it.

c++ thread division into microprocessor

I have a question... I need to build an app multi-thread and my question is: if I have a 2cpu processor, is automatically that my 2 threads are separately one by processor?
and if I have 4 threads and my pc have 4cpu, are again 1 per processor? and if I have 4 processor and 2 cpus, how is divided??
thanks in advance
This is not really a question which can be answered unless you specify the operating system at a minimum.
C++ itself knows nothing of threads, they are a service provided by the OS to the execution environment, and depend on that OS for its implementation.
As a general observation, I'm pretty certain that Linux schedules threads independently so that multiple threads can be spread across different CPUs and/or cores. I suspect Windows would do the same.
Some OS' will allow you to specify thread affinity, the ability for threads (and sometimes groups of threads) to stick with a single CPU but, again, that's an OS issue rather than a C++ one.
For Windows (as per your comment), you may want to read this introduction. Windows provides a SetProcessAffinityMask() function for controlling affinity of all threads in a given process or SetThreadAffinityMask() for controlling threads independently.
But, usually, you'll find it's best to leave these alone and let the OS sort it out - unless you have a specific need for different behaviour, the OS will almost certainly make the right decisions.
How threads get allocated to processors is specific to the OS your application is running on. Typically most OS's don't make any guarantees about how your threads are split across the processors, although some do have some low level APIs to allow you to specify thread affinity.
If your threads are CPU bound, then they will certainly tend to be scheduled on all available CPUs.
If your threads are IO bound, then if you only have one thread per CPU, most of the CPUs will be sitting idle. This is why - when attempting to maximize performace - it is important to measure what is happening and either find a hard coded ratio of threads per CPU, or use the operating systems thread pooling mechanism which has access to enough information to keep exactly as many threads active as there are CPU cores.
You generally dont want to have MORE active threads that CPUs (i.e. threads that arn't blocked waiting for IO to complete) as the act of switching between active threads on a CPU does incur small cost that can add up.

Allocate more processor cycles to my program

I've been working on win32, c,c++ for a while. I code on visual studio. Most of the time I see system idle process uses more cpu utilization. Is there a way to allocate more processor cycles to my program to run it faster? I understand there might be limitations from i/o, in those cases this question doesn't make any sense.
OR
did i misunderstood the task manager numbers? I'm in a confusion, please help me out.
And I want to do something in program itself, btw I will be happy if answers are specific to windows.
Thanks in advance
~calvin
If your program it the only program that has something to do (not wait for IO), its thread will always be assigned to a processor core.
However, if you have a multi-core processor, and a single-threaded program, the CPU usage of your process displayed in the task manager will always be limited by 100/Ncores.
For example, if you have a quad-core machine, your process will be at 25% (using one core), and the idle process at around 75%. You can only additional CPU power by dividing your tasks into chunks that can be worked on by separate threads which will then be run on the idle cores.
The idle process only "runs" when no other process needs to. If you want to use more CPU cycles, then use them.
If your program is idling, it doesn't do anything, i.e. there is nothing that could be done any faster. So the CPU is probably not the bottle-neck in your case.
Are you maybe waiting for data coming from the disk or network?
In case your processor has multiple cores and your program uses only one core to its full extent, making your program multi-threaded could work.
In a multitask / multithread OS the processor(s) time is splitted among threads.
If you want a specific thread to get bigger time chunk you can set its priority with the SetThreadPriority function, not wise to do it though.
Only special software (should) mess with those settings.
It's common for window applications to have a low cpu usage percent (which we see in the task manager)
because most of the time they just wait for messages.
Use threads to:
abstract away all the I/O waits.
assign work to all cores.
also, remove all sleep-wait states from main thread.
Defer all I/O to a thread, so that wait states are confined within it. Keep the actual computations in the foreground thread, and use synchronization mechanisms that make the I/O slave thread to wait for your main thread when communicating.
If your CPU is multi-core, and your problem is paralellizable, create as many threads as you have cores, research "set affinity" functions to assign them between the cores and still keep a separate thread for all I/O.
Also pay attention not to wait in your main thread - usleep(1) doesn't send you into background for 1 microsecond, but for "no less than..." and that may mean anything between 1ms and 100ms but hardly ever less than that, and never anything close to a microsecond.

My threadspool just make 4~5threads. why?

I use QueueUserWorkItem() function to invoke threadpool.
And I tried lots of work with it. (about 30000)
but by the task manager my application only make 4~5 thread after I push the start button.
I read the MSDN which said that the default number of thread limitation is about 500.
why just a few of threads are made in my application?
I'm tyring to speed up my application and I dout this threadpool is the one of reason that slow down my application.
thanks
It is important to understand how the threadpool scheduler works. It was designed to fine-tune the number of running threads against the capabilities of your machine. Your machine probably can run only two threads at the same time, dual-core CPUs are the current standard. Maybe four.
So when you dump a bunch of threads in its lap, it starts out by activating only two threads. The rest of them are in a queue, waiting for CPU cores to become available. As soon as one of those two threads completes, it activates another one. Twice a second, it evaluates what's going on with active threads that didn't complete. It makes the rough assumption that those threads are blocking and thus not making progress and allows another thread to activate. You've now got three running threads. Getting up the 500 threads, the default max number of threads, will take 249 seconds.
Clearly, this behavior spells out what a thread should do to be suitable to run as a threadpool thread. It should complete quickly and don't block often. Note that blocking on I/O requests is dealt with separately.
If this behavior doesn't suit you then you can use a regular Thread. It will start running right away and compete with other threads in your program (and the operating system) for CPU time. Creating 30,000 of such threads is not possible, there isn't enough virtual memory available for that. A 32-bit operating system poops out somewhere south of 2000 threads, consuming all available virtual memory. You can get about 50,000 threads on a 64-bit operating system before the paging file runs out. Testing these limits in a production program is not recommended.
I think you may have misunderstood the use of the threadpool. Spawning threads and killing threads involves the Windows Kernel and is an expensive operation. If you continuously need threads to perform an aynchronous operation and then you throw them away it would perform many system calls.
So the threadpool is actually a group of threads which are created once which instead of exiting when they complete their task actually enter a wait for another item for queueuserworkitem. The threadpool will then tune itself based on how many threads are required concurrently for your process. If you wish to test this write this code:
for(int i = 0; i < 30000; i++)
{
ThreadPool.QueueUserWorkItem(myMethod);
}
You will see this will create a whole bunch of threads. Maybe not 30000 as some of the threads that are created will be reused as the ThreadPool starts to work through your function calls.
The threadpool is there so you can avoid creating a thread for every asynchronous operation for the very reason that threads are expensive. If you want 30,000 threads you're going to use a lot of memory for the thread stacks plus waste a lot of CPU time doing context switches. Now creating that many threads would be justified if you had 30,000 CPU cores...