I have a question... I need to build an app multi-thread and my question is: if I have a 2cpu processor, is automatically that my 2 threads are separately one by processor?
and if I have 4 threads and my pc have 4cpu, are again 1 per processor? and if I have 4 processor and 2 cpus, how is divided??
thanks in advance
This is not really a question which can be answered unless you specify the operating system at a minimum.
C++ itself knows nothing of threads, they are a service provided by the OS to the execution environment, and depend on that OS for its implementation.
As a general observation, I'm pretty certain that Linux schedules threads independently so that multiple threads can be spread across different CPUs and/or cores. I suspect Windows would do the same.
Some OS' will allow you to specify thread affinity, the ability for threads (and sometimes groups of threads) to stick with a single CPU but, again, that's an OS issue rather than a C++ one.
For Windows (as per your comment), you may want to read this introduction. Windows provides a SetProcessAffinityMask() function for controlling affinity of all threads in a given process or SetThreadAffinityMask() for controlling threads independently.
But, usually, you'll find it's best to leave these alone and let the OS sort it out - unless you have a specific need for different behaviour, the OS will almost certainly make the right decisions.
How threads get allocated to processors is specific to the OS your application is running on. Typically most OS's don't make any guarantees about how your threads are split across the processors, although some do have some low level APIs to allow you to specify thread affinity.
If your threads are CPU bound, then they will certainly tend to be scheduled on all available CPUs.
If your threads are IO bound, then if you only have one thread per CPU, most of the CPUs will be sitting idle. This is why - when attempting to maximize performace - it is important to measure what is happening and either find a hard coded ratio of threads per CPU, or use the operating systems thread pooling mechanism which has access to enough information to keep exactly as many threads active as there are CPU cores.
You generally dont want to have MORE active threads that CPUs (i.e. threads that arn't blocked waiting for IO to complete) as the act of switching between active threads on a CPU does incur small cost that can add up.
Related
I learned that all of user threads mapped with a kernel thread be blocked if one of the threads calls some system call likes I/O System Call.
If std::thread is implemented by creating only a user thread in some environment, then a thread for I/O in some programs can block a thread for Rendering.
So I think distinguishing user / kernel is important but c++ standard does not.
Then how I assure that some situations like above will not occur in particular environment(like Windows10 )?
I learned that all of user threads mapped with a kernel thread be blocked if one of the threads calls some system call likes I/O System Call.
Yes, however it's rare for anything to use kernel's system calls directly. Typically they use a user-space library. For a normally blocking "system" call (e.g. the read() function in a standard C library) the library can emulate it using asynchronous functions (e.g. the aio_read() function in a standard C library) and a user-space thread switches.
So I think distinguishing user / kernel is important but c++ standard does not.
It is important, but for a different reason.
The first problem with user-space threading is that the kernel isn't aware of thread priorities. If you imagine a computer running 2 completely separate applications (with the user using "alt+tab" to switch between them), where each application has a high priority thread (for user interface), and few medium priority threads (for general work) plus a few low priority threads (for doing things like prefetching and pre-calculating stuff in the background); you can end up with a situation where kernel gives CPU time to one application (that uses the CPU time for low priority threads) because it doesn't know the other application needs CPU time for its higher priority threads.
In other words, for a multi-process environment, user-space threading has a high risk of wasting CPU time doing irrelevant work (in one process) while important work (in another process) waits.
The second problem with user-space threading is that (for modern systems) good scheduling decisions take into account differences between different CPUs ("big.Little", hyper-threading, which caches are shared by which CPUs, ..) and power management (e.g. for low priority threads it's reasonable to reduce CPU clock speed to increase battery life and/or reduce CPU temperatures so they can run faster for longer when higher priority work needs to be done later); and user-space has none of the information needed (and none of the ability to change CPU speeds, etc) and can not make good scheduling decisions.
Note that these problems could be "fixed" by having a huge amount of communication between user-space and kernel (the user-space threading informing kernel of thread priorities of waiting threads and currently running thread, kernel informing user-space of CPU differences and power management, etc); but the whole point of user-space thread switching is to avoid the cost of kernel system calls, so this communication between user-space and kernel would make user-space thread switching pointless.
Then how I assure that some situations like above will not occur in particular environment(like Windows10 )?
You can't. It's not your decision.
When you choose to use high level abstractions (std::thread in C++ rather than using the kernel directly from assembly language) you are deliberately delegating low level decisions to something else (the compiler and its run-time environment). The advantages (you no longer have to care about these decisions) are the disadvantages (you are no longer able to make these decisions).
Rephrasing my attempt to answer, after talking to the OP and understanding better what is really being asked.
Most I/O operations are blocking per thread level: if a threads starts one, only this thread will be blocked, not the whole process.
The OP seems to intend to start a rendering operation in a thread and doesn't want it to be blocked by an I/O operation in this thread. Two possible solutions are:
To spawn another thread to do this blocking I/O operation, and then let the rendering thread to proceed independently of the I/O;
To use resources specific of each OS (that doesn't belong to C++), to start the same I/O operation in an asynchronous, non blocking form.
Lastly, to minimize the blocking of the OS access to I/O, what an application developer can do is to try to make sure that there's no simultaneous access to the same I/O device at the same time.
You can be assured that std::thread is not using "user threads" because that concept pretty much died around the turn of the century.
Modern hardware has multiple CPU cores, which work much better if there are sufficient kernel threads. Without enough kernel threads, CPU cores may sit idle.
The idea of "user threads" originated in an era when there was only a single CPU core, and people instead worried about having too many kernel threads.
I am using visual studio 2012. I have a module, where, I have to read a huge set of files from the hard disk after traversing their corresponding paths through an xml. For this i am doing
std::vector<std::thread> m_ThreadList;
In a while loop I am pushing back a new thread into this vector, something like
m_ThreadList.push_back(std::thread(&MyClass::Readfile, &MyClassObject, filepath,std::ref(polygon)));
My C++11 multi threading knowledge is limited.The question that I have here , is , how do create a thread on a specific core ? I know of parallel_for and parallel_for_each in vs2012, that make optimum use of the cores. But, is there a way to do this using standard C++11?
As pointed out in other comments, you cannot create a thread "on a specific core", as C++ has no knowledge of such architectural details. Moreover, in the majority of cases, the operating system will be able to manage the distribution of threads among cores/processors well enough.
That said, there exist cases in which forcing a specific distribution of threads among cores can be beneficial for performance. As an example, by forcing a thread to execute onto a one specific core it might be possible to minimise data movement between different processor caches (which can be critical for performance in certain memory-bound scenarios).
If you want to go down this road, you will have to look into platform-specific routines. E.g., for GNU/linux with POSIX threads you will want pthread_setaffinity_np(), in FreeBSD cpuset_setaffinity(), in Windows SetThreadAffinityMask(), etc.
I have some relevant code snippets here if you are interested:
http://gitorious.org/piranhapp0x/mainline/blobs/master/src/thread_management.cpp
I'm fairly certain that core affinity isn't included in std::thread. The assumption is that the OS is perfectly capable of making best possible use of the cores available. In all but the most extreme of cases you're not to going to beat the OS's decision, so the assumption is a fair one.
If you do go down that route then you have to add some decision making to your code to take account of machine architecture to ensure that your decision is better than the OSes on every machine you run on. That takes a lot of effort! For starters you'll be wanting to limit the number of threads to match the number of cores on the computer. And you don't have any knowledge of what else is going on in the machine; the OS does!
Which is why thread pools exist. They tend by default to have as many threads as there are cores, automatically set up by the language runtime. AFAIK C++11 doesn't have one of those. So the one good thing you can do to get the optimum performance is to find out how many cores there are and limit the number of threads you have to that number. Otherwise it's probably just best to trust the OS.
Joachim Pileborg's comment is well worth paying attention to, unless the work done by each thread outweighs the I/O overhead.
As a quick overview of threading in the context of dispatching threads to cores:
Most modern OS's make use of kernel level threads, or hybrid. With kernel level threading, the OS "sees" all the threads in each process; in contrast to user level threads, which are employed in Java, where the OS sees a single process, and has no knowledge of threading. Now, because, with kernel level threading, the OS can recognise the separate threads of a process, and manages their dispatch onto a given core, there is the potential for true parallelism - where multiple threads of the same process are run on different cores. You, as the programmer, will have no control over this however, when employing std::thread; the OS decides. With user level threading, all the management of threads are done at the user level, with Java, a library manages the "dispatch". In the case of hybrid threading, kernel threading is used, where each kernel thread is actually a set of user level threads.
So I know I can increase the number of threads of a process in Linux using setrlimit and friends. According to this, the theoretical limit on the number of threads is determined by memory (somewhere around 100,000k). For my use I'm looking into using the FIFO scheduler in a cooperative style, so spurious context switches aren't a concern. I know I can limit the number of active threads to the number of cores. My question is what the pratical limit on the number of threads are, after which assumptions in the scheduler start being voilated. If I maintain a true cooperative style are additional threads "free"? Any case studied or actual examples would be especially interesting.
The Apache server seems to be the most analagous program to this situation. Does anybody have any numbers related to how many threads they've seen Apache spawn before becoming useless?
Related, but has to do with Windows, pre-emptive code.
I believe the number of threads is limited
by the available memory (each thread need at least several pages, and often many of them, notably for its stack and thread local storage). See the pthread_attr_setstacksize function to tune that. Threads stack space of a megabyte each are not uncommon.
At least on Linux (NPTL i.e. current Glibc) and other systems where users threads are the same as kernel threads, but the number of tasks the kernel can schedule.
I would guess that on most Linux systems, the second limitation is stronger than the first. Kernel threads (on Linux) are created thru the clone(2) Linux system call. In old Unix or Linux kernels, the number of tasks was hardwired. It is probably tunable today, but I guess it is in the many thousands, not the millions!
And you should consider coding in the Go language, its goroutines are the feather-light threads you are dreaming of.
If you want many cooperative threads, you could look into Chicken Scheme implementation tricks.
As far as I understand, the kernel has kernelthreads for each core in a computer and threads from the userspace are scheduled onto these kernel threads (The OS decides which thread from an application gets connected to which kernelthread). Lets say I want to create an application that uses X number of cores on a computer with X cores. If I use regular pthreads, I think it would be possible that the OS decides to have all the threads I created to be scheduled onto a single core. How can I ensure that each each thread is one-on-one with the kernelthreads?
You should basically trust the kernel you are using (in particular, because there could be another heavy process running; the kernel scheduler will choose tasks to be run during a quantum of time).
Perhaps you are interested in CPU affinity, with non-portable functions like pthread_attr_setaffinity_np
You're understanding is a bit off. 'kernelthreads' on Linux are basically kernel tasks that are scheduled alongside other processes and threads. When the kernel's scheduler runs, the scheduling algorithm decides which process/thread, out of the pool of runnable threads, will be scheduled to run next on a given CPU core. As #Basile Starynkevitch mentioned, you can tell the kernel to pin individual threads from your application to a particular core, which means the operating system's scheduler will only consider running it on that core, along with other threads that are not pinned to a particular core.
In general with multithreading, you don't want your number of threads to be equal to your number of cores, unless you're doing exclusively CPU-bound processing, you want number of threads > number of cores. When waiting for network or disk IO (i.e. when you're waiting in an accept(2), recv(2), or read(2)) you're thread is not considered runnable. If N threads > N cores, the operating system may be able to schedule a different thread of yours to do work while waiting for that IO.
What you mention is one possible model to implement threading. But such a hierarchical model may not be followed at all by a given POSIX thread implementation. Since somebody already mentioned linux, it dosn't have it, all threads are equal from the point of view of the scheduler, there. They compete for the same resources if you don't specify something extra.
Last time I have seen such a hierarchical model was on a machine with an IRIX OS, long time ago.
So in summary, there is no general rule under POSIX for that, you'd have to look up the documentation of your particular OS or ask a more specific question about it.
I've been working on win32, c,c++ for a while. I code on visual studio. Most of the time I see system idle process uses more cpu utilization. Is there a way to allocate more processor cycles to my program to run it faster? I understand there might be limitations from i/o, in those cases this question doesn't make any sense.
OR
did i misunderstood the task manager numbers? I'm in a confusion, please help me out.
And I want to do something in program itself, btw I will be happy if answers are specific to windows.
Thanks in advance
~calvin
If your program it the only program that has something to do (not wait for IO), its thread will always be assigned to a processor core.
However, if you have a multi-core processor, and a single-threaded program, the CPU usage of your process displayed in the task manager will always be limited by 100/Ncores.
For example, if you have a quad-core machine, your process will be at 25% (using one core), and the idle process at around 75%. You can only additional CPU power by dividing your tasks into chunks that can be worked on by separate threads which will then be run on the idle cores.
The idle process only "runs" when no other process needs to. If you want to use more CPU cycles, then use them.
If your program is idling, it doesn't do anything, i.e. there is nothing that could be done any faster. So the CPU is probably not the bottle-neck in your case.
Are you maybe waiting for data coming from the disk or network?
In case your processor has multiple cores and your program uses only one core to its full extent, making your program multi-threaded could work.
In a multitask / multithread OS the processor(s) time is splitted among threads.
If you want a specific thread to get bigger time chunk you can set its priority with the SetThreadPriority function, not wise to do it though.
Only special software (should) mess with those settings.
It's common for window applications to have a low cpu usage percent (which we see in the task manager)
because most of the time they just wait for messages.
Use threads to:
abstract away all the I/O waits.
assign work to all cores.
also, remove all sleep-wait states from main thread.
Defer all I/O to a thread, so that wait states are confined within it. Keep the actual computations in the foreground thread, and use synchronization mechanisms that make the I/O slave thread to wait for your main thread when communicating.
If your CPU is multi-core, and your problem is paralellizable, create as many threads as you have cores, research "set affinity" functions to assign them between the cores and still keep a separate thread for all I/O.
Also pay attention not to wait in your main thread - usleep(1) doesn't send you into background for 1 microsecond, but for "no less than..." and that may mean anything between 1ms and 100ms but hardly ever less than that, and never anything close to a microsecond.