Related
This question already has answers here:
How can multithreading speed up an application (when threads can't run concurrently)?
(9 answers)
Closed 9 years ago.
My professor causally mentioned that we should program multi-thread programs even if we are using a unicore processor however because of the lack of time , he did not elaborate on it .
I would like to know what are the benefits of a multi-thread program in a unicore processor ??
It won't be as significant as a multi-core system but it can still provide some benefits.
Mainly all the benefits that you are going to get will be regarding to the context switch that will happen after a input miss to the already executing thread. Executing thread may be waiting for anything such as a hardware resource or a branch mis-prediction or even data transfer after a cache miss.
At this point the waiting thread can be executed to benefit from this "waiting time". But of course context switch will take some time. Also managing threads inside the code rather than sequential computation can create some extra complexity to your program. And as it has been said, some applications needs to be multi-threaded so there is no escape from the context switch in some cases.
Some applications need to be multi-threaded. Multi-threading isn't just about improving performance by using more cores, it's also about performing multiple tasks at once.
Take Skype for example - The GUI needs to be able to accept the text you're entering, display it on the screen, listen for new messages coming from the user you're talking to, and display them. This wouldn't be a trivial task in a single threaded application.
Even if there's only one core available, the OS thread scheduler will give you the illusion of parallelism.
Usually it is about not blocking. Running many threads on a single core still gives the illusion of concurrency. So you can have, say, a thread doing IO while another one does user interactions. The user interaction thread is not blocked while the other does IO, so the user is free to carry on interacting.
Benefits could be different.
One of the widely used examples is the application with GUI, which supposed to perform some kind of computations. If you will have a single thread - the user will have to wait the result before dealing something else with the application, but if you start it in the separate thread - user interface could be still available for user during the computation process. So, multi-thread program could emulate multi-task environment even on a unicore system. That's one of the points.
As others have already mentioned, not blocking is one application. Another one is separation of logic for unrelated tasks that are to be executed simultaneously. Using threads for that leaves handling of scheduling these tasks to the OS.
However, note that it may also be possible to implement similar behavior using asynchronous operations in a single thread. "Future" and boost::asio provide ways of doing non-blocking stuff without necessarily resorting to multiple threads.
I think it depends a bit on how exactly you design your threads and which logic is actually in the thread. Some benefits you can even get on a single core:
A thread can wrap a blocking/long-during call you can't circumvent otherwise. For some operations there are polling mechanisms, but not for all.
A thread can wrap an almost standalone part of your application that has virtually no interaction with other code. For example background polling for updates, monitoring some resource (e.g. free storage), checking internet connectivity. If you keep them in a separate thread you can keep the code relatively simple in its own 'runtime' without caring too much about the impact on the main program, the sole communication with the main logic is usually a single 'event'.
In some environments you might get more processing time. This mainly depends on how your OS scheduling system works, but if this allocates time per thread, the more threads you have the more your app will be scheduled.
Some benefits long-term:
Where it's not hard to do you benefit if your hardware evolves. You never know what's going to happen, today your app runs on a single-core embedded device, tomorrow that embedded device gets a quad core. Programming threaded from the beginning improves your future scalability.
One example is an environment where you can deterministically assign work to a thread, e.g. based on some hash all related operations end up in the same thread. The advantage for single cores is 'small' but it's not hard to do as you need little synchronization primitives so the overhead stays small.
That said, I think there are situations where it's very ill advise:
As soon as your required synchronization mechanism with other threads becomes complex (e.g. multiple locks, lots of critical sections, ...). It might still be then that multi-threading gives you a benefit when effectively moving to multiple CPUs, but the overhead is huge both for your single core and your programming time.
For instance think about operations that block because of slow peripheral devices (harddisk access etc.). While these are waiting, even the single core can do other things asyncronously.
In a lot of applications the bottleneck is not CPU processing power. So when the program flow is waiting for completion of IO requests (user input, network/disk IO), critical resources to be available, or any sort of asynchroneously triggered events, the CPU can be scheduled to do other work instead of just blocking.
In this case you don't necessarily need multiple threads that can actually run in parallel. Cooperative multi-tasking concepts like asynchroneous IO, coroutines, or fibers come into mind.
If however the application's bottleneck is CPU processing power (constantly 100% CPU usage), then it makes sense to increase the number of CPUs available to the application. At that point it is easier to scale the application up to use more CPUs if it was designed to run in parallel upfront.
As far as I can see, one answer was not yet given:
You will have to write multithreaded applications in the future!
The average number of cores will double every 18 months in the future. People have learned single-threaded programming for 50 years now, and now they are confronted with devices that have multiple cores. The programming style in a multi-threaded environment differs significantly from single-threaded programming. This refers to low-level aspects like avoiding race conditions and proper synchronization, as well as the high-level aspects like the general algorithm design.
So in addition to the points already mentioned, it's also about writing future-proof software, scalability and the development of the skills that are required to achieve these goals.
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
Preemptive multitasking in C/C++: can a running thread be interrupted by some timer and switch between tasks?
Many VMs and other language runtimes using green-threading and such are implemented in these terms; can C/C++ apps do the same?
If so, how?
This is going to be platform dependent, so please discuss this in terms of the support particular platforms have for this; e.g. if there's some magic you can do in a SIGALRM handler on Linux to swap some kind of internal stack (perhaps using longjmp?), that'd be great!
I ask because I am curious.
I have been working for several years making async IO loops. When writing async IO loops I have to be very careful not to put expensive to compute computation into the loop as it will DOS the loop.
I therefore have an interest in the various ways an async IO loop can be made to recover or even fully support some kind of green threading or such approach. For example, sampling the active task and the number of loop iterations in a SIGALRM, and then if a task is detected to be blocking, move the whole of everything else to a new thread, or some cunning variation on this with the desired result.
There was some complaints about node.js in this regard recently, and elsewhere I've seen tantalizing comments about other runtimes such as Go and Haskell. But lets not go too far away from the basic question of whether you can do preemptive multitasking in a single thread in C/C++
Windows has fibers that are user-scheduled units of execution sharing the same thread.
http://msdn.microsoft.com/en-us/library/windows/desktop/ms682661%28v=vs.85%29.aspx
UPD: More information about user-scheduled context switching can be found in LuaJIT sources, it supports coroutines for different platforms, so looking at the sources can be useful even if you are not using lua at all. Here is the summary: http://coco.luajit.org/portability.html,
As far as i understand you are mixing things that are usually not mixed:
Asynchrounous Singals
A signal is usually delivered to the program (thus in your description one thread) on the same stack that is currently running and runs the registered signal handler... in BSD unix there is an option to let the handler run on a separate so-called "signal stack".
Threads and Stacks
The ability to run a thread on its own stack requires the ability to allocate stack space and save and restore state information (that includes all registers...) - otherwise clean "context switch" between threads/processes etc. is impossible. Usually this is implemented in the kernel and very often using some form of assembler since that is a very low-level and very time-sensitive operation.
Scheduler
AFAIK every system capable of running threads has some sort of scheduler... which is basically a piece of code running with the highest privileges. Often it has subscribed to some HW signal (clock or whatever) and makes sure that no other code ever registers directly (only indirectly) to that same signal. The scheduler has thus the ability to preemt anything on that system. Main conern is usually to give the threads enough CPU cycles on the available cores to do their job. Implementation usually includes some sort of queues (often more than one), priority handling and several other stuff. Kernel-side threads usually have a higher priority than anything else.
Modern CPUs
On modern CPUs the implementation is rather complicated since involves dealing with several cores and even some "special threads" (i.e. hyperthreads)... since modern CPUs usually have several levels of Cache etc. it is very important to deal with these appropriately to achieve high performance.
All the above means that your thread can and most probably will be preempted by OS on a regular basis.
In C you can register signal handlers which in turn preempt your thread on the same stack... BEWARE that singal handlers are problematic if reentered... you can either put the processing into the signal handler or fill some structure (for example a queue) and have that queue content consumed by your thread...
Regarding setjmp/longjmp you need to be aware that they are prone to several problems when used with C++.
For Linux there is/was a "full preemption patch" available which allows you to tell the scheduler to run your thread(s) with even higher priority than kernel thread (disk I/O...) get!
For some references see
http://www.kernel.org/doc/Documentation/scheduler/sched-rt-group.txt
http://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt
http://www.kernel.org/doc/Documentation/scheduler/
http://www.kernel.org/doc/Documentation/rt-mutex.txt
http://www.kernel.org/doc/Documentation/rt-mutex-design.txt
http://www.kernel.org/doc/Documentation/IRQ.txt
http://www.kernel.org/doc/Documentation/IRQ-affinity.txt
http://www.kernel.org/doc/Documentation/preempt-locking.txt
http://tldp.org/LDP/LG/issue89/vinayak2.html
http://lxr.linux.no/#linux+v3.1/kernel/sched.c#L3566
Can my thread help the OS decide when to context switch it out?
https://www.osadl.org/Realtime-Preempt-Kernel.kernel-rt.0.html
http://www.rtlinuxfree.com/
C++: Safe to use longjmp and setjmp?
Use of setjmp and longjmp in C when linking to C++ libraries
http://www.cs.utah.edu/dept/old/texinfo/glibc-manual-0.02/library_21.html
For seeing an acutal implementation of a scheduler etc. checkout the linux serouce code at https://kernel.org .
Since your question isn't very specific I am not sure whether this is a real answer but I suspect it has enough information to get you started.
REMARK:
I am not sure why you might want to implement something already present in the OS... if it for a higher performance on some async I/O then there are several options with maximum performance usually available on the kernel-level (i.e. write kernel-mode code)... perhaps you can clarify so that a more specific answer is possible.
Userspace threading libraries are usually cooperative (e.g: GNU pth, SGI's statethreads, ...). If you want preemptiveness, you'd go to kernel-level threading.
You could probably use getcontext()/setcontext()... from a SIGALARM signal handler, but if it works, it would be messy at best. I don't see what advantage has this approach over kernel threading or event-based I/O: you get all the non-determinism of preemptiveness, and you don't have your program separated into sequential control flows.
As others have outlined, preemptive is likely not very easy to do.
The usual pattern for this is using co-procedures.
Coprocedures are a very nice way to express finite state machines (e.g. text parsers, communication handlers).
You can 'emulate' the syntax of co-procedures with a modicum of preprocessor macro magic.
Regarding optimal input/output scheduling
You could have a look at Boost Asio: The Proactor Design Pattern: Concurrency Without Threads
Asio also has a co-procedure 'emulation' model based on a single (IIRC) simple preprocessor macro, combined with some amount of cunningly designed template facilities that come things eerily close to compiler support for _stack-less co procedures.
The sample HTTP Server 4 is an example of the technique.
The author of Boost Asio (Kohlhoff) explains the mechanism and the sample on his Blog here: A potted guide to stackless coroutines
Be sure to look for the other posts in that series!
What you're asking makes no sense. What would your one thread be interrupted by? Any executing code has to be in a thread. And each thread is basically a sequential execution of code. For a thread to be interrupted, it has to be interrupted by something. You can't just jump around randomly inside your existing thread as a response to an interrupt. Then it's no longer a thread in the usual sense.
What you normally do is this:
either you have multiple threads, and one of your threads is suspended until the alarm is triggered,
alternatively, you have one thread, which runs in some kind of event loop, where it receives events from (among other sources) the OS. When the alarm is triggered, it sends a message to your thread's event loop. If your thread is busy doing something else, it won't immediately see this message, but once it gets back into the event loop and processing events, it'll get it, and react.
The title is an oxymoron, a thread is an independent execution path, if you have two such paths, you have more than one thread.
You can do a kind of "poor-man's" multitasking using setjmp/longjmp, but I would not recommend it and it is cooperative rather than pre-emptive.
Neither C nor C++ intrinsically support multi-threading, but there are numerous libraries for supporting it, such as native Win32 threads, pthreads (POSIX threads), boost threads, and frameworks such as Qt and WxWidgets have support for threads also.
I am considering the use of potentially hundreds of threads to implement tasks that manage devices over a network.
This is a C++ application running on a powerpc processor with a linux kernel.
After an initial phase when each task does synchronization to copy data from the device into the task, the task becomes idle, and only wakes up when it receives an alarm, or needs to change some data (configuration), which is rare after the start phase. Once all tasks reach the "idle" phase, I expect that only a few per second will need to wake.
So, my main concern is, if I have hundreds of threads will they have a negative impact on the system once they become idle?
Thanks.
amso
edit:
I'm updating the question based on the answers that I got. Thanks guys.
So it seems that having a ton of threads idling (IO blocked, waiting, sleeping, etc), per se , will not have an impact on the system in terms of responsiveness.
Of course, they will spend extra money for each thread's stack and TLS data but that's okay as long as we throw more memory at the thing (making it more €€€)
But then, other issues have to be accounted for. Having 100s of threads waiting will likely increase memory usage on the kernel, due to the need of wait queues or other similar resources. There's also a latency issue, which looks non-deterministic. To check the responsiveness and memory usage of each solution one should measure it and compare.
Finally, the whole idea of hundreds of threads that will be mostly idling may be modeled like a thread pool. This reduces a bit of code linearity but dramatically increases the scalability of the solution and with propper care can be easily tunable to adjust the compromise between performance and resource usage.
I think that's all. Thanks everyone for their input.
--
amso
Each thread has overhead - most importantly each one has its own stack and TLS. Performance is not that much of a problem since they will not get any time slices unless they actually do anything. You may still want to consider using thread pools.
Chiefly they will use up address space and memory for stacks; once you get, say, 1000 threads, this gets quite significant as I've seen that 10M per thread is typical for stacks (on x86_64). It is changable, but only with care.
If you have a 32-bit processor, address space will be the main limitation once you hit 1000s of threads, you can easily exhaust the AS.
They use up some kernel memory, but probably not as much as userspace.
Edit: of course threads share address space with each other only if they are in the same process; I am assuming that they are.
I'm not a Linux hacker, but assuming that Linux's thread scheduling is similar to Windows'...
Yes, of course the will be some impact. Every bit of memory you consume will potentially have some impact.
However, in a time-sliced environment, threads that are in a Wait/Sleep/Join state will not consume CPU cycles until they are awoken.
I would be worried about offering 1:1 thread-connections mappings, if nothing else because it leaves you rather exposed to denial of service attacks. (pthread_create() is a fairly expensive operation compared to just a call to accept())
EboMike has already answered the question directly - provided threads are blocked and not busy-waiting then they won't consume much in the way of resources although they will occupy memory and swap for all the per-thread state.
I'm learning the basics of the kernel now. I can't give you a specific answer yet; I'm still a noob... but here are some things for you to chew on.
Linux implements each POSIX thread as a unique process. This will create overhead as others have mentioned. In addition to this, your waiting model appears flawed any way you do it. If you create one conditional variable for each thread, then I think (based off of my interpretation of the website below) that you'll actually be expending a lot of kernel memory, as each thread would be placed into its own wait queue. If instead you break your threads up for each group of X threads to share a conditional variable, then you've got problems as well because every time the variable signals, you must wake up _EVERY_DARN_PROCESS_ in that variable's wait queue.
I also assume that you will need to do some object sharing an synchronization. In this case, your code may get slower because of the need to wake up all processes waiting on a resource, as I mentioned earlier.
I know this wasn't much help, but as I said, I'm a kernel noob. Hope it helped a little.
http://book.chinaunix.net/special/ebook/PrenticeHall/PrenticeHallPTRTheLinuxKernelPrimer/0131181637/ch03lev1sec7.html
I'm not sure what "device" you are talking about, but if it's a file descriptor, I'd suggest that you look at starting to migrate to using either poll or epoll (Id suggest the latter given the description of how active you expect each file descriptor to be). That way, you could use one process which would be responsible for all the fds.
I have a multithreaded application. Each module is executed in a separate thread.
Modules are:
- network module - used to receive/send data from network
- parser module - encode/decode network data to internal presentation
- 2 application module - perform some application logic on the above data one after other
- counter module - used to gather statistics from other modules
- timer module - used to schedule timers
- and much more ...
All threads using message queues for inter thread communication (std::deque sync by conditional variable and mutex).
Some modules are used by others ones (e.g. all modules use timer and counter) and this for each message received from network wich should be handled in very high rates.
This is pretty complex application and the design looks "reasonable". From other hand, I'm not sure that such design, thread per module, is the "best" one? In particular, I'm afraid that such design "encorage" a lot of context switches.
What do you think?
Is there're any good guidelines or open source project to learn from how to do "correct" design of threaded application?
Thread-per-function designs are just naive: they assume that by separating tasks - by module - onto threads, that some kind of scalability will be achieved.
This kind of design is inefficient, as very few task breakdowns yield exactly as many tasks as there are CPUs.
Far more rational designs are to break tasks down into 'jobs' - and then use thread pooling mechanisms to dispatch those jobs.
Advantages over the thread-per-module approach:
Thread pools take advantage of all cores. with thread-per-module if you have modules < cores you have cores sitting idle.
Thread pools minimize contention and resources by maintaining a parity between active threads, and cores. with thread-per-module, if modules > cores you incur needless extra context switches and (on some platforms) each thread exhausts other limited per process resources (like virtual memory).
Thread pools let a "module" do multiple jobs at a time. thread-per-module means that the busiest module still only gets one core.
I wouldn't call myself an expert an multi-threaded design. But I've at least worked with threads enough to have run into various issues trying to design them to work together (communication, locking resources, waiting for threads to end, etc).
At this point, my general rule of thumb is that I must justify the existence of each new thread. For example, if the network layer I'm using provides both a synchronous and an asynchronous API, can I really justify making the network code use synchronous calls in a new thread instead of just using the asynchronous calls in the main thread? In your case, how many modules actually need a thread of their own for a specific reason. Are there any that could instead just be called in turn from the main thread?
If some threads have no good reason for existing, then you might be able to save yourself some trouble and complexity by just putting that module in the main thread.
Now of course, there are good justifiable reasons for putting things in threads. Such as making synchronous calls that may block for a long time, keeping a GUI thread responsive while performing a long task, or being able to take advantage of parallel processing of a large task on a multi-core system.
I don't know of any particular "correct" way to do it. A lot of it really comes down to the details of what your application is actually supposed to do.
A good guideline is to put operations that might block (such as I/O) in its own thread. Your network module is a definite candidate here. Have your network thread use select (I assume UNIX here) to block on input.
Asynchronous events are good in separate threads as well. Your timer module looks like a good candidate here.
You might want to put your other modules in one thread to decrease complexity of your application. BUT, you might want to split them up if you have a multi-processor system.
Have a good strategy for locking resources and mutex handling to prevent deadlocks. A dependency graph (using a whiteboard!) might help here to get your design correct.
Good luck! Sounds like a complex system which will cause many hours of fun development!
For what platform?
For instance a Win32 applications the best model for back-end servers (like yours seems to be) is the thread pool and IO Completion Port. This is not just some hear say and opinion, there are strong facts behind this claim. Rick Vicik of the Windows Performance team has posted a series of articles describing in greater detail why high end servers need to follow this model, see High Performance Windows Programs.
There are other factors that come into play, like for instance the typo of protocol your network module has to handle. Request-Response protocols are often handled by one-thread-per-request metaphor and they do well enough, but high-throughput high-scale protocols don't fare well in that model, specifically because of boxcaring requirements.
Ultimately, whether your design is sound or not is hard to tell just from this brief description. Personally I tend o favor an IO completion driven threading model, as opposed to logical-module driven one, but that's just me.
Just to add to the other answers, lets reason every single thread in your dessign:
network module
Accepted.
parser module + 2 application module
Are you sure that these 3 threads can't be merged into one, main data processing thread? If that were the case, you could then benefit of a thread pool like others sugested, having this processing performed by N threads.
timer module
This one probably is reasonable in most platforms, as you will need a message processing loop to dispatch timer events. Also, if you ever need a GUI that could be the place.
counter module
This is the one that most annoys me. I can't find the reason for having a separate thread for this. Depending on how much you increment it, it will be a nice bottleneck for the application.
I'll suggest keeping separate counters in each thread and poll(message queue) for them when you need it.
and much more ...
Hope not!
I am working on a threaded application on Linux in C++ which attempts to be real time, doing an action on a heartbeat, or as close to it as possible.
In practice, I find the OS is swapping out my thread and causing delays of up to a tenth of a second while it is switched out, causing the heartbeat to be irregular.
Is there a way my thread can hint to the OS that now is a good time to context switch it out? I could make this call right after doing a heartbeat, and thus minimize the delay due to an ill timed context switch.
It is hard to say what the main problem is in your case, but it is most certainly not something that can be corrected with a call to sched_yield() or pthread_yield(). The only well-defined use for yielding, in Linux, is to allow a different ready thread to preempt the currently CPU-bound running thread at the same priority on the same CPU under SCHED_FIFO scheduling policy. Which is a poor design decision in almost all cases.
If you're serious about your goal of "attempting to be real-time" in Linux, then first of all, you should be using a real-time sched_setscheduler setting (SCHED_FIFO or SCHED_RR, FIFO preferred).
Second, get the full preemption patch for Linux (from kernel.org if your distro does not supply one. It will also give you the ability to reschedule device driver threads and to execute your thread higher than, say, hard disk or ethernet driver threads.
Third, see RTWiki and other resources for more hints on how to design and set up a real-time application.
This should be enough to get you under 10 microseconds response time, regardless of system load on any decent desktop system. I have an embedded system where I only squeeze out 60 us response idle and 150 us under heavy disk/system load, but it's still orders of magnitude faster than what you're describing.
You can tell the current executing thread to pause execution with various commands such as yield.
Just telling the thread to pause is non-determanistic, 999 times it might provide good intervals and 1 time it doesn't.
You'll will probably want to look at real time scheduling for consistant results. This site http://www2.net.in.tum.de/~gregor/docs/pthread-scheduling.html seems to be a good starting spot for researching about thread scheduling.
use sched_yield
And fur threads there is an pthread_yield http://www.kernel.org/doc/man-pages/online/pages/man3/pthread_yield.3.html
I'm a bit confused by the question. If your program is just waiting on a periodic heartbeat and then doing some work, then the OS should know to schedule other things when you go back to waiting on the heartbeat.
You aren't spinning on a flag to get your "heartbeat" are you?
You are using a timer function such as setitimer(), right? RIGHT???
If not, then you are doing it all wrong.
You may need to specify a timer interval that is just a little shorter than what you really need. If you are using a real-time scheduler priority and a timer, your process will almost always be woken up on time.
I would say always on time, but Linux isn't a perfect real-time OS yet.
I'm not too sure for Linux, but on Windows it's been explained that you can't ask the system to not interrupt you for several reasons (first paragraph mostly). Off my head, one of the reasons is hardware interrupts that can occur at any time and over which you have no control.
EDIT Some guy just suggested the use of sched_yield then deleted his answer. It'll relinquish time for your whole process though. You can also use sched_setscheduler to hint the kernel about what you need.