Performance difference for multi-thread and multi-process

Performance difference for multi-thread and multi-process - c++

A few years ago, in the Windows environment, I did some testing, by letting multiple instances of CPU computation intensive + memory access intensive + I/O access intensive application run. I developed 2 versions: One is running under multi-processing, another is running under multi-threading.
I found that the performance is much better for multi-processing. I read somewhere else (but I can't remember the site).
Which states that the reason is that under multi-threading, they are "fighting" for a single memory pipeline and I/O pipeline, which makes the performance worse compared to multi-processing
However, I can't find that article anymore. I was wondering, till today, whether the below still hold true?
In Windows, having the algorithm
code run under multi-processing, there is a high
chance that the performance will be
better than multi-threading.

It depends on how much the various threads or processes (I'll be using the collective term "tasks" for both of them) need to communicate, especially by sharing memory: that's easy, cheap and fast for threads, but not at all for processes, so, if a lot of it is going on, I bet processes' performance is not going to beat threads'.
Also, processes (esp. on Windows) are "heavier" to get started, so if a lot of "task starts" occur, again threads can easily beat processes in terms of performance.
Next, you can have CPUs with "hyperthreading", which can run (at least) two threads on a core very rapidly -- but, not processes (since the "hyperthreaded" threads cannot be using distinct address spaces) -- yet another case in which threads can win performance-wise.
If none of these considerations apply, then the race should be no better than a tie, anyway.

I'm not sure what the quote even means. It's very close to nonsense.
The primary thing that in-proc threads share is virtual memory address space.

I found this is true as well. but I think it has something to do with the scheduling. because if you run it long enough, the multi-processes is just as fast as multi-threads. that number is about 10 seconds. if the algorithm needs to be run for 10 seconds. the multi-processes is as fast as multi-thread. but if it only needs to be run for less than 1 second. multi-processes is much,much faster than multi-thread.

Related

When is better to use CPU or I/O intensive code in child processes [ C++ ]

I got an exam question and didn't know answer.
The task was:
A programmer would like to create a very fast application, hence it organizes its software in 13 processes (a parent and 12 children), all running in parallel.
Child threads are very :
I/O-intensive, hence make very frequent use of system calls (read/write from files/pipes/sockets, write on standard output, etc.)
CPU intensive, hence make very frequent use of system calls:
Describe when this would be a good choice, when this would be a bad choice, and motivate your answer.
Number 1 and 2, are different questions. So answer should be for both. Good and bad sides of I/O intensive, good and bad sides of CPU intensive.
* Sorry for inappropriate topic, I changed it.
* "Child Threads" was on exam paper. So I copied it. I think, my professor wanted to write "process"
Thank you

Multi-threading and multi-processing is always best when it is embarrassingly parallel so each thread or process does unrelated to other thread work and is worst when threads try to share same resource.
I/O-intensive, hence make very frequent use of system calls
(read/write from files/pipes/sockets, write on standard output,
etc.)
Good idea to have such processes is when each process does I/O (reads from and/or writes to) with different media. Bad idea is when these try to use same media and so are waiting after each other.
On older media where the device has to move reading/writing heads between several tracks it can seriously hinder the performance.
CPU intensive, hence make very frequent use of system calls
Here is limitation how lot of processor cores the system has. Best case is to have one CPU intensive process per processor core. So if these processes share same core then it is worst and when these each run on different core then it is best.
The overheads from creating processes, frequent context switching between processes and communication between processes on same core are actually causing multiprocessing on single core to perform worse than same calculations done by single process (I assume in hands of master on both cases).
Often multiprocessing is not done only because of performance considerations. The more frequent reason is that the architecture consisting of smaller modules is cleaner and quality of such modules is simpler to test and ensure in separation.

My professor helped me with answer.
Actually, it was better to have 13 cores in PC, when it was using CPU intensive code
The problem was in time delay, when it has I/O intensive code. So it is better to have less cores to use them perfectly.

Multithreading crowds out other processes

I have added multithreading to a raytracer I am writing, and while it does run much faster now, when it's running, my computer is almost unusably slow. Obviously I want to use all my PC's compute power, but I don't want it to prevent any other application from getting access to the CPUs.
I thought about having the threads sleep, but unless they all sleep at the same time, then the other threads would just eat up the extra time. Also, I don't necessarily want to give up a certain percentage of available compute power if I'm not going to use it.
Also, (This is not my official question) I've noticed that for some reason the first thread launched does more work than the second, and the second more than the third, and so on until like the last 5 threads (out of 32) won't actually get a crack at any work, despite the fact that there's plenty to go a around (there's at least 0.5M work items for them to chew through). If someone would like to venture a guess in the comments, it would be appreciated.

If you use the standard threads, you could try to use thread::hardware_concurrency to find out an estimate of the maximul number of threads that are really supported by hardware, in order not to overload your cpu.
If it returns 0 the information is not available. In other cases you could limit yourself to this number or a little bit below (thinking that other processes might use these as well).
If limiting the number of threads does not improve responsiveness, you can also consider calling from time to time this_thread::yield() to give opportunity to reschedule threads. But depending on the kind of job and synchronisation you use, this second alternative might decrease performance.

As requested, my comment as an answer:
It sounds like you've oversubscribed your poor CPU. Try reducing the number of threads?
If there's significantly more threads than hardware cores, a lot of time is going to be wasted switching between threads, scheduling them in the OS, and in contention over shared variables. It would also cause the general slowdown of the other running programs, because they have to contend with the high number of threads from your program (which by default all have the same priority as the other programs' threads in the eyes of the OS scheduler).

Multithreading efficiency in C++

I am trying to learn threading in C++, and just had a few questions about it (more specifically <thread>.
Let's say the machine this code will run on has 4 cores, should I split up an operation into 4 threads? If I were to create 8 threads instead of 4, would this run slower on a 4 core machine? What if the processor has hyperthreading, should I try and make the threads match the number of physical cores or logical cores?
Should I just not worry about the number of cores a machine has, and try to create as many threads as possible?
I apologize if these questions have been already answered; I've been looking for information about threading with <thread>, which was introduced in c11 so I haven't been able to find too much about it.
The program in question is going to run many independent simulations.
If anybody has any insight about <thread> or just multithreading in general, I would be glad to hear it.

If you are performing pure calculations with no I/O - and those calculations are freestanding and not relying on results from other calculations happening in another thread, the maximum number of such threads should be the number of cores (possibly one or two less if the system is also loaded with other tasks).
If you are doing network I/O or similar, more threads are certainly a possibility.
If you are doing disk-I/O, a single thread reading from the disk is often best, because disk reads from multiple threads leads to moving the read/write head around on the disk, which just makes things slower.
If you're using threads for to make the code simpler, then the number of threads will probably depend on what you are doing.
It also depends on how "freestanding" each thread is. If they need to share data in complex ways, the sharing/waiting for other thread/etc, may well make it slower with more threads.
And as others have said, try to make your framework for this flexible and test different options. Preferably on multiple machines (unless you only have one kind of machine that you will ever run your code on).

There is no such thing as <threads.h>, you mean <thread>, the thread support library introduced in C++11.
The only answer to your question is "test and see". You can make your code flexible enough, so that it can be run by passing an N parameter (where N is the desired number of threads).
If you are CPU-bound, the answer will be very different from the case when you are IO bound.
So, test and see! For your reference, this link can be helpful. And if you are serious, then go ahead and get this book. Multithreading, concurrency, and the like are hairy topics.

Let's say the machine this code will run on has 4 cores, should I split up an operation into 4 threads?
If some portions of your code can be run in parallel, then yes it can be made to go faster, but this is very tricky to do since loading threads and switching data between them takes a ton of time.
If I were to create 8 threads instead of 4, would this run slower on a 4 core machine?
It depends on the context switching it has to do. Sometimes the execution will switch between threads very often and sometimes it will not but this is very difficult to control. It will not in any case run faster than 4 threads doing the same work.
What if the processor has hyperthreading, should I try and make the threads match the number of physical cores or logical cores?
Hyperthreading works nearly the same as having more cores. When you will notice the differences between a real core and an execution core, you will have enough knowledge to work around the caveats.
Should I just not worry about the number of cores a machine has, and try to create as many threads as possible?
NO, threads are hard to manage, avoid them as much as you can.
The program in question is going to run many independent simulations.
You should look into openmp. It is a library in C made to parallelize computation when your program can be split up. Do not confuse parallel with concurrent. Concurrent is simply multiple threads working together while parallel is made specifically to speed up your application. Maybe openmp is overkill for your thing, but it is a good thing to know when you are approaching parallel computing

Don't think of the number of threads you need as in comparison to the machine you're running on. Threading is valuablue any time you have a process that:
A: There is some very slow operation, that the rest of the process need not wait for.
B: Certain functions can run faster than one another and don't need to be executed inline.
C: There is a lot of non-order dependant I/O going on(web servers).
These are just a few of the obvious examples when launching a thread makes sense. The number of threads you launch is therefore more dependant on the number of these scenarios that pop up in your code, than the architecture you expect to run on. In fact unless you're running a process that really really needs to be optimized, it is likely that you can only eek out a few percentage points of additional performance by benchmarking for your architecture in comparison to the number of threads that you launch, and in modern computers this number shouldn't vary much at all.
Let's take the I/O example, as it is the scenario that will see the most benefit. Let's assume that some program needs to interract with 200 users over the network. Network I/O is very very slow. Thousands of times slower than the CPU. If we were to handle each user in turn we would waste thousands of processor cycles just waiting for data to come from the first user. Could we not have been processing information from more than one user at a time? In this case since we have roughly 200 users, and the data that we're waiting for we know to be 1000s of times slower than what we can handle(assuming we have a minimal amount of processing to do on this data), we should launch as many threads as the operating system allows. A web server that takes advantage of threading can serve hundreds of more people per second than one that does not.
Now, let's consider a less I/O intensive example, where say we have several functions that execute in turn, but are independant of one another and some of them might run faster, say because there is disk I/O in one, and no disk I/O in another. In this case, our I/O is still fairly fast, but we will certainly waste processing time waiting for the disk to catch up. As such we can launch a few threads, just to take advantage of our processing power, and minimize wasted cycles. However, if we launch as many threads as the operating system allows we are likely to cuase memory management issues for branch predictors, etc... and launching too many threads in this case is actually sub optimal and can slow the program down. Note that in this, I never mentioned how many cores the machine has! NOt that optimizing for different architectures isn't valuable, but if you optimize for one architecture you are likely very close to optimal for most. Assuming, again, that you're dealing with all reasonably modern processors.

I think most people would say that large scale threading projects are better supported by languages other than c++ (go, scala,cuda). Task parallelism as opposed to data parallelism works better in c++. I would say that you should create as many threads as you have tasks to dole out but if data parallelism is more related to your problem consider maybe using cuda and linking to the rest of your project at a later time
NOTE: if you look at some sort of system monitor you will notice that there are likely far more than 8 threads running, I looked at my computer and it had hundreds of threads running at once so don't worry too much about the overhead. The main reason I choose to mention the other languages is that managing many threads in c++ or c tends to be very difficult and error prone, I did not mention it because the c++ program will run slower(which unless you use cuda it probably won't)

In regards to hyper-threading let me comment on what I have found from experience.
In large dense matrix multiplication hyper-threading actually gives worse performance. For example Eigen and MKL both use OpenMP (at least the way I have used them) and get better results on my system which has four cores and hyper-threading using only four threads instead of eight. Also, in my own GEMM code which gets better performance than Eigen I also get better results using four threads instead of eight.
However, in my Mandelbrot drawing code I get a big performance increase using hyper-threading with OpenMP (eight threads instead of four). The general trend (so far) seems to be that if the code works well using schedule(static) in OpenMP then hyper-threading does not help and may even be worse. If the code works better using schedule(dynamic) then hyper-threading may help.
In other words, my observation so far is that if the run time of each thread can vary a lot hyper-threading can help. If the run time of each thread is constant then it may even make performance worse. But YOU have to test and see for each case.

Poor performance in multi-threaded C++ program

I have a C++ program running on Linux in which a new thread is created to do some computationally expensive work independent of the main thread (The computational work completes by writing the results to files, which end up being very large). However, I'm getting relatively poor performance.
If I implement the program straightforward (without introducing other threads), it completes the task in roughly 2 hours. With the multi-threaded program it takes around 12 hours to do the same task (this was tested with only one thread spawned).
I've tried a couple of things, including pthread_setaffinity_np to set the thread to a single CPU (out of the 24 available on the server I'm using), as well as pthread_setschedparam to set the scheduling policy (I've only tried SCHED_BATCH). But the effects of these have so far been negligible.
Are there any general causes for this kind of problem?
EDIT: I've added some example code that I'm using, which is hopefully the most relevant parts. The function process_job() is what actually does the computational work, but it would be too much to include here. Basically, it reads in two files of data, and uses these to perform queries on an in-memory graph database, in which the results are written to two large files over a period of hours.
EDIT part 2: Just to clarify, the problem is not that I want to use threads to increase the performance of an algorithm I have. But rather, I want to run many instances of my algorithm simultaneously. Therefore, I expect the algorithm would run at a similar speed when put in a thread as it would if I didn't use multi-threads at all.
EDIT part 3: Thanks for the suggestions all. I'm currently doing some unit tests (seeing which parts are slowing down) as some have suggested. As the program takes a while to load and execute, it is taking time to see any results from the tests and therefore I apologize for late responses. I think the main point I wanted to clarify is possible reasons why threading could cause a program to run slowly. From what I gather from the comments, it simply shouldn't be. I'll post when I can find a reasonable resolution, thanks again.
(FINAL) EDIT part 4: It turns out that the problem was not related to threading after all. Describing it would be too cumbersome at this point (including the use of compiler optimization levels), but the ideas posted here were very useful and appreciated.
struct sched_param sched_param = {
sched_get_priority_min(SCHED_BATCH)
};
int set_thread_to_core(const long tid, const int &core_id) {
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(core_id, &mask);
return pthread_setaffinity_np(tid, sizeof(mask), &mask);
}
void *worker_thread(void *arg) {
job_data *temp = (job_data *)arg; // get the information for the task passed in
...
long tid = pthread_self();
int set_thread = set_thread_to_core(tid, slot_id); // assume slot_id is 1 (it is in the test case I run)
sched_get_priority_min(SCHED_BATCH);
pthread_setschedparam(tid, SCHED_BATCH, &sched_param);
int success = process_job(...); // this is where all the work actually happens
pthread_exit(NULL);
}
int main(int argc, char* argv[]) {
...
pthread_t temp;
pthread_create(&temp, NULL, worker_thread, (void *) &jobs[i]); // jobs is a vector of a class type containing information for the task
...
return 0;
}

If you have plenty of CPU cores, and have plenty of work to do, it should not take longer to run in multithreaded than single threaded mode - the actual CPU time may be a fraction longer, but the "wall-clock time" should be shorter. I'm pretty sure that your code has some sort of bottleneck where one thread is blocking the other.
This is because of one or more of these things - I'll list them first, then go into detail below:
Some lock in a thread is blocking the second thread from running.
Sharing of data between threads (either true or "false" sharing)
Cache thrashing.
Competition for some external resource causing thrashing and/or blocking.
Badly designed code in general...
Some lock in a thread is blocking the second thread from running.
If there is a thread that takes a lock, and another thread wants to use the resource that is locked by this thread, it will have to wait. This obviously means the thread isn't doing anything useful. Locks should be kept to a minimum by only taking the lock for a short period. Using some code to identify if locks are holding your code, such as:
while (!tryLock(some_some_lock))
{
tried_locking_failed[lock_id][thread_id]++;
}
total_locks[some_lock]++;
Printing some stats of the locks would help to identify where the locking is contentious - or you can try the old trick of "Press break in the debugger and see where you are" - if a thread is constantly waiting for some lock, then that's what's preventing progress...
Sharing of data between threads (either true or "false" sharing)
If two threads use [and update the value of it frequently] the same variable, then the two threads will have to swap "I've updated this" messages, and the CPU's have to fetch the data from the other CPU before it can continue with it's use of the variable. Since "data" is shared on a "per cache-line" level, and a cache-line is typically 32-bytes, something like:
int var[NUM_THREADS];
...
var[thread_id]++;
would classify as something called "false sharing" - the ACTUAL data updated is unique per CPU, but since the data is within the same 32-byte region, the cores will still have updated the same are of memory.
Cache thrashing.
If two threads do a lot of memory reading and writing, the cache of the CPU may be constantly throwing away good data to fill it with data for the other thread. There are some techniques available to ensure that two threads don't run in "lockstep" on which part of cache the CPU uses. If the data is 2^n (power of two) and fairly large (a multiple of the cache-size), it's a good idea to "add an offset" for each thread - for example 1KB or 2KB. That way, when the second thread reads the same distance into the data region, it will not overwrite exactly the same area of cache that the first thread is currently using.
Competition for some external resource causing thrashing and/or blocking.
If two threads are reading or writing from/to the hard-disk, network card, or some other shared resource, this can lead to one thread blocking another thread, which in turn means lower performance. It is also possible that the code detects different threads and does some extra flushing to ensure that data is written in the correct order or similar, before starting work with the other thread.
It is also possible that there are locks internally in the code that deals with the resource (user-mode library or kernel mode drivers) that block when more than one thread is using the same resource.
Generally bad design
This is a "catchall" for "lots of other things that can be wrong". If the result from one calculation in one thread is needed to progress the other, obviously, not a lot of work can be done in that thread.
Too small a work-unit, so all the time is spent starting and stopping the thread, and not enough work is being done. Say for example that you dole out small numbers to be "calculate if this is a prime" to each thread, one number at a time, it will probably take a lot longer to give the number to the thread than the calculation of "is this actually a prime-number" - the solution is to give a set of numbers (perhaps 10, 20, 32, 64 or such) to each thread, and then report back the result for the whole lot in one go.
There are plenty of other "bad design". Without understanding your code it's quite hard to say for sure.
It is entirely possible that your problem is none of the ones I've mentioned here, but most likely it is one of these. Hopefully this asnwer is helpful to identify the cause.

Read CPU Caches and Why You Care to understand why a naive port of an algorithm from one thread to multiple threads will more often than not result in greatly reduced performance and negative scalability. Algorithms that are specififcally designed for parallelism take care of overactive interlocked operations, false sharing and other causes of cache pollution.

Here are a few things you might wanna look into.
1°) Do you enter any critical section (locks, semaphores, etc.) between your worker thread and your main thread? (this should be the case if your queries modify the graph). If so, that could be one of the sources of the multithreading overhead : threads competing for a lock usually degrades performances.
2°) You're using a 24 cores machines, which I assume would be NUMA (Non-Uniform Memory Access). Since you set the threads affinities during your tests, you should pay close attention to the memory topology of your hardware. Looking at the files in /sys/devices/system/cpu/cpuX/ can help you with that (beware that cpu0 and cpu1 aren't necessarily close together, and thus does not necessarily share memory). Threads heavily using memory should use local memory (allocated in the same NUMA node as the core they're executing on).
3°) You are heavily using disk I/O. Which kind of I/O is that? if every thread perform every time some synchronous I/O, you might wanna consider asynchronous system calls, so that the OS stays in charge of scheduling those requests to the disk.
4°) Some caches issues have already been mentionned in other answers. From experience, false sharing can hurt performances as much as you're observing. My last recommendation (which should have been my first) is to use a profiler tool, such as Linux Perf, or OProfile. With such performance degradation you're experiencing, the cause will certainly appear quite clearly.

The other answers have all addressed the general guidelines that can cause your symptoms. I will give my own, hopefully not excessively redundant version. Then I will talk a bit about how you can get to the bottom of the problem with everything discussed in mind.
In general, there's a few reasons you'd expect multiple threads to perform better:
A piece of work is dependent on some resources (disk, memory, cache, etc.) while other pieces can proceed independently of these resources or said workload.
You have multiple CPU cores that can process your workload in parallel.
The main reasons, enumerated above, you'd expect multiple threads to perform less well are all based on resource contention:
Disk contention: already explained in detail and can be a possible issue, especially if you are writing small buffers at a time instead of batching
CPU time contention if the threads are scheduled onto the same core: probably not your issue if you're setting affinity. However, you should still double check
Cache thrashing: similarly probably not your problem if you have affinity, though this can be very expensive if it is your problem.
Shared memory: again talked about in detail and doesn't seem to be your issue, but it wouldn't hurt to audit the code to check it out.
NUMA: again talked about. If your worker thread is pinned to a different core, you will want to check whether the work it needs to access is local to the main core.
Ok so far not much new. It can be any or none of the above. The question is, for your case, how can you detect where the extra time is coming from. There's a few strategies:
Audit the code and look for obvious areas. Don't spend too much time doing this as it's generally unfruitful if you wrote the program to begin with.
Refactor the single threaded code and the multi-threaded code to isolate one process() function, then profile at key checkpoints to try to account for the difference. Then narrow it down.
Refactor the resource access into batches, then profile each batch on both the control and the experiment to account for the difference. Not only will this tell you which areas (disk access vs memory access vs spending time in some tight loop) you need to focus your efforts on, doing this refactor might even improve your running time overall. Example:
First copy the graph structure to thread-local memory (perform a straight-up copy in the single-threaded case)
Then perform the query
Then setup an asynchronous write to disk
Try to find a minimally reproducible workload with the same symptoms. This means changing your algorithm to do a subset of what it already does.
Make sure there's no other noise in the system that could've caused the difference (if some other user is running a similar system on the work core).
My own intuition for your case:
Your graph structure is not NUMA friendly for your worker core.
The kernel can actually scheduled your worker thread off the affinity core. This can happen if you don't have isolcpu on for the core you're pinning to.

I can't tell you what's wrong with your program because you haven't shared enough of it to do a detailed analysis.
What I can tell you is if this was my problem the first thing I would try is to run two profiler sessions on my application, one on the single threaded version and another on the dual thread configuration. The profiler report should give you a pretty good idea of where the extra time is going. Note that you may not need to profile the entire application run, depending on the problem the time difference may become obvious after you profile for a few seconds or minutes.
As far as profiler choices for Linux you may want to consider oprofile or as a second choice gprof.
If you find you need help interpreting the profiler output feel free to add that to your question.

It can be a right pain in the rear to track down why threads aren't working as planned. One can do so analytically, or one can use tool to show what's going on. I've had very good mileage out of ftrace, Linux's clone of Solaris's dtrace (which in turn is based on what VxWorks, Greenhill's Integrity OS and Mercury Computer Systems Inc have been doing for a looong time.)
In particular I found this page very useful: http://www.omappedia.com/wiki/Installing_and_Using_Ftrace, particularly this and this section. Don't worry about it being an OMAP orientated website; I've used it on X86 Linuxes just fine (though you may have to build a kernel to include it). Also remember that the GTKWave viewer is primarily intended for looking at log traces from VHDL developments, which is why it looks 'odd'. It's just that someone realised that it would be a usable viewer for sched_switch data too, and that saved them writing one.
Using the sched_switch tracer you can see when (but not necessarily why) your threads are running, and that might be enough to give you a clue. The 'why' can be revealed by careful examination of some of the other tracers.

If you are getting slowdown from using 1 thread, it is likely due to overhead from using thread safe library functions, or from thread setup. Creating a thread for each job will cause significant overhead, but probably not as much as you refer to.
In other words, it is probably some overhead from some thread safe library function.
The best thing to do, is to profile your code to find out where time is spent. If it is in a library call, try to find a replacement library or implement it yourself. If the bottleneck is thread creation/destruction try reusing threads, for instance using OpenMP tasks or std::async in C++11.
Some libraries are really nasty wrt thread safe overhead. For instance, many rand() implementations use a global lock, rather than using thread local prgn's. Such locking overhead is much larger than generating a number, and is hard to track without a profiler.
The slowdown could also stem from small changes you have made, for instance declaring variables volatile, which generally should not be necessary.

I suspect you're running on a machine with one single-core processor. This problem is not parallelizable on that kind of system. Your code is constantly using the processor, which has a fixed number of cycles to offer to it. It actually runs more slowly because the additional thread adds expensive context switching to the problem.
The only kinds of problems that parallelize well on a single-processor machine are those that allow one path of execution to run while another is blocked waiting for I/O, and situations (such as keeping a responsive GUI) where allowing one thread to get some processor time is more important than executing your code as quickly as possible.

If you only want to run many independent instances of your algorithm can you just submit multiple jobs (with different parameters, can be handled by a single script) to your cluster? That would eliminate the need to profile and debug your multithreaded program. I don't have much experience with multithreaded programming but if you use MPI or OpenMP then you'd have to write less code for the book keeping too. For example, if some common initialization routine is needed and the processes can run independently thereafter you can just do that by initializing in one thread and doing a broadcast. No need for maintaining locks and such.

Where's the balance between thread amount and thread block times?

Elongated question:
When having more blocking threads then CPU cores, where's the balance between thread amount and thread block times to maximize CPU efficiency by reducing context switch overhead?
I have a wide variety of IO devices that I need to control on Windows 7, with a x64 multi-core processor: PCI devices, network devices, stuff being saved to hard drives, big chunks of data being copied,... The most common policy is: "Put a thread on it!". Several dozen threads later, this is starting to feel like a bad idea.
None of my cores are being used 100%, and there's several cores who're still idling, but there are delays showing up in the range of 10 to 100ms who cannot be explained by IO blockage or CPU intensive usage. Other processes don't seem to require resources either. I'm suspecting context switch overhead.
There's a bunch of possible solutions I have:
Reduce threads by bundling the same IO devices: This mainly goes for the hard drive, but maybe for the network as well. If I'm saving 20MB to the hard drive in one thread, and 10MB in the other, wouldn't it be better to post it all to the same? How would this work in case of multiple hard drives?
Reduce threads by bundling similar IO devices, and increase it's priority: Dozens of threads with increased priority are probably gonna make my user interface thread stutter. But I can bundle all that functionality together in 1 or a couple of threads and increase it's priority.
Any case studies tackling similar problems are much appreciated.

First, it sounds like these tasks should be performed using asynchronous I/O (IO Completion Ports, preferably), rather than with separate threads. Blocking threads are generally the wrong way to do I/O.
Second, blocked threads shouldn't affect context switching. The scheduler has to juggle all the active threads, and so, having a lot of threads running (not blocked) might slow down context switching a bit. But as long as most of your threads are blocked, they shouldn't affect the ones that aren't.

10-100ms with some cores idle: it's not context-switching overhead in itself since a switch is orders of magnitude faster than these delays, even with a core swap and cache flush.
Async I/O would not help much here. The kernel thread pools that implement ASIO also have to be scheduled/swapped, albeit this is faster than user-space threads since there are fewer Wagnerian ring-cycles. I would certainly head for ASIO if the CPU loading was becoming an issue, but it's not.
You are not short of CPU, so what is it? Is there much thrashing - RAM shortage? Excessive paging can surely result in large delays. Where is your page file? I've shoved mine off Drive C onto another fast SATA drive.
PCI bandwidth? You got a couple of TV cards in there?
Disk controller flushing activity - have you got an SSD that's approaching capacity? That's always a good one for unexplained pauses. I get the odd pause even though my 128G SSD is only 2/3 full.
I've never had a problem specifically related to context-swap time and I've been writing multiThreaded apps for decades. Windows OS schedules & despatches the ready threads onto cores reasonably quickly. 'Several dozen threads' in itself, (ie. not all running!), is not remotely a problem - looking now at my TaskManger/performance, I have 1213 threads loaded on and no performance issues at all with ~6% CPU usage, (app on test running in background, bitTorrent etc). Firefox has 30 threads, VLC media player 27, my test app 23. No problem at all writing this post.
Given your issue of 10-100ms delays, I would be amazed if fiddling with thread priorities and/or changing the way your work is loaded onto threads provides any improvement - something else is stuffing your system, (you haven't got any drivers that I coded, have you? :).
Does perfmon give any clues?
Rgds,
Martin

I don't think that there is a conclusive answer, and it probably depends
on your OS as well; some handle threads better than others. Still,
delays in the 10 to 100 ms range are not due to context switching itself
(although they could be due to characteristics of the scheduling
algorithm). My experience under Windows is that I/O is very
inefficient, and if you're doing I/O, of any type, you will block. And
that I/O by one process or thread will end up blocking other processes
or threads. (Under Windows, for example, there's probably no point in
having more than one thread handle the hard drive. You can't read or
write several sectors at the same time, and my impression is that
Windows doesn't optimize accesses like some other systems do.)
With regards to your exact questions:
"If I'm saving 20MB to the hard drive in one thread, and 10MB in the
other, wouldn't it be better to post it all to the same?": It depends on
the OS. Normally, there should be no reduction in time or latency using
separate threads, and depending on other activity and the OS, there
could be an improvement. (If there are several disk requests in
instance, most OS's will optimize the accesses, reordering the requests
to reduce head movement.) The simplest solution would be to try both,
and see which works better on your system.
"How would this work in case of multiple hard drives?": The OS should
be able to do the I/O in parallel, if the requests are to different
drives.
With regards to increasing priority of one or more theads, it's very OS
dependent, but probably worth trying. Unless there's significant CPU
time used in the threads with the higher priority, it shouldn't impact
the user interface—these threads are mostly blocked for I/O,
remember.

Well, my Windows 7 is currently running 950 threads. I don't think that adding another few dozen on would make a significant difference. However, you should definitely be looking at a thread pool or other work-stealing device for this - you shouldn't make new threads just to let them block. If Windows provides asynchronous I/O by default, then use it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js