"If you could sense the operation of a computer that is switching itself every few milliseconds amount dozens of tasks you would certainly agree that the computer seems to be performing these tasks simultaneously even though we know that the computer is interleaving the computations of the various tasks"
M.Ben-Ari, Principles of Concurrent Programming, 1982.
So on a single core CPU, it would be impossible for a single atomic operation to be carried out at the same time as another within the same system?
Yes, single-core CPUs can perform multiple operations simultaneously. For example, Pentium processors have multiple pipelines that operate concurrently. One could be doing an add while another is doing a load from memory. Of course you'd have no way to observe the effects of the simultaneous operations.
Further, architectures like the Pentium 4 are single-core, yet can have hyperthreading. This means that the different pipelines in a single core can not only be executing instructions concurrently, but those operations can be for separate threads. In other words, the CPU can issue instructions for different threads on the same clock tick.
Related
I have some calculation task on a large amount of the data - so it can be quite easily to parallel. Next question is how many threads does it make sense to create. Of course I can measure time for different number of thread on my machine, but what if a program will be run on different machines, so I can't really make manual measurement. Is just get number of threads from std::thread::hardware_concurrency() good enough, or there are some other ways?
That function (std::thread::hardware_concurrency()) will give you the total core count, including hyperthreading.
If your program does intensive number crunching I would say using only physical cores and setting processor affinity is the best choice.
You can know the current processor topology with hwloc library which works in most platforms.
You may find an comprehensible explanation (though a bit old) here.
If there is lot of I/O then you may run two threads for processor to allow one to process data while other is waiting for input, or one extra thread without affinity so it can take processor time while others are waiting for I/O, but this is a very rough estimation: better measure in your machine.
If you can test in other processors, you may have different strategy for each processor.
I'm programming a recursive tree search with multiple branches and works fine. To speed up I'm implementing a simple multithreading: I distribute the search into main branches and scatter them among the threads. Each thread doesn't have to interact with the others, and when a solve is found I add it to a common std::vector using a mutex this way:
if (CubeTest.IsSolved())
{ // Solve algorithm found
std::lock_guard<std::mutex> guard(SearchMutex); // Thread safe code
Solves.push_back(Alg); // Add the solve
}
I don't allocate variables in dynamic store (heap) with new and delete, since the memory needs are small.
The maximum number of threads I use is the quantity I get from: std::thread::hardware_concurrency()
I did some tests, always the same search but changing the amount or threads used, and I found things that I don't expected.
I know that if you double the amount of threads (if the processor has enougth capacity) you can't expect to double the performance, because of context switching and things like that.
For example, I have an old Intel Xeon X5650 with 6 cores / 12 threads. If I execute my code, until the sixth thread things are as expected, but if I use an additional thread the performace is worst. Using more threads increase the performace very little, to the point that the use of all avaliable threads (12) barely compensates for the use of only 6:
Threads vs processing time chart for Xeon X5650:
(I repeat the test several times and I show the average times of all the tests).
I repeat the tests in other computer with an Intel i7-4600U (2 cores / 4 threads) and I found this:
Threads vs processing time chart for i7-4600U:
I understand that with less cores the performance gain using more threads is worst.
I think also that when you start to use the second thread in the same core the performance is penalized in some way. Am I right? How can I improve the performance in this situation?
So my question is if this performance gains for multithreading is what I can expect in the real world, or on the other hand, this numbers are telling me that I'm doing things wrong and I should learn more about mutithreading programming.
What's the “real world” performance improvement for multithreading I can expect?
It depends on many factors. In general, the most optimistic improvement that one can hope for is reduction of runtime by factor of number of cores1. In most cases this is unachievable because of the need for threads to synchronise with one another.
In worst case, not only is there no improvement due to lack of parallelism, but also the overhead of synchronisation as well as cache contention can make the runtime much worse than the single threaded program.
Peak memory use often increases linearly by number of threads because each thread needs to operate on data of their own.
Total CPU time usage, and therefore energy use also increases due to extra time spent on synchronisation. This is relevant to systems that operate on battery power as well as those that have poor heat management (both apply to phones and laptops).
Binary size would be marginally larger due to extra code that deals with threads.
1 Whether you get all of the performance out of "logical" cores i.e. "hyper threading" or "clustered multi threading" also depends on many factors. Often, one executes the same function in all threads, in which case they tend to use the same parts of the CPU, in which case sharing the core with multiple threads doesn't necessarily yield benefit.
A CPU which uses hyperthreading claims to be able to execute two threads simultaneously on one core. But actually it doesn't. It just pretends to be able to do that. Internally it performs preemptive multitasking: Execute a bit of thread A, then switch to thread B, execute a bit of B, back to A and so on.
So what's the point of hyperthreading at all?
The thread switches inside the CPU are faster than thread switches managed by the thread scheduler of the operating system. So the performance gains are mostly through avoiding overhead of thread switches. But it does not allow the CPU core to perform more operations than it did before.
Conclusion: The performance gain you can expect from concurrency depend on the number of physical cores of the CPU, not logical cores.
Also keep in mind that thread synchronization methods like mutexes can become pretty expensive. So the less locking you can get away with the better. When you have multiple threads filling the same result set, then it can sometimes be better to let each thread build their own result set and then merge those sets later when all threads are finished.
Currently, I am learning parallel processing using CPU, which is a well-covered topic with plenty of tutorials and books.
However, I could not find a single tutorial or resource that talks about programming techniques for hyper threaded CPU. Not a single code sample.
I know that to utilize hyper threading, the code must be implemented such that different parts of the CPU can be used at the same time (simplest example is calculating integer and float at the same time), so it's not plug-and-play.
Which book or resource should I look at if I want to learn more about this topic? Thank you.
EDIT: when I said hyper threading, I meant Simultaneous Multithreading in general, not Intel's hyper threading specifically.
Edit 2: for example, if I have an i7 8-core CPU, I can make a sorting algorithms that runs 8 times faster when it uses all 8-core instead of 1. But it will run the same on a 4-core CPU and a 4c-8t CPU, so in my case SMT does nothing.
Meanwhile, Cinebench will run much better on a 4c-8t CPU than on a 4c-4t CPU.
SMT is generally most effective, when one thread is loading something from memory. Depending on the memory (L1, L2, L3 cache, RAM), read/write latency can span a lot of CPU cycles that would have to be wasted doing nothing, if only one thread would be executed per core.
So, if you want to maximize the impact of SMT, try to interleave memory access of two threads so that one of them can execute instructions, while the other reads data. Theoretically, you can also use a thread just for cache warming, i.e. loading data from RAM or main storage into cache for subsequent use by other threads.
The way of successfully applying this can vary from one system to another because the access latency of cache, RAM and main storage as well as their size may differ by a lot.
Once I thought the only occasion multiple threads should be used is when IO processing is needed.
But I heard it's also useful without IO processing. Because it helps to occupy more CPU resources.
In my understanding, this would be
the process with more threads are given more CPU time.
Is this why multiple threads help improve performance even on single core?
One possible reason you can see greater performance from multiple threads on a single CPU is that CPUs tend to be really good at instruction reordering and making use of instruction-level parallelism. Threads have fewer data and control dependencies with respect to one another than any two sequential instructions within a single thread, and therefore they offer more possibilities for the CPU and OS-level schedulers and re-ordering mechanisms to be very clever.
Don't forget that things like "reads and writes in memory" are still "I/O" when viewed in a particular way. These are relatively slow operations, and much of the pipelining in modern CPUs is used to hide memory latency - having multiple threads executing at once can be useful for filling up time that would otherwise have to be filled with delay slots where there are data hazards within a single thread.
That said, threads are often not a good solution to increase performance, and can have precisely the opposite effect. It can be very easy to saturate all available memory bandwidth using a single thread on some problems.
Are they both the same thing? Looking just at what concurrent or parallel means in geometry, I'd definetely say no:
In geometry, two or more lines are said to be concurrent if they intersect at a single point.
and
Two lines in a plane that do not
intersect or meet are called parallel
lines.
Again, in programming, do they have the same meaning? If yes...Why?
Thanks
I agree that the geometry vocabulary is in conflict. Think of train tracks instead: Two trains which are on parallel tracks can run independently and simultaneously with little or no interaction. These trains run concurrently, in parallel.
The basic usage difficulty is that "concurrent" can mean "at the same time" (with the trains or code) or "at the same place" (with the geometric lines). For many practical purposes (trains, thread resources) these two notions are directly in conflict.
Natural language is supposed to be silly, ambiguous, and confusing. But we're programmers. We can take refuge in the clarity, simplicity, and elegance of our formal programming languages. Like perl.
From Wikipedia:
Concurrent computing is a form of
computing in which programs are
designed as collections of interacting
computational processes that may be
executed in parallel.
Basically, programs can be written as concurrent programs if they are made up of smaller interacting processes. Parallel programming is actually doing these processes at the same time.
So I suppose that concurrent programming is really a style that lends itself to processes being executed in parallel to improve performance.
No, definitely concurrent is different from parallel. here is exactly how.
Concurrency refers to the sharing of resources in the same time frame. As an example, several processes may share the same CPU or share memory or an I/O device.
Now, by definition two processes are concurrent if an only if the second starts execution before the first has terminated (on the same CPU). If the two processes both run on the same - say for now - single-core CPU the processes are concurrent but not parallel: in this case, parallelism is only virtual and refers to the OS doing timesharing. The OS seems to be executing several processes simultaneously. If there is only one single-core CPU, only one instruction from only one process can be executing at any particular time. Since the human time scale is billions of times slower than that of modern computers, the OS can rapidly switch between processes to give the appearance of several processes executing at the same time.
If you instead run the two processes on two different CPUs, the processes are parallel: there is no sharing in the same time frame, because each process runs on its own CPU. The parallelism in this case is not virtual but physical. It is worth noting here that running on different cores of the same multi-core CPU still can not be classified as fully parallel, because the processes will share the same CPU caches and will even contend for them.