Does each Map have its own thread? So, when we do splitting, we should split the task for as many Map function as we have processors available? Or there's some other way, besides threads, where we can run map functions in parallel?
I assume you're speaking about hadoop mapreduce implementation. Also, I assume you're speaking about cores workload.
For the intro, the number of map tasks for a given job is derived from the number of input data splits. Then, those tasks are scheduled to task nodes, where mappers are started, up to mapred.tasktracker.map.tasks.maximum per node. This configuration paramether may differ for different nodes, for example in case of different computational power. I'll add an illustration from one of my other answers on SO:
The mappers by default, runs in a different JVM and there can be multiple JVMs running at any particular instance on a node, up to mapred.tasktracker.map.tasks.maximum. Those JVMs are recreated for each starting map task, or can be reused
for several consecutive runs. Won't dig in details, but this setting can also affect performance due to tradeoff between memory fragmentation and JVM instatiation overhead.
Proceeding to your question, amount of cores loaded by running JVMs is controlled by underlying OS, which does balance load and optimize computations. One can expect that different JVMs will be executed over different cores, if possible. One can expect performance degradation if number of mappers exceeds number of cores in general case. I have skewed usecases where latter is not true.
An example:
Say you have job splitted in 100 map tasks, to be run on 2 task nodes with 2 cpu unit each, with mapred.tasktracker.map.tasks.maximum equal to 2. Then, most of the time (except when waiting on mappers to start) your 100 elements task will be executed 4 at a given time, resulting (in average) in 50 tasks completed by each node.
And last, but not least. For mapper task, it is common to not to have CPU as bottleneck, but IO. In that case, it's not uncommon to get better results with many small on CPU machines vs a few servers huge on CPU.
Related
I want to implement a multithreaded MD5 brute-force attack algorithm (in C++). I know about Rainbow tables and dictionaries, but I'm not going to implement the most efficient MD5 cracker, just interested in brute-force algorithm
The problem is how to distribute all password variations of all available lengths between threads. For example, to restore a password containing only lower-case characters from 4 to 6 symbols we should look over N=26^4+26^5+26^6=321254128 combinations (according to variation with repetitions formula, Vnk = n^k)
So that distribute all permutations in equial parts between, for example 8 threads, we should know every (N/8)*t variation, where t=(1..7). And take notice, these variationa have different length (4,5,6), and variations of 4-5 symbols could be pushed to the same thread with some number of 6-symbols variations
Does anybody know, how that algorithm is implemented in "real-world" brute-force applications? Maybe some kind of thread-pool?
The approach I find quite flexible is to spawn threads running the following code:
void thread_fn() {
PASSWORD_BLOCK block;
while (get_next_password_block(&block) {
for (PASSWORD password in block) {
if (verify_password(password)) set_password_found(password);
}
}
}
Typically, if code is well optimised, you will spawn as many threads as active cores; however in some cases launching more threads than cores can provide some performance gain (this points to sub-optimal code optimisation).
get_next_password_block() is where all locking and synchronisation is done. This function is responsible for keeping track of password list/range, incrementing password, etc.
Why use PASSWORD_BLOCK and not just a single password? Well, MD5 is a very fast algorithm, so if we will call get_next_password_block() for each password then overhead of locking/incrementing will be extreme. Besides, SIMD instructions allow us to perform bulk MD5 computations (4 passwords at a time), so we want a fast and efficient way to get a sizeable chunk of passwords to reduce overhead.
Particular size of the block depends on CPU speed and algorithm complexity; for MD5 I would expect it to be on the order of millions passwords.
The "correct" way of doing this would be to have a pool of workers (equal to the number of CPU cores, either not counting hyperthread cores, or counting all of them as "one") and a lockfree FIFO queue to which you submit groups of a hundred thousand or so tasks. This gives an acceptable balance between synchronization overhead and load balancing.
The trick is to divide work into relatively small groups, so the time when only one thread remains doing the last group is not too long (no parallelism there!), but at the same time not make the groups too small so you are bound by synchronization / bus contention. MD5 is pretty fast, so a few ten thousand to hundred thousand work items should be fine.
However, given the concrete problem, that's actually overkill. Way too complicated.
There are 26 times more 5-letter passwords than there are 4- letter passwords, and 26 times more 6-letter passwords than there are 5-letter ones, and so on. In other words, the longest password length has by far the biggest share in the total number of combinations. All 4,5,6 digit combinations together only make up about 3.9% of the combinations of all 7-digit combinations. In other words, they are totally insiginificant. 96% of the total runtime is within the 7 digit combinations, no matter what you do with the rest. It is even more extreme if you consider letters and digits or capitalization.
Thus, you can simply fire up as many threads as you have CPU cores, and run all 4-digit combinations in one thread, all 5-digit combinations in another one, and so on. That's not great, but it is good enough since nobody will notice a difference anyway.
Then simply partition the possible 7-digit combinations into num_thread equal-sized ranges, and have each thread that is finished with its initial range continue with that one.
Work will not always be perfectly balanced, but it will be during 96% of the runtime. And, it works with the absolute minimum of task management (none) and synchronization (merely need to set a global flag to exit when a match was found).
Since you cannot expect perfect load balancing even if you do perfect, correct task scheduling (since thread scheduling is in the hands of the operating system, not yours), this should be very close to the "perfect" approach.
Alternatively, you could consider firing up one extra thread which does the entire all-but-longest range of combinations (the "insignificant 3%") and partition the rest equally. This will cause a few extra context switches during startup, but on the other hand makes the program logic even simpler.
Manual partitioning of a task to worker threads is not efficient from both view: the effort spent and the resulting load balance. Modern processors and OSes add to the imbalance even of what initially looks like very balanced workload due to:
cache misses: one thread can hit cache, another can suffer from the cache miss spending up to thousands cycles per memory load operation where the same load can be performed in a few cycles.
Turbo-boost, power-management, core-parking features. Both processor itself and OS can manage frequency and availability of computing units contributing to the imbalance.
Thread preemption: there are other processes running in modern multitasking operation systems which can temporarily interrupting execution flow of a thread.
Modern work-stealing scheduling algorithms are quite efficient in mapping and load-balancing of even imbalanced work to worker threads: you just describe where you have the potential parallelism and task scheduler assigns it to the available resources. Work-stealing is a distributed approach which does not involve one shared state (e.g. iterator) and thus has no bottlenecks.
Check out cilk, tbb, or ppl for more information about implementations of such scheduling algorithms.
Moreover, they are friendly to nested and recursive parallel constructs like:
void check_from(std::string pass) {
check_password(pass);
if(pass.size() < MAX_SIZE)
cilk_for(int i = 0; i < syms; i++)
check_from(pass+sym[i]);
}
I have written a parallel program using OpenMP. It uses two threads because my laptop is dual core and the threads do a lot of matrix operations, so they are CPU bound. There is no data sharing among the threads. A single instance of the program runs quite fast. But when I run multiple instances of the same program simultaneously, the performance degrades. Here is a plot:
The running time for a single instance (two threads) is 0.78 seconds. The running time for two instances (total of four threads) is 2.06, which is more than double of 0.78. After that, the running time increases in proportion with the number of instances (number of threads).
Here is the timing profile of one of the instances when multiple were run in parallel:
Can someone offer insights into what could be going on? The profile shows that 50% of the time is being consumed by OpenMP. What does that mean?
Similar to what #Bort said, you made the application multithreaded (two threads) because you have two cores.
This means that when only one instance of your program is running (ideally) it gets to use the whole CPU.
However, if two instances of the application are running, there are no more resources available. They will each take twice the time. Same for more instances.
You cannot fix this issue without also increasing the number of cores available for each instance (i.e. keeping it at 2 per instance, rather than a shrinking percentage).
I want to use multi-threads to accelerate my program, but not sure which way is optimal.
Say we have 10000 small tasks, it takes maybe only 0.1s to finish one of them. Now I have a CPU with 12 cores and I want to use 12 threads to make it faster.
So far as I know, there are two ways:
1.Tasks Pool
There are always 12 threads running, each of them get one new task from the tasks pool after it finished its current work.
2.Separate Tasks
By separating the 10000 tasks into 12 parts and each thread works on one part.
The problem is, if I use tasks pool it is a waste of time for lock/unlock when multiple threads try to access the tasks pool. But the 2nd way is not ideal because some of the threads finish early, the total time depends on the slowest thread.
I am wondering how you deal with this kind of work and any other best way to do it? Thank you.
EDIT: Please note that the number 10000 is just for example, in practice, it may be 1e8 or more tasks and 0.1 per task is also an average time.
EDIT2: Thanks for all your answers :] It is good to know kinds of options.
So one midway between the two approaches is to break into say 100 batches of 100 tasks each and let the a core pick a batch of 100 tasks at a time from the task pool.
Perhaps if you model the randomness in execution time in a single core for a single task, and get an estimate of mutex locking time, you might be able to find an optimal batch size.
But without too much work we at least have the following lemma :
The slowest thread can only take at max 100*.1 = 10s more than others.
Task pool is always the best solution here. It's not just optimum time, it's also comprehensibility of code. You should never force your tasks to conform to the completely unrelated criteria of having the same number of subtasks as cores - your tasks have nothing to do with that (in general), and such a separation doesn't scale when you change machines, etc. It requires overhead to collaborate on combining results in subtasks for the final task, and just generally makes an easy task hard.
But you should not be worrying about the use of locks for taskpools. There are lockfree queues available if you ever determined them necessary. But determine that first. If time is your concern, use the appropriate methods of speeding up your task, and put your effort where you will get the most benefit. Profile your code. Why do your tasks take 0.1 s? Do they use an inefficient algorithm? Can loop unrolling help? If you find the hotspots in your code through profiling, you may find that locks are the least of your worries. And if you find everything is running as fast as possible, and you want that extra second from removing locks, search the internet with your favorite search engine for "lockfree queue" and "waitfree queue". Compare and swap makes atomic lists easy.
Both ways suggested in the question will perform well and similarly to each another (in simple cases with predictable and relatively long duration of the tasks). If the target system type is known and available (and if performance is really a top concern), the approach should be chosen based on prototyping and measurements.
Do not necessarily prejudice yourself as to the optimal number of threads matching the number of the cores. If this is a regular server or desktop system, there will be various system processes kicking in here and then and you may see your 12 threads variously floating between processors which hurts memory caching.
There are also crucial non-measurement factors you should check: do those small tasks require any resources to execute? Do these resources impose additional potential delays (blocking) or competition? Are there additional apps competing for the CPU power? Will the application need to be grow to accommodate different execution environments, task types, or user interaction models?
If the answer to all is negative, here are some additional approaches that you can measure and consider.
Use only 10 or 11 threads. You will observe a small slowdown, or even
a small speedup (the additional core will serve OS processes, so that
thread affinity of the rest will become more stable compared to 12
threads). Any concurrent interactive activity on the system will see
a big boost in responsiveness.
Create exactly 12 threads but explicitly set a different processor
affinity mask to each, to impose a 1-1 mapping between threads and processors.
This is good in the simplest near-academical case
where there are no resources other than CPU and shared memory
involved; you will see no chronic migration of threads across
processes. The drawback is an
algorithm closely coupled to a particular machine; on another machine
it could behave so poorly as to finish never at all (because of an
unrelated real time task that
blocks one of your threads forever).
Create 12 threads and split the tasks evenly. Have each thread
downgrade its own priority once it is past 40% and again once it is
past 80% of its load. This will improve load balancing inside your
process, but it will behave poorly if your application is competing
with other CPU-bound processes.
100ms/task - pile 'em on as they are - pool overhead will be insignificant.
OTOH..
1E8 tasks # 0.1s/task = 10,000,000 seconds
= 2777.7r hours
= 115.7 days
That's much more than the interval between patch Tuesday reboots.
Even if you run this on Linux, you should batch up the output and flush it to disk in such a manner that the job is restartable.
Is there a database involved? If so, you should have told us!
Each working thread may have its own small task queue with the capacity of no more than one or two memory pages. When the queue size becomes low (a half of capacity) it should send a signal to some manager thread to populate it with more tasks. If queue is organized in batches then working threads do not need to enter critical sections as long as current batch is not empty. Avoiding critical sections will give you extra cycles for actual job. Two batches per queue are enough, and in this case one batch can take one memory page, and so queue takes two.
The point of memory pages is that thread does not have to jump all over the memory to fetch data. If all data are in one place (one memory page) you avoid cache misses.
I have a vector<int> with 10,000,000 (10 million) elements, and that my workstation has four cores. There is a function, called ThrFunc, that operates on an integer. Assume that the runtime for ThrFunc for each integer in the vector<int> is roughly the same.
How should I determine the optimal number of threads to fire off? Is the answer as simple as the number of elements divided by the number of cores? Or is there a more subtle computation?
Editing to provide extra information
No need for blocking; each function invocation needs only read-only
access
The optimal number of threads is likely to be either the number of cores in your machine or the number of cores times two.
In more abstract terms, you want the highest possible throughput. Getting the highest throughput requires the fewest contention points between the threads (since the original problem is trivially parallelizable). The number of contention points is likely to be the number of threads sharing a core or twice that, since a core can either run one or two logical threads (two with hyperthreading).
If your workload makes use of a resource of which you have fewer than four available (ALUs on Bulldozer? Hard disk access?) then the number of threads you should create will be limited by that.
The best way to find out the correct answer is, with all hardware questions, to test and find out.
Borealid's answer includes test and find out, which is impossible to beat as advice goes.
But there's perhaps more to testing this than you might think: you want your threads to avoid contention for data wherever possible. If the data is entirely read-only, then you might see best performance if your threads are accessing "similar" data -- making sure to walk through the data in small blocks at a time, so each thread is accessing data from the same pages over and over again. If the data is completely read-only, then there is no problem if each core gets its own copy of the cache lines. (Though this might not make the most use of each core's cache.)
If the data is in any way modified, then you will see significant performance enhancements if you keep the threads away from each other, by a lot. Most caches store data along cache lines, and you desperately want to keep each cache line from bouncing among CPUs for good performance. In that case, you might want to keep the different threads running on data that is actually far apart to avoid ever running into each other.
So: if you're updating the data while working on it, I'd recommend having N or 2*N threads of execution (for N cores), starting them with SIZE/N*M as their starting point, for threads 0 through M. (0, 1000, 2000, 3000, for four threads and 4000 data objects.) This will give you the best chance of feeding different cache lines to each core and allowing updates to proceed without cache line bouncing:
+--------------+---------------+--------------+---------------+--- ...
| first thread | second thread | third thread | fourth thread | first ...
+--------------+---------------+--------------+---------------+--- ...
If you're not updating the data while working on it, you might wish to start N or 2*N threads of execution (for N cores), starting them with 0, 1, 2, 3, etc.. and moving each one forward by N or 2*N elements with each iteration. This will allow the cache system to fetch each page from memory once, populate the CPU caches with nearly identical data, and hopefully keep each core populated with fresh data.
+-----------------------------------------------------+
| 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 ... |
+-----------------------------------------------------+
I also recommend using sched_setaffinity(2) directly in your code to force the different threads to their own processors. In my experience, Linux aims to keep each thread on its original processor so much it will not migrate tasks to other cores that are otherwise idle.
Assuming ThrFunc is CPU-bound then you want probably one thread per core, and divide the elements between them.
If there's an I/O element to the function then the answer is more complicated, because you can have one or more threads per core waiting for I/O while another is executing. Do some tests and see what happens.
I agree with the previous comments. You should run tests to determine what number yields the best performance. However, this will only yield the best performance for the particular system you're optimizing for. In most scenarios, your program will be run on other people's machines, on the architecture of which you should not make too many assumptions.
A good way to numerically determine the number of threads to start would be to use
std::thread::hardware_concurrency()
This is part of the C++11 and should yield the number of logical cores in the current system. Logical cores means either the physical number of cores - in case the processor does not support hardware threads (ie HyperThreading) - or the number of hardware threads.
There's also a Boost-function that does the same, see Programmatically find the number of cores on a machine.
The optimal number of threads should equal the number of cores, in which situation the computation capacity of each core will be fully utilized, if the computation on each element is independently.
The optimal number of cores (threads) will probably be determined by when you achieve saturation of the memory system (caches and RAM). Another factor that could come into play is that of inter-core locking (locking a memory area that other cores might want to access, updating it and then unlocking it) and how efficient it is (how long the lock is in place and how often it is locked/unlocked).
A single core running a generic software whose code and data are not optmized for multi-core will come close to saturating memory all by itself. Adding more cores will, in such a scenario, result in a slower application.
So unless your code economizes heavily on memory accesses I'd guess the answer to your question is one (1).
I've found a real world example I'll put here for the ones who want a less technical / more intuitional answer:
Having multiple threads per core is like having two queues in an airport for each scanner(which people on both queues eventually have to pass through).
Two people at a time can put their baggage on the conveyer belt, but only one at a time can pass through the scanner. Now at this point, obviously there's a contention point at the entrance of the scanner, but what happens in reality is most of the times both queues function very well.
In this example, the queues represent threads and the scanner is the main functions of a core. As a general rule of thumb, the impact of each thread is 1.25th a core, i.e., it's not like having an entire new core. So if the task is CPU-bound slightly over the number of available processors is probably best.
But notice that if the task is IO-Bound, where threads will be spending most of their time waiting for external resources such as database connections, file systems, or other external sources of data, then you can assign (many) more threads than the number of available processors.
Source1, Source2
Are they both the same thing? Looking just at what concurrent or parallel means in geometry, I'd definetely say no:
In geometry, two or more lines are said to be concurrent if they intersect at a single point.
and
Two lines in a plane that do not
intersect or meet are called parallel
lines.
Again, in programming, do they have the same meaning? If yes...Why?
Thanks
I agree that the geometry vocabulary is in conflict. Think of train tracks instead: Two trains which are on parallel tracks can run independently and simultaneously with little or no interaction. These trains run concurrently, in parallel.
The basic usage difficulty is that "concurrent" can mean "at the same time" (with the trains or code) or "at the same place" (with the geometric lines). For many practical purposes (trains, thread resources) these two notions are directly in conflict.
Natural language is supposed to be silly, ambiguous, and confusing. But we're programmers. We can take refuge in the clarity, simplicity, and elegance of our formal programming languages. Like perl.
From Wikipedia:
Concurrent computing is a form of
computing in which programs are
designed as collections of interacting
computational processes that may be
executed in parallel.
Basically, programs can be written as concurrent programs if they are made up of smaller interacting processes. Parallel programming is actually doing these processes at the same time.
So I suppose that concurrent programming is really a style that lends itself to processes being executed in parallel to improve performance.
No, definitely concurrent is different from parallel. here is exactly how.
Concurrency refers to the sharing of resources in the same time frame. As an example, several processes may share the same CPU or share memory or an I/O device.
Now, by definition two processes are concurrent if an only if the second starts execution before the first has terminated (on the same CPU). If the two processes both run on the same - say for now - single-core CPU the processes are concurrent but not parallel: in this case, parallelism is only virtual and refers to the OS doing timesharing. The OS seems to be executing several processes simultaneously. If there is only one single-core CPU, only one instruction from only one process can be executing at any particular time. Since the human time scale is billions of times slower than that of modern computers, the OS can rapidly switch between processes to give the appearance of several processes executing at the same time.
If you instead run the two processes on two different CPUs, the processes are parallel: there is no sharing in the same time frame, because each process runs on its own CPU. The parallelism in this case is not virtual but physical. It is worth noting here that running on different cores of the same multi-core CPU still can not be classified as fully parallel, because the processes will share the same CPU caches and will even contend for them.