What's the difference between a red black tree and a single runqueue? - scheduling

I have been trying to understand the difference between the two as they apply to different algorithms used to choose tasks to run in some CPU schedulers.
What is the difference between an RB tree that places lowest time needed process on the left and chooses nodes from the left to run, and a queue that places them in a shortest job first order?

A single queue has time complexity[1] of O(1) on search because it can just pop the next process to execution. Insertion has also O(1) as it places the new item at the end of the queue. This kind of round-robin scheduler was used e.g. in early Linux kernel. The downside was that all tasks were executed every time in the same order.
To fix this, a simple improvement is to keep popping the head of the queue with O(1) and search a suitable slot in the queue on insert by priority and/or time requirements thus having O(n). Some schedulers keep multiple queues (or even a priority queue), that have varying operation times depending from the implementation and needs.
Red-black tree, on the other hand, has time complexity of O(log n) to get the next process and on insert. The principle idea of a red-black tree is that it keeps itself balanced with every operation thus remaining efficient without any further optimization operations. A priority queue can also be implemented using a red-black tree internally.
A good starting point on (Linux) schedulers is the CFS article on IBM's site, which has a nice set of references, as well.


Parallel union find algorithm

I need to parallelize kruskal's algorithm, the serial version used the union find algorithm for detecting cycle in the undirected graph. Is there any way to parallelize this part of code?
Well, it can be parallelized to some extent. It is as follows:
Initially all the edges are sorted in the ascending order. There is a main thread which actually scans each edge from the beginning and decides whether adding the current edge forms the cycle. Our main aim in parallelizing the algorithm is to make these checks parallel.
This is where we use the worker threads. Each thread is given certain number of edges to examine, where in each thread checks if its edges form a cycle with the current representation after every iteration (iteration means the main thread adding a new edge). As the main thread keeps on adding the edges, some threads see that certain edges are already forming a cycle with the current representation.
Such edges are marked as discarded. When the main thread reaches such edges, it simply moves on to the next one without making any check on it.
Thus, we have actually made these checks parallel, which means the algorithm runs quickly increasing the efficiency.
In fact, there is a nice paper that uses the same idea described above.
If you are pretty much concerned about the running time of over-all algorithm, you can even use a parallel sorting algorithm initially as #jarod42 suggested.

how to apply parallelism-programming in graph problems?

Problem Description:
there is n tasks, and in these tasks, one might be dependent on the others, which means if A is dependent on B, then B must be finished before A gets finished.
1.find a way to finish these tasks as quickly as possible?
2.if take parallelism into account, how to design the program to finish these tasks?
Apparently, the answer to the first question is, topological-sort these tasks, then finish them in that order.
But how to do the job if parallelism taken into consideration?
My answer was,first topological-sort these tasks, then pick those tasks which are independent and finish them first, then pick and finish those independent ones in the rest...
Am I right?
Topological sort algorithms may give you various different result orders, so you cannot just take the first few elements and assume them to be the independent ones.
Instead of topological sorting I'd suggest to sort your tasks by the number of incoming dependency edges. So, for example if your graph has A --> B, A --> C, B --> C, D-->C you would sort it as A[0], D[0], B[1], C[3] where [i] is the number of incoming edges.
With topological sorting, you could also have gotting A,B,D,C. In that case, it wouldn't be easy to find out that you can execute A and D in parallel.
Note that after a task was completely processed you then have to update the remaining tasks, in particular, the ones that were dependent on the finished task. However, if the number of dependencies going into a task is limited to a relatively small number (say a few hundreds), you can easily rely on something like radix/bucket-sort and keep the sort structure updated in constant time.
With this approach, you can also easily start new tasks, once a single parallel task has finished. Simply update the dependency counts, and start all tasks that now have 0 incoming dependencies.
Note that this approach assumes you have enough processing power to process all tasks that have no dependencies at the same time. If you have limited resources and care for an optimal solution in terms of processing time, then you'd have to invest more effort, as the problem becomes NP-hard (as arne already mentioned).
So to answer your original question: Yes, you are basically right, however, you lacked to explain how to determine those independent tasks efficiently (see my example above).
I would try sorting them in a directed forest structure with task execution time as edge weigths. Order the arborescences from heaviest to lightest and start with the heaviest. Using this approach you can, at the same time, check for circular dependencies.
Using parallelism, you get the bin problem, which is NP-hard. Try looking up approximative solutions for that problem.
Have a look at the Critical Path Method, taken from the are of project management. It basically do what you need: given tasks with dependecies and durations, it produces how much time it will take, and when to activate each task.
(*)Note that this technique is assuming infinite number of resources for optimal solution. For limited resources there are heuristics for greedy algorithms such as: GPRW [current+following tasks time] or MSLK [minimum total slack time].
(*)Also note, it requires knowing [or at least estimating] how long will each task take.

Design Problem: Thread safety of std::map

I am using std::map to implement my local hash table, which will be accessed by multiple threads at the same time.
I did some research and found that std::map is not thread safe.
So I will use a mutex for insert and delete operations on the map.
I plan to have separate mutex(es), one for each map entry so that they can be modified independently.
Do I need to put find operation also under critical section?
Will find operation be affected by insert/delete operations?
Is there any better implementation than using std::map that can take care of everything?
Binary trees are not particularly suited to Multi-Threading because the rebalancing can degenerate in a tree-wide modification. Furthermore, a global mutex will very negatively access the performance.
I would strongly suggest using an already written thread-safe containers. For example, Intel TBB contains a concurrent_hash_map.
If you wish to learn however, here are some hints on building a concurrent sorted associative container (I believe a full introduction to be not only out of my reach but also out of place, here).
Rather than a regular Mutex, you may want to use a Reader/Writer Mutex. This means parallelizing Reads, while Writes remain strictly sequential.
Own Tree
You can also build your own red-black or AVL tree. By augmenting the tree structure with a Reader/Writer Mutex per node. This allows you to only block part of the tree, rather than the whole structure, even when rebalancing. eg inserts with keys far enough apart can be parallel.
Skip Lists
Linked lists are much more amenable to concurrent manipulations, because you can easily isolate the modified zone.
A Skip List builds on this strength, but augments the structure to provide O(log N) access by key.
The typical way to walk a list is using the hand over hand idiom, that is, you grab the mutex of the next node before releasing the one of the current node. Skip Lists add a 2nd dimension as you can dive between two nodes, thus releasing both of them (and letting other walkers go ahead of you).
Implementations are much simpler than for binary search trees.
Another interesting piece is the idea of persistent (or semi-persistent) data-structures, often found in functional programming. Binary Search Tree are particularly amenable for it.
The basic idea is to never change a node (or its content) once it exists. You do so by sharing a mutable head, that will point to the later version.
To Read: you copy the current head, then use it without worry (the information is immutable)
To Write: each node that you would modify in a regular tree is instead copied and the copy modified, therefore you rebuild part of the tree (up to the root) each time, and update the head to point to the new root. There are efficient ways to rebalance on descending the tree. Writes are sequential
The main advantage is that a version of the map is always available. That is, you can always read even when another thread is performing an insert or delete. Furthermore, because read access only require a single concurrent read (when copying the root pointer), they are near lock-free, and thus have excellent performance.
Reference counting (intrinsic) is your friend for those nodes.
Note: copies of the tree are very cheap :)
I do not know any implementation in C++ of either a concurrent Skip List or a concurrent Semi-Persistent Binary Search Tree.
You will in deed need to put find in a critical section, but you might want to have two different locks, one for writing and one for reading. The write lock is exclusive but if no thread holds the write lock several threads may read concurrently with no problems.
Such an implementation would work with most STL implementations but it would not be standards compliant, however. std::map is usually implemented using a red-black tree which doesn't change when elements are read. If the map was implemented using a splay tree instead, the tree would change during lookup and only one thread could read at a time.
For most purposes I would recommend using two locks.
Yes, if the insert or delete results in a rebalance I believe that find could be affected too.
Yes - You would need to put insert, delete and find in a critical section. There are techniques to enable multiple finds at the same time.
From what I can see, a similar question has been answered here, and the answer includes the explanation for this question also, as well as a link explaining the thread safety in more details.
Thread safety of std::map for read-only operations

Organizing a task-based scientific computation

I have a computational algebra task I need to code up. The problem is broken into well-defined individuals tasks that naturally form a tree - the task is combinatorial in nature, so there's a main task which requires a small number of sub-calculations to get its results. Those sub-calculations have sub-sub-calculations and so on. Each calculation only depends on the calculations below it in the tree (assuming the root node is the top). No data sharing needs to happen between branches. At lower levels the number of subtasks may be extremely large.
I had previously coded this up in a functional fashion, calling the functions as needed and storing everything in RAM. This was a terrible approach, but I was more concerned about the theory then.
I'm planning to rewrite the code in C++ for a variety of reasons. I have a few requirements:
Checkpointing: The calculation takes a long time, so I need to be able to stop at any point and resume later.
Separate individual tasks as objects: This helps me keep a good handle of where I am in the computations, and offers a clean way to do checkpointing via serialization.
Multi-threading: The task is clearly embarrassingly parallel, so it'd be neat to exploit that. I'd probably want to use Boost threads for this.
I would like suggestions on how to actually implement such a system. Ways I've thought of doing it:
Implement tasks as a simple stack. When you hit a task that needs subcalculations done, it checks if it has all the subcalculations it requires. If not, it creates the subtasks and throws them onto the stack. If it does, then it calculates its result and pops itself from the stack.
Store the tasks as a tree and do something like a depth-first visitor pattern. This would create all the tasks at the start and then computation would just traverse the tree.
These don't seem quite right because of the problems of the lower levels requiring a vast number of subtasks. I could approach it in a iterator fashion at this level, I guess.
I feel like I'm over-thinking it and there's already a simple, well-established way to do something like this. Is there one?
Technical details in case they matter:
The task tree has 5 levels.
Branching factor of the tree is really small (say, between 2 and 5) for all levels except the lowest which is on the order of a few million.
Each individual task would only need to store a result tens of bytes large. I don't mind using the disk as much as possible, so long as it doesn't kill performance.
For debugging, I'd have to be able to recall/recalculate any individual task.
All the calculations are discrete mathematics: calculations with integers, polynomials, and groups. No floating point at all.
there's a main task which requires a small number of sub-calculations to get its results. Those sub-calculations have sub-sub-calculations and so on. Each calculation only depends on the calculations below it in the tree (assuming the root node is the top). No data sharing needs to happen between branches. At lower levels the number of subtasks may be extremely large... blah blah resuming, multi-threading, etc.
Correct me if I'm wrong, but it seems to me that you are exactly describing a map-reduce algorithm.
Just read what wikipedia says about map-reduce :
"Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.
"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output – the answer to the problem it was originally trying to solve.
Using an existing mapreduce framework could save you a huge amount of time.
I just google "map reduce C++" and I start to get results, notably one in boost http://www.craighenderson.co.uk/mapreduce/
These don't seem quite right because of the problems of the lower levels requiring a vast number of subtasks. I could approach it in a iterator fashion at this level, I guess.
You definitely do not want millions of CPU-bound threads. You want at most N CPU-bound threads, where N is the product of the number of CPUs and the number of cores per CPU on your machine. Exceed N by a little bit and you are slowing things down a bit. Exceed N by a lot and you are slowing things down a whole lot. The machine will spend almost all its time swapping threads in and out of context, spending very little time executing the threads themselves. Exceed N by a whole lot and you will most likely crash your machine (or hit some limit on threads). If you want to farm lots and lots (and lots and lots) of parallel tasks out at once, you either need to use multiple machines or use your graphics card.

breadth first or depth first search

I know how this algorithm works, but cant decide when to use which algorithm ?
Are there some guidelines, where one better perform than other or any considerations ?
Thanks very much.
If you want to find a solution with the shortest number of steps or if your tree has infinite height (or very large) you should use breadth first.
If you have a finite tree and want to traverse all possible solutions using the smallest amount of memory then you should use depth first.
If you are searching for the best chess move to play you could use iterative deepening which is a combination of both.
IDDFS combines depth-first search's space-efficiency and breadth-first search's completeness (when the branching factor is finite).
BFS is generally useful in cases where the graph has some meaningful "natural layering" (e.g., closer nodes represent "closer" results) and your goal result is likely to be located closer to the starting point or the starting points are "cheaper to search".
When you want to find the shortest path, BFS is a natural choice.
If your graph is infinite or pro grammatically generated, you would probably want to search closer layers before venturing afield, as the cost of exploring remote nodes before getting to the closer nodes is prohibitive.
If accessing more remote nodes would be more expensive due to memory/disk/locality issues, BFS may again be better.
Which method to use usually depends on application (ie. the reason why you have to search a graph) - for example topological sorting requires depth-first search whereas Ford-Fulkerson algorithm for finding maximum flow requires breadth-first search.
If you are traversing a tree, depth-first will use memory proportional to its depth. If the tree is reasonably balanced (or has some other limit on its depth), it may be convenient to use recursive depth-first traversal.
However, don't do this for traversing a general graph; it will likely cause a stack overflow. For unbounded trees or general graphs, you will need some kind of auxiliary storage that can expand to a size proportional to the number of input nodes. In this case, breadth-first traversal is simple and convenient.
If your problem provides a reason to choose one node over another, you might consider using a priority queue, instead of a stack (for depth-first) or a FIFO (for breadth-first). A priority queue will take O(log K) time (where K is the current number of different priorities) to find the best node at each step, but the optimization may be worth it.