CFQ scheduling algorithm - scheduling

The CFQ algorithm uses an ordered set of queues based on the I/O priority of the processes that made the requests. That means there's a queue for process of priority, let say, 1, another one for priority 2, etc.
I understand that the algorithm takes the first request from each queue, sort them (for avoiding unnecessary head movements) and place them into a dispatch queue for being handle. But since a single request can have many blocks to read (not necessarily contiguous), how is this sort possible? I mean, if I have:
Request1 = [1,2,345,6,423]
and
Request2 = [3,4,2344,664]
being [a,b,c] a list of the blocks a, b and c, how are the resquests 1 and 2 placed into the dispatch queue? As you can see they have a non empty intersection (for example the block 6 is after the blocks 3 and 4)
Other thing I don't get is, again, since a request can have multiples blocks to read, what kind of scheduling is made inside of it? FCFS? or does it order the blocks?
For example, let's say we have a request that contains the following list of blocks to read:
[1,23,5,76,3]
How would the algorithm handle this?
by FCFS:
[1,23,5,76,3]
or by sorting the blocks:
[1,3,4,23,76]
Maybe I didn't understand the algorithm, couldn't find enough documentation. If anyone have a link o paper with a more detailed explanation, please refer me to it.

I my understandig, CFQ does not schedule single track requests, but timeslices for a number of requests.
Quote from IA64wiki:
The CFQ scheduler aims to distribute disk time fairly amongst
processes competing for access to the disk. In addition, it uses
anticipation and deadlines to improve performance, and attempt to
bound latency.
Each process issuing I/O gets a time slice during which it has
exclusive access for synchronous requests. The time slice is bounded
by two parameters: slice_sync adjusted by the process's I/O priority
gives the length in milliseconds of each slice; and quantum gives the
number of requests that can be issued.
All processes share a set of 17 queues for asynchronous I/O, one for
each effective I/O priority. Each queue gets a time slice based on
slice_async adjusted by priority), and the queues are treated
round-robin.
Within each slice, requests are merged (both front and back), and
issued according to SCAN, with backward seeks up to back_seek_max
permitted, but biassed by back_seek_penalty compared with forward
seeks. CFQ will wait up to slice_idle ms within a slice for a process
to issue more I/O. This allows anticipation, and also improves
fairness when a process issues bursty I/O; but in general it reduces
performance.

Related

STD Queue: Efficient, or overkill?

I'm parsing a financial market data feed that has lots of data.
I will receive a string and extract a value, which I will store in a globally declared variable.
The program works like this: When data (string) arrives, a thread is invoked. This thread checks a value passed alongside the string to understand what kind of string it is.
Conditioned on it being the kind of string for my interest, my question is this:
Is it better to pass the string into a queue, for parsing, or to simply parse directly within the invoked thread.
Conceptually, I am worried that if I ask the invoked thread to do work then it may not be available for subsequent market data events, which occur at high frequency, and I will lose data.
Were I to place the string in a queue, I would of course need another thread that popped items off the queue and parsed them.
I have a very fast PC and speed is my interest here. Does the board have experience and know what is the best approach here?
The real question is, do you care more about the latency of your system, or about the throughput.
If you optimize for latency, meaning that you want to respond as soon as the event occured (which is usually the case in HFT), you'll probably try to avoid passing variables to another thread, as this will generate unnecesary slowdown and will increase your latency. In particular, while optimizing for latency, you want your CPU cache not to be invalidated, which often may be the case if you use multithreading (well, at least more often, than if it's single threaded). Besides, checking what specific string contains is really fast in comparison to sending things over a network, meaning that if that's the only operation you shouldn't be worried about too many messages coming in and waiting to be processed. Even if it happens, the message will gently wait in kernels queue for a microsecond, while you're processing the previous one. To put it in perspective, I'd assume that if you're getting less than 300 thousand messages per second (and I'd bet you do) and the only thing you're doing is checking if the string contains a pattern, you should not go for threads.
On the other hand, if you care more about throughput - meaning that you really have a lot messages (think hunderds of thousands per second), or you're doing more heavy computations, than just searching for a pattern in a string, then probably it's better to use queue and process events in another thread.

Thread-level parallelism for set of mutually exclusive tasks

I've encountered a problem while processing streaming data with C++. Data comes in entries, size of each entry is comparatively small, and the task processing each entry does not take much time. But each entry, as well as the task processing that entry, is assigned to a class (a concept, not C++ class), where only one of the tasks belonging to a same class can be executed at a time.
Besides, there are billions of entries and ten millions of classes, and entry comes in random classes.
I found it difficult to parallelize these tasks. Any suggestion of how to speed up the process will be great help!
Really thanks!
Put the work entries into a set of class-specific work queues. Use the queue size as as a priority with larger size having priority over smaller sizes. Set up a priority queue (of queues) to hold the class specific work queues based on their size as priority.
Work entries are entered into their corresponding queue if nobody is processing that queue.
Set up a thread pool of about the same size as the number of CPUs you have.
Each thread asks the priority queue for the highest priority work queue, which is the one that arguably holds the most work. The thread removes the class queue from the priority queue. It then processes all the elements in that queue, without locking it; this keeps the overhead per unit small.
If new members of that class show up in the meantime, they are added to a new queue for that class but not placed in the priority queue. The worker thread on using up current queue, checks for a new queue of the same class and processes that if present.
You will want one task processor per hardware processor or so, assuming CPU is at least somewhat involved.
Each processor locks a set of active classes, slurps a bunch of jobs off the front of the queue that can be run given the current set state, marks those as in use in the set, unlocks the set and queue, and processes.
How many jobs per slurp varies with the rate of jobs, the number of processors, and how clumped and latency sensitive they are.
Starvation should be impossible, as the unslurped elements should be high proirity once the class is free to be processed.
Work in batches to keep the contention for set and queue low: ideally dynamically change the batch size based off contention, idle, and latency information.
This makes the overhead stay proportional to number of processors, not number of task-classes.

When to use the disruptor pattern and when local storage with work stealing?

Is the following correct?
The disruptor pattern has better parallel performance and scalability if each entry has to be processed in multiple ways (io operations or annotations), since that can be parallelized using multiple consumers without contention.
Contrarily, work stealing (i.e. storing entries locally and stealing entries from other threads) has better parallel performance and scalability if each entry has to be processed in a single way only, since disjointly distributing the entries onto multiple threads in the disruptor pattern causes contention.
(And is the disruptor pattern still so much faster than other lockless multi-producer multi-consumer queues (e.g. from boost) when multiple producers (i.e. CAS operations) are involved?)
My situation in detail:
Processing an entry can produce several new entries, which must be processed eventually, too. Performance has highest priority, entries being processed in FIFO order has second priority.
In the current implementation, each thread uses a local FIFO, where it adds its new entries. Idle threads steal work from other thread's local FIFO. Dependencies between the thread's processing are resolved using a lockless, mechanically sympathetic hash table (CASs on write, with bucket granularity). This results in pretty low contention but FIFO order is sometimes broken.
Using the disruptor pattern would guarantee FIFO order. But wouldn't distributing the entries onto the threads cause much higher contention (e.g. CAS on a read cursor) than for local FIFOs with work stealing (each thread's throughput is about the same)?
References I've found
The performance tests in the standard technical paper on the disruptor (Chapter 5 + 6) do not cover disjoint work distribution.
https://groups.google.com/forum/?fromgroups=#!topic/lmax-disruptor/tt3wQthBYd0 is the only reference I've found on disruptor + work stealing. It states that a queue per thread is dramatically slower if there is any shared state, but does not go into detail or explain why. I doubt that this sentence applies to my situation with:
shared state being resolved with a lockless hash table;
having to disjointly distribute entries amongst consumers;
except for work stealing, each thread reads and writes only in its local queue.
Update - Bottom line up front for max performance: You need to write both in the idiomatic syntax for disruptor and work stealing, and then benchmark.
To me, I think the distinction is primarily in the split between message vs task focus, and therefore in the way you want to think of the problem. Try to solve your problem, and if it is task-focused then Disruptor is a good fit. If the problem is message focused, then you might be more suited to another technique such as work stealing.
Use work stealing when your implementation is message focused. Each thread can pick up a message and run it through to completion. For an example HTTP server - Each inbound http request is allocated a thread. That thread is focused on handling the request start to finish - logging the request, checking security controls, doing vhost lookup, fetching file, sending response, and closing connection
Use disruptor when your implementation is task focused. Each thread can work on a particular stage of the processing. Alternative example: for a task focus, the processing would be split into stages, so you would have a thread that does logging, a thread for security controls, a thread for vhost lookup, etc; each thread focused on its task and passes the request to the next thread in the pipeline. Stages may be parallelised but the overall structure is a thread focused on a specific task and hands the message along between threads.
Of course, you can change your implementation to suit each approach better.
In your case, I would structure the problem differently if you wanted to use Disruptor. Generally you would eliminate shared state by having a single thread own the state and pass all tasks through that thread of work - look up SEDA for lots of diagrams like this. This can have lots of benefits, but again, is really down to your implementation.
Some more verbosity:
Disruptor - very useful when strict ordering of stages is required, additional benefits when all tasks are of a consistent length eg: no blocking on external system, and very similar amount of processing per task. In this scenario, you can assume that all threads will work evenly through the system, and therefore arrange N threads to process every N messages. I like to think of Disruptor as an efficient way to implement SEDA-like systems where threads process stages. You can of course have an application with a single stage and multiple parallel units performing same work at each stage however this is not really the point in my view. This would totally avoid the cost of shared state.
Work stealing - use this when tasks are of a varied duration and the order of message processing is not important, as this allows threads that are free and have already consumed their messages to continue progress from another task queue. This way if for example you have 10 threads and 1 is blocked on IO, the remainder will still complete their processing.

MPI distribution layer

I used MPI to write a distribution layer. Let say we have n of data sources and k of data consumers. In my approach each of n MPI processes reads data, then distributes it to one (or many) of k data consumers (other MPI processes) in given manner (logic).
So it seems to be very generic and my question is there something like that already done?
It seems simple, but it might be very complicated. Let say that distribution checks which of data consumers is ready to work (dynamic work distribution). It may distribute data according to given algorithm based on data. There are plenty of possibilities and I as every of us do not want to reinvent the wheel.
As far as I know, there is no generic implementation for it, other than the MPI API itself. You should use the correct functions according to the problem's constraints.
If what you're trying to build a simple n-producers-and-k-consumers synchronized job/data queue, then of course there are already many implementations out there (just google it and you should get a few).
However, the way you present it seems very general - sometimes you want the data to only be sent to one consumer, sometimes to all of them, etc. In that case, you should figure out what you want and when, and use either point-to-point communication functions, or collective communication functions, accordingly (and of course everyone has to know what to expect - you can't have a consumer waiting for data from a single source, while the producer wishes to broadcast the data...).
All that aside, here is one implementation that comes to mind that seems to answer all of your requirements:
Make a synchronized queue, producers pushing data in one end, consumers taking it from the other (decide on all kinds of behaviors for the queue as you need - is the queue size limited, does adding an element to a full queue block or fail, does removing an element from an empty queue block or fail, etc.).
Assuming the data contains some flag that tells the consumers if this data is for everyone or just for one of them, the consumers peek and either remove the element, or leave it there and just note that they already did it (either by keeping its id locally, or by changing a flag in the data itself).
If you don't want a single piece of collective data to block until everyone dealt with it, you can use 2 queues, one for each type of data, and the consumers would take data from one of the queues at a time (either by choosing a different queue each time, randomly choosing a queue, prioritizing one of the queues, or by some accepted order that is deductible from the data (e.g. lowest id first)).
Sorry for the long answer, and I hope this helps :)

How many threads to create and when?

I have a networking Linux application which receives RTP streams from multiple destinations, does very simple packet modification and then forwards the streams to the final destination.
How do I decide how many threads I should have to process the data? I suppose, I cannot open a thread for each RTP stream as there could be thousands. Should I take into account the number of CPU cores? What else matters?
Thanks.
It is important to understand the purpose of using multiple threads on a server; many threads in a server serve to decrease latency rather than to increase speed. You don't make the cpu more faster by having more threads but you make it more likely a thread will always appear at within a given period to handle a request.
Having a bunch of threads which just move data in parallel is a rather inefficient shot-gun (Creating one thread per request naturally just fails completely). Using the thread pool pattern can be a more effective, focused approach to decreasing latency.
Now, in the thread pool, you want to have at least as many threads as you have CPUs/cores. You can have more than this but the extra threads will again only decrease latency and not increase speed.
Think the problem of organizing server threads as akin to organizing a line in a super market. Would you like to have a lot of cashiers who work more slowly or one cashier who works super fast? The problem with the fast cashier isn't speed but rather that one customer with a lot of groceries might still take up a lot of their time. The need for many threads comes from the possibility that a few request that will take a lot of time and block all your threads. By this reasoning, whether you benefit from many slower cashiers depends on whether your have the same number of groceries or wildly different numbers. Getting back to the basic model, what this means is that you have to play with your thread number to figure what is optimal given the particular characteristics of your traffic, looking at the time taken to process each request.
Classically the number of reasonable threads is depending on the number of execution units, the ratio of IO to computation and the available memory.
Number of Execution Units (XU)
That counts how many threads can be active at the same time. Depending on your computations that might or might not count stuff like hyperthreads -- mixed instruction workloads work better.
Ratio of IO to Computation (%IO)
If the threads never wait for IO but always compute (%IO = 0), using more threads than XUs only increase the overhead of memory pressure and context switching. If the threads always wait for IO and never compute (%IO = 1) then using a variant of poll() or select() might be a good idea.
For all other situations XU / %IO gives an approximation of how many threads are needed to fully use the available XUs.
Available Memory (Mem)
This is more of a upper limit. Each thread uses a certain amount of system resources (MemUse). Mem / MemUse gives you an approximation of how many threads can be supported by the system.
Other Factors
The performance of the whole system can still be constrained by other factors even if you can guess or (better) measure the numbers above. For example, there might be another service running on the system, which uses some of the XUs and memory. Another problem is general available IO bandwidth (IOCap). If you need less computing resources per transferred byte than your XUs provide, obviously you'll need to care less about using them completely and more about increasing IO throughput.
For more about this latter problem, see this Google Talk about the Roofline Model.
I'd say, try using just ONE thread; it makes programming much easier. Although you'll need to use something like libevent to multiplex the connections, you won't have any unexpected synchronisation issues.
Once you've got a working single-threaded implementation, you can do performance testing and make a decision on whether a multi-threaded one is necessary.
Even if a multithreaded implementation is necessary, it may be easier to break it into several processes instead of threads (i.e. not sharing address space; either fork() or exec multiple copies of the process from a parent) if they don't have a lot of shared data.
You could also consider using something like Python's "Twisted" to make implementation easier (this is what it's designed for).
Really there's probably not a good case for using threads over processes - but maybe there is in your case, it's difficult to say. It depends how much data you need to share between threads.
I would look into a thread pool for this application.
http://threadpool.sourceforge.net/
Allow the thread pool to manage your threads and the queue.
You can tweak the maximum and minimum number of threads used based on performance profiling later.
Listen to the people advising you to use libevent (or OS specific utilities such as epoll/kqueue). In the case of many connections this is an absolute must because, like you said, creating threads will be an enormous perfomance hit, and select() also doesn't quite cut it.
Let your program decide. Add code to it that measures throughput and increases/decreases the number of threads dynamically to maximize it.
This way, your application will always perform well, regardless of the number of execution cores and other factors
It is a good idea to avoid trying to create one (or even N) threads per client request. This approach is classically non-scalable and you will definitely run into problems with memory usage or context switching. You should look at using a thread pool approach instead and look at the incoming requests as tasks for any thread in the pool to handle. The scalability of this approach is then limited by the ideal number of threads in the pool - usually this is related to the number of CPU cores. You want to try to have each thread use exactly 100% of the CPU on a single core - so in the ideal case you would have 1 thread per core, this will reduce context switching to zero. Depending on the nature of the tasks, this might not be possible, maybe the threads have to wait for external data, or read from disk or whatever so you may find that the number of threads is increased by some scaling factor.