Problem
I am currently writing a stream parser that parses multiple feeds that are coming in very fast. Let's assume that it is a twitter stream of accounts' tweets where there are X accounts. I am trying to make the processing as concurrent as possible, while also making sure each account's tweets are processed sequentially.
Each tweet requires some parsing that takes some time. So if I were to use a naive thread pool, I will run into a problem where some tweets assigned to quicker ending threads and a single account's tweets may be logged out of order.
This task can be approached using a producer-consumer model. Where in this case there is only one producer: the twitter feed. The consumers are where I am uncertain.
The Approach
My idea to tackle this is fairly simple: map each account to a bucket numbered between 1 and T. Where T is the number of threads available in my computer. Then process each bucket sequentially. This way all buckets can be run concurrently, and no single account's tweets will be logged out of order.
Here is a crude visualization of what that looks like with two threads and three accounts:
As you can see, since we have two threads, Accounts 1 and 3 map to the same thread but maintain internal consistency. Threads 1 & 2 can run concurrently with no conflicts ever arising.
This structure is also very extendable. If I have more producers with Accounts 4 & 5, for example. I can still add to threads 2 and 1, respectively, without losing internal account consistency.
What I've Done So Far/The Code
I'm not sure how to structure this programmatically. I'm fairly new to multi-threading in C++ so I'm using modified code from this blog post as a way to structure my file.
I'd take a read-through if you have the time, but basically my code is a minimal example replicating this process. There are 5 buckets. The tweet parsing is simulated by making a sleep for 1 second. I assign each task a mutex to lock it based on it's bucket. This is done using a simple mutex = mutex_map[task_id % NUM_BUCKETS].
The code is available here, although the VM is limited to 2 threads. If we scale up to 11 threads (on my machine), we run into race conditions where some threads beat the others. Essentially what happens is this:
The machine has 11 threads initially available.
It assigns Task 0, 5, and 10 to some threads in the thread pool.
Task 0 goes first, gets the mutex and locks up. Tasks 5 and 10 are waiting because the mutex for bucket 0 is locked
Once Task 0 is finished, sometimes Task 10 goes first, and sometimes Task 5 goes first.
Now the solution is to just limit the thread pool size to NUM_BUCKETS, but there is a core problem I'm trying to solve here, which is that what I want to happen in the background is not being implemented.
Solution?
Anyone have any suggested ideas on how to approach this? How do I enforce consistency within a bucket? I want to basically assign each task to a specific thread based on the hash. Not sure how to do so as thread pool is what manages this for me...
Related
I have been working on Akka Actor model. I have an usecase where more than 1000 actors will be in active and I have to process those actors. I thought of controlling the thread count through configuration defined in the application.conf.
But no. of dispatcher thread created in my application makes me helpless in tuning the dispatcher configuration. Each time when I restart my application, I see different number of dispatcher threads created (I have checked this via Thread dump each time after starting the application).
Even thread count is not equal to the one which I defined in parallelism-min. Due to this low thread count, my application is processing very slowly.
On checking the no. of core in my machine through the below code:
Runtime.getRuntime().availableProcessors();
It displays 40. But the no. of dispatcher thread count created is less than 300 even I configured parallelism as 500.
Following is my application.conf file:
consumer-dispatcher {
type = "Dispatcher"
executor = "fork-join-executor"
fork-join-executor {
parallelism-min = 500
parallelism-factor = 20.0
parallelism-max = 1000
}
shutdown-timeout = 1s
throughput = 1
}
May I know on what basis akka will be creating dispatcher threads internally and how I can increase my dispatcher thread count to increase parallel processing of actors?
X-Post from discuss.lightbend.com
First let me answer the question directly.
A fork-join-executor will be backed by a java.util.concurrent.forkJoinPool pool with its parallelism set to the implied parallelism from the dispatcher config. (parallelism-factor * processors, but no larger than max or less than min). So, in your case, 800.
And while I’m no expert on the implementation of the ForkJoinPool the source for the Java implementation of ForkJoinPool says “All worker thread creation is on-demand, triggered by task submissions, replacement of terminated workers, and/or compensation for blocked workers.” and it has methods like getActiveThreads(), so it’s clear that ForkJoinPooldoesn’t just naively create a giant pool of workers.
In other words, what you are seeing is expected: it’s only going to create threads as they are needed. If you really must have a gigantic pool of worker threads you could create a thread-pool-executor with a fixed-pool-size of 800. This would give you the implementation you are looking for.
But, before you do so, I think you are entirely missing the point of actors and Akka. One of the reasons that people like actors is that they are much more lightweight than threads and can give you a lot more concurrency than a thread. (Also note that concurrency != parallelism as noted in the documentation on concepts.) So trying to create a pool of 800 threads to back 1000 actors is very wasteful. In the akka docs introduction it highlights "Millions of actors can be efficiently scheduled on a dozen of threads".
I can’t tell you exactly how many threads you need without knowing your application (for example if you have blocking behavior) but the defaults (which would give you a parallelism factor of 20) is probably just fine. Benchmark to be certain, but I really don’t think you have a problem with too few threads. (The ForkJoinPool behavior you are observing seems to confirm this.)
I'm developing a program (using C++ running on a Linux machine) that uses SQLite as a back-end.
It has 2 threads which carry out the following tasks:
Thread 1
Waits for a piece of data to arrive (in this case, via a radio module)
Immediately inserts it into the database
Returns to waiting for new data
It is important this thread is "listening" for as much of the time as possible and isn't blocked waiting to insert into the database
Thread 2
Every 2 minutes, runs a SELECT on the database to find un-processed data
Processes the data
UPDATEs the rows fetched with a flag to show they have been processed
The key thing is to make sure that Thread 1 can always INSERT into the database, even if this means that Thread 2 is unable to SELECT or UPDATE (as this can just take place at a future point, the timing isn't critical).
I was hoping to find a way to prioritise INSERTs somehow using SQLite, but have failed to find a way so far. Another thought was for Thread 1 to push it's the data into a basic queue (held in memory) and then bulk INSERT it every so often (as this wouldn't be blocking the receiving of data and could do a simple check to see if the database was locked, if so, wait a few milliseconds and try again).
However, what is the "proper" way to do this with SQLite and C++ threads?
SQlite database can be opened with or without multi-threading support. Both threads should open the database separately.
If you want to do the hard way, you can use a priority queue and process the queries.
Let's say I have a command to edit a single entry of an article, called ArticleEditCommand.
User 1 issues an ArticleEditCommand based on V1 of the article.
User 2 issues an ArticleEditCommand based on V1 of the same
article.
If I can ensure that my nodes process the older ArticleEditCommand commands first, I can be sure that the command from User 2 will fail because User 1's command will have changed the version of the article to V2.
However, if I have two nodes process ArticleEditCommand messages concurrently, even though the commands will be taken of the queue in the correct order, I cannot guarantee that the nodes will actually process the first command before the second command, due to a spike in CPU or something similar. I could use a sql transaction to update an article where version = expectedVersion and make note of the number of records changed, but my rules are more complex, and can't live solely in SQL. I would like my entire logic of the command processing guaranteed to be concurrent between ArticleEditCommand messages that alter that same article.
I don't want to lock the queue while I process the command, because the point of having multiple command handlers is to handle commands concurrently for scalability. With that said, I don't mind these commands being processed consecutively, but only for a single instance/id of an article. I don't expect a high volume of ArticleEditCommand messages to be sent for a single article.
With the said, here is the question.
Is there a way to handle commands consecutively across multiple nodes for a single unique object (database record), but handle all other commands (distinct database records) concurrently?
Or, is this a problem I created myself because of a lack of understanding of CQRS and concurrency?
Is this a problem that message brokers typically have solved? Such as Windows Service Bus, MSMQ/NServiceBus, etc?
EDIT: I think I know how to handle this now. When User 2 issues the ArticleEditCommand, an exception should be throw to the user letting them know that there is a current pending operation on that article that must be completed before then can queue the ArticleEditCommand. That way, there is never two ArticleEditCommand messages in the queue that effect the same article.
First let me say, if you don't expect a high volume of ArticleEditCommand messages being sent, this sounds like premature optimization.
In other solutions, this problem is usually not solved by message brokers, but by optimistic locking enforced by the persistence implementation. I don't understand why a simple version field for optimistic locking that can be trivially handled by SQL contradicts complicated business logic/updates, maybe you could elaborate more?
It's actually quite simple and I did that. Basically, it looks like this ( pseudocode)
//message handler
ModelTools.TryUpdateEntity(
()=>{
var entity= _repo.Get(myId);
entity.Do(whateverCommand);
_repo.Save(entity);
}
10); //retry 10 times until giving up
//repository
long? _version;
public MyObject Get(Guid id)
{
//query data and version
_version=data.version;
return data.ToMyObject();
}
public void Save(MyObject data)
{
//update row in db where version=_version.Value
if (rowsUpdated==0)
{
//things have changed since we've retrieved the object
throw new NewerVersionExistsException();
}
}
ModelTools.TryUpdateEntity and NewerVersionExistsException are part of my CavemanTools generic purpose library (available on Nuget).
The idea is to try doing things normally, then if the object version (rowversion/timestamp in sql) has changed we'll retry the whole operation again after waiting a couple of miliseconds. And that's exactly what the TryUpdateEntity() method does. And you can tweak how much to wait between tries or how many times it should retry the operation.
If you need to notify the user, then forget about retrying, just catch the exception directly and then tell the user to refresh or something.
Partition based solution
Achieve node stickiness by routing the incoming command based on the object's ID (eg. articleId modulo your-number-of-nodes) to make sure the commands of User1 and User2 ends up on the same node, then process the commands consecutively. You can choose to process all commands one by one or if you want to parallelize the execution, partition the commands on something like ID, odd/even, by country or similar.
Grid based solution
Use an in-memory grid (eg. Hazelcast or Coherence) and use a distributed Executor Service (http://docs.hazelcast.org/docs/2.0/manual/html/ch09.html#DistributedExecution) or similar to coordinate the command processing across the cluster.
Regardless - before adding this kind of complexity, you should of course ask yourself if it's really a problem if User2's command would be accepted and User1 got a concurrency error back. As long as User1's changes are not lost and can be re-applied after a refresh of the article it might be perfectly fine.
(the problem is embarrassingly parallel)
Consider an array of 12 cells:
|__|__|__|__|__|__|__|__|__|__|__|__|
and four (4) CPUs.
Naively, I would run 4 parallel jobs and feeding 3 cells to each CPU.
|__|__|__|__|__|__|__|__|__|__|__|__|
=========|========|========|========|
1 CPU 2 CPU 3 CPU 4 CPU
BUT, it appears, that each cell has different evaluation time, some cells are evaluated very quickly, and some are not.
So, instead of wasting "relaxed CPU", I think to feed EACH cell to EACH CPU at time and continue until the entire job is done.
Namely:
at the beginning:
|____|____|____|____|____|____|____|____|____|____|____|____|
1cpu 2cpu 3cpu 4cpu
if, 2cpu finished his job at cell "2", it can jump to the first empty cell "5" and continue working:
|____|done|____|____|____|____|____|____|____|____|____|____|
1cpu 3cpu 4cpu 2cpu
|-------------->
if 1cpu finished, it can take sixth cell:
|done|done|____|____|____|____|____|____|____|____|____|____|
3cpu 4cpu 2cpu 1cpu
|------------------------>
and so on, until the full array is done.
QUESTION:
I do not know a priori which cell is "quick" and which cell is "slow", so I cannot spread cpus according to the load (more cpus to slow, less to quick).
How one can implement such algorithm for dynamic evaluation with MPI?
Thanks!!!!!
UPDATE
I use a very simple approach, how to divide the entire job into chunks, with IO-MPI:
given: array[NNN] and nprocs - number of available working units:
for (int i=0;i<NNN/nprocs;++i)
{
do_what_I_need(start+i);
}
MPI_File_write(...);
where "start" corresponds to particular rank number. In simple words, I divide the entire NNN array into fixed size chunk according to the number of available CPU and each CPU performs its chunk, writes the result to (common) output and relaxes.
IS IT POSSIBLE to change the code (Not to completely re-write in terms of Master/Slave paradigm) in such a way, that each CPU will get only ONE iteration (and not NNN/nprocs) and after it completes its job and writes its part to the file, will Continue to the next cell and not to relax.
Thanks!
There is a well known parallel programming pattern, known under many names, some of which are: bag of tasks, master / worker, task farm, work pool, etc. The idea is to have a single master process, which distributes cells to the other processes (workers). Each worker runs an infinite loop in which it waits for a message from the master, computes something and then returns the result. The loop is terminated by having the master send a message with a special tag. The wildcard tag value MPI_ANY_TAG can be used by the worker to receive messages with different tags.
The master is more complex. It also runs a loop but until all cells have been processed. Initially it sends each worker a cell and then starts a loop. In this loop it receives a message from any worker using the wildcard source value of MPI_ANY_SOURCE and if there are more cells to be processed, sends one of them to the same worker that have returned the result. Otherwise it sends a message with a tag set to the termination value.
There are many many many readily available implementations of this model on the Internet and even some on Stack Overflow (for example this one). Mind that this scheme requires one additional MPI process that often does very little work. If this is unacceptable, one can run a worker loop in a separate thread.
You want to implement a kind of client-server architecture where you have workers asking the server for work whenever they are out of work.
Depending on the size of the chunks and the speed of your communication between workers and server, you may want to adjust the size of the chunks sent to workers.
To answer your updated question:
Under the master/slave (or worker pool if that's how you prefer it to be labelled) model, you will basically need a task scheduler. The master should have information about what work has been done and what still needs to be done. The master will give each process some work to be done, then sit and wait until a process completes (using nonblocking receives and a wait_all). Once a process completes, have it send the data to the master then wait for the master to respond with more work. Continue this until the work is done.
I need to implement a statistics reporter - an object that prints to screen bunch of statistic.
This info is updated by 20 threads.
The reporter must be a thread itself that wakes up every 1 sec, read the info and prints it to screen.
My design so far: InfoReporterElement - one element of info. has two function, PrintInfo and UpdateData.
InfoReporterRow - one row on screen. A row holds vector of ReporterInfoElement.
InfoReporterModule - a module composed of a header and vector of rows.
InfoRporter - the reporter composed of a vector of modules and a header. The reporter exports the function 'PrintData' that goes over all modules\rows\basic elements and prints the data to screen.
I think that I should an Object responsible to receive updates from the threads and update the basic info elements.
The main problem is how to update the info - should I use one mutex for the object or use mutex per basic element?
Also, which object should be a threads - the reporter itself, or the one that received updates from the threads?
I would say that first of all, the Reporter itself should be a thread. It's basic in term of decoupling to isolate the drawing part from the active code (MVC).
The structure itself is of little use here. When you reason in term of Multithread it's not so much the structure as the flow of information that you should check.
Here you have 20 active threads that will update the information, and 1 passive thread that will display it.
The problem here is that you encounter the risk of introducing some delay in the work to be done because the active thread cannot acquire the lock (used for display). Reporting (or logging) should never block (or as little as possible).
I propose to introduce an intermediate structure (and thread), to separate the GUI and the work: a queuing thread.
active threads post event to the queue
the queuing thread update the structure above
the displaying thread shows the current state
You can avoid some synchronization issues by using the same idea that is used for Graphics. Use 2 buffers: the current one (that is displayed by the displaying thread) and the next one (updated by the queuing thread). When the queuing thread has processed a batch of events (up to you to decide what a batch is), it asks to swap the 2 buffers, so that next time the displaying thread will display fresh info.
Note: On a more personal note, I don't like your structure. The working thread has to know exactly where on the screen the element it should update is displayed, this is a clear breach of encapsulation.
Once again, look up MVC.
And since I am neck deep in patterns: look up Observer too ;)
The main problem is how to update the
info - should i use one mutex for the
object or use mutex per basic element?
Put a mutex around the basic unit of update action. If this is an InfoReporterElement object, you'd need a mutex per such object. Otherwise, if a row is updated at a time, by any one of the threads then put the mutex around the row and so on.
Also, which object should be a threads
- the reporter itself, or the one that received updates from the threads?
You can put all of them in separate threads -- multiple writer threads that update the information and one reader thread that reads the value.
You seem to have a pretty good grasp of the basics of concurrency.
My intial thought would be a queue which has a mutex which locks for writes and deletes. If you have the time then I would look at lock-free access.
For you second concern I would have just one reader thread.
A piece of code would be nice to operate on.
Attach a mutex to every InfoReporterElement. As you've written in a comment, not only you need getting and setting element value, but also increment it or probably do another stuff, so what I'd do is make a mutexed member function for every interlocked operation I'd need.