Is there a way to implement concurrent queue using memcached?

Is there a way to implement concurrent queue using memcached? - concurrency

I am working on key-value database similar to memcached.
From some time I am thinking how one might implement concurrent message queue using memcache without using locks and without using "help from the server" - this is how Redis manage do it.
In most implementations there are key_to_head and key_to_tail.
For example PUSH data seems to be easy:
increment head
set data
However whats happen, if between the two operations, a reader decide to POP?
I checked some implementations, but all of them not really work or I am not "clever enough" to see the trick they using...

Related

What's the most efficient way to async send data while async receiving with 0MQ?

I've got a ROUTER/DEALER setup where both ends need to be able to receive and send data asynchronously, as soon as it's available. The model is pretty much 0MQ's async C++ server: http://zguide.zeromq.org/cpp:asyncsrv
Both the client and the server workers poll, when there's data available they call a callback. While this happens, from another thread (!) I'm putting data in a std::deque. In each poll-forever thread, I check the deque (under lock), and if there are items there, I send them out to the specified DEALER id (the id is placed in the queue).
But I can't help thinking that this is not idiomatic 0MQ. The mutex is possibly a design problem. Plus, memory consumption can probably get quite high if enough time passes between polls (and data accumulates in the deque).
The only alternative I can think of is having another DEALER thread connect to an inproc each time I want to send out data, and just have it send it and exit. However, this implies a connect per item of data sent + construction and destruction of a socket, and it's probably not ideal.
Is there an idiomatic 0MQ way to do this, and if so, what is it?

I dont fully understand your design but I do understand your concern about using locks.
In most cases you can redesign your code to remove the use of locks using zeromq PAIR sockets and inproc.
Do you really need a std::deque? If not you could just use a zerom queue as its just a queue that you can read/write from from different threads using sockets.
If you really need the deque then encapsulate it into its own thread (a class would be nice) and make its API (push etc) accessible via inproc sockets.
So like I said before I may be on the wrong track but in 99% of cases I have come across you can always remove the locks completely with some ZMQ_PAIR/inproc if you need signalling.

0mq queue has limited buffer size and it can be controlled. So memory issue will get to some point and then dropping data will occur. For that reason you may consider using conflate option leaving only most recent data in queue.
In a case of single server and communication within single machine with many threads I suggest using publish/subscribe model where with conflate option you will receive new data as soon as you read buffer and won't have to worry about memory. And it removes blocking queue problem.
As for your implementation you are quite right, it is not best design but it is quite unavoidable. I suggest checking question Access std::deque from 3 threads while it answers your problem, it may not be the best approach.

Multiple db inserts with Django performance is not increased by parallel threads

I'm doing thousands and thousands of inserts to a PostgreSQL database with Python and Django (using the CLI, so no web server at all).
The objects that are inserted are already in memory, and I'm poping them one by one from a FIFO queue (using Python's native https://docs.python.org/2/library/queue.html)
What I'm doing basically is:
args1, args2 = queue.get()
m1, _ = Model1.objects.get_or_create(args1)
Model2.objects.create(m1, args2)
I was thinking a way to do this faster was too spawn a few more threads that can do this in parallel. To my surprise the performance is actually slightly decreased... I was expecting almost linear improvement in relation to the number of threads.. not sure what's going on..
Is there something database specific I'm missing, are there table locks that are blocking the threads when this is running?
Or does it have something to do with that each thread can only access a single database connection atomically during runtime?
I have standard configuration for PostgreSQL (9.3) and Django (1.7.7) installed with apt-get on Debian Jessie.
Also I tried with 4 threads, which is the same number of CPUs I have available on my box.

There are a few things going on here.
Firstly you are using very high level ORM methods (get_or_create, create). Those are generally not a good fit for bulk operations since methods like that tend to have a lot of overhead to provide a nice API and also do additional work to prevent users from shooting themselves in the foot too easily.
Secondly your careful use of a queue is very counterproductive in multiple ways:
Due to django running in autocommit mode by default each database operation is carried out in its own transaction. Since that is a relatively expensive operation this also causes unnecessary overhead.
Inserting each object by itself also causes a lot more back and forth communication between the database and django, which again produces overhead, slowing things down.
Thirdly the reason using multiple threads is even slower stems from the fact that python has a GIL (Global Interpreter Lock). This prevents multiple threads from executing Python code at the same time. There is a lot of material on the web about the whys and hows of the GIL and what can be done in which circumstances to mitigate it. There is a nice summary by Dave Beazly about the GIL that should get you started if you're interested in learning more about it.
Additionally I'd generally recommend against doing large inserts from multiple threads in any language since - depending on your database and data model - this can also cause slowdowns inside the database due to possibly required locking.
Now there are many solutions to your problem but I'd recommend to start with a simple one:
Django actually provides a handy low-level interface to create models in bulk, fittingly enough called bulk_create(). I'd suggest removing all that fancy queue and thread code and using this interface as directly as possible with the data you already have.
In case this isn't sufficient for your case a possible alternative would be to generate an INSERT INTO statement from the data and executing that directly on the database.

If all you want to achieve is simply insertion, could you instead just use the save() method instead of get_or_create(). get_or_create() queries the database first. If the table is large, the call to get_or_create() can be a bottleneck. And that's probably why having multiple parallel threads do not help.
The other possibility is with the insertion itself. Postgres by default enables auto-commit on a per insert (transaction) basis. The committing process involves complex mechanisms under the hood. Long story short, you may try disabling auto-commit and see if that would help in your particular case. A relevant article is here.

C++ IOCP server container information

I have been passing a few ideas around in my head about how to actually contain large amounts of connections using an IO type of architecture while maintaining KISS. Through examples on the web, it seems like most use a double/single linked list with CONTAINING_RECORD. And, as a newbie in IO servers ( though, improving every day), I too use a linked-list container for an IO architecture.
My question is, instead of using a single/double linked list for my connections, why cant I just build a large array and use CONTAINING_RECORD? Can I used STL vector? Would that work? Also, what are other type of containers that work best with a massive IO server.
Im in the process of re-writing the server architecture for my game server (after many revisions), and would like to head into the right direction this time around because id rather not have to rewrite it again in the near future.
Thank you for your time, and replies.
Edit: Currently my server architecture is (in a nutshell):
Main thread listening and accepting -> Pass over the socket into a container.
Worker threads(2-3) grab IO events for the container of sockets.
Worker threads Read/Write Data on that container.
Main thread and worker threads all use a linked-list. I want to get away from this.

Your "connection list" will probably have removals from any position, not just the end. For std::vector, removing elements in the middle is an O(N) operation, but for linked lists it can be O(1). (For single-linked lists this isn't trivial and may require an inconvenient API).
std::map may be an interesting choice as it offers both O(log N) finding and removing of elements.

As with all data structures, it depends very much on what you want to do with it.
In a previous job I spent most of my time working on a hugely multithreaded C++ server which, in its Windows incarnation, used IO Completion Ports (the Solaris backend used /dev/poll, which is not that dissimilar in several essentials). That one stored connection-related data structures in a map-like structure dating from before the STL, using the file descriptors as the key values. Thus whenever we got an event on a connection we could look up its related data structures by the descriptor the IO layer handed us. New connections were easy to handle - just add an entry to the dictionary - and closed connections could also be cleaned up quite trivially.
Naturally one has to be careful about cross-thread access to these structures and about operation ordering - since IO is inherently effectful, the ordering of operations is crucial. Fortunately IOCP won't give you another event on another thread for the same socket until you put the socket back into the CP, but the Solaris implementation had to also keep a structure linking file descriptors to worker threads in order to ensure that we only processed one event per socket at a time, and in strict order, and we also tried to inject subsequent events for a socket into the same thread to avoid having to potentially switch the socket's structures onto another processor which is a disaster for cache hit rates.
The basic summary though is that we found an appropriately-designed dictionary class to be incredibly useful for this sort of thing.

console out in multi-threaded applications

Usually developing applications I am used to print to console in order to get useful debugging/tracing information. The application I am working now since it is multi-threaded sometimes I see my printf overlapping each other.
I tried to synchronize the screen using a mutex but I end up in slowing and blocking the app. How to solve this issue?
I am aware of MT logging libraries but in using them, since I log too much, I slow ( a bit ) my app.
I was thinking to the following idea..instead of logging within my applications why not log outside it? I would like to send logging information via socket to a second application process that actually print out on the screen.
Are you aware of any library already doing this?
I use Linux/gcc.
thanks
afg

You have 3 options. In increasing order of complexity:
Just use a simple mutex within each thread. The mutex is shared by all threads.
Send all the output to a single thread that does nothing but the logging.
Send all the output to a separate logging application.
Under most circumstances, I would go with #2. #1 is fine as a starting point, but in all but the most trivial applications you can run in to problems serializing the application. #2 is still very simple, and simple is a good thing, but it is also quite scalable. You still end up doing the processing in the main application, but for the vast majority of applications you gain nothing by spinning this off to it's own, dedicated application.
Number 3 is what you're going to do in preformance-critical server type applications, but the minimal performance gain you get with this approach is 1: very difficult to achieve, 2: very easy to screw up, and 3: not the only or even most compelling reason people generally take this approach. Rather, people typically take this approach when they need the logging service to be seperated from the applications using it.

Which OS are you using?
Not sure about specific library's, but one of the classical approaches to this sort of problem is to use a logging queue, which is worked by a writer thread, who's job is purely to write the log file.
You need to be aware, either with a threaded approach, or a multi-process approach that the write queue may back up, meaning it needs to be managed, either by discarding entries or by slowing down your application (which is obviously easier if it's the threaded approach).
It's also common to have some way of categorising your logging output, so that you can have one section of your code logging at a high level, whilst another section of your code logs at a much lower level. This makes it much easier to manage the amount of output that's being written to files and offers you the option of releasing the code with the logging in it, but turned off so that it can be used for fault diagnosis when installed.

As I know critical section has less weight.
Critical section
Using critical section
If you use gcc, you could use atomic accesses. Link.

Frankly, a Mutex is the only way you really want to do that, so it's always going to be slow in your case because you're using so many print statements.... so to solve your question then, don't use so many print_f statements; that's your problem to begin with.
Okay, is your solution using a mutex to print? Perhaps you should have a mutex to a message queue which another thread is processing to print; that has a potential hang up, but I think will be faster. So, use an active logging thread that spins waiting for incoming messages to print. The networking solution could work too, but that requires more work; try this first.

What you can do is to have one queue per thread, and have the logging thread routinely go through each of these and post the message somewhere.
This is fairly easy to set up and the amount of contention can be very low (just a pointer swap or two, which can be done w/o locking anything).

Multithreaded job queue manager

I need to manage CPU-heavy multitaskable jobs in an interactive application. Just as background, my specific application is an engineering design interface. As a user tweaks different parameters and options to a model, multiple simulations are run in the background and results displayed as they complete, likely even as the user is still editing values. Since the multiple simulations take variable time (some are milliseconds, some take 5 seconds, some take 10 minutes), it's basically a matter of getting feedback displayed as fast as possible, but often aborting jobs that started previously but are now no longer needed because of the user's changes have already invalidated them. Different user changes may invalidate different computations so at any time I may have 10 different simulations running. Somesimulations have multiple parts which have dependencies (simulations A and B can be seperately computed, but I need their results to seed simulation C so I need to wait for both A and B to finish first before starting C.)
I feel pretty confident that the code-level method to handle this kind of application is some kind of multithreaded job queue. This would include features of submitting jobs for execution, setting task priorities, waiting for jobs to finish, specifying dependencies (do this job, but only after job X and job Y have finished), canceling subsets of jobs that fit some criteria, querying what jobs remain, setting worker thread counts and priorities, and so on. And multiplatform support is very useful too.
These are not new ideas or desires in software, but I'm at the early design phase of my application where I need to make a choice about what library to use for managing such tasks. I've written my own crude thread managers in the past in C (I think it's a rite of passage) but I want to use modern tools to base my work on, not my own previous hacks.
The first thought is to run to OpenMP but I'm not sure it's what I want. OpenMP is great for parallelizing at a fine level, automatically unrolling loops and such. While multiplatform, it also invades your code with #pragmas. But mostly it's not designed for managing large tasks.. especially cancelling pending jobs or specifying dependencies. Possible, yes, but it's not elegant.
I noticed that Google Chrome uses such a job manager for even the most trivial tasks. The design goal seems to be to keep the user interaction thread as light and nimble as possible, so anything that can get spawned off asynchronously, should be. From looking at the Chrome source this doesn't seem to be a generic library, but it still is interesting to see how the design uses asynchronous launches to keep interaction fast. This is getting to be similar to what I'm doing.
There are a still other options:
Surge.Act: a Boost-like library for defining jobs. It builds on OpenMP, but does allow chaining of dependencies which is nice. It doesn't seem to feel like it's got a manager that can be queried, jobs cancelled, etc. It's a stale project so it's scary to depend on it.
Job Queue is quite close to what I'm thinking of, but it's a 5 year old article, not a supported library.
Boost.threads does have nice platform independent synchronization but that's not a job manager. POCO has very clean designs for task launching, but again not a full manager for chaining tasks. (Maybe I'm underestimating POCO though).
So while there are options available, I'm not satisfied and I feel the urge to roll my own library again. But I'd rather use something that's already in existence. Even after searching (here on SO and on the net) I haven't found anything that feels right, though I imagine this must be a kind of tool that is often needed, so surely there's some community library or at least common design.
On SO there's been some posts about job queues, but nothing that seems to fit.
My post here is to ask you all what existing tools I've missed, and/or how you've rolled your own such multithreaded job queue.

We had to build our own job queue system to meet requirements similar to yours ( UI thread must always respond within 33ms, jobs can run from 15-15000ms ), because there really was nothing out there that quite met our needs, let alone was performant.
Unfortunately our code is about as proprietary as proprietary gets, but I can give you some of the most salient features:
We start up one thread per core at the beginning of the program. Each pulls work from a global job queue. Jobs consist of a function object and a glob of associated data (really an elaboration on a func_ptr and void *). Thread 0, the fast client loop, isn't allowed to work on jobs, but the rest grab as they can.
The job queue itself ought to be a lockless data structure, such as a lock-free singly linked list (Visual Studio comes with one). Avoid using a mutex; contention for the queue is surprisingly high, and grabbing mutexes is costly.
Pack up all the necessary data for the job into the job object itself -- avoid having pointer from the job back into the main heap, where you'll have to deal with contention between jobs and locks and all that other slow, annoying stuff. For example, all the simulation parameters should go into the job's local data blob. The results structure obviously needs to be something that outlives the job: you can deal with this either by a) hanging onto the job objects even after they've finished running (so you can use their contents from the main thread), or b) allocating a results structure specially for each job and stuffing a pointer into the job's data object. Even though the results themselves won't live in the job, this effectively gives the job exclusive access to its output memory so you needn't muss with locks.
Actually I'm simplifying a bit above, since we need to choreograph exactly which jobs run on which cores, so each core gets its own job queue, but that's probably unnecessary for you.

I rolled my own, based on Boost.threads. I was quite surprised by how much bang I got from writing so little code. If you don't find something pre-made, don't be afraid to roll your own. Between Boost.threads and your experience since writing your own, it might be easier than you remember.
For premade options, don't forget that Chromium is licensed very friendly, so you may be able to roll your own generic library around its code.

Microsoft is working on a set of technologies for the next Version of Visual Studio 2010 called the Concurrency Runtime, the Parallel Pattern Library and the Asynchronous Agents Library which will probably help. The Concurrency Runtime will offer policy based scheduling, i.e. allowing you to manage and compose multiple scheduler instances (similar to thread pools but with affinitization and load balancing between instances), the Parallel Pattern Library will offer task based programming and parallel loops with an STL like programming model. The Agents library offers an actor based programming model and has support for building concurrent data flow pipelines, i.e. managing those dependencies described above. Unfortunately this isn't released yet, so you can read about it on our team blog or watch some of the videos on channel9 there is also a very large CTP that is available for download as well.
If you're looking for a solution today, Intel's Thread Building Blocks and boost's threading library are both good libraries and available now. JustSoftwareSolutions has released an implementation of std::thread which matches the C++0x draft and of course OpenMP is widely available if you're looking at fine-grained loop based parallelism.
The real challenge as other folks have alluded to is to correctly identify and decompose work into tasks suitable for concurrent execution (i.e. no unprotected shared state), understand the dependencies between them and minimize the contention that can occur on bottlenecks (whether the bottleneck is protecting shared state or ensuring the dispatch loop of a work queue is low contention or lock-free)... and to do this without scheduling implementation details leaking into the rest of your code.
-Rick

Would something like threadpool be useful to you? It's based on boost::threads and basically implements a simple thread task queue that passes worker functions off to the pooled threads.

I've been looking for near the same requirements. I'm working on a game with 4x-ish mechanics and scheduling different parts of what gets done almost exploded my brain. I have a complex set of work that needs to get accomplished at different time resolutions, and to a different degree of actual simulation depending on what system/region the player has actively loaded. This means as the player moves from system to system, I need to load a system to the current high resolution simulation, offload the last system to a lower resolution simulation, and do the same for active/inactive regions of systems. The different simulations are big lists of population, political, military, and economic actions based on profiles of each entity. I'm going to try to describe my issue and my approach so far and I hope it's useful at describe an alternative for you or someone else. The rough outline of the structure I'm building will use the following:
cpp-taskflow (A Modern C++ Parallel Task Programming Library) I'm going to make a library of modules that will be used as job construction parts. Each entry will have an API for initializing and destruction as well as pointers for communication. I'm hoping to write it in a way that they will be nest-able using the cpp-taskflow API to set-up all the dependencies at job creation time, but provide a means of live adjustment and having a kill-switch available. Most of what I'm making will be decision trees of state machines, or state machines of behavior trees so the job data structure will be settings and states of time-resolution tagged data pointing to actual stats and object values.
FlatBuffers I'm looking to use this library to build a "job list entry" as well as an "object wrapper" system. Each entry in the job queues will be a flatbuffer object describing the work needed done(settings for the module), as well as containing the data(or shared pointers to the data) for the work that needs done. The object storage flatbuffers will contain the data that represents entity tables. For me, most of the actual data will me arrays that need deciding/working on. I'm also looking to use flatbuffers as a communication/control channel between threads. I'm torn on making a master "router" thread all the others communicate through, or each one containing their own, and having some mechanism of discovery.
SQLite Since only the active regions/systems need higher resolution work done, some of the background job lists the game will create(for thousands of systems and their entities) will be pretty large and long lived. 100's of thousands - millions of jobs(big in my mind), each requiring an unknown amount of time to complete. In my case, I don't care when they get done, as long as they all do(long campains). I plan on each thread getting a table of an in-memory sqlite db as a job queue. Each entry will contain a blob of flatbuffer work, a pointer to a buffer to notify upon completion, a pointer to a control buffer for updates, and other fields decorating the job item(location, data ranges, priority) that will get filled as the job entry makes new jobs, and as the items are consumed into the database. This give me a way I can create relational ties between jobs and simply construct queries if I need to re-work/update jobs, remove them and their dependencies, or update/re-order priorities or dependencies. All this being used in an sqlite db also means that at any time I can dump the whole thing to disk and reload it later, or switch to attaching to and processing it from disk. Additionally, this gives me access to a lot of search and ordering algorithmic work I'd normally need a bunch of different types of containers for. Being able to use SQL queries gives me a lot of options to process the jobs.
The communication queue(as a db) is what I'm torn as to whether I should make access via the corresponding thread(each thread contains it's own messaging db, and the module API has locks/mutex abstracted for access), or have all updates, adds/removes, and communication via some master router thread into one large db. I have no idea which will give me the least headaches as far as mutexing and locks. I got a few days into making a monster spaghetti beast of shared pointers to sbuffer pools and lookup tables, so each thread had it's own buffer in, and separate out buffers. That's when I decided to just offload the giant list keeping to sqlite. Then I thought, why not just feed the flatbuffer objects of everything else into tables.
Having almost everything in a db means from each module, I can write sql statements that represent the view of the data I need to work on as well as pivot on the fly as to how the data is worked on. Having the jobs themselves in a db means I can do the same for them as well. SQLite has multi-threading access, so using it as a Multithreaded job queue manager shouldn't be too much of a stretch.
In summary, Cpp-Taskflow will allow you to setup complicated nested loops with dependency chaining and job-pool multithreading. Out of the box it comes with most of the structure you need. FlatBuffers will allow you to create job declarations and object wrappers easy to feed into stream-buffers as one unit of work and pass them between job threads, and SQLite will allow you to tag and queue the stream-buffer jobs into blob entries in a way that should allow adding, searching, ordering, updating, and removal with minimal work on your end. It also makes saving and reloading a breeze. Snapshots and roll-backs should also be doable, you just have to keep your mind wrapped around the order and resolution of events for the db.
Edit: Take this with a grain of salt though, I found your question because I'm trying to accomplish what Crashworks described. I'm thinking of using affinity to open long living threads and have the master thread run the majority of the Cpp-Taskflow hierarchy work, feeding jobs to the others. I've yet to use the sqlite meothod of job-queue/control communication, that's just my plan so far.
I hope someone finds this helpful.

You might want to look at Flow-Based Programming - it is based on data chunks streaming between asynchronous components. There are Java and C# versions of the driver, plus a number of precoded components. It is intrinsically multithreaded - in fact the only single-threaded code is within the components, although you can add timing constraints to the standard scheduling rules. Although it may be at too fine-grained a level for what you need, there may be stuff here you can use.

Take a look at boost::future (but see also this discussion and proposal) which looks like a really nice foundation for parallelism (in particular it seems to offer excellent support for C-depends-on-A-and-B type situations).
I looked at OpenMP a bit but (like you) wasn't convinced it would work well for anything but Fortran/C numeric code. Intel's Threading Building Blocks looked more interesting to me.
If it comes to it, it's not too hard to roll your own on top of boost::thread.
[Explanation: a thread farm (most people would call it a pool) draws work from a thread-safe queue of functors (tasks or jobs). See the tests and benchmark for examples of use. I have some extra complication to (optionally) support tasks with priorities, and the case where executing tasks can spawn more tasks into the work queue (this makes knowing when all the work is actually completed a bit more problematic; the references to "pending" are the ones which can deal with the case). Might give you some ideas anyway.]

You may like to look at Intel Thread Building Blocks. I beleave it does what you want and with version 2 it's Open Source.

There's plenty of distributed resource managers out there. The software that meets nearly all of your requirements is Sun Grid Engine. SGE is used on some of the worlds largest supercomputers and is in active development.
There's also similar solutions in Torque, Platform LSF, and Condor.
It sounds like you may want to roll your own but there's plenty of functionality in all of the above.

I don't know if you're looking for a C++ library (which I think you are), but Doug Lea's Fork/Join framework for Java 7 is pretty nifty, and does exactly what you want. You'd probably be able to implement it in C++ or find a pre-implemented library.
More info here:
http://artisans-serverintellect-com.si-eioswww6.com/default.asp?W1

A little late to the punch perhaps, but take a look also at ThreadWeaver:
http://en.wikipedia.org/wiki/ThreadWeaver

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js