How to organize parallel access to data (e.g. ETS table) from bunch of processes in erlang or elixir?
In traditional model I would create RWLock and make critical section as small as possible. So, I can access to hash-table with parallel reads at least.
In erlang first idea is implement gen_server that store table in state. But all access will be serialized. How to handle it to serve faster?
Use direct access to :ets and specify read_concurrency: true in call to :ets.new/2.
GenServer is a redundant link here, that might become a bottleneck.
Related
I'm doing thousands and thousands of inserts to a PostgreSQL database with Python and Django (using the CLI, so no web server at all).
The objects that are inserted are already in memory, and I'm poping them one by one from a FIFO queue (using Python's native https://docs.python.org/2/library/queue.html)
What I'm doing basically is:
args1, args2 = queue.get()
m1, _ = Model1.objects.get_or_create(args1)
Model2.objects.create(m1, args2)
I was thinking a way to do this faster was too spawn a few more threads that can do this in parallel. To my surprise the performance is actually slightly decreased... I was expecting almost linear improvement in relation to the number of threads.. not sure what's going on..
Is there something database specific I'm missing, are there table locks that are blocking the threads when this is running?
Or does it have something to do with that each thread can only access a single database connection atomically during runtime?
I have standard configuration for PostgreSQL (9.3) and Django (1.7.7) installed with apt-get on Debian Jessie.
Also I tried with 4 threads, which is the same number of CPUs I have available on my box.
There are a few things going on here.
Firstly you are using very high level ORM methods (get_or_create, create). Those are generally not a good fit for bulk operations since methods like that tend to have a lot of overhead to provide a nice API and also do additional work to prevent users from shooting themselves in the foot too easily.
Secondly your careful use of a queue is very counterproductive in multiple ways:
Due to django running in autocommit mode by default each database operation is carried out in its own transaction. Since that is a relatively expensive operation this also causes unnecessary overhead.
Inserting each object by itself also causes a lot more back and forth communication between the database and django, which again produces overhead, slowing things down.
Thirdly the reason using multiple threads is even slower stems from the fact that python has a GIL (Global Interpreter Lock). This prevents multiple threads from executing Python code at the same time. There is a lot of material on the web about the whys and hows of the GIL and what can be done in which circumstances to mitigate it. There is a nice summary by Dave Beazly about the GIL that should get you started if you're interested in learning more about it.
Additionally I'd generally recommend against doing large inserts from multiple threads in any language since - depending on your database and data model - this can also cause slowdowns inside the database due to possibly required locking.
Now there are many solutions to your problem but I'd recommend to start with a simple one:
Django actually provides a handy low-level interface to create models in bulk, fittingly enough called bulk_create(). I'd suggest removing all that fancy queue and thread code and using this interface as directly as possible with the data you already have.
In case this isn't sufficient for your case a possible alternative would be to generate an INSERT INTO statement from the data and executing that directly on the database.
If all you want to achieve is simply insertion, could you instead just use the save() method instead of get_or_create(). get_or_create() queries the database first. If the table is large, the call to get_or_create() can be a bottleneck. And that's probably why having multiple parallel threads do not help.
The other possibility is with the insertion itself. Postgres by default enables auto-commit on a per insert (transaction) basis. The committing process involves complex mechanisms under the hood. Long story short, you may try disabling auto-commit and see if that would help in your particular case. A relevant article is here.
Firstly, forgive me if this is a stupid question.
I would like to create a generic synchronised list (like in Java) for reuse in my Go projects.
I found the source of Go's linked list and I was wondering would it be sufficient to simply add mutex locks to the list manipulation functions?
If you're going to make a concurrent-safe container, you need to protect all access to the data, not just writes. Inspecting an element, or even calling Len() without synchronizing the read could return invalid or corrupt data.
It's probably easier to protect the entire data structure with a mutex, rather than implement your own concurrent linked list.
I'm developing a multi-threaded plugin for a single-threaded application (which has a non-thread-safe API).
My current plugin has two threads: the main one which is application's thread and another one which is used for processing data of the main thread. Long story short, the first one creates objects, gives them an ID, inserts them into a map and sometimes even access and delete them (if application says so); the second one is reading data from that map and is altering objects.
My question is: What tehniques can I use in order to make my plugin thread-safe?
First, you have to identify where race conditions may exist. Then, you will have to use some mechanism to assure that the shared data is accessed in a safe way, hence achieving Thread Safety.
For your particular case, it seems the race condition will be on the shared map and possibly the objects (map's values) it contains as well (if it's possible that both threads attempt to alter the same object simultaneously).
My suggestion is that you use a well tested thread safe map implementation, and then if needed add the extra "protection" for the map's values themselves. This way you ensure the map is always in a consistent state for both threads, and if both threads attempt to modify the same object data (map's values), the data won't be corrupted or left inconsistent.
For the map itself, you can search for "Concurrent Hash Map" or "Atomic Hash Map" data structures for C++ and see if they are of good quality and are available for your compiler/platform. Good examples are Intel's TBB concurrent_hash_map or Facebook's folly AtomicHashMap. They both have advantages and disadvantages and you will have to analyze what's best for your situation.
As for the objects the map contains, you can use plain mutexes (simple, lock, modify data, unlock), atomic operations (trickier, only for simple datatypes) or other method, once more depending on your compiler/platform and speed requirements.
Hope this helps!
what is the need for a read shared lock?
I can understand that write locks have to be exclusive only. But what is the need for many clients to access the document simultaneously and still share only read privilege? Practical applications of Shared read locks would be of great help too.
Please move the question to any other forum you'd find it appropriate to be in.
Though this is a question purely related to ABAP programming and theory I'm doing, I'm guessing the applications are generic to all languages.
Thanks!
If you do complex and time-consuming calculations based on multiple datasets (e. g. postings), you have to ensure that none of these datasets is changed while you're working - otherwise the calculations might be wrong. Most of the time, the ACID principles will ensure this, but sometimes, that's not enough - for example if the datasource is so large that you have to break it up into parallel subtasks or if you have to call some function that performs a database commit or rollback internally. In this case, the transaction isolation is no longer enough, and you need to lock the entity on a logical level.
Is it ok to check the current thread inside a function?
For example if some non-thread safe data structure is only altered by one thread, and there is a function which is called by multiple threads, it would be useful to have separate code paths depending on the current thread. If the current thread is the one that alters the data structure, it is ok to alter the data structure directly in the function. However, if the current thread is some other thread, the actual altering would have to be delayed, so that it is performed when it is safe to perform the operation.
Or, would it be better to use some boolean which is given as a parameter to the function to separate the different code paths?
Or do something totally different?
What do you think?
You are not making all too much sense. You said a non-thread safe data structure is only ever altered by one thread, but in the next sentence you talk about delaying any changes made to that data structure by other threads. Make up your mind.
In general, I'd suggest wrapping the access to the data structure up with a critical section, or mutex.
It's possible to use such animals as reader/writer locks to differentiate between readers and writers of datastructures but the performance advantage for typical cases usually wont merit the additional complexity associated with their use.
From the way your question is stated, I'm guessing you're fairly new to multithreaded development. I highly suggest sticking with the simplist and most commonly used approaches for ensuring data integrity (most books/articles you readon the issue will mention the same uses for mutexes/critical sections). Multithreaded development is extremely easy to get wrong and can be difficult to debug. Also, what seems like the "optimal" solution very often doesn't buy you the huge performance benefit you might think. It's usually best to implement the simplist approach that will work then worry about optimizing it after the fact.
There is a trick that could work in case, as you said, the other threads will only make changes only once in a while, although it is still rather hackish:
make sure your "master" thread can't be interrupted by the other ones (higher priority, non fair scheduling)
check your thread
if "master", just change
if other, put off scheduling, if needed by putting off interrupts, make change, reinstall scheduling
really test to see whether there are no issues in your setup.
As you can see, if requirements change a little bit, this could turn out worse than using normal locks.
As mentioned, the simplest solution when two threads need access to the same data is to use some synchronization mechanism (i.e. critical section or mutex).
If you already have synchronization in your design try to reuse it (if possible) instead of adding more. For example, if the main thread receives its work from a synchronized queue you might be able to have thread 2 queue the data structure update. The main thread will pick up the request and can update it without additional synchronization.
The queuing concept can be hidden from the rest of the design through the Active Object pattern. The activ object may also be able to publish the data structure changes through the Observer pattern to other interested threads.