django/celery data import fails to check for existing data - django

I have running celery with django. I import a stream of objects into my database by using tasks. each task imports one object. concurrency is 2. within the stream objects can be duplicated, but should not be inside my database.
code i'm running:
if qs.exists() and qs.count() == 1:
return qs.get()
elif qs.exists():
logger.exception('Multiple venues for same place')
raise ValueError('Multiple venues for same place')
else:
obj = self.create(**defaults)
problem is that if objects inside the stream are duplicate and very close to each other, the app still imports the same objects twice.
I assume that the db checks are not working properly with this concurrency setup. what architecture du you recommend to resolve this issue?

You have to use locking architecture, so will block the the two tasks from executing the object fetching part at the same time, you can use python-redis-lock to do that.

Related

Django Celery shared_task singleton pattern

I have a Django based site that has several background processes that are executed in Celery workers. I have one particular task that can run for a few seconds with several read/writes to the database that are subject to a race condition if a second task tries to access the same rows.
I'm trying to prevent this by ensuring the task is only ever running on a single worker at a time but I'm running into issues getting this to work correctly. I've used this Celery Task Cookbook Recipe as inspiration, trying to make my own version that works for my scenario of ensuring that this specific task is only running on one worker at a time, but it still seems to be possible to encounter situations where it's executed across more than one worker.
So far, in tasks.py I have:
class LockedTaskInProgress(Exception):
"""The locked task is already in progress"""
silent_variable_failure = True
#shared_task(autoretry_for=[LockedTaskInProgress], default_retry_delay=30)
def create_or_update_payments(things=None):
"""
This takes a list of `things` that we want to process payments on. It will
read the thing's status, then apply calculations to make one or more payments
for various users that are owed money for the thing.
`things` - The list of things we need to process payments on.
"""
lock = cache.get('create_or_update_payments') # Using Redis as our cache backend
if not lock:
logger.debug('Starting create/update payments processing. Locking task.')
cache.set('create_or_update_payments', 'LOCKED')
real_create_or_update_payments(things) # Long running function w/ lots of DB read/writes
cache.delete('create_or_update_payments')
logger.debug('Completed create/update payments processing. Lock removed.')
else:
logger.debug('Unable to process create/update payments at this time. Lock detected.')
raise LockedTaskInProgress
The above seems to almost work but there still looks to be a possible race condition between the cache.get and cache.set that has shown up in my testing.
I'd love to get suggestions on how to improve this to make it more robust.
Think I've found a way of doing this, inspired by an older version of the Celery Task Cookbook Recipe I was using earlier.
Here's my implementation:
class LockedTaskInProgress(Exception):
"""The locked task is already in progress"""
silent_variable_failure = True
#shared_task(autoretry_for=[LockedTaskInProgress], default_retry_delay=30)
def create_or_update_payments(things=None):
"""
This takes a list of `things` that we want to process payments on. It will
read the thing's status, then apply calculations to make one or more payments
for various users that are owed money for the thing.
`things` - The list of things we need to process payments on.
"""
LOCK_EXPIRE = 60 * 5 # 5 Mins
lock_id = 'create_or_update_payments'
acquire_lock = lambda: cache.add(lock_id, 'LOCKED', LOCK_EXPIRE)
release_lock = lambda: cache.delete(lock_id)
if acquire_lock():
try:
logger.debug('Starting create/update payments processing. Locking task.')
real_create_or_update_payments(things) # Long running function w/ lots of DB read/writes
finally:
release_lock()
logger.debug('Completed create/update payments processing. Lock removed.')
else:
logger.debug('Unable to process create/update payments at this time. Lock detected.')
raise LockedTaskInProgress
It's very possible that there's a better way of doing this but this seems to work in my tests.

returned array referencing the same memory spot but returning different values

I'm setting up a basic dynamic web page to display EC2 instance data and I need to be checking and passing arrays with the data inside to display with D3. Im using multiprocess to run the collection in the background.
Running python3.7 and the newest version of Flask.
app.py Code
#app.route('/experiment')
def experiment():
type = request.args.get('type')
resource = request.args.get('resource')
action = request.args.get('action')
if 'test' not in session:
thread = multiprocessing.Process(target=exp.transmitTest)
session['test'] = 'started'
thread.start()
print(f"Looking for Data at {hex(id(exp.getData()))} found {exp.getData()}")
return render_template('experiment.html', data=exp.getData(), type=request.args.get('type'), resource=request.args.get('resource'), action=request.args.get('action'))
Backend Code
def transmitTest(self):
for i in range(5):
self.data.append(random.randint(0,100))
time.sleep(4)
print(f"Data: {self.data} at {hex(id(self.data))}")
def getData(self):
return self.data
My JS scheduler runs '/experiment' every 5 seconds. The print statements show that the array im writing to and getting from the getter are at the same memory space, but one is empty and the other has the data. Can anyone help me understand this?
so I figured it out. When calling object methods in processes in flask python creates copies of the objects and then differentiates between the two copies even if they do take up the same memory space. I needed to add a backend queue through redisqueue (https://blog.miguelgrinberg.com/post/the-flask-mega-tutorial-part-xxii-background-jobs) so that I could make asynchronous calls to my backend without disrupting flask's routing.

Celery: dynamic queues at object level

I'm writing a django app to make polls which uses celery to put under control the voting system. Right now, I have two queues, default and polls, the first one with concurrency set to 8 and the second one set to 1.
$ celery multi start -A myproject.celery default polls -Q:default default -Q:polls polls -c:default 8 -c:polls 1
Celery routes:
CELERY_ROUTES = {
'polls.tasks.option_add_vote': {
'queue': 'polls',
},
'polls.tasks.option_subtract_vote': {
'queue': 'polls',
}
}
Task:
#app.task
def option_add_vote(pk):
"""
Updates given option id and its poll increasing vote number by 1.
"""
option = Option.objects.get(pk=pk)
try:
with transaction.atomic():
option.vote_quantity += 1
option.save()
option.poll.total_votes += 1
option.poll.save()
except IntegrityError as exc:
raise self.retry(exc=exc)
The option_add_vote method (task) updates the poll-object vote-number value adding 1 to the previous value. So, to avoid concurrency problems, I set the poll queue concurrency to 1. This allow the system to handle thousand of vote requests to be completed successfully.
The problem will be, as I can imagine, a bottle-neck when the system grows up.
So, I was thinking about some kind of dynamic queues where all vote requests to any options of a certain poll where routered to a custom queue. I think this will make the system more reliable and fast.
What do you think? How can I make it?
EDIT1:
I got a new idea thanks to Paul and Plahcinski. I'm storing the votes as objects in their own model (a user-options relationship). When someone votes an option it creates an object from this model, allowing me to count how many votes an option has. This free the system from the voting-concurrency problem, so it could be executed in parallel.
I'm thinking about using CELERYBEAT_SCHEDULE to cron a task that updates poll options based on the result of Vote.objects.get(pk=pk).count(). Maybe I could execute it every hour or do partial updates for those options that are getting new votes...
But, how do I give to the clients updated options in real time?
As Plahcinski says, I can have a cached value for my options in Redis (or any other mem-cached system?) and use it to temporally store this values, giving to any new request the cached value.
How can I mix this with my standar values in django models? Anyone could give me some code references or hints?
Am I in the good way or did I make mistakes?
What I would do is remove your incrementation for the database and move to redis and use the database model as your cached value. Have a celery beat that updates recently incremented redis keys to your database
http://redis.io/commands/INCR
What about just having a simple model that stores vote -1/+1 integers then a celery task that reconciles those with the FK object for atomic transactions and updates?

how to serialize binary files to use with a celery task

I recently integrated celery (django-celery to be more specific) in one of my applications. I have a model in the application as follows.
class UserUploadedFile(models.Model)
original_file = models.FileField(upload_to='/uploads/')
txt = models.FileField(upload_to='/uploads/')
pdf = models.FileField(upload_to='/uploads/')
doc = models.FileField(upload_to='/uploads/')
def convert_to_others(self):
# Code to convert the original file to other formats
Now, once a user uploads a file, i want to convert the original file to txt, pdf and doc formats. calling the convert_to_others method is a bit of an expensive process so i plan to do it asynchronously using celery. So i wrote a simple celery task as follows.
#celery.task(default_retry_delay=bdev.settings.TASK_RETRY_DELAY)
def convert_ufile(file, request):
"""
This task method would call a UserUploadedFile object's convert_to_others
method to do the file conversions.
The best way to call this task would be doing it asynchronously
using apply_async method.
"""
try:
file.convert_to_others()
except Exception, err:
# If the task fails log the exception and retry in 30 secs
log.LoggingMiddleware.log_exception(request, err)
convert_ufile.retry(exc=err)
return True
and then called the task as follows:
ufile = get_object_or_404(models.UserUploadedFiles, pk=id)
tasks.convert_ufile.apply_async(args=[ufile, request])
Now when the apply_async method is called it raises the following exception:
PicklingError: Can't pickle <type 'cStringIO.StringO'>: attribute lookup cStringIO.StringO failed
I think this is because celery (by default) uses pickle library to serialize data, and pickle is not able to serialize the binary file.
Question
Are there any other serializers that can serialize a binary file on its own? If not how can i serialize a binary file using the default pickle serializer ?
You are correct that celery tries to pickle data for which pickling is unsupported. Even if you would find a way to serialize data you want to send to celery task, I wouldn't do this.
It is always a good idea to send as little data as possible to the celery tasks, so in your case I would pass only the id of a UserUploadedFile instance. Having this you can fetch your object by id in celery task and perform convert_to_others() .
Please also note that the object could change its state (or it could even be deleted) before the task is executed. So it is much safer to fetch the object in your celery task instead of sending its full copy.
To sum up, sending only an instance id and refetching it in task gives you a few things:
You send less data to your queue.
You do not have to deal with data inconsistency issues.
It's actually possible in your case. :)
The only 'drawback' is that you need to perform an extra, inexpensive SELECT query to refetch your data, which in overall looks like a good deal, when compared to above issues, doesn't it?

Multiprogramming in Django, writing to the Database

Introduction
I have the following code which checks to see if a similar model exists in the database, and if it does not it creates the new model:
class BookProfile():
# ...
def save(self, *args, **kwargs):
uniqueConstraint = {'book_instance': self.book_instance, 'collection': self.collection}
# Test for other objects with identical values
profiles = BookProfile.objects.filter(Q(**uniqueConstraint) & ~Q(pk=self.pk))
# If none are found create the object, else fail.
if len(profiles) == 0:
super(BookProfile, self).save(*args, **kwargs)
else:
raise ValidationError('A Book Profile for that book instance in that collection already exists')
I first build my constraints, then search for a model with those values which I am enforcing must be unique Q(**uniqueConstraint). In addition I ensure that if the save method is updating and not inserting, that we do not find this object when looking for other similar objects ~Q(pk=self.pk).
I should mention that I ham implementing soft delete (with a modified objects manager which only shows non-deleted objects) which is why I must check for myself rather then relying on unique_together errors.
Problem
Right thats the introduction out of the way. My problem is that when multiple identical objects are saved in quick (or as near as simultaneous) succession, sometimes both get added even though the first being added should prevent the second.
I have tested the code in the shell and it succeeds every time I run it. Thus my assumption is if say we have two objects being added Object A and Object B. Object A runs its check upon save() being called. Then the process saving Object B gets some time on the processor. Object B runs that same test, but Object A has not yet been added so Object B is added to the database. Then Object A regains control of the processor, and has allready run its test, even though identical Object B is in the database, it adds it regardless.
My Thoughts
The reason I fear multiprogramming could be involved is that each Object A and Object is being added through an API save view, so a request to the view is made for each save, thus not a single request with multiple sequential saves on objects.
It might be the case that Apache is creating a process for each request, and thus causing the problems I think I am seeing. As you would expect, the problem only occurs sometimes, which is characteristic of multiprogramming or multiprocessing errors.
If this is the case, is there a way to make the test and set parts of the save() method a critical section, so that a process switch cannot happen between the test and the set?
Based on what you've described, it seems reasonable to assume that multiple Apache processes are a source of problems. Are you able to replicate if you limit Apache to a single worker process?
Maybe the suggestions in this thread will help: How to lock a critical section in Django?
An alternative approach could be utilizing a queue. You'd just stick your objects to be saved into the queue and have another process doing the actual save. That way you could guarantee that objects were processed sequentially. This wouldn't work well if your application depends on having the object saved by the time the response is returned unless you also had the request processes wait on the result (watching a finished queue for example).
Updated
You may find this info useful. Mr. Dumpleton does a much better job of laying out the considerations than I could attempt to summarize here:
http://code.google.com/p/modwsgi/wiki/ProcessesAndThreading
http://code.google.com/p/modwsgi/wiki/ConfigurationGuidelines especially the Defining Process Groups section.
http://code.google.com/p/modwsgi/wiki/QuickConfigurationGuide Delegation to Daemon Process section
http://code.google.com/p/modwsgi/wiki/IntegrationWithDjango
Find the section of text toward the bottom of the page that begins with:
Now, traditional wisdom in respect of
Django has been that it should
perferably only be used on single
threaded servers. This would mean for
Apache using the single threaded
'prefork' MPM on UNIX systems and
avoiding the multithreaded 'worker'
MPM.
and read until the end of the page.
I have found a solution that I think might work:
import threading
def save(self, *args, **kwargs):
lock = threading.Lock()
lock.acquire()
try:
# Test and Set Code
finally:
lock.release()
It doesn't seam to break the save method like that decorator and thus far I have not seen the error again.
Unless anyone can say that this is not a correct solution, I think this works.
Update
The accepted answer was the inspiration for this change.
I seams I was under the impressions that locks were some sort of special voodoo that were exempt by normal logic. Here the lock = threading.Lock() is run each time, thus instantiating a new unlocked lock which may always be acquired.
I needed a single central lock for the purpose, but were could that go unless I had a thread running all the time holding the lock? The answer seamed to be to use file locks explained in this answer to the StackOverflow question mentioned in the accepted answer.
The following is that solution modified to suit my situation:
The Code
Th following is my modified DjangoLock. I wished to keep locks relative to the Django root, to do this I put a custom variable into the settings.py file.
# locks.py
import os
import fcntl
from django.conf import settings
class DjangoLock:
def __init__(self, filename):
self.filename = os.path.join(settings.LOCK_DIR, filename)
self.handle = open(self.filename, 'w')
def acquire(self):
fcntl.flock(self.handle, fcntl.LOCK_EX)
def release(self):
fcntl.flock(self.handle, fcntl.LOCK_UN)
def __del__(self):
self.handle.close()
And now the additional LOCK_DIR settings variable:
# settings.py
import os
PATH = os.path.abspath(os.path.dirname(__file__))
# ...
LOCK_DIR = os.path.join(PATH, 'locks')
That will now put locks in a folder named locks relative to the root of the Django project. Just make sure you give apache write access, in my case:
sudo chown www-data locks
And finally the usage is much the same as before:
import locks
def save(self, *args, **kwargs):
lock = locks.DjangoLock('ClassName')
lock.acquire()
try:
# Test and Set Code
finally:
lock.release()
This is now the implementation I am using and it seams to be working really well. Thanks to all who have contributed to the process of arriving at this end.
You need to use synchronization on the save method. I haven't tried this yet, but here's a decorator that can be used to do so.