Offloading extensive calculations in save method of Django custom FileField - django

I'm building a gallery webapp based on Django (4.1.1) and Vue. I want to also upload and display videos (not only images). For supporting formats, that don't work in a video html tag, I'm converting these formats to mp4 via pyffmpeg.
For that I created a custom field for my model based on FileField. In it's save method I take the files content, convert it and save the result. This is called by the serializer through a corresponding ViewSet. This works, but the video conversion takes way too long, so that the webrequest from my Vue app (executed with axios) is running into a timeout.
It is clear, that I need to offload the conversion somehow, return a corresponding response directly and save the data in the database as soon as the conversion is finished.
Is this even possible? Or do I need to write a custom view apart from the ViewSet to do the calculation? Can you give me a hint on how to offload that calcuation? I only have rudimentary knowledge about things like asyncio.
TL;DR: How to do extensive calculations asychronous on file data before saving them to a model with FileField and returning a response before the calcuation ends?
I can provide my current code if necessary.

I've now solved my problem, though I'm still interested in other/better solutions. My solution works but might not be the best and I feel it is a bit hacky at some places.
TL;DR: Installed django-q as task queue manager with a redis database backend, connected it to django and then called the function for transcoding the video file from my view via
taskid = async_task("apps.myapp.services.transcode_video", data)
This should be a robust system to handle these transcode tasks in parallel and without blocking the request.
I found this tutorial about Django-Q. Django-Q manages and executes tasks from django. It runs in parallel with Django and is connected to it via its broker (a redis database in this case).
First I installed django-q and the redis client modules via pip
pip install django-q redis
Then I build up a Redis database (here running in a docker container on my machine with the official redis image). How to do that depends largely on your platform.
Then configuring Django to use Django-Q by adding the configuration into settings.py (Note, that I disabled timeouts, because the transcode tasks can take rather long. May change that in future):
Q_CLUSTER = {
'name': 'django_q_django',
'workers': 8,
'recycle': 500,
'timeout': None,
'compress': True,
'save_limit': 250,
'queue_limit': 500,
'cpu_affinity': 1,
'label': 'Django Q',
'redis': {
'host': 'redishostname',
'port': 6379,
'password': 'mysecureredisdbpassword',
'db': 0, }
}
and then activating Django-Q by adding it to the installed apps in settings.py:
INSTALLED_APPS = [
...
'django_q',
]
Then migrate the model definitions of Django Q via:
python manage.py migrate
and start Django Q via (the Redis database should run at this point):
python manage.py qcluster
This runs in a separate terminal from the typical
python manage.py runserver
Note: Of course these two are only for development. I currently don't know how to deploy Django Q in production yet.
Now we need a file for our functions. As in the tutorial I added the file services.py to my app. There I simply defined the function to run:
def transcode_video(data):
# Doing my transcoding stuff here
return {'entryid': entry.id, 'filename': target_name}
This function can then be called inside the view code via:
taskid = async_task("apps.myapp.services.transcode_video", data)
So I can provide data to the function and get a task ID as a return value. The return value of the executed function will appear in the result field of the created task, so that you can even return data from there.
I encountered a problem at that stage: The data contains a TemporaryUploadedFile object, which resulted in a pickle error. The data seems to get pickled before it gets passed to Django Q, which didn't work for that object. There might be a way to convert the file in a picklable format, though since I already need the file on the filesystem for invoking pyffmeg on it, in the view I just write the data to a file (in chunks to avoid loading the whole file into memory at once) with
with open(filepath, 'wb') as f:
for chunk in self.request.data['file'].chunks():
f.write(chunk)
Normally in the ViewSet I would call serializer.save() at the end, but for transcoding I don't do that, since the new object gets saved inside the Django Q function after the transaction. There I create it like this: (UploadedFile being from dango.core.files.uploadedfile and AlbumEntry being my own model for which I want to create an instance)
with open(target_path, 'rb') as f:
file = UploadedFile(
file=f,
name=target_name,
content_type=data['file_type']+"/"+data['target_ext'],
)
entry = AlbumEntry(
file=file,
... other Model fields here)
entry.save()
To return a defined Response from the viewset even when the object wasn't created yet, I had to overwrite the create() method in addition to the perform_create() method (where I did all the handling). For this I copied the code from the parent class and changed it slightly to return a specific response depending on the return value of perform_create() (which previously didn't return anything):
def create(self, request, *args, **kwargs):
serializer = self.get_serializer(data=request.data)
serializer.is_valid(raise_exception=True)
taskid = self.perform_create(serializer)
if taskid:
return HttpResponse(json.dumps({'taskid': taskid, 'status': 'transcoding'}), status=status.HTTP_201_CREATED)
headers = self.get_success_headers(serializer.data)
return Response(serializer.data, status=status.HTTP_201_CREATED, headers=headers)
So perform_create() would return a task ID on transcode jobs and None otherwise. A corresponding response is send here.
Last but not least there was the problem of the frontend not knowing when the transcoding was done. So I build a simple view to get a task by ID:
#api_view(['GET'])
#authentication_classes([authentication.SessionAuthentication])
#permission_classes([permissions.IsAuthenticated])
def get_task(request, task_id):
task = Task.get_task(task_id)
if not task:
return HttpResponse(json.dumps({
'success': False
}))
return HttpResponse(json.dumps({
'id': task.id,
'result': task.result,
...some more data to return}))
You can see that I return a fixed response, when the task is not found. This is my workaround, since by default the Task object will get created only when the task is finished. For my purpose it is OK to just assume, that it still runs. A comment in this github issue of Django Q suggests, that to get an up-to-date Task object you would need to write your own Task model and implement it in a way, that it regularly checks Django Q for the Task status. I didn't want to do this.
I also put the result in the response, so that my frontend can poll the task regularly (by its task ID) and when the transcode is finished it will contain the ID of the created Model object in the database. When my frontend sees this, it will load the objects content.

Related

How to run two requests parallel in django rest

I have two requests, which are called from react front end, one request is running in a loop which is returning image per request, now the other request is registering a user, both are working perfect, but when the images request is running in a loop, at the same time I register user from other tab,but that request status shows pending, if I stops the images request then user registered,How can I run them parallel at the same time.
urls.py
url(r'^create_user/$', views.CreateUser.as_view(), name='CreateAbout'),
url(r'^process_image/$', views.AcceptImages.as_view(), name='AcceptImage'),
Views.py
class CreateUser(APIView):
def get(self,request):
return Response([UserSerializer(dat).data for dat in User.objects.all()])
def post(self,request):
payload=request.data
serializer = UserSerializer(data=payload)
if serializer.is_valid():
instance = serializer.save()
instance.set_password(instance.password)
instance.save()
return Response(serializer.data, status=status.HTTP_201_CREATED)
return Response(serializer.errors, status=status.HTTP_400_BAD_REQUEST)
class AcceptImages(APIView):
def post(self,request):
global video
stat, img = video.read()
frame = img
retval, buffer_img = cv2.imencode('.jpg', frame)
resdata = base64.b64encode(buffer_img)
return Response(resdata)
These endpoints I am calling from react,the second endpoint is being calling in a loop and the same time from other tab I register user but it shows status pending and If I stop the image endpoint it then register the user,how can I make these two request run parallel.
I have researched a lot but can't find appropriate solution, there is one solution I using celery, but did not whether it solves my problem if it solves how can I implement above scenerio
You should first determine whether the bottleneck is the frontend or the backend.
frontend: Chrome can make up to 6 requests for the same domain at a time. (Up to HTTP/1.1)
backend: If you use python manage.py runserver, consider using gunicorn or uWSGI. As the Django documentation says, the manage.py command should not be used in production. Modify the process and thread count settings in the gunicorn or uWSGI to 2 or higher and try again.

Serving images asynchronously using django and celery?

I have a django app that serves images when a certain page is loaded. The images are stored on S3 and I retrieve them using boto and send the image content as an HttpResponse with the appropriate content type.
The problem here is, this is a blocking call. Sometimes, it takes a long time (few secs for the image of few hundred KBs) to retrieve the images and serve them to the client.
I tried doing converting this process to a celery task (async, non-blocking), but I am not sure how I can send back the data (images) when they are done downloading. Just returning HttpResponse from a celery task does not work. I found docs related to http callback tasks in an old celery docs here, but this is not supported in the newer celery versions.
So, should I use polling in the js? (I have used celery tasks in other parts of my website, but all of them are socket based) or is this even the right way to approach the problem?
Code:
Django views code that fetches the images (from S3 using boto3): (in views.py)
#csrf_protect
#ensure_csrf_cookie
def getimg(request, public_hash):
if request.user.is_authenticated:
query = img_table.objects.filter(public_hash=public_hash)
else:
query = img_table.objects.filter(public_hash=public_hash, is_public=True)
if query.exists():
item_dir = construct_s3_path(s3_map_thumbnail_folder, public_map_hash)
if check(s3, s3_bucket, item_dir): #checks if file exists
item = s3.Object(s3_bucket, item_dir)
item_content = item.get()['Body'].read()
return HttpResponse(item_content, content_type="image/png",status=200)
else: #if no image found, return a blank image
blank = Image.new('RGB', (1000,1000), (255,255,255))
response = HttpResponse(content_type="image/jpeg")
blank.save(response, "JPEG")
return response
else: #if request image corresponding to hash is not found in db
return render(request, 'core/404.html')
I call the above django view in a page like this:
<img src='/api/getimg/123abc' alt='img'>
In urls.py I have:
url(r'^api/getimg/(?P<public_hash>[a-zA-Z0-9]{6})$', views.getimg, name='getimg')

Celery+Docker+Django -- Getting tasks to work

I've been trying to learn Celery over the past week and adding it to my project that uses Django and Docker-Compose. I am having a hard time understanding how to get it to work; my issue is that I can't seem to get uploading to my database to work when using tasks. The upload function, insertIntoDatabase, was working fine before without any involvement with Celery but now uploading doesn't work. Indeed, when I try to upload, my website tells me too quickly that the upload was successful, but then nothing actually gets uploaded.
The server is started up with docker-compose up, which will make migrations, perform a migrate, collect static files, update requirements, and then start the server. This is all done using pavement.py; the command in the Dockerfile is CMD paver docker_run. At no point is a Celery worker explicitly started; should I be doing that? If so, how?
This is the way I'm calling the upload function in views.py:
insertIntoDatabase.delay(datapoints, user, description)
The upload function is defined in a file named databaseinserter.py. The following decorator was used for insertIntoDatabase:
#shared_task(bind=True, name="database_insert", base=DBTask)
Here is the definition of the DBTask class in celery.py:
class DBTask(Task):
abstract = True
def on_failure(self, exc, *args, **kwargs):
raise exc
I am not really sure what to write for tasks.py. Here is what I was left with by a former co-worker just before I picked up from where he left off:
from celery.decorators import task
from celery.utils.log import get_task_logger
logger = get_task_logger(__name__)
#task(name="database_insert")
def database_insert(data):
And here are the settings I used to configure Celery (settings.py):
BROKER_TRANSPORT = 'redis'
_REDIS_LOCATION = 'redis://{}:{}'.format(os.environ.get("REDIS_PORT_6379_TCP_ADDR"), os.environ.get("REDIS_PORT_6379_TCP_PORT"))
BROKER_URL = _REDIS_LOCATION + '/0'
CELERY_RESULT_BACKEND = _REDIS_LOCATION + '/1'
CELERY_ACCEPT_CONTENT = ['application/json']
CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_SERIALIZER = 'json'
CELERY_ENABLE_UTC = True
CELERY_TIMEZONE = "UTC"
Now, I'm guessing that database_insert in tasks.py shouldn't be empty, but what should go there instead? Also, it doesn't seem like anything in tasks.py happens anyway--when I added some logging statements to see if tasks.py was at least being run, nothing actually ended up getting logged, making me think that tasks.py isn't even being run. How do I properly make my upload function into a task?
You're not too far off from getting this working, I think.
First, I'd recommend that you do try to keep your Celery tasks and your business logic separate. So, for example, it probably makes good sense to have the business logic involved with inserting your data into your DB in the insertIntoDatabase function, and then separately create a Celery task, perhaps name insert_into_db_task, that takes in your args as plain python objects (important) and calls the aforementioned insertIntoDatabase function with those args to actually complete the DB insertion.
Code for that example might looks like this:
my_app/tasks/insert_into_db.py
from celery.decorators import task
from celery.utils.log import get_task_logger
logger = get_task_logger(__name__)
#task()
def insert_into_db_task(datapoints, user, description):
from my_app.services import insertIntoDatabase
insertIntoDatabase(datapoints, user, description)
my_app/services/insertIntoDatabase.py
def insertIntoDatabase(datapoints, user, description):
"""Note that this function is not a task, by design"""
# do db insertion stuff
my_app/views/insert_view.py
from my_app.tasks import insert_into_db_task
def simple_insert_view_func(request, args, kwargs):
# start handling request, define datapoints, user, description
# next line creates the **task** which will later do the db insertion
insert_into_db_task.delay(datapoints, user, description)
return Response(201)
The app structure I'm implying is just how I would do it and isn't required. Note also that you can probably use #task() straight up and not define any args for it. Might simplify things for you.
Does that help? I like to keep my tasks light and fluffy. They mostly just do jerk proofing (make sure the involved objs exist in DB, for instance), tweak what happens if the task fails (retry later? abort task? etc.), logging, and otherwise they execute business logic that lives elsewhere.
Also, in case it's not obvious, you do need to be running celery somewhere so that there are workers to actually process the tasks that your view code are creating. If you don't run celery somewhere then your tasks will just stack up in the queue and never get processed (and so your DB insertions will never happen).

Django REST framework non-model serializer and BooleanField

I seem to have hit a wall full of puzzling results when trying to deal with the following use case:
URL: '^api/event/(?P<pk>[0-9]+)/registration$'
payload: {"registered": "true"} or {"registered": "false"}
I retrieve the event object corresponding to the given pk, and then based on that I want:
in a GET request to retrieve whether the authenticated user is registered or not
in a PUT to change the registration state.
Everything works fine until the point where I want to process the incoming payload in the PUT request. I've tried creating a serializer like this:
class RegistrationSerializer(serializers.Serializer):
registered = fields.BooleanField()
and call it from an APIView's put method with:
serializer = RegistrationSerializer(data=request.DATA)
but it doesn't work and serializer.data always contains `{"registered": False}
From a shell I tried another isolated test:
>>> rs = RegistrationSerializer(data={'registered':True})
>>> rs
<app.serializers.RegistrationSerializer object at 0x10a08cc10>
>>> rs.data
{'registered': False}
What am I doing wrong? What would be the best way to handle this use case?
You need to call rs.is_valid() first, before accessing rs.data.
Really the framework ought to raise an exception if you don't do so.

Celery task model instance data being stomped by web worker?

I have a task that gets called on one view. Basically the task is responsible for fetching some pdf data, and saving it into s3 via django storages.
Here is the view that kicks it off:
#login_required
#minimum_stage(STAGE_SIGN_PAGE)
def page_complete(request):
if not request.GET['documentKey']:
logger.error('Document Key was missing', exc_info=True, extra={
'request': request,
})
user = request.user
speaker = user.get_profile()
speaker.readyForStage(STAGE_SIGN)
speaker.save()
retrieveSpeakerDocument.delay(user.id, documentKey=request.GET['documentKey'], documentType=DOCUMENT_PAGE)
return render_to_response('speaker_registration/redirect.html', {
'url': request.build_absolute_uri(reverse('registration_sign_profile'))
}, context_instance=RequestContext(request))
Here is the task:
#task()
def retrieveSpeakerDocument(userID, documentKey, documentType):
print 'starting task'
try:
user = User.objects.get(pk=userID)
except User.DoesNotExist:
logger.error('Error selecting user while grabbing document', exc_info=True)
return
echosign = EchoSign(user=user)
fileData = echosign.getDocumentWithKey(documentKey)
if not fileData:
logger.error('Error retrieving document', exc_info=True)
else:
speaker = user.get_profile()
print speaker
filename = "%s.%s.%s.pdf" % (user.first_name, user.last_name, documentType)
if documentType == DOCUMENT_PAGE:
afile = speaker.page_file
elif documentType == DOCUMENT_PROFILE:
afile = speaker.profile_file
content = ContentFile(fileData)
afile.save(filename, content)
print "saving user in task"
speaker.save()
In the meantime, my next view hits (actually its an ajax call, but that doesn't matter). Basically its fetching the code for the next embedded document. Once it gets it, it updates the speaker object and saves it:
#login_required
#minimum_stage(STAGE_SIGN)
def get_profile_document(request):
user = request.user
e = EchoSign(request=request, user=user)
e.createProfile()
speaker = user.get_profile()
speaker.profile_js = e.javascript
speaker.profile_echosign_key = e.documentKey
speaker.save()
return HttpResponse(True)
My task works properly, and updates the speaker.page_file property correctly. (I can temporarily see this in the admin, and also watch it occur in the postgres logs.)
However it soon gets stamped over, I BELIEVE by the call in the get_profile_document view after it updates and saves the profile_js property. In fact I know this is where it happens based on the SQL statements. Its there before the profile_js is updated, then its gone.
Now I don't really understand why. The speaker is fetched RIGHT before each update and save, and there's no real caching going on here yet, unless get_profile() does something weird. What is going on and how might I avoid this? (Also, do I need to call save on speaker after running save on the fileField? It seems like there are duplicate calls in the postgres logs because of this.
Update
Pretty sure this is due to Django's default view transaction handling. The view begins a transaction, takes a long time to finish, and then commits, overwriting the object I've already updated in a celery task.
I'm not exactly sure how to solve for it. If I switch the method to manual transactions and then commit right after I fetch the echosign js (takes 5-10 seconds), does it start a new transaction? Didn't seem to work.
Maybe not
I don't have TransactionMiddleware added in. So unless its happening anyway, that's not the problem.
Solved.
So here's the issue.
Django apparently keeps a cache of objects that it doesn't think have changed anywhere. (Correct me if I'm wrong.) Since celery was updating my object in the db outside of django, it had no idea this object had changed and fed me the cached version back when I said user.get_profile().
The solution to force it to grab from the database is simply to regrab it with its own id. Its a bit silly, but it works.
speaker = user.get_profile()
speaker = Speaker.objects.get(pk=speaker.id)
Apparently the django authors don't want to add any kind of refresh() method onto objects, so this is the next best thing.
Using transactions also MIGHT solve my problem, but another day.
Update
After further digging, its because the user model has a _profile_cache property on it, so that it doesn't refetch every time you grab the profile in one request from the same object. Since I was using get_profile() in the echosign function on the same object, it was being cached.