Perform celery task after successful commit in Flask? - flask

Relatively long-running task is delegated to celery workers, which are running separately, on another server.
However, results are added back to the relational database (table updated according to a task_descr.id as a key, see below), worker uses ignore_result.
Task requested from Flask application:
task = app.celery.send_task('tasks.mytask', [task_descr.id, attachments])
The problem is that tasks are requested while transaction is not yet closed on the Flask side. This causes race condition, because sometimes celery worker completes the task before the end of transaction in Flask app.
What is the proper way to send tasks after successful transaction only?
Or should the worker check task_descr.id availability before attempting conditional UPDATE and retry the task (this feels as too complex arrangement)?
Answer to Run function after a certain type of model is committed discusses similar situation, but here task sending is explicit, so no need to listen to the updates/inserts in some model.

One of the ways is Per-Request After-Request Callbacks, thanks to Armin Ronacher:
from flask import g
def after_this_request(func):
if not hasattr(g, 'call_after_request'):
g.call_after_request = []
g.call_after_request.append(func)
return func
#app.after_request
def per_request_callbacks(response):
for func in getattr(g, 'call_after_request', ()):
response = func(response)
return response
In my case the usage is in the form of a nested function:
task_desc = ...
attachments = ...
#...
#after_this_request
def send_mytask(response):
if response.status_code in {200, 302}:
task = app.celery.send_task('tasks.mytask', [task_descr.id, attachments])
return response
Not ideal, but works. My tasks are only for successfully served request, so I do not care of 500s or other error conditions.

Related

transaction.atomic celery task

I have transaction.atomic celery task:
#app.task(
name="create_order",
bind=True,
ignore_results=True,
)
#transaction.atomic
def create_order(self: Task) -> None:
try:
data = MyModel.objects.select(...)
# Some actions that may take long time and only use DB for SELECT queries
make_order(data, ...)
except SomeException as exc:
raise self.retry(exc=exc, countdown=5)
else:
data.status = DONE
data.save()
#transaction.atomic decorator creates new connection with DB and holds it before any exception or COMMIT statement. But what if task raises self.retry? Connection will be closed and when the task retries django will open a new one?
Technically, transaction.atomic does not open a new connection, it grabs an existing connection out of the django.db.connections collection, and these connections are typically instantiated at startup. The connections stay live and are reused across the app. Typically, all of the connections do a ping before executing a query to make sure the connection is usable (otherwise a new connection will be established). Exiting the code block of the task will not close the connection, and depending on your connection settings, the same process will happen when the process is retried (transaction.atomic will grab the connection out of the connections collection and then the execution of the query will do the connectivity check).

How to create a queue for python-requests in Django?

REST API service has a limit of requests (say a maximum of 100 requests per minute). In Django, I am trying to allow USERs to access such API and retrieve data in real-time to update SQL tables. Therefore there is a problem that if multiple users are trying to access the API, the limit of requests is likely to be exceeded.
Here is a code snippet as an example of how I currently perform requests - each user will add a list of objects he wants to request and run request_engine().start(object_list) to access the API. I use multithreading to speed up requests. I also allow retrying failed API requests via setting a limit on the number of requests for each request object upper_limit.
As I understand there should be some queue for API requests. I anticipate there must be a more elegant solution for this, however, I could not find any similar examples. How can one implement/rewrite this for multiUSER usage with Django?
import requests
from multiprocessing.dummy import Pool as ThreadPool
N=50 # number of threads
upper_limit=1 # limit on the number of requests for a single object
class request_engine():
def __init__(self):
pass
def start(self,objs):
self.objs={obj:{'status':0,'data':None} for obj in objs}
done=False
while not done:
self.parallel_requests()
done=all(_['status']>upper_limit or _['status']==-1 for obj,_ in self.objs.items())
return dict(self.objs)
def single_request(self,request_obj):
URL = f"https://reqres.in/api/users?page={request_obj}"
r = requests.get(url = URL)
if r.ok:
res = r.json()
self.objs[request_obj]['status']=-1
self.objs[request_obj]['data']=res
else:
self.objs[request_obj]['status']+=1
def parallel_requests(self):
objs=[obj for obj,_ in self.objs.items() if _['status']!=-1 and _['status']<=upper_limit]
pool = ThreadPool(N)
pool.map(self.single_request, objs)
pool.close()
pool.join()
objs=[1,2,3,4,5,6,7,7,8,234,124,24,535,6,234,24,4,1,3,4,5,4,3,5,3,1,5,2,3,5,3]
result=request_engine().start(objs)
print([_['status'] for obj,_ in result.items()])
# status corresponds to the number of unsuccessful requests
# status=-1 implies success of the request
Thanks in advance.

Real-time update on Django application using MySQL <> WebSocket

I need to continuously get data from a MySQL database which gets data with an update frequency of around 200 ms. I need to continuously update the data value on the dashboard text field.My dashboard is built on Django.
I have read a lot about Channels but all the tutorials are about chat applications. I know that I need to implement WebSockets which will basically have an open connection and get the data. With the chat application, it makes sense but I haven't come across anything which talks about MySQL database.
I also read about mysql-events. Since the data which is getting in the table is from an external sensor, I don't understand how I can monitor a table inside Django i.e whenever a new row is added in the table, I need to get that new inserted based on a column value.
Any ideas on how to go about it? I have gone through a lot of articles and I couldnt find something specific to this requirement.
Thanks to Timothee Legros answer, it kinda helped me move along in the right direction.
Everywhere on the internet, it says that Django channels is/can be used for real-time applications, but nowhere it talks about the exact implementation(other than chat applications).
I used Celery, Django Channels and Celery's Beat to accomplish the task and it works as expected.
There are three parts to it. Setting up channel's, then creating a celery task, calling it periodically (with the help of Celery Beat) and then sending that task's output to channel's so that it can send that data to the websocket.
Channels
I followed the original tutorial on Channel's website and build up on that.
routing.py
from django.urls import re_path
from . import consumers
websocket_urlpatterns = [
re_path(r'ws/chat/(?P<room_name>\w+)/$', consumers.ChatConsumer),
re_path(r'ws/realtimeupdate/$', consumers.RealTimeConsumer),
]
consumers.py
class RealTimeConsumer(AsyncWebsocketConsumer):
async def connect(self):
self.channel_group_name = 'core-realtime-data'
# Join room group
await self.channel_layer.group_add(
self.channel_group_name,
self.channel_name
)
await self.accept()
async def disconnect(self, close_code):
# Leave room group
await self.channel_layer.group_discard(
self.channel_group_name,
self.channel_name
)
# Receive message from WebSocket
async def receive(self, text_data):
print(text_data)
pass
async def loc_message(self, event):
# print(event)
message_trans = event['message_trans']
message_tag = event['message_tag']
# print("sending data to websocket")
await self.send(text_data=json.dumps({
'message_trans': message_trans,
'message_tag': message_tag
}))
This class will basically send data to the websocket once it receives it. Above two will be specific to the app.
Now we will setup Celery.
In the project's base directory, where the setting file resides, we need to make three files.
celery.py This will init the celery.
routing.py This will be used to route the channel's websocket addresses.
task.py This is where we will setup the task
celery.py
import os
from celery import Celery
# set the default Django settings module for the 'celery' program.
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'proj_name.settings')
app = Celery('proj_name', backend='redis://localhost', broker='redis://localhost/')
# Using a string here means the worker doesn't have to serialize
# the configuration object to child processes.
# - namespace='CELERY' means all celery-related configuration keys
# should have a `CELERY_` prefix.
app.config_from_object('django.conf:settings', namespace='CELERY')
# Load task modules from all registered Django app configs.
app.autodiscover_tasks()
#app.task(bind=True)
def debug_task(self):
print(f'Request: {self.request!r}')
routing.py
from channels.auth import AuthMiddlewareStack
from channels.routing import ProtocolTypeRouter, URLRouter
from app_name import routing
application = ProtocolTypeRouter({
# (http->django views is added by default)
'websocket': AuthMiddlewareStack(
URLRouter(
routing.websocket_urlpatterns
)
),
})
tasks.py
#shared_task(name='realtime_task')
def RealTimeTask():
time_s = time.time()
result_trans = CustomModel_1.objects.all()
result_tag = CustomModel_2.objects.all()
result_trans_json = serializers.serialize('json', result_trans)
result_tag_json = serializers.serialize('json', result_tag)
# output = {"ktr": result_transmitter_json, "ktag": result_tag_json}
# print(output)
channel_layer = get_channel_layer()
message = {'type': 'loc_message',
'message_transmitter': result_trans_json,
'message_tag': result_tag_json}
async_to_sync(channel_layer.group_send)('core-realtime-data', message)
print(time.time()-time_s)
The task, after completing the task, sends the result back to the Channels, which in turn will relay it to the websocket.
Settings.py
# Channels
CHANNEL_LAYERS = {
'default': {
'BACKEND': 'channels_redis.core.RedisChannelLayer',
'CONFIG': {
"hosts": [('127.0.0.1', 6379)],
},
},
}
CELERY_BEAT_SCHEDULE = {
'task-real': {
'task': 'realtime_task',
'schedule': 1 # this means, the task will run itself every second
},
}
Now the only thing left is to create a websocket in the javascript file and start listening to it.
//Create web socket to receive data
const chatSocket = new WebSocket(
'ws://'
+ window.location.host
+ '/ws/realtimeupdate'
+ '/'
);
chatSocket.onmessage = function(e) {
const data = JSON.parse(e.data);
console.log(e.data + '\n');
// For trans
var arrayOfObjects = JSON.parse(data.message_trans);
//Do your thing
//For tags
var arrayOfObjects_tag = JSON.parse(data.message_tag);
//Do your thing
}
};
chatSocket.onclose = function(e) {
console.error('Chat socket closed unexpectedly');
};
To answer the MySQL usage, I am inserting data into the MySQL database from external sensor and in the tasks.py, am querying the table using Django ORM.
Overall, it does the intended work, populate a real-time dashboard with real-time data from MySQL . Am sure, there might be different and better approach to it, please let me know about it.
Your best bet if you need to constantly query your sql database would be to use Celery or dramatiq which is simpler/easier but less battle tested in combination with Django Channels.
Celery allows you to create workers (kind of like background processes) that you can send tasks (functions) to. When a worker receives a task it will execute. All this is done in the background. From the task that the worker is executing you can actually send data back through a websocket directly from the worker. This only works if you have django channels + channel layers enabled because when you enable channel layers, each consumer instance created when you open a channel/websocket will have a name that you can pass to the worker so that it knows which websocket to send the query data back to.
Here is what the flow of this process would look like:
Client requests to connect to your websocket
Consumer instance is created and with it a specific name for it
Consumer instance accepts connection
Consumer triggers celery task and passes the name
Worker begins polling your SQL databases every X seconds
When worker finds new entry use the name it was given and send the new entry back through the websocket.
I suggest reading django channels documentation on consumers and channel layers as well as celery or dramatiq tutorials to understand how those work. For all this to work you will also have to learn about Redis and a message queue service such as RabbitMQ. There is just too much to put in a simple answer but I can provide more information if you have specific questions.
Edit:
Get Redis Server Setup on your machine. If you are on Windows like me then you have to download WSL 2 and install Ubuntu from the Windows Store (free). This link can walk you through it.
Get RabbitMQ server setup. Follow their tutorial
Enable Django Channels and Django-Channel-layers and then setup Redis as your default Django-channels backend.
Setup Dramatiq or Celery. I prefer Dramatiq as it is basically a new and improved version of Celery albeit being less popular. It is much easier to setup and use. This is the github repo for Django-dramatiq and it will walk you through how to set it up. Note that just like when you launch your django server with python manage.py runserver you have to launch dramatiq workers with python manage.py rundramatiq before testing you website.
Create a tasks.py file in your django app and inside of that task implement your code to check MySQL database for new entries. If you haven't figured that out already here is the link to get started with that. In your tasks file you should have a function with the dramatiq.actor decorator on top so that dramatiq knows that the function is a task.
Build a django-channels consumer to handle WebSocket connections as well as allow you to send data through the WebSocket connection. This is what the standard consumer would look like:
class AsyncDashboardConsumer(AsyncJsonWebsocketConsumer):
async def connect(self):
await self.accept()
async def disconnect(self, code):
await self.close()
async def receive_json(self, text_data=None, bytes_data=None, **kwargs):
someData = text_data['someData']
someOtherData = text_data['someOtherData']
if 'execute_getMySQLdata' in text_data['function']:
await self.getData(someData, someOtherData)
async def sendDataToClient(self, event):
await self.send(text_data=event['text'])
async def getData(self, someData, someOtherData):
sync_to_async(SQLData.send(self.channel_name, someData, someOtherData))
connect function is called when the client attempts to connect to the WebSocket URL that your routing file (in step 2) points to this consumer.
recieve_json function is called whenever the client sends data to your django server.
getData function is called from the recieve_json function and sends a message to start your dramatiq task that you created earlier to check SQL db. Note that when you send the message you must pass in self.channel_name as you use that channel_name to send data back through the WebSocket directly from the dramatiq worker/task.
sendDataToClient function is used when you send data back to the client. So when you send data from your task this is the function you must pass in as a callable.
To send data from the task you created earlier use this: async_to_sync(channel_layer.send)(channelName, {'type': 'sendData', 'text': jsonPayload}). Notice how you pass the channelName as well as the sendData function from your consumer.
Finally, this is what the javascript on the client side would look like:
let socket = new WebSocket("wss://javascript.info/article/websocket/demo/hello");
socket.onopen = function(e) {
alert("[open] Connection established");
alert("Sending to server");
socket.send("My name is John");
};
socket.onmessage = function(event) {
alert(`[message] Data received from server: ${event.data}`);
};
socket.onclose = function(event) {
if (event.wasClean) {
alert(`[close] Connection closed cleanly, code=${event.code} reason=${event.reason}`);
} else {
// e.g. server process killed or network down
// event.code is usually 1006 in this case
alert('[close] Connection died');
}
};
socket.onerror = function(error) {
alert(`[error] ${error.message}`);
};
This code came directly from this JavaScript WebSocket walkthrough.
This is how a basic web application with background workers would continually update information in real-time. There are probably other ways of doing this without background workers but since you want to get information as fast as possible as soon as it arrives it is better to have a background process that is continually checking for updates. On another note, the code above means that separate connections to the database are opened for each new client that connects but you can easily take advantage of django-channels groups and have one connection to your database that then just sends to all clients in certain groups.
Build a microservice for Websockets connections
Another way to implement such a feature - is to build a standalone WebSocket microservice.
Monolyth architecture isn't what you need here. Every WebSocket will open a connection to the Django (which will be behind reverse proxy and server: NGINX and Gunicorn ex.). If your client opens two tabs in the browser you will get 2 connections etc...
My recommendation is to modify the tech stack (yes, I'm a huge fan of Django, but there are many cool solutions in building WS):
Use Starlette ready for production framework with build-in WebSockets: https://www.starlette.io/websockets/
Use uvicorn.workers.UvicornWorker for Gunicorn to manage your ASGI application: this is only 1 line of code, like gunicorn -w 4 -k uvicorn.workers.UvicornWorker --log-level warning example:app
handle your WebSocket connections and use examples to request updates from the database: https://www.starlette.io/database/
Use super simple Javascript code to open the connection of the client-side and listen for updates.
So your models, templates, the view will be managed by Django.
Your WebSocket connections will be managed by Starlette in a native async way.
If you're interested in such an option I can make detailed instructions.

APscheduler job not firing when run in Flask on AWS Lambda

A while back, I wrote a small Flask app (deployed as an AWS lambda via Serverless) to do some on-the-fly DynamoDB updates via Slack slash commands. A coworker suggested adding a component so that updates could be scheduled in advance.
I looked up using APscheduler and added a new component to the app. In the abbreviated example following, a Slack slash command would send a POST request to the app's "/scheduler" endpoint:
from flask import Flask, request
from apscheduler.schedulers.background import BackgroundScheduler
from pytz import timezone
[etc...]
app = Flask(__name__)
city = timezone([my timezone])
sched = BackgroundScheduler(timezone=city)
sched.start()
def success_webhook(markdown):
webhook_url = os.environ["webhook_url"]
data = json.dumps({"text": {"type": "mrkdwn", "text": markdown}})
headers = {"Content-Type": "application/json"}
r.post(webhook_url, data=data, headers=headers)
def pass_through(package):
db = boto3.resource(
"dynamodb",
region_name=os.environ["region_name"],
aws_access_key_id=os.environ["aws_access_key_id"],
aws_secret_access_key=os.environ["aws_secret_access_key"],
)
table = db.Table(table_name)
update_action = table.update_item(
Key={"id": "[key]"},
UpdateExpression="SET someValue = :val1",
ExpressionAttributeValues={":val1": package["text"]},
)
if update_action["ResponseMetadata"]["HTTPStatusCode"] == 200:
success_webhook("success")
#app.route("/scheduler", methods=["POST"])
def scheduler():
incoming = (request.values).to_dict()
sched.add_job(pass_through, "date", run_date=incoming["run_date"],
id=incoming["id_0"], args=[incoming])
return "success", 200
if __name__ == "__main__":
app.run()
I tested locally and everything worked fine -- I could schedule jobs and they would run on time; other app endpoints for checking scheduled jobs and removing scheduled jobs [not shown above] also worked as expected.
But once I spun up the AWS lambda running the Flask app, the scheduler never actually runs the pass_through() function for the jobs. Sure, the job gets added -- I can also see it in the list of jobs and remove it from the schedule -- but when the time comes for the lambda to actually run pass_through(), it doesn't. Wondering if anyone knows anything about this situation?
Lambda execution will stop right after you return a value, so even when you schedule the job here:
sched.add_job(pass_through, "date", run_date=incoming["run_date"],
id=incoming["id_0"], args=[incoming])
return "success", 200
The lambda execution will stop and the job will not run later.
If you need to schedule jobs you probably need another solution that is not lambda, however you may use cloudwatch to trigger you lambdas on schedule: https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/RunLambdaSchedule.html

Receiving events from celery task

I have a long running celery task which iterates over an array of items and performs some actions.
The task should somehow report back which item is it currently processing so end-user is aware of the task's progress.
At the moment my django app and celery seat together on one server, so I am able to use Django's models to report the status, but I am planning to add more workers which are away from Django, so they can't reach DB.
Right now I see few solutions:
Store intermediate results manually using some storage, like redis or mongodb making then available over the network. This worries me a little bit because if for example I will use redis then I should keep in sync the code on a Django side reading the status and Celery task writing the status, so they use the same keys.
Report status to the Django back from celery using REST calls. Like PUT http://django.com/api/task/123/items_processed
Maybe use Celery event system and create events like Item processed on which django updates the counter
Create a seperate worker which runs on a server with django which holds a task which only increases items proceeded count, so when the task is done with an item it issues increase_messages_proceeded_count.delay(task_id).
Are there any solution or hidden problems with the ones I mentioned?
There are probably many ways to achieve your goal, but here is how I would do it.
Inside your long running celery task set the progress using django's caching framework:
from django.core.cache import cache
#app.task()
def long_running_task(self, *args, **kwargs):
key = "my_task: %s" % self.result.id
...
# do whatever you need to do and set the progress
# using cache:
cache.set(key, progress, timeout="whatever works for you")
...
Then all you have to do is make a recurring AJAX GET request with that key and retrieve the progress from cache. Something along those lines:
def task_progress_view(request, *args, **kwargs):
key = request.GET.get('task_key')
progress = cache.get(key)
return HttpResponse(content=json.dumps({'progress': progress}),
content_type="application/json; charset=utf-8")
Here is a caveat though, if you are running your server as multiple processes, make sure that you are using something like memcached, because django's native caching will be inconsistent among the processes. Also I probably wouldn't use celery's task_id as a key, but it is sufficient for demonstration purpose.
Take a look at flower - a real-time monitor and web admin for Celery distributed task queue:
https://github.com/mher/flower#api
http://flower.readthedocs.org/en/latest/api.html#get--api-tasks
You need it for presentation, right? Flower works with websockets.
For instance - receive task completion events in real-time (taken from official docs):
var ws = new WebSocket('ws://localhost:5555/api/task/events/task-succeeded/');
ws.onmessage = function (event) {
console.log(event.data);
}
You would likely need to work with tasks ('ws://localhost:5555/api/tasks/').
I hope this helps.
Simplest:
Your tasks and django app already share access one or two data stores - the broker and the results backend (if you're using one that is different to the broker)
You can simply put some data into one or other of these data stores that indicates which item the task is currently processing.
e.g. if using redis simply have a key 'task-currently-processing' and store the data relevant to the item currenlty being processed in there.
You can use something like Swampdragon to reach the user from the Celery instance (you have to be able to reach it from the client thou, take care not to run afoul of CORS thou). It can be latched onto the counter, not the model itself.
lehins' solution looks good if you don't mind your clients repeatedly polling your backend. That may be fine but it gets expensive as the number of clients grows.
Artur Barseghyan's solution is suitable if you only need the task lifecycle events generated by Celery's internal machinery.
Alternatively, you can use Django Channels and WebSockets to push updates to clients in real-time. Setup is pretty straightforward.
Add channels to your INSTALLED_APPS and set up a channel layer. E.g., using a Redis backend:
CHANNEL_LAYERS = {
"default": {
"BACKEND": "channels_redis.core.RedisChannelLayer",
"CONFIG": {
"hosts": [("redis", 6379)]
}
}
}
Create an event consumer. This will receive events from Channels and push them via Websockets to the client. For instance:
import json
from asgiref.sync import async_to_sync
from channels.generic.websocket import WebSocketConsumer
class TaskConsumer(WebsocketConsumer):
def connect(self):
self.task_id = self.scope['url_route']['kwargs']['task_id'] # your task's identifier
async_to_sync(self.channel_layer.group_add)(f"tasks-{self.task_id}", self.channel_name)
self.accept()
def disconnect(self, code):
async_to_sync(self.channel_layer.group_discard)(f"tasks-{self.task_id}", self.channel_name)
def item_processed(self, event):
item = event['item']
self.send(text_data=json.dumps(item))
Push events from your Celery tasks like this:
from asgiref.sync import async_to_sync
from channels.layers import get_channel_layer
...
async_to_sync(get_channel_layer.group_send)(f"tasks-{task.task_id}", {
'type': 'item_processed',
'item': item,
})
You can also write an async consumer and/or invoke group_send asynchronously. In either case you no longer need the async_to_sync wrapper.
Add websocket_urlpatterns to your urls.py:
websocket_urlpatterns = [
path(r'ws/tasks/<task_id>/', TaskConsumer.as_asgi()),
]
Finally, to consume events from JavaScript in your client, you can do something like this:
let task_id = 123;
let protocol = location.protocol === 'https:' ? 'wss://' : 'ws://';
let socket = new WebSocket(`${protocol}${window.location.host}/ws/tasks/${task_id}/`);
socket.onmessage = function(event) {
let data = JSON.parse(event.data);
let item = data.item;
// do something with the item (e.g., push it into your state container)
}