Should I use a task queue (Celery), ayncio or neither for an API that polls other APIs?

Should I use a task queue (Celery), ayncio or neither for an API that polls other APIs? - django

I have written an API with Django which purpose is to operate as a bridge between a website back-end and external services we use, so that the website doesn't have to handle many requests to external APIs (CRM, calendar events, email providers etc.).
The API mainly polls other services, parses the results and forwards them to the website backend.
I initially went for a Celery-based task queue, as it seemed to me like the right tool to offload that processing to another instance, but I'm starting to think it doesn't really fit the purpose.
As the website expects synchronous responses, my code contains a lot of :
results = my_task.delay().get()
or
results = chain(fetch_results.s(), parse_results.s()).delay().get()
Which doesn't feel like the proper way to use Celery tasks.
It is efficient when pulling dozens of requests and processing the results in parallel - a periodic refresh task for example - but adds a lot of overhead for simple requests (fetch - parse - forward), which represent most of the traffic.
Should I go full synchronous for those "simple requests" and keep Celery tasks for specific scenarios ? Is there an alternative design (maybe involving asyncio) that would better suit the purpose of my API ?
Using Django, Celery (w/ Amazon SQS) on an EBS EC2 instance.

You could consider using Gevent with your Django webserver to allow it to operate efficiently for the "simple requests" you've mentioned without being blocked. If you proceed with this approach, be sure to pool database connections with PgBouncer or Pgpool-II or a Python library since each greenlet will make its own connection.
Once you've implemented that, it's possible to also use Gevent instead of Celery to handle asynchronous processing by joining on multiple Greenlets that each make an external API request, rather than incur the overhead of passing messages to an external celery worker.
Your implementation is similar to what we've done at Kloudless, which provides a single API to access multiple other APIs, including CRM, calendar, storage, etc.

Related

Subscribe to a redis channel in Django project

I have multiple applications written with nodejs or python/django or ...
These services are working fine. But need to have pub/sub Async communication with each other.
In nodejs there is no problem and easily can pub/sub to any redis channel.
Question: My question is how can i continuously subscribe to a redis channel and receive data published with other services?
Note: many links suggest to use django-channels. But I guess that's not the way to do it. If so can any one help me and give details on how to do it.
Update:
Django by default is not event-based like nodejs. So if i use a redis client, I should for example check redis every second and see if anything is published or not. I don't think using just a redis client in python will be enough.
Really appreciate it.

There are a lot of alternatives. If you have FIFO issue you have to use queues in order to connect one microservice to another. For me, if you don’t have Big Data problem you can use RabbitMQ, It is very practical and very effective, otherwise if you have Big Data problem you can use Kafka. There are wide variety services.
If you want just Pub/Sub. The best tool is Redis, It is very fast and easy to integrate. If you are concerned how to implement it in Python just look at article
[Update]
It's possible to create a manage.py command in django and subscribe to redis in that management file and execute this script separated from django server:
class Command(BaseCommand):
def handle(self, *args, **options):
r = redis.StrictRedis(host='localhost', port=6379, db=1)
p = r.pubsub()
p.psubscribe('topic.*')
for message in p.listen():
if message:
print('message received, do anything you want with it.')

In order to handle subscriptions to redis, you will need to have a separate continuously running process (server) which listens to redis and then makes something with your data. django-channels will do the same by running the code in a worker
As pointed above, Django provides convenient way to run "servers" by using Django management command approach. When running a django management command, you have complete access to your code, i.e. to ORM as well.
Only detail that you mentioned Async communication. Here you need to take into account that Django's ORM is strictly sync code, and you need to pay attention how you want to use ORM with async code. Probably you need to clarify what do you mean by async here.
As for redis messages processing, you can use any libraries that work with it. For example, aioredis or redis-py

Role of kafka consumer, seperate service or Django component?

I'm designing a web log analytic.
And I found an architect with Django(Back-end & front-end)+ kafka + spark.
I also found some same system from this link:http://thevivekpandey.github.io/posts/2017-09-19-high-velocity-data-ingestion.html with below architect
But I confuse about the role of kafka-consumer. It will is a service, independent to Django, right?
So If I want to plot real-time data to front-end chart, how to I attached to Django.
It will too ridiculous if I place both kafka-consumer & producer in Django. Request from sdk come to Django by pass to kafa topic (producer) and return Django (consumer) for process. Why we don't go directly. It looks simple and better.
Please help me to understand the role of kafka consumer, where it should belong? and how to connect to my front-end.
Thanks & best Regards,
Jame

The article mentions about the use case without Kafka:
We saw that in times of peak load, data ingestion was not working properly: it was taking too long to connect to MongoDB and requests were timing out. This was leading to data loss.
So the main point of introducing Kafka and Kafka Consumer is to avoid too much load on DB layer and handle it gracefully with a messaging layer in between. To be honest, any message queue can be used in this case, not only Kafka.
Kafka Consumer can be a part of the web layer. It wouldn't be optimal, because you want the separation of concerns (which makes the system more reliable in case of failures) and ability to scale things independently.
It's better to implement the Kafka Consumer as a separate service if the concerns mentioned above really matter (scalability and reliability) and it's easy for you to do operationally (because you need to deploy, monitor, etc. a new service now). In the end it's a classic monolith vs. microservices dilemma.

Handling long requests

I'm working on a long request to a django app (nginx reverse proxy, mysql db, celery-rabbitMQ-redis set) and have some doubts about the solution i should apply :
Functionning : One functionality of the app allows users to migrate thousands of objects from one system to another. Each migration is logged into a db, and the users are provided the possibility to get in a csv format the history of the migration : which objects have been migrated, which status (success, errors, ...)
To get the history, a get request is sent to a django view, which returns, after serialization and rendering into csv, the download response.
Problem : the serialisation and rendering processes, for a large set of objects (e.g. 160 000) are quite long and the request times out.
Some solutions I was thinking about/found thanks to pervious search are :
Increasing the amount of time before timeout : easy, but I saw everywhere that this is a global nginx setting and would affect every requests on the server.
Using an asynchronous task handled by celery : the concept would be to make an initial request to the server, which would launch the serializing and rendering task with celery, and give a special httpresponse to the client. Then the client would regularly ask the server if the job is done, and the server would deliver the history at the end of processing. I like this one but I'm not sure about how to technically implement that.
Creating and temporarily storing the csv file on the server, and give the user a way to access it & to download it. I'm not a big fan of that one.
So my question is : has anyone already faced a similar question ? Do you have advises for the technical implementation of the solution (#2), or a better solution to propose me ?
Thqnks !

Clearly you should use Celery + RabbitMQ/REDIS. If you look at the docs it´s not that hard to setup.
The first question is whether to use RabbitMQ or Redis. There are many SO questions about this with good information about pros/cons.
The implementation in django is really simple. You can just wrap django functions with celery tasks (with #task attribute) and it´ll become async, so this is the easy part.
The problem I see in your project is that the server who is handling http traffic is the same server running the long process. That can affect performance and user experience even if celery is running on the background. Of course that depends on how much traffic you are expecting on that machine and how many migrations can run at the same time.
One of the things you setup on Celery is the number of workers (concurrent processing units) available. So the number of cores in your machine will matter.
If you need to handle http calls quickly I would suggest to delegate the migration process to another machine. Celery/REDIS can be configured that way. Let´s say you´ve got 2 servers. One would handle only normal django calls (no celery) and trigger celery tasks on the other server (the one who actually runs the migration process). Both servers can connect to the same database.
But this is just an infrastructure optimization and you may not need it.
I hope this answers your question. If you have specific Celery issues it would be better to create another question.

Not sure if I should use celery

I have never used celery before and I'm also a django newbie so I'm not sure if I should use celery in my project.
Brief description of my project:
There is an API for sending (via SSH) jobs to scientific computation clusters. The API is an abstraction to the different scientific job queue vendors out there. http://saga-project.github.io/saga-python/
My project is basically about doing a web GUI for this API with django.
So, my concern is that, if I use celery, I would have a queue in the local web server and another one in each of the remote clusters. I'm afraid this might complicate the implementation needlessly.
The API is still in development and some of the features aren't fully finished. There is a function for checking the state of the remote job execution (running, finished, etc.) but the callback support for state changes is not ready. Here is where I think celery might be appropriate. I would have one or several periodic task(s) monitoring the job states.
Any advice on how to proceed please? No celery at all? celery for everything? celery just for the job states?

I use celery for similar purpose and it works well. Basically I have one node running celery workers that manage the entire cluster. These workers generate input data for the cluster nodes, assign tasks, process the results for reporting or generating dependent tasks.
Each cluster node is running a very small python server which takes a db id of it's assigned job. It then calls into the main (http) server to request the data it needs and finally posts the data back when complete. In my case, the individual nodes don't need to message each other and run time of each task is very long (hours). This makes the delays introduced by central management and polling insignificant.
It would be possible to run a celery worker on each node taking tasks directly from the message queue. That approach is appealing. However, I have complex dependencies that are easier to work out from a centralized control. Also, I sometimes need to segment the cluster and centralized control makes this possible to do on the fly.
Celery isn't good at managing priorities or recovering lost tasks (more reasons for central control).
Thanks for calling my attention to SAGA. I'm looking at it now to see if it's useful to me.

Celery is useful for execution of tasks which are too expensive to be executed in the handler of HTTP request (i.e. Django view). Consider making an HTTP request from Django view to some remote web server and think about latencies, possible timeouts, time for data transfer, etc. It also makes sense to queue computation intensive tasks taking much time for background execution with Celery.
We can only guess what web GUI for API should do. However Celery fits very well for queuing requests to scientific computation clusters. It also allows to track the state of background task and their results.
I do not understand your concern about having many queues on different servers. You can have Django, Celery broker (implementing queues for tasks) and worker processes (consuming queues and executing Celery tasks) all on the same server.

Django and Celery Confusion

After reading a lot of blogposts, I decided to switch from crontab to Celery for my middle-scale Django project. I have a few things I didn't understand:
1- I'm planning to start a micro EC2 instance which will be dedicated to RabbitMQ, would this be sufficient for a small-to-medium heavy tasking? (Such as dispatching periodical e-mails to Amazon SES).
2- Computing of tasks, does compution of tasks occur on the Django server or the rabbitMQ server (assuming the rabbitMQ is on a seperate server)?
3- When I need to grow my system and have 2 or more application servers behind a load balancer, do these two celery machines need to connect to the same rabbitMQ vhost? Assuming application servers are the carbon copy and tasks are same and everything is sync on the database level.

I don't know the answer to this question, but you can definitely configure it to be suitable (e.g. use -c1 for a single process worker to avoid using much memory, or eventlet/gevent pools), see also the --autoscale option. The choice of broker transport also matters here, the ones that are not polling are more CPU effective (rabbitmq/redis/beanstalk).
Computing happens on the workers, the broker is only responsible for accepting, routing and delivering messages (and persisting messages to disk when necessary).
To add additional workers these should indeed connect to the same virtual host. You would
only use separate virtual hosts if you would want applications to have separate message buses.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js