Running multiple functions - python-2.7

I think I have confused myself on how I should approach this.
I have a number of functions that I use to interact with an api, for example get product ID, update product detail, update inventory. These calls need to be done one after another, and are all wrapped up in one function api.push().
Let's say I need to run api.push() 100 times, 100 product IDs
What I want to do is run many api.push at the same time, so that I can speed up the processing of my. For example, lets say I want to run 5 at a time.
I am confused to whether this is multiprocessing or threading, or neither. I tried both but they didn't seem to work, for example I have this
jobs = []
for n in range(0, 4):
print "adding a job %s" % n
p = multiprocessing.Process(target=api.push())
jobs.append(p)
# Starts threads
for job in jobs:
job.start()
for job in jobs:
job.join()
Any guidance would be appreciated
Thanks

Please read the python doc and do some research on the global interpreter lock to see whether you should use threading or multiprocessing in your situation.
I do not know the inner workings of api.push, but please note that you should pass a function reference to multiprocessing.Process.
Using p = multiprocessing.Process(target=api.push()) will pass whatever api.push() returns as the function to be called in the subprocesses.
if api.push is the function to be called in the subprocess, you should use p = multiprocessing.Process(target=api.push) instead, as it passes a reference to the function rather than a reference to the result of the function.

Related

It's possible to perform db operations asynchronously in django?

I'm writing a command to randomly create 5M orders in a database.
def constrained_sum_sample(
number_of_integers: int, total: Optional[int] = 5000000
) -> int:
"""Return a randomly chosen list of n positive integers summing to total.
Args:
number_of_integers (int): The number of integers;
total (Optional[int]): The total sum. Defaults to 5000000.
Yields:
(int): The integers whose the sum is equals to total.
"""
dividers = sorted(sample(range(1, total), number_of_integers - 1))
for i, j in zip(dividers + [total], [0] + dividers):
yield i - j
def create_orders():
customers = Customer.objects.all()
number_of_customers = Customer.objects.count()
for customer, number_of_orders in zip(
customers,
constrained_sum_sample(number_of_integers=number_of_customers),
):
for _ in range(number_of_orders):
create_order(customer=customer)
number_of_customers will be at least greater than 1k and the create_order function does at least 5 db operations (one to create the order, one to randomly get the order's store, one to create the order item (and this can go up to 30, also randomly), one to get the item's product (or higher but equals to the item) and one to create the sales note.
As you may suspect this take a LONG time to complete. I've tried, unsuccessfully, to perform these operations asynchronously. All of my attempts (dozen at least; most of them using sync_to_async) have raised the following error:
SynchronousOnlyOperation you cannot call this from an async context - use a thread or sync_to_async
Before I continue to break my head, I ask: is it possible to achieve what I desire? If so, how should I proceed?
Thank you very much!
Not yet supported but in development.
Django 3.1 has officially asynchronous support for views and middleware however if you try to call ORM within async function you will get SynchronousOnlyOperation.
if you need to call DB from async function they have provided helpers utils like:
async_to_sync and sync_to_async to change between threaded or coroutine mode as follows:
from asgiref.sync import sync_to_async
results = await sync_to_async(Blog.objects.get, thread_sensitive=True)(pk=123)
If you need to queue call to DB, we used to use tasks queues like celery or rabbitMQ.
By the way if you really know what you are doing you can call it but on your responsibility
just turn off the Async safety but watch out for data lost and integrity errors
#settings.py
DJANGO_ALLOW_ASYNC_UNSAFE=True
The reason this is needed in Django is that many libraries, specifically database adapters, require that they are accessed in the same thread that they were created in. Also a lot of existing Django code assumes it all runs in the same thread, e.g. middleware adding things to a request for later use in views.
More fun news in the release notes:
https://docs.djangoproject.com/en/3.1/topics/async/
It's possible to achieve what you desire, however you need a different perspective to solve this problem.
Try using asynchronous workers, and a simple one would be rq workers or celery.
Use one of these libraries to process async long-running tasks defined in django in different threads or processes.
you can use bulk_create() to create large number of objects , this will speed up the process , additionally put the bulk_create() under a separate thread.

Is it possible to put a function in timed loop using django-background-task

Say i want to execute a function every 5 minutes without using cron job.
What i think of doing is create a django background task which actually calls that function and at the end of that function, i again create that task with schedule = say 60*5.
this effectively puts the function in a time based loop.
I tried a few iterations, but i am getting import errors. But is it possible to do or not?
No It's not possible in any case as it will effectively create cyclic import problems in django. Because in tasks you will have to import that function and in the file for that function, you will have to import tasks.
So no whatever strategy you take, you are gonna land into the same problem.
I made something like. Are you looking for this?
import threading
import time
def worker():
"""do your stuff"""
return
threads = list()
while (true):
time.sleep(300)
t = threading.Thread(target=worker)
threads.append(t)
t.start()

How does Luigi's require work?

I am using Luigi tool of Spotify to handle dependencies between several jobs.
def require(self):
yield task1()`
info = retrieve_info()
yield task2(info=info)
In my example, I'd like to require from task1, then retrieve some information that depends on the execution of task1 in order to pass it as an argument of task2. However, my function retrieve_info won't work because task1 has not runned yet.
My question is, since I am using yield, task1 should not process before the call of retrieve_info is made? Is Luigi iterating over the required function and then launching the processing of the different task?
If this last assumption is right, how can I use execution of a required task as an input of a second required class?

Getting task_ids for all tasks created with celery chord

My goal is to retrieve all the task_ids from a django celery chord call so that I can revoke the tasks later if needed. However, I cannot figure out the correct method to retrieve the task ids. I execute the chord as:
c = chord((loadTask.s(i) for i in range(0, num_lines, CHUNK_SIZE)), finalizeTask.si())
task_result = c.delay()
# get task_ids
I examined the task_result's children variable, but it is None.
I can manual create the chord semantics by using a group and another task as follows, and retrieve the associated task_ids, but I do not like breaking up the call. When this code is run within a task as subtasks, it can cause the main task to hang when the group is revoked before the finalize task begins.
g = group((loadTask.s(i) for i in range(0, num_lines, CHUNK_SIZE)))
task_result = g.delay()
storeTaskIds(task_result.children)
task_result.get()
task_result2 = self.finalizeTask.delay()
storeTaskIds([task_result2.task_id])
Any thoughts would be appreciated!
I'm trying to do something similar, I was hoping I could just revoke the chord with one call and everything within it would be recursively revoked for me.
You could make a chord out of the group and your finalizeTask to keep from breaking up the calls.
I realize this is coming two months after you asked, but maybe it'll help someone and maybe I should just get the task ids of everything in my group.

How should I implement callback for taskset in celery

Question
I use celery to launch task sets that look like this:
I perform a batch of tasks that can be run in parallel, number of tasks in this batch varies from tens to couple thousands.
I aggregate results of these tasks into single answer, then do something with this answer --- like store to the database, save to special result file and so on. Basically after tasks done executing I have to call function that has following signature:
def callback(result_file_name, task_result_list):
#store in file
def callback(entity_key, task_result_list):
#store in db
For now step 1. is done in Celery queue and step 2 is done outside celery:
tasks = []
# add taksks to tasks list
task_group = group()
task_group.tasks = tasks
result = task_group.apply_async()
res = result.join()
# Aggregate results
# Save results to file, database whatever
This approach is cumbersome since I have to stop a single thread until all tasks are performed (which can take couple of hours).
I would like to somehow move step 2 to celery also --- esentially I would need to add a callback to entire taskset (as far as I know it is unsupported in Celery) or submit a task that is executed after all these subtasks.
Does anyone have idea how to do it? I use it in the django enviorment so I can store some state in the database.
To sum up my recent findings
Chords won't do
I'cant use chords straight forwardly because chords enable me to create callbacks that look this way:
def callback(task_result_list):
#store in file
there is no obvious way to pass additional parameters to callback (especially because these callbacks can't be local functions).
Using the database either
I can store results using TaskSetMeta but this entity has no status field --- so even if I would add a signal to TaskSetMeta i'd have to pool task results which could have siginificant overhead.
Well answer was really straightforward, and I can indeed use chords --- and additional parameters (like report file name and so on) must be passed as kwargs.
Here is chord task:
#task
def print_and_sum(to_sum, file_name):
print file_name
print sum(to_sum)
return file_name, sum(to_sum)
Here is how to instantiate it:
subtasks = [...]
result = chord(subtasks)(print_and_sum.subtask(kwargs={'file_name' : 'report_file.csv'}))