How does Luigi's require work? - python-2.7

I am using Luigi tool of Spotify to handle dependencies between several jobs.
def require(self):
yield task1()`
info = retrieve_info()
yield task2(info=info)
In my example, I'd like to require from task1, then retrieve some information that depends on the execution of task1 in order to pass it as an argument of task2. However, my function retrieve_info won't work because task1 has not runned yet.
My question is, since I am using yield, task1 should not process before the call of retrieve_info is made? Is Luigi iterating over the required function and then launching the processing of the different task?
If this last assumption is right, how can I use execution of a required task as an input of a second required class?

Related

Running multiple functions

I think I have confused myself on how I should approach this.
I have a number of functions that I use to interact with an api, for example get product ID, update product detail, update inventory. These calls need to be done one after another, and are all wrapped up in one function api.push().
Let's say I need to run api.push() 100 times, 100 product IDs
What I want to do is run many api.push at the same time, so that I can speed up the processing of my. For example, lets say I want to run 5 at a time.
I am confused to whether this is multiprocessing or threading, or neither. I tried both but they didn't seem to work, for example I have this
jobs = []
for n in range(0, 4):
print "adding a job %s" % n
p = multiprocessing.Process(target=api.push())
jobs.append(p)
# Starts threads
for job in jobs:
job.start()
for job in jobs:
job.join()
Any guidance would be appreciated
Thanks
Please read the python doc and do some research on the global interpreter lock to see whether you should use threading or multiprocessing in your situation.
I do not know the inner workings of api.push, but please note that you should pass a function reference to multiprocessing.Process.
Using p = multiprocessing.Process(target=api.push()) will pass whatever api.push() returns as the function to be called in the subprocesses.
if api.push is the function to be called in the subprocess, you should use p = multiprocessing.Process(target=api.push) instead, as it passes a reference to the function rather than a reference to the result of the function.

Airflow task to refer to multiple previous tasks?

Is there a way I can have a task require the completion of multiple upstream tasks which are still able to finish independently?
download_fcr --> process_fcr --> load_fcr
download_survey --> process_survey --> load_survey
create_dashboard should require load_fcr and load_survey to successfully complete.
I do not want to force anything in the 'survey' task chain to require anything from the 'fcr' task chain to complete. I want them to process in parallel and still complete even if one fails. However, the dashboard task requires both to finish loading to the database before it should start.
fcr *-->*-->*
\
---> create_dashboard
/
survey *-->*-->*
You can pass a list of tasks to set_upstream or set_downstream. In your case, if you specifically want to use set_upstream, you could describe your dependencies as:
create_dashboard.set_upstream([load_fcr, load_survey])
load_fcr.set_upstream(process_fcr)
process_fcr.set_upstream(download_fcr)
load_survey.set_upstream(process_survey)
process_survey.set_upstream(download_survey)
Have a look at airflow's source code: even when you pass just one task object to set_upstream, it actually wraps a list around it before doing anything.
download_fcr.set_downstream(process_fcr)
process_fcr.set_downstream(load_fcr)
download_survey.set_downstream(process_survey)
process_survey.set_downstream(load_survey)
load_survey.set_downstream(create_dashboard)
load_fcr.set_downstream(create_dashboard)

Play Framework 2.4 Sequential run of multiple Promises

I have got a Play 2.4 (Java-based) application with some background Akka tasks implemented as functions returning Promise.
Task1 downloads bank statements via bank Rest API.
Task2 processes the statements and pairs them with customers.
Task3 does some other processing.
Task2 cannot run before Task1 finishes its work. Task3 cannot run before Task2. I was trying to run them through sequence of Promise.map() like this:
protected F.Promise run() throws WebServiceException {
return bankAPI.downloadBankStatements().map(
result -> bankProc.processBankStatements().map(
_result -> accounting.checkCustomersBalance()));
}
I was under an impression, that first map will wait until Task1 is done and then it will call Task2 and so on. When I look into application (tasks are writing some debug info into log) I can see, that tasks are running in parallel.
I was also trying to use Promise.flatMap() and Promise.sequence() with no luck. Tasks are always running in parallel.
I know that Play is non-blocking application in nature, but in this situation I really need to do things in right order.
Is there any general practice on how to run multiple Promises in selected order?
You're nesting the second call to map, which means what's happening here is
processBankStatements
checkCustomerBalance
downloadBankStatements
Instead, you need to chain them:
protected F.Promise run() throws WebServiceException {
return bankAPI.downloadBankStatements()
.map(statements -> bankProc.processBankStatements())
.map(processedStatements -> accounting.checkCustomersBalance());
}
I notice you're not using result or _result (which I've renamed for clarity) - is that intentional?
Allright, I found a solution. The correct answer is:
If you are chaining multiple Promises in the way I do. That means, in return of map() function you are expecting another Promise.map() function and so on, you should follow these rules:
If you are returning non-futures from mapping, just use map()
If you are returning more futures from mapping, you should use flatMap()
The correct code snippet for my case is then:
return bankAPI.downloadBankStatements().flatMap(result -> {
return bankProc.processBankStatements().flatMap(_result -> {
return accounting.checkCustomersBalance().map(__result -> {
return null;
});
});
});
This solution was suggested to me a long time ago, but it was not working at first. The problem was, that I had a hidden Promise.map() inside function downloadBankStatements() so the chain of flatMaps was broken in this case.

copying rather than modifying a job (APScheduler)

I'm writing a database-driven application with APScheduler (v3.0.0). Especially during development, I find myself frequently wanting to command a scheduled job to start running now without affecting its subsequent schedule.
It's possible to do this at job creation time, of course:
def dummy_job(arg):
pass
sched.add_job(dummy_job, trigger='interval', hours=3, args=(None,))
sched.add_job(dummy_job, trigger=None, args=(None,))
However, if I already have a job scheduled with an interval or date trigger...
>>> sched.print_jobs()
Jobstore default:
job1 (trigger: interval[3:00:00], next run at: 2014-08-19 18:56:48 PDT)
... there doesn't seem to be a good way to tell the scheduler "make a copy of this job which will start right now." I've tried sched.reschedule_job(trigger=None), which schedules the job to start right now, but removes its existing trigger.
There's also no obvious, simple way to duplicate a job object while preserving its args and any other stateful properties. The interface I'm imagining is something like this:
sched.dup_job(id='job1', new_id='job2')
sched.reschedule_job('job2', trigger=None)
Clearly, APScheduler already contains an internal mechanism to copy job objects since repeated calls to get_job don't return the same object (that is, (sched.get_job(id) is sched.get_job(id))==False).
Has anyone else come up with a solution here? I'm thinking of posting a suggestion on the developers' site if not.
As you've probably figured out by now, that phenomenon is caused by the job stores instantiating jobs on the fly based on data loaded from the back end. To run a copy of a job immediately, this should do the trick:
job = sched.get_job(id)
sched.add_job(job.func, args=job.args, kwargs=job.kwargs)

How should I implement callback for taskset in celery

Question
I use celery to launch task sets that look like this:
I perform a batch of tasks that can be run in parallel, number of tasks in this batch varies from tens to couple thousands.
I aggregate results of these tasks into single answer, then do something with this answer --- like store to the database, save to special result file and so on. Basically after tasks done executing I have to call function that has following signature:
def callback(result_file_name, task_result_list):
#store in file
def callback(entity_key, task_result_list):
#store in db
For now step 1. is done in Celery queue and step 2 is done outside celery:
tasks = []
# add taksks to tasks list
task_group = group()
task_group.tasks = tasks
result = task_group.apply_async()
res = result.join()
# Aggregate results
# Save results to file, database whatever
This approach is cumbersome since I have to stop a single thread until all tasks are performed (which can take couple of hours).
I would like to somehow move step 2 to celery also --- esentially I would need to add a callback to entire taskset (as far as I know it is unsupported in Celery) or submit a task that is executed after all these subtasks.
Does anyone have idea how to do it? I use it in the django enviorment so I can store some state in the database.
To sum up my recent findings
Chords won't do
I'cant use chords straight forwardly because chords enable me to create callbacks that look this way:
def callback(task_result_list):
#store in file
there is no obvious way to pass additional parameters to callback (especially because these callbacks can't be local functions).
Using the database either
I can store results using TaskSetMeta but this entity has no status field --- so even if I would add a signal to TaskSetMeta i'd have to pool task results which could have siginificant overhead.
Well answer was really straightforward, and I can indeed use chords --- and additional parameters (like report file name and so on) must be passed as kwargs.
Here is chord task:
#task
def print_and_sum(to_sum, file_name):
print file_name
print sum(to_sum)
return file_name, sum(to_sum)
Here is how to instantiate it:
subtasks = [...]
result = chord(subtasks)(print_and_sum.subtask(kwargs={'file_name' : 'report_file.csv'}))