copying rather than modifying a job (APScheduler) - python-2.7

I'm writing a database-driven application with APScheduler (v3.0.0). Especially during development, I find myself frequently wanting to command a scheduled job to start running now without affecting its subsequent schedule.
It's possible to do this at job creation time, of course:
def dummy_job(arg):
pass
sched.add_job(dummy_job, trigger='interval', hours=3, args=(None,))
sched.add_job(dummy_job, trigger=None, args=(None,))
However, if I already have a job scheduled with an interval or date trigger...
>>> sched.print_jobs()
Jobstore default:
job1 (trigger: interval[3:00:00], next run at: 2014-08-19 18:56:48 PDT)
... there doesn't seem to be a good way to tell the scheduler "make a copy of this job which will start right now." I've tried sched.reschedule_job(trigger=None), which schedules the job to start right now, but removes its existing trigger.
There's also no obvious, simple way to duplicate a job object while preserving its args and any other stateful properties. The interface I'm imagining is something like this:
sched.dup_job(id='job1', new_id='job2')
sched.reschedule_job('job2', trigger=None)
Clearly, APScheduler already contains an internal mechanism to copy job objects since repeated calls to get_job don't return the same object (that is, (sched.get_job(id) is sched.get_job(id))==False).
Has anyone else come up with a solution here? I'm thinking of posting a suggestion on the developers' site if not.

As you've probably figured out by now, that phenomenon is caused by the job stores instantiating jobs on the fly based on data loaded from the back end. To run a copy of a job immediately, this should do the trick:
job = sched.get_job(id)
sched.add_job(job.func, args=job.args, kwargs=job.kwargs)

Related

Airflow: How to create Dynamic SubDags

This is not my exact circumstance, but it does explain the circumstances of my issue.
Assume an AWS S3 bucket contains an unknown number of files. I have already written operators that are capable of performing the tasks I need on an individual file, and my goal is to parallelize the process. Ideally, I want an operator that inherits from SubDagOperator and accomplishes something similar to the following:
def fn_generate_operator_for_s3_file(s3_file):
task_id = unique_task_id_for_s3_file(s3_file)
return MyS3FileActionOperator(task_id=task_id, s3_file=s3_file)
AwsS3BucketMapOperator(SubDagOperator):
def __init__(aws_s3_bucket_config, fn_generate_operator_for_s3_file, **kwargs):
# Disregard implementation, just know that it retrieves the bucket
aws_s3_bucket = get_aws_s3_bucket(aws_s3_bucket_config)
with DAG(subdag_name, ....) as subdag:
for s3_file in aws_s3_bucket:
operator_task = fn_generate_operator_for_s3_file(s3_file)
# operator_task should be added to subdag implicitly due to the `with` context manager statement
super(AwsS3BucketMapOperator, self).__init__(subdag=subdag, **kwargs)
In essence, I want to be able to map an arbitrary operator that is known to be able to handle an S3 file across all files (or some filtered set of files) in an S3 bucket, using some operator_generator callable that is passed to the Map Operator in order to actually instantiate the subdag operators.
Caveats: My understanding of how DAGs are discovered is that the __init__ method of Operator instances in a DAG are all run prior to the actual execution phase of the DAG itself, and that it actually does this discovery process continuously.
There are cases where the process of actually gathering the configuration needed to accurately determine what set of subdag Operators need to be generated is computationally expensive.
Ideally I'd like to have the generation process of the subdag only be run once, and the only way I could see that is if the generation of the subdag occurs in the execute() method of the Map Operator class. Doing this however results in a situation where the subdag is not found in the DagBag, and thus fails to run. Is there any way around this?
If there is no way to programmatically determine the contents of the subdag at execution time, are there ways to limit how often the expensive operations needed to generate the subdag are run?

Should I have concern about datastoreRpcErrors?

When I run dataflow jobs that writes to google cloud datastore, sometime I see the metrics show that I had one or two datastoreRpcErrors:
Since these datastore writes usually contain a batch of keys, I am wondering in the situation of RpcError, if some retry will happen automatically. If not, what would be a good way to handle these cases?
tl;dr: By default datastoreRpcErrors will use 5 retries automatically.
I dig into the code of datastoreio in beam python sdk. It looks like the final entity mutations are flushed in batch via DatastoreWriteFn().
# Flush the current batch of mutations to Cloud Datastore.
_, latency_ms = helper.write_mutations(
self._datastore, self._project, self._mutations,
self._throttler, self._update_rpc_stats,
throttle_delay=_Mutate._WRITE_BATCH_TARGET_LATENCY_MS/1000)
The RPCError is caught by this block of code in write_mutations in the helper; and there is a decorator #retry.with_exponential_backoff for commit method; and the default number of retry is set to 5; retry_on_rpc_error defines the concrete RPCError and SocketError reasons to trigger retry.
for mutation in mutations:
commit_request.mutations.add().CopyFrom(mutation)
#retry.with_exponential_backoff(num_retries=5,
retry_filter=retry_on_rpc_error)
def commit(request):
# Client-side throttling.
while throttler.throttle_request(time.time()*1000):
try:
response = datastore.commit(request)
...
except (RPCError, SocketError):
if rpc_stats_callback:
rpc_stats_callback(errors=1)
raise
...
I think you should first of all determine which kind of error occurred in order to see what are your options.
However, in the official Datastore documentation, there is a list of all the possible errors and their error codes . Fortunately, they come with recommended actions for each.
My advice is that your implement their recommendations and see for alternatives if they are not effective for you

Airflow task to refer to multiple previous tasks?

Is there a way I can have a task require the completion of multiple upstream tasks which are still able to finish independently?
download_fcr --> process_fcr --> load_fcr
download_survey --> process_survey --> load_survey
create_dashboard should require load_fcr and load_survey to successfully complete.
I do not want to force anything in the 'survey' task chain to require anything from the 'fcr' task chain to complete. I want them to process in parallel and still complete even if one fails. However, the dashboard task requires both to finish loading to the database before it should start.
fcr *-->*-->*
\
---> create_dashboard
/
survey *-->*-->*
You can pass a list of tasks to set_upstream or set_downstream. In your case, if you specifically want to use set_upstream, you could describe your dependencies as:
create_dashboard.set_upstream([load_fcr, load_survey])
load_fcr.set_upstream(process_fcr)
process_fcr.set_upstream(download_fcr)
load_survey.set_upstream(process_survey)
process_survey.set_upstream(download_survey)
Have a look at airflow's source code: even when you pass just one task object to set_upstream, it actually wraps a list around it before doing anything.
download_fcr.set_downstream(process_fcr)
process_fcr.set_downstream(load_fcr)
download_survey.set_downstream(process_survey)
process_survey.set_downstream(load_survey)
load_survey.set_downstream(create_dashboard)
load_fcr.set_downstream(create_dashboard)

PBS Professional hook not updating Priority

I am trying to implement a hook to determine a job's priority upon entering the queue.
The hook is enabled, imported, and event type is "queuejob", so it is in place (like other hooks we have enabled). This hook however does not seem to alter a job's priority as I am expecting.
Here is a simplified example of how I'm trying to alter the Priority for a job:
import pbs
try:
e=pbs.event()
j=e.job
if j.server == 'myserver':
j.Priority = j.Priority + 50
e.accept()
except SystemExit:
pass
Whenever I submit a job after importing this hook, I run the 'qstat -f' on my job, the Priority is always 0, whether I set it to another value in my qsub script or leave it to the default.
Thank you.
Couple of things I discovered:
It appears that PBS does not like using j.Priority in a calculation and assignment, so I had to use another internal variable (which was fine since I had one already for something else)
i.e.:
j.Priority = High_Priority
if pbs.server() == 'myserver'
j.Priority = High_Priority + 50
Also, (as can be seen in the last example), j.server should actually be pbs.server().

How does Luigi's require work?

I am using Luigi tool of Spotify to handle dependencies between several jobs.
def require(self):
yield task1()`
info = retrieve_info()
yield task2(info=info)
In my example, I'd like to require from task1, then retrieve some information that depends on the execution of task1 in order to pass it as an argument of task2. However, my function retrieve_info won't work because task1 has not runned yet.
My question is, since I am using yield, task1 should not process before the call of retrieve_info is made? Is Luigi iterating over the required function and then launching the processing of the different task?
If this last assumption is right, how can I use execution of a required task as an input of a second required class?