Is it possible to perform faster backfill in Airflow? - airflow-scheduler

I have searched quite a lot about this but could not find any substantial information about it. My problem is that I have a DAG that should backfill the data from March 2017.
So I have put the start_date: '01-03-2017'. I have also provided schedule_interval as daily. Now, I know that my DAG will start running from March 2017 with a given schedule. But if my dag will follow schedule daily, it will take more than 2 years to reach the current date
I cannot wait for 2 years to get the past data. I want my DAG to complete backfill as soon as possible so that my DAG catches the current time and start scheduling every day. How can I achieve this? Can I provide max_active_runs to some high number to schedule several DAGRuns at the same time?

In case of a backfill, your DAG won't run only according to the schedule. It will perform daily tasks for the past, but they will run concurrently until the time it completes all the backfill tasks. Only the execution date for each of these runs will be the date in the past. Once it reaches the current date, it will then go forward with as per the schedule.

Related

Training job runtime exceeded MaxRuntimeInSeconds provided

I would like to run my model 30 days in using aws sagemaker training job, but its max time is 5 days, how to resume the earlier to proceed further
Follow these steps:
Open a support ticket to increase Longest run time for a training job
to 2419200 seconds (28 days). (this can't be adjusted using the service quotas in AWS Web console).
Using the SageMaker Python SDK, when creating an Estimator, set max_run=2419200.
Implement Resume from checkpoints in your training script.
Also, the questions in #rok's answer are very relevant to consider.
According to the documentation here the maximum allowed runtime is 28 days, not 5. Check your configuration please . You are right, according to the documentation here the maximum runtime for a training job is 5 days. There are multiple things you can do: more powerful (multiple) GPU to reduce training time, or save checkpoint and restart training from there. Anyway 30 days looks like a very big training time (with associated cost), are you sure you need that ?
Actually you could ask for service quotas increase from here but as you can see Longest run time for a training job is not adjustable. So I don't you have any other choice of either using checkpoints or greater GPUs.

Run Athena every 15 minutes vs Kinesis Data Analytics

I am going to be using Athena for report generation on data available in S3. A lot of it is time series data coming from IoT devices.
Users can request reports over years and years' worth of data but will mostly be weekly, monthly or annual.
I am thinking to save aggregates every 15 minutes for ex: 12:00, 12:15, 12:30, 12:45, 1:00 etc. The calculated aggregates should always be at the full 15 mins and cannot be at 12:03 and 12:18 so on and so forth. Is it possible with Kinesis data analytics? If yes, how?
If not, does scheduling a lambda to be triggered every 5-10 minutes and having athena calculate those aggregates sound like a reasonable approach? Any alternatives I should consider?
Kinesis Data Analytics runs Apache Flink which supports tumbling windows. The intervals starting from 00:00, 00:15, etc. should work by default by setting the window time to 15min.
https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/datastream/operators/windows/#tumbling-windows
Since 15min is quite slow, you could also consider writing AWS Glue job (Apache Spark) and have it triggered periodically with built-in Glue triggers.
Or you can go with your current solution (Lambda/Athena).
One of the main decisions here would be how much do you need to invest to learn Spark or Flink vs. alredy known (I assume) Athena query. I would reserve some limited time for each approach to test them before picking one. This way you can quickly see where things get complicated.

How to reduce the time taken by the glue etl job(spark) to actually start executing?

I want to start a glue etl job, though the execution is fair (time concerns), however, the time taken by glue to actually start executing the job is too much.
I looked into various documentation and answers but none of them could give me the solution. There was some explanation of this behavior: cold start but no solution.
I expect to have the job up asap, it takes sometimes around 10 mins to start a job which gets executed in 2 mins.
Unfortunately it's not possible now. Glue uses EMR under the hood and it requires some time to spin up a new cluster with desired number of executors. As far as I know they have a pool of spare EMR clusters with some most common DPU configurations so if you are lucky your job can get one and start immediately, otherwise it will wait.

scheduling an informatica workflow with a customized frequency

Hello Dear Informatica admin/platform experts,
I have a workflow that i need to scheduled say Monday-Friday and Sunday. All the 6 days the job should at a specific time say 10 times in a day, but the timing is not uniform but at a predefined time say(9 AM, 11 AM, 1:30 PM etc), so the difference in the timing is not uniform. so we had 10 different scheduling workflows for each schedule/run that triggers a shell script that uses pmcmd command.
It looked a bit weird for me, so what i did was, have a single workflow that triggers the pmcmd shell script, and have a link between the start and the shell script where i specified a condition of the time and scheduled it to run monday-friday and sunday every 30 minutes.
So what happens is, it runs 48 times in a day but actually triggers the "actual" workflow only 10 times. and the remaining 38 times it just runs but does nothing.
one of my informatica admin colleague says that running this 38 times(which does actually nothing) consumes informatica resources. Though i was quite sure it does not, but as i am just an informatica developer and not an expert, thought of posting it here, to check whether it is really true?
Thanks.
Regards
Raghav
Well... it does consume some resources. Each time workflow starts, it does quite a few operations on the Repository. It also allocates some memory on Integration Service as well as creates log file for the Workflow. Even if there are no sessions executed at all.
So, there is an impact. Multiply that by the number of workflows, times the number of executions - and there might be a problem.
Not to mention there are some limitations regarding the number of Workflow being executed at the same time.
I don't know your platform and setup. But this look like a field for improvement indeed. A cron scheduler should help you a lot.

Scheduling strategy behind AWS Batch

I am wondering what the scheduling strategy behind AWS Batch looks like. The official documentation on this topic doesn't provide much details:
The AWS Batch scheduler evaluates when, where, and how to run jobs that have been submitted to a job queue. Jobs run in approximately the order in which they are submitted as long as all dependencies on other jobs have been met.
(https://docs.aws.amazon.com/batch/latest/userguide/job_scheduling.html)
"Approximately" fifo is quite vaque. Especially as the execution order I observed when testing AWS Batch did't look like fifo.
Did I miss something? Is there a possibility to change the scheduling strategy, or configure Batch to execute the jobs in the exact order in which they were submitted?
I've been using Batch for a while now, and it has always seemed to behave in roughly a FIFO manner. Jobs that are submitted first will generally be started first, but because of limitations with distributed systems, this general rule won't work out perfectly. Jobs with dependencies are kept in the PENDING state until their dependencies have completed, and then they go into the RUNNABLE state. In my experience, whenever Batch is ready to run more jobs from the RUNNABLE state, it picks the job with the earliest time submitted.
However, there are some caveats. First, if Job A was submitted first but requires 8 cores while Job B was submitted later but only requires 4 cores, Job B might be selected first if Batch has only 4 cores available. Second, after a job leaves the RUNNABLE state, it goes into STARTING while Batch downloads the Docker image and gets the container ready to run. Depending on a number of factors, jobs that were submitted at the same time may take longer or shorter in the STARTING state. Finally, if a job fails and is retried, it goes back into the PENDING state with its original time submitted. When Batch decides to select more jobs to run, it will generally select the job with the earliest submit date, which will be the job that failed. If other jobs have started before the first job failed, the first job will start its second run after the other jobs.
There's no way to configure Batch to be perfectly FIFO because it's a distributed system, but generally if you submit jobs with the same compute requirements spaced a few seconds apart, they'll execute in the same order you submitted them.