Get Airflow Dag Concurrent Runs in MWAA - amazon-web-services

Can we get the airflow dag run count( concurrency) , which runs in MWAA? I set the concurrency on Dag Level. But I would like to get dag concurrency, and if it is below a specific limit would like to trigger the dag.
Since the dag is running in MWAA , is there a metric available to get this count ? I found PoolRunningSlots metric which is very close to get Concurrent Dag run count( checked couple of times, this value is same as the run count of the dag ).
FYI, PoolRunningSlots
https://docs.aws.amazon.com/mwaa/latest/userguide/access-metrics-cw-202.html#access-metrics-cw-console-v202
Please let me know if you encounter this before. TIA.

I don't know why do you want to do it manually, because Airflow can do it by just configuring max_active_runs (use your specific limit) which defines how many running concurrent instances of a DAG there are allowed to be.
For PoolRunningSlots metric, it's the number of slots used for a specific pool, so if you have a pool used in different dags, or different pools used in your dag (a different pool for each task), or the tasks used different number of slots (check the doc), in these cases PoolRunningSlots will be very different from the concurrent Dag runs count.
If you really need this metric, you can create a custom metric by adding a task at the beginning of your dag which increment a cloud watch metric (doc), and adding on_success_callback and on_failure_callback to your dag to decrease the same metric, in this case this new metric will represent the count of the active runs of your dag.

Related

airflow health check

The airflow I'm using, sometimes the pipelines wait for a long time to be scheduled. There have also been instances where a job was running for too long (presumably taking up resources of other jobs)
I'm trying to work out how to programatically identify the health of the scheduler and potentially monitor those in the future without any additional frameworks. I started to have a look at the metadata database tables. All I can think of now is to see start_date and end_date from dag_run, and duration of the tasks. What are the other metrics that I should be looking at? Many thanks for your help.
There is no need to go "deep" inside the database.
Airflow provide you with metrics that you can utilize for the very purpose: https://airflow.apache.org/docs/apache-airflow/stable/logging-monitoring/metrics.html
If you scroll down, you will see all the useful metrics and some of them are precisely what you are looking for (especially Timers).
This can be done with the usual metrics integration. Airflow publishes the metrics via statsd, and Airflow Official Helm Chart (https://airflow.apache.org/docs/helm-chart/stable/index.html) even exposes those metrics for Prometheus via statsd exporter.
Regarding the spark job - yeah - current implementation of spark submit hook/operator is implemented in "active poll" mode. The "worker" process of airflow polls the status of the job. But Airlfow can run multiple worker jobs in parallel. Also if you want, you can implement your own task which will behave differently.
In "classic" Airflow you'd need to implement a Submit Operator (to submit the job) and "poke_reschedule" sensor (to wait for the job to complete) and implement your DAG in the way that sensort task will be triggered after the operator. The "Poke reschedule" mode works in the way that the sensor is only taking the worker slot for the time of "polling" and then it frees the slot for some time (until it checks again).
As of Airflow 2.2 you can also write a Deferrable Operator (https://airflow.apache.org/docs/apache-airflow/stable/concepts/deferring.html?highlight=deferrable) where you could write single Operator - doing submision first, and then deferring the status check - all in one operator. Defferrable operators are efficiently handling (using async.io) potentially multiple thousands of waiting/deferred operators without taking slots or excessive resources.
Update: If you really cannot use statsd (helm is not needed, statsd is enough) you should never use DB to get information about the DAGS. Use Stable Airflow REST API instead: https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html

Are there any problems with running same cron job that takes 2 hours to complete every 10 minutes?

I have a script that takes two hours to run and I want to run it every 15 minutes as a cronjob on a cloud vm.
I noticed that my cpu is often at 100% usage. Should I resize memory and/or number_of_cores ?
Each time you execute your cron job, a new process will be created.
So if your job takes 120 min (2h) to complete, and you will be starting new jobs every 15 minutes, then you will be having 8 jobs running at the same time (120/15).
Thus, if the jobs are resource intensive, you will observe issues, such as 100% cpu usage.
So the question whether to up-scale or not is really dependent on the nature of these jobs. What do they do, how much cpu and memory do they take? Based on your description you are already running at 100% CPU often, thus an upgrade would be warranted in my view.
It would depend on your cron, but outside of resourcing for your server/application the following issues should be considered:
Is there overlap in data? i.e. do you retrieve a pool of data that will be processed multiple times.
Will duplicate critical actions happen? i.e. will a customer receive an email multiple times, will a payment be processed multiple times.
Is there a chance of a race condition that cause the script to exit early.
Will there be any collisions in the processing i.e. duplicate bookings made etc.
You will need to increase the CPU and Memory specification of your VM instance (in GCP) due to the high CPU load of your instance. The document [1] on upgrading the machine type of your VM instance, to do this need to shutdown your VM instance and change it´s machine type.
To know about different machine types in GCP, please have the link [2].
On the other hand, you can autoscale based on the average CPU utilization if you use managed instance group (MIG) [3]. Using this policy tells the autoscaler to collect the CPU utilization of the instances in the group and determine whether it needs to scale. You set the target CPU utilization the autoscaler should maintain and the autoscaler works to maintain that level.
[1] https://cloud.google.com/compute/docs/instances/changing-machine-type-of-stopped-instance
[2] https://cloud.google.com/compute/docs/machine-types
[3] https://cloud.google.com/compute/docs/autoscaler/scaling-cpu-load-balancing#scaling_based_on_cpu_utilization

Suppress mesage-The Mapping task failed to run. Another instance of the task is currently running

I have set up multiple jobs in Informatica cloud to sync data from Oracle with Informatica objects. The job is scheduled to run every 3 minutes as per the business requirements. Sometimes the job used to run long due to secure agent resource crunch and my team used to multiple emails as below
The Mapping task failed to run. Another instance of the task is currently running.
Is there any way to suppress these failure emails in the mapping?
This wont get set at the mapping level but on the session or integration service level see following https://network.informatica.com/thread/7312
This type of error comes when workflow/session is running and trying to re-run. use check if by script if already running then wait. If want to run multiple instance of same:
In Workflow Properties Enable 'Configure Concurrent Execution' by checking the check box.
once its enables you 2 options
Allow Concurrent Run with same instance name
Allow Concurrent run only with unique instance name
Notifications configured at the task level over ride those at the org level, so you could do this by configuring notifications at the task level and only sending out warnings to the broader list. That said, some people should still be receiving the error level warning because if it recurs multiple times within a short period of time there may be another issue.
Another thought is that batch processes that run every three minutes that take longer than three minutes is usually an opportunity to improve the design. Often a business requirement for short batch intervals is around a "near real time" desire. If you have also Cloud Application Integration service, you may want to set up an event to trigger the batch run. If there is still overlap based on events, you can use the Cloud Data Integration API to and create a dynamic version of the task each time. For really simple integrations you could perform the integration in CAI, which does allow multiple instances running at the same time.
HTH

Is there a way to specify a minimum number of workers for Cloud Dataflow w/ autoscaling?

I'd like to specify a minimum number of workers for my job that autoscaling will not go below (akin to how it works for max_num_workers). Is this possible? My reason is that sometimes the worker startup takes long enough that the autoscaling decides to drop the number of workers to one, even though doing so is not optimal for my job. I'd still like to use autoscaling in case the job is larger than my estimated minimum.
Minimum number of workers is not yet supported. Could file a ticket with job details so that it support can take a look to understand why it downscales to too few workers?
According to the Autoscaling documentation, you could specify the maximum number of workers in the --maxNumWorkers option and the --numWorkers as the initial number of workers. You could find a description of these options in this document
You can specify minimum number of workers using --numWorkers which is the initial number of workers to be used when application is deployed.
You can specify maximum number of workers using --maxNumWorkers which is when dataflow autoscales then maximum how many workers it can use.
This is now supported: the argument to pass is --num_workers according to the documentation

MapReduce on Yarn:Control the mapper or reducer tasks running simultaneously?

My mapreduce-based hive sql is running on Yarn and the hadoop version is 2.7.2 . What I want ,it to restrict the mapper tasks or reducer tasks running simultaneously when some hive sql is really big. I have tried following parameters ,but in fact they are not what I want:
mapreduce.tasktracker.reduce.tasks.maximum: The maximum number of reduce tasks that will be run simultaneously by a task tracker.
mapreduce.tasktracker.map.tasks.maximum: The maximum number of map tasks that will be run simultaneously by a task tracker.
the above two parameters seems unavailable for my yarn cluster, because yarn has no concept of JobTracker,which is the concept of hadoop 1.x? And I have checked my applicatiion whose running mappers is above 20, but the mapreduce.tasktracker.reduce.tasks.maximum value is just the default value 2.
and then , I tried the following two parameters , also, they are not what I need:
mapreduce.job.maps: The default number of map tasks per job. Ignored when mapreduce.jobtracker.address is "local".
mapreduce.job.reduces: The default number of reduce tasks per job. Typically set to 99% of the cluster's reduce capacity, so that if a node fails the reduces can still be executed in a single wave. Ignored when mapreduce.jobtracker.address is "local".
mapreduce.job.maps is just a hint for how many splits will be created for mapping tasks , and mapreduce.job.maps define how many reducer will be generated.
But what I want to limit ,is how many mapper or reducer tasks was allowed to run simultaneously for each application?
In my below screenshot, a yarn application has at least 20+ mapper tasks running ,which cost too much cluster resource.I want to limit it to 10 at most.
So, what can I do?
There may be several questions here. First of all to control the reducers for a particular job running at the same time of the mappers or before all of the mappers have completed you need to tweak: mapreduce.job.reduce.slowstart.completedmaps.
This parameter defaults to .8 which is 80%. This means when 80% of the mappers complete the reducers to start. If you want the reducers to wait until all of the mappers are complete then you need to set this to 1.
As for controlling the number of the mappers running at one time then you need to look at setting up either the fair scheduler or capacity scheduler.
Using one of the schedulers you can set minimums and maximums of resources for a queue where a job runs which will control how many containers (Mappers and Reducers are containers in Yarn) run at one time.
There is good information out there about both schedulers.
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/FairScheduler.html
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html