AWS Glue Job Alerting on Long Run Time - amazon-web-services

I'm hoping to configure some form of alerting for AWS Glue Jobs when they run longer than a configurable amount of time. These Glue jobs can be triggered at any time of day, and usually take less than 2 hours to complete. However if this exceeds the 2 hour threshold, I want to get a notification for this (via SNS).
Usually I can configure run time alerting in CloudWatch Metrics, but I am struggling to do this for a Glue Job. The only metric I can see that could be useful is
glue.driver.aggregate.elapsedTime, but it doesn't appear to help. Any advice would be appreciated.

You could use the library for that. You just need the job run id and then call getJobRun to get the execution time. Based on that you can then notify someone / some other service.

Related

How to monitor an AWS Glue Workflow

I have a Glue Workflow consisting of multiple AWS Glue jobs, and I want to be alerted when it fails. Currently I have CloudWatch alarms on each of the individual jobs that make up the workflow. The problems with my current solution are that it requires creating many alarms instead of just one, and the alarms fire on a single failure of a job, even if the job succeeds on automatic retry. As far as I can tell there are no Cloudwatch metrics associated with the Workflow like there are for the jobs, so I don't know how I can monitor for Workflow failures.

How can we visualize the Dataproc job status in Google Cloud Plarform?

How can we visualize (via Dashboards) the Dataproc job status in Google Cloud Platform?
We want to check if jobs are running or not, in addition of their status like running, delay, blocked. On top of it we want to set alerting (Stackdriver Alerting) as well.
In this page, you have all the metrics available in Stackdriver
https://cloud.google.com/monitoring/api/metrics_gcp#gcp-dataproc
You could use cluster/job/submitted_count, cluster/job/failed_count and cluster/job/running_count to create the dashboard and metrics
Also, you could use cluster/job/completion_time to warn about long-running jobs and cluster/job/duration to check if jobs are enqueued in PENDING status for a long time.
cluster/job/completion_time is logged only after the job is completed. i.e. if the job takes 7 hours to complete, it is only registered at the 7th hour.
Similarly cluster/job/duration logs the time spent in each state only after the state is complete. Say if a job was in pending state for 1 hour, only at the 60th minute you would see this metric.
Dataproc has an open issue to introduce more metric that would help with this active alerting use case -> https://issuetracker.google.com/issues/211910984

Can Cloud Dataflow streaming job scale to zero?

I'm using Cloud Dataflow streaming pipelines to insert events received from Pub/Sub into a BigQuery dataset. I need a few ones to keep each job simple and easy to maintain.
My concern is about the global cost. Volume of data is not very high. And during a few periods of the day, there isn't any data (any message on pub/sub).
I would like that Dataflow scale to 0 worker, until a new message is received. But it seems that minimum worker is 1.
So minimum price for each job for a day would be : 24 vCPU Hour... so at least $50 a month/job. (without discount for monthly usage)
I plan to run and drain my jobs via api a few times per day to avoid 1 full time worker. But this does not seem to be the right form for a managed service like DataFlow.
Is there something I missed?
Dataflow can't scale to 0 workers, but your alternatives would be to use Cron, or Cloud Functions to create a Dataflow streaming job whenever an event triggers it, and for stopping the Dataflow job by itself, you can read the answers to this question.
You can find an example here for both cases (Cron and Cloud Functions), note that Cloud Functions is not in Alpha release anymore and since July it's in General Availability release.
A streaming Dataflow job must always have a single worker. If the volume of data is very low, perhaps batch jobs fit the use case better. Using a scheduler or cron you can periodically start a batch job to drain the topic and this will save on cost.

AWS Glue: How to reduce the number of DPUs for an ETL job

AWS Glue documentation regarding pricing reads:
A Glue ETL job requires a minimum of 2 DPUs. By default, AWS Glue
allocates 10 DPUs to each ETL job. You are billed $0.44 per DPU-Hour
in increments of 1 minute, rounded up to the nearest minute, with a
10-minute minimum duration for each ETL job.
I want to reduce the number of DPUs allocated to my ETL job. I searched for this option in Glue console. But I didn't find it. Can you please let me know how do I do that?
Thanks
To reduce the number of DPU, please go to AWS glue jobs console. Select the job and under action Edit the job. Under "Script libraries and job parameters", you should see "Concurrent DPUs per job run". You can provide an integer value to increase or reduce the number of DPUs.

AWS: What can I use to run periodic tasks on RDS?

In specific RDS column as a date, I keep the information when user's trials end.
I'm going to check everyday these dates in database and when less the few days lefts to the end of trial, I want send an email message (with SES).
How can I run a periodic tasks in AWS to check database? I know that I can use:
Lambda
EC2 (or Elastic Beanstalk)
Is there any other solution which I've missed?
You can also use AWS Batch for this. This suits better if the job is heavy and takes more time to complete.
How long does it take to run your check? If it takes less than 300 sec and is well within the limits of Lambda (AWS Lambda Limits), then schedule tasks with Lambda: Schedule Expressions Using Rate or Cron
Otherwise, the best option is to use: AWS Data Pipeline. Very easy to schedule and run your custom script periodically. It charges at least one hour of instance.
Go with lamda here
You can create a Lambda function and direct AWS Lambda to execute it on a regular schedule. You can specify a fixed rate (for example, execute a Lambda function every hour or 15 minutes), or you can specify a Cron expression.