EMR Job Long Running Notifications - amazon-web-services

Consider we have around 30 EMR Jobs runs in 5:30 AM PST to 10:30 PST.
We have S3 Buckets and we use to receive flat files in S3 bucket and through lambda functions, received files will be copied to other target paths.
We have dynamo DB tables for data processing once data gets received in target path.
Now the problem area is since we have multiple dependencies & parallel execution, sometimes job gets failed due to memory issue as well as sometimes take more time to get completed.
Sometimes it will run for 4 or 5 hours, and finally it will get terminated with memory or any other issues like Subnet not available or EC2 issue. So we dont want to wait till that long time.
Eg: Job_A process some 1st to 4th files and Job_B processes from 5th to 10th files. Like that it goes.
Here Job_B has dependency with Job_A with 3rd file. So, Job_B will wait until Job_A gets completed. Like this dependency we have in our process.
I would like to get notification from EMR Jobs like below,
Eg: Average Running time for Job_A is 1 hour, but it is running for more than 1 hour and in this case I need to get notified by email or any other way.
How to achieve it? Please help or advise anyone.
Regards,
Karthik

Repeatedly call the list of steps by using lambda and aws sdk, e.g. boto3 and check the start date. When it is 1 hour behind, then you can trigger some notification like Amazon SES. See the documentation.
For example, you can call the list_steps for the running steps only.
response = client.list_steps(
ClusterId='string',
StepStates=['RUNNING']
)
Then it will give you below response.
{
'Steps': [
{
...
'Status': {
...
'Timeline': {
'CreationDateTime': datetime(2015, 1, 1),
'StartDateTime': datetime(2015, 1, 1),
'EndDateTime': datetime(2015, 1, 1)
}
}
},
],
...
}

Related

MWAA Airflow Scaling: what do I do when I have to run frequent & time consuming scripts? (Negsignal.SIGKILL)

I have an MWAA Airflow env in my AWS account. The DAG I am setting up is supposed to read massive data from S3 bucket A, filter what I want and dump the filtered results to S3 bucket B. It needs to read every minute since the data is coming in every minute. Every run processes about 200MB of json data.
My initial setting was using env class mw1.small with 10 worker machines, if I only run the task once in this setting, it takes about 8 minutes to finish each run, but when I start the schedule to run every minute, most of them could not finish, starts to take much longer to run (around 18 mins) and displays the error message:
[2021-09-25 20:33:16,472] {{local_task_job.py:102}} INFO - Task exited with return code Negsignal.SIGKILL
I tried to expand env class to mw1.large with 15 workers, more jobs were able to complete before the error shows up, but still could not catch up with the speed of ingesting every minute. The Negsignal.SIGKILL error would still show before even reaching worker machine max.
At this point, what should I do to scale this? I can imagine opening another Airflow env but that does not really make sense. There must be a way to do it within one env.
I've found the solution to this, for MWAA, edit the environment and under Airflow configuration options, setup these configs
celery.sync_parallelism = 1
celery.worker_autoscale = 1,1
This will make sure your worker machine runs 1 job at a time, preventing multiple jobs to share the worker, hence saving memory and reduces runtime.

Trigger another lambda after a week of first lambda execution

I am working on a code where Lambda Function 1 (call it, L1) executes on messages from an SQS queue. I want to execute another lambda (call it, L2) exactly a week after L1 completes and want to pass L1's output to L2.
Execution Environment: Java
For my application, we are expecting around 10k requests on L1 per day. And same number of requests for L2.
If it runs for a week, we can have around 70k active executions at peak.
Things that I have tried:
Cloudwatch events with cron: I can schedule a cron with specified time or date which will trigger L2. But I couldn't find way to pass input with scheduled Cloudwatch event.
Cloudwatch events with new rules: At the end of first lambda I can create a new cloudwatch rule with specified time and specified input. But that will create as many rules (for my case, it could be around 10k new cloudwatch rules everyday). Not sure if that is a good practice or even supported.
Step function: There are two types step functions in play today.
Standard: Supports wait for a year, but only supports 25k active executions at any time. Won't scale since my application will already have 70k active executions at the end of first week.
https://docs.aws.amazon.com/step-functions/latest/dg/limits.html
Express: Doesn't have limit on number of active executions but supports max 5 minutes executions. It will time out after that.
https://docs.aws.amazon.com/step-functions/latest/dg/express-limits.html
It would be easy to create a new Cloudwatch Rule with the "week later" Lambda as a target as the last step in the first Lambda. You would set a Rule with a cron that runs 1 time in 1 week. Then, the Target has an input field. In the console it looks like:
You didn't indicate your programming environment but you can do something similar to (psuedo code, based on Java SDK v2):
String lambdaArn = "the one week from today lambda arn";
String ruleArn = client.putRule(PutRuleRequest.builder()
.scheduleExpression("17 20 23 7 *")
.name("myRule")).ruleArn();
Target target = TargetBuilder.builder().arn(lambdaArn).input("{\"message\": \"blah\"}").rule("myRule");
client.putTargets(PutTargetsRequest.builder().targets(target));
This will create a Cloudwatch Event Rule that runs one time, 1 week from today with the input as shown.
Major Edit
With your new requirements (at least 1 week later, 10's of thousands of events) I would not use the method I described above as there are just too many things happening. Instead I would have a database of events that will act as a queue. Either a DynamoDB or RDS database will suffice. At the end of each "primary" Lambda run, insert an event with the date and time of the next run. For example, today, July 18 I would insert July 25. The table would be something like (PostgreSQL syntax):
create table event_queue (
run_time timestamp not null,
lambda_input varchar(8192),
);
create index on event_queue( run_time );
Where the lambda_input column has whatever data you want to pass to the "week later" Lambda. In PostgreSQL you would do something like:
insert into event_queue (run_time, lambda_input)
values ((current_timestamp + interval '1 week'), '{"value":"hello"}');
Every database has something similar to the date/time functions shown or the code to do this isn't terrible.
Now, in CloudWatch create a rule that runs once an hour (the resolution can be tuned). It will trigger a Lambda that "feeds" an SQS queue. The Lambda will query the database:
select * from event_queue where run_time < current_timestamp
and, for each row, put a message into an SQS queue. The last thing it does is delete these "old" messages using the same where clause.
On the other side you have your "week later" Lambdas that are getting events from the SQS queue. These Lambdas are idle until a set of messages are put into the queue. At that time they fire up and empty the queue, doing whatever the "week later" Lambda is supposed to do.
By running the "feeder" Lambda hourly you basically capture everything that is 1 week plus up to 1 hour old. The less often you run it the more work that your "week later" Lambda's have to do and conversely, running every minute will add load to the database but remove it from the week later Lambda.
This should scale well, assuming that the "feeder" Lambda can keep up. 10k transactions / 24 hours is only 416 transactions and the reading of the DB and creation of the messages should be very quick. Even scaling that by 10 to 100k/day is still only ~4000 rows and messages which, again, should be very doable.
Cloudwatch is more for cron jobs. To trigger something at a specific timestamp or after X amount of time I would recommend using Step Functions instead.
You can achieve your use-case by using a State Machine with a Wait State (you can pass tell it how long to wait based on your input) followed by your Lambda Task State. It will be similar to this example.

AWS Pipeline, wait for stage to complete

I have a AWS pipeline, which:
1) first stage, get template.yaml and build a ec2 windows instance via script
note when this machine boots up, via user data it starts a script to downloads requirements, git etc, code, setups iis and various other stuff.
so this happens once the cloudformation part has completed, and takes about another 5 mins
2) i then want to run external tests on this machine - maybe using blazemeter, as the second part of the pipeline
the problem is that between stage 1 and 2 i need to wait for the website to work on the box, so i need to wait at least 5 mins. i could add a manual approval stage, but this seams cumbersome.
does anyone have a way to add this timed wait? or a pipeline process to check the site is up?

Alternative to Cronjob using AWS Lambda + Cloudwatch ()

I'm a developer on a startup and right now we are using around 30 cronjobs, some of them run each minute, others run once per day while other run on specific days. The problem are the ones that run every minute, when most of the time is not necessary.
This somewhat increases our expenses because during the night, they still run when most of the times our services have nobody online (and don't require to be run).
We have been talking about using AWS to replace those cronjobs into something like event based. Yet, I cannot find a solution. Here's an example of one of our cronjobs:
One costumer starts to make a registration and has 8 minutes to complete it. Right now, we have a cronjob that runs every minute to validate if he completed, and if not, to "delete" it.
I though I could replace this with a SNS + Lambda event. Basically, when an user starts registration, send an message to SNS, that would triger a lambda function. Yet, it could only run after 8 minutes, and not instantly.
I've seen on SNS that we can delay up to 15 minutes, but we got some other service that sends an email after few hours, which would not work
Anyone have a clue on how can I do it?
Thanks
You can use AWS step functions to implement the workflow and add a delay to wait before invoking the Lambda function.

Is there a way to set a walltime on AWS Batch jobs?

Is there a way to set a maximum running time for AWS Batch jobs (or queues)? This is a standard setting in most batch managers, which avoids wasting resources when a job hangs for whatever reason.
As of April, 2018, AWS Batch now supports setting a Job Timeout when submitting a Job, or in the job definition.
https://aws.amazon.com/about-aws/whats-new/2018/04/aws-batch-adds-support-for-automatic-termination-with-job-execution-timeout/
You specify an attemptDurationSeconds parameter, which must be at least 60 seconds, either in your job definition, or when you submit the job. When this number of seconds has passed following the job attempt's startedAt timestamp, AWS Batch terminates the job. On the compute resource, your job's container receives a SIGTERM signal to give your application a chance to shut down gracefully; if the container is still running after 30 seconds, a SIGKILL signal is sent to forcefully shut down the container.
Source: https://docs.aws.amazon.com/batch/latest/userguide/job_timeouts.html
POST /v1/submitjob HTTP/1.1
Content-type: application/json
{
...
"timeout": {
"attemptDurationSeconds": number
}
}
AFAIK there is no feature to do this. However, a workaround was suggested in the forum for a similar question.
One idea is to call Batch as an Activity from Step Functions, pingback
back on a schedule (e.g. every minute) from that job. If it stops
responding then you can detect that situation as a Timeout in the
activity and act accordingly (terminate the job etc.). Not an ideal
solution (especially if the job continues to ping back as a "zombie"),
but it's a start. You'd also likely have to store activity tokens in a
database to trace them to Batch job id.
Alternatively, you split that setup into 2 steps, and schedule a Batch
job from a Lambda in the first state, then pass the Batch job id to
the second step which then polls Batch (from another Lambda) for its
state with Retry and IntervalSeconds (e.g. once every minute, or even
with exponential backoff), and MaxAttempts calculated based on your
timeout. This way, you don't need any external state storage
mechanism, long polling or even a "ping back" from the job (it CAN be
a zombie), but the downside is more steps.
There is no option to set timeout on batch job but you can setup a lambda function that triggers every 1 hour or so and deletes jobs created before say 24 hours.
working with aws for some time now and could not find a way to set a maximum running time for batch jobs.
However there are some alternative way which you could utilize.
AWS Forum
Sadly there is no way to set the limit execution time on AWS Batch.
One solution may be to edit the docker's entry point to schedule the execution time limit.