AWS Glue Job parallel running got error "Rate exceeded" ThrottlingException Status Code: 400

AWS Glue Job parallel running got error "Rate exceeded" ThrottlingException Status Code: 400 - amazon-web-services

I have a simple (just print hello) glue 2.0 job that runs in parallel, triggered from a step function map. Glue job Maximum concurrency is set to 40 and so as Step Funcitons Map's MaxConcurrency.
.
It runs fine if I kicked off under 20 parallel glue jobs but exceeding that (I tried max 35 parallel) I got intermittent errors like this:
Rate exceeded (Service: AWSGlue; Status Code: 400; Error Code:
ThrottlingException; Request ID: 0a350b23-2f75-4951-a643-20429799e8b5;
Proxy: null)
I've checked the service quotas documentation
https://docs.aws.amazon.com/general/latest/gr/glue.html and my account settings. 200 max should have handled my 35 parallel jobs happily.
There are no other Glue job scheduled to be run at the same time in my aws account.
Should I just blindly request to increase the quota and see it fixed or is there anything I can do to get around this?

Thanks to luk2302 and Robert for the suggestions.
Based on their advice, I reach to a solution.
Add a retry in the Glue Task. (I tried IntervalSeconds 1 and BackoffRate 1 but that's too low and didn't work)
"Resource": "arn:aws:states:::glue:startJobRun",
"Type": "Task",
"Retry": [
{
"ErrorEquals": [
"Glue.AWSGlueException"
],
"BackoffRate": 2,
"IntervalSeconds": 2,
"MaxAttempts": 3
}
]
Hope this helps someone.

The quota that you are hitting is not the concurrent job quota of Glue, but the Start Job Run API quota. You basically requested too many job runs per second. If possible just wait in between every Start Job Run call.

Related

How to improve Airflow task concurrency

I have question about dag and task concurrency.
Scneario:
I have two DAG files
DAG 1 has only one task
DAG 2 has three tasks, From three, one task is calling third party API (Third party API response time is 900 miliseconds, it is simple weather API for showing current weather of provided city. e.g https://api.weatherapi.com/v1/current.json?key=api_key&q=Londodn ) and other 2 task are just for logs(print statment)
I trigger DAG 1 with the custom payloads having 1000 records
(
for e.g.
conf: {
[
{
"city": "London",
...
},
{
...
}
]
}
)
DAG 1 task just loop though the records and call the DAG 2 1000 times with individual record
So first, I want to ask here about this approach. Is this a good approach to call the list of data with 2 DAGs or is there any better way to do this?
My concern is it is taking 17 minutes for DAG 2 to process all 1000 execution
I am using Managed Workflow (AWS) configuration are as below:
Environment class: mw1.large
Scheduler count: 4
Maximum worker count: 25
Minimum worker count: 20
Region: us-west-2
core.max_active_runs_per_dag: 1000
core.max_active_tasks_per_dag: 5000
Default MWAA config for task as per aws documentation
(https://docs.aws.amazon.com/mwaa/latest/userguide/best-practices-tuning.html)
core.parallelism: 10000
core.dag_concurrency: 10000
Can anyone guide me how can I improve my AWS Managed Airflow performance to improve the parallelism of DAG run?
I want to understand the parallelism and concurrency settings if they are set to this high as above configs are then why it takes 17 minutes for Airflow to complete tasks?
Thanks!

how to add sharedIdentifier to aws event bridge rule for scheduled execution of aws batch job

I configured aws bridge event rule (via web gui) for running aws batch job - rule is triggered but a I am getting following error after invocation:
shareIdentifier must be specified. (Service: AWSBatch; Status Code: 400; Error Code: ClientException; Request ID: 07da124b-bf1d-4103-892c-2af2af4e5496; Proxy: null)
My job is using scheduling policy and needs shareIdentifier to be set but I don`t know how to set it. Here is screenshot from configuration of rule:
There are no additional settings for subsequent arguments/parameters of job, the only thing I can configure is retries. I also checked aws-cli command for putting rule (https://awscli.amazonaws.com/v2/documentation/api/latest/reference/events/put-rule.html) but it doesn`t seem to have any additional settings. Any suggestions how to solve it? Or working examples?
Edited:
I ended up using java sdk for aws batch: https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-batch. I have a scheduled method that periodically spawns jobs with following peace of code:
AWSBatch client = AWSBatchClientBuilder.standard().withRegion("eu-central-1").build();
SubmitJobRequest request = new SubmitJobRequest()
.withJobName("example-test-job-java-sdk")
.withJobQueue("job-queue")
.withShareIdentifier("default")
.withJobDefinition("job-type");
SubmitJobResult response = client.submitJob(request);
log.info("job spawn response: {}", response);

Have you tried to provide additional settings to your target via the input transformer as referenced in the AWS docs AWS Batch Jobs as EventBridge Targets ?
FWIW I'm running into the same problem.

I had a similar issue, from the CLI and the GUI, I just couldn't find a way to pass ShareIdentifier from an Eventbridge rule. In the end I had to use a state machine (step function) instead:
"States": {
"Batch SubmitJob": {
"Type": "Task",
"Resource": "arn:aws:states:::batch:submitJob.sync",
"Parameters": {
"JobName": <name>,
"JobDefinition": <Arn>,
"JobQueue": <QueueName>,
"ShareIdentifier": <Share>
},
...
You can see it could handle ShareIdentifier fine.

google-cloud-videointelligence - TimeoutError start Oct 5

I started to get:
concurrent.futures._base.TimeoutError: Operation did not complete within the designated timeout.
(even for a 5 sec videos & I have timeout=1000)
It started on Oct 5 (before that it work great for months).
What I use:
python:3.8.7 ,pip install google-cloud-videointelligence==2.3.3 ,Google cloud ,running on Cloud Run - python:3.8.7-slim
Code:
from google.cloud import videointelligence
video_client = videointelligence.VideoIntelligenceServiceClient()
context = videointelligence.VideoContext(
segments=None
)
features= [ videointelligence.Feature.LABEL_DETECTION,
videointelligence.Feature.TEXT_DETECTION,
videointelligence.Feature.OBJECT_TRACKING]
request = videointelligence.AnnotateVideoRequest(
input_uri="gs://"+path,
video_context=context,
features=features
)
operation = video_client.annotate_video(request)
result = operation.result(timeout=1000)
result = json.loads(MessageToJson(result._pb))

One of the reasons that you encounter error concurrent.futures._base.TimeoutError: Operation did not complete within the designated timeout. is that you are sending more video material per minute than before and you might be hitting the quota "Backend time in seconds per minute" during processing. See quotas for more information.
Check if you are hitting the quota in Cloud Console > IAM & Admin > Quotas and search for "Backend time in seconds per minute". If so, increase the "Backend time in seconds per minute" quota to your desired value to increase the number of videos processed in parallel.
Just to add, I was able to reproduce the same error when I hit the quota. Below is the sample image:
See actual error message when quota was hit:

EMR Job Long Running Notifications

Consider we have around 30 EMR Jobs runs in 5:30 AM PST to 10:30 PST.
We have S3 Buckets and we use to receive flat files in S3 bucket and through lambda functions, received files will be copied to other target paths.
We have dynamo DB tables for data processing once data gets received in target path.
Now the problem area is since we have multiple dependencies & parallel execution, sometimes job gets failed due to memory issue as well as sometimes take more time to get completed.
Sometimes it will run for 4 or 5 hours, and finally it will get terminated with memory or any other issues like Subnet not available or EC2 issue. So we dont want to wait till that long time.
Eg: Job_A process some 1st to 4th files and Job_B processes from 5th to 10th files. Like that it goes.
Here Job_B has dependency with Job_A with 3rd file. So, Job_B will wait until Job_A gets completed. Like this dependency we have in our process.
I would like to get notification from EMR Jobs like below,
Eg: Average Running time for Job_A is 1 hour, but it is running for more than 1 hour and in this case I need to get notified by email or any other way.
How to achieve it? Please help or advise anyone.
Regards,
Karthik

Repeatedly call the list of steps by using lambda and aws sdk, e.g. boto3 and check the start date. When it is 1 hour behind, then you can trigger some notification like Amazon SES. See the documentation.
For example, you can call the list_steps for the running steps only.
response = client.list_steps(
ClusterId='string',
StepStates=['RUNNING']
)
Then it will give you below response.
{
'Steps': [
{
...
'Status': {
...
'Timeline': {
'CreationDateTime': datetime(2015, 1, 1),
'StartDateTime': datetime(2015, 1, 1),
'EndDateTime': datetime(2015, 1, 1)
}
}
},
],
...
}

How much time does AWS step function keeps the execution running?

I am new to AWS Step Function. I have created a basic step function with Activity Worker in the back end. For how much time, does the Step Function keeps the execution alive and not time out if the execution is still not picked by the activity worker?

For how much time, does the Step Function keeps the execution alive
and not time out if the execution is still not picked by the activity
worker?
1 year
You can specify TimeoutSeconds in activity task which is also a recommended way
"ActivityState": {
"Type": "Task",
"Resource": "arn:aws:states:us-east-1:123456789012:activity:HelloWorld",
"TimeoutSeconds": 300,
"HeartbeatSeconds": 60,
"Next": "NextState"
}
Step functions can keep the task in the queue for maximum 1 year. You can find more info on Step Functions limitations on this page.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

AWS Glue Job parallel running got error "Rate exceeded" ThrottlingException Status Code: 400 - amazon-web-services

The quota that you are hitting is not the concurrent job quota of Glue, but the Start Job Run API quota. You basically requested too many job runs per second. If possible just wait in between every Start Job Run call.

Related

How to improve Airflow task concurrency

how to add sharedIdentifier to aws event bridge rule for scheduled execution of aws batch job

google-cloud-videointelligence - TimeoutError start Oct 5

EMR Job Long Running Notifications

How much time does AWS step function keeps the execution running?

Categories

Resources