Throttling while registering activities in Simple Work Flow - amazon-web-services

We have started to experience failures when our processes start up during the registration of activities. The problem is happening in GenericActivityWorker.registerActivityTypes.
The exception generate is:
Caused by: AmazonServiceException: Status Code: 400, AWS Service: AmazonSimpleWorkflow, AWS Request ID: 78726c24-47ee-11e3-8b49-534d57dc0b7f, AWS Error Code: ThrottlingException, AWS Error Message: Rate exceeded
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:350)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:202)
at com.amazonaws.services.simpleworkflow.AmazonSimpleWorkflowClient.invoke(AmazonSimpleWorkflowClient.java:3061)
at com.amazonaws.services.simpleworkflow.AmazonSimpleWorkflowClient.registerActivityType(AmazonSimpleWorkflowClient.java:2231)
at com.amazonaws.services.simpleworkflow.flow.worker.GenericActivityWorker.registerActivityType(GenericActivityWorker.java:153)
at com.amazonaws.services.simpleworkflow.flow.worker.GenericActivityWorker.registerActivityTypes(GenericActivityWorker.java:118)
at com.amazonaws.services.simpleworkflow.flow.worker.GenericActivityWorker.registerTypesToPoll(GenericActivityWorker.java:105)
at com.amazonaws.services.simpleworkflow.flow.worker.GenericWorker.start(GenericWorker.java:367)
at com.amazonaws.services.simpleworkflow.flow.ActivityWorker.start(ActivityWorker.java:248)
at com.fluid.retail.workflows.DefaultWorkflowHost.start(DefaultWorkflowHost.java:226)
... 5 more
The ActivityWorker in question has 5 activity implementation classes associated with it, and I think that this throttling is occurring because the internal Flow Framework code is looping over the activity types to register them without any delay in between them.
Because this code is internal to the framework, we can't add any sleep() calls to prevent being throttled.
Any ideas would be appreciated.

Are you sure this is happening during registering your ur activities? Or it is happening during scheduling your activities?
You would get this issue if you try to run a workflow that will schedule too many activities too fast. At this point you have 2 options.
1. Try and make the activites sequential and make them wait on the previous one.
2. Contact AWS to increase your accounts rate.

Related

How to fix CloudRun error 'The request was aborted because there was no available instance'

I'm using managed CloudRun to deploy a container with concurrency=1. Once deployed, I'm firing four long-running requests in parallel.
Most of the time, all works fine -- But occasionally, I'm facing 500's from one of the nodes within a few seconds; logs only provide the error message provided in the subject.
Using retry with exponential back-off did not improve the situation; the retries also end up with 500s. StackDriver logs also do not provide further information.
Potentially relevant gcloud beta run deploy arguments:
--memory 2Gi --concurrency 1 --timeout 8m --platform managed
What does the error message mean exactly -- and how can I solve the issue?
This error message can appear when the infrastructure didn't scale fast enough to catch up with the traffic spike. Infrastructure only keeps a request in the queue for a certain amount of time (about 10s) then aborts it.
This usually happens when:
traffic suddenly largely increase
cold start time is long
request time is long
We also faced this issue when traffic suddenly increased during business hours. The issue is usually caused by a sudden increase in traffic and a longer instance start time to accommodate incoming requests. One way to handle this is by keeping warm-up instances always running i.e. configuring --min-instances parameters in the cloud run deploy command. Another and recommended way is to reduce the service cold start time (which is difficult to achieve in some languages like Java and Python)
I also experiment the problem. Easy to reproduce. I have a fibonacci container that process in 6s fibo(45). I use Hey to perform 200 requests. And I set my Cloud Run concurrency to 1.
Over 200 requests I have 8 similar errors. In my case: sudden traffic spike and long processing time. (Short cold start for me, it's in Go)
I was able to resolve this on my service by raising the max autoscaling container count from 2 to 10. There really should be no reason that 2 would be even close to too low for the traffic, but I suspect something about the Cloud Run internals were tying up to 2 containers somehow.
Setting the Max Retry Attempts to anything but zero will remedy this, as it did for me.

Celery/SQS task retry gone haywire - how to get rid of it?

We've got Celery/SQS set up for asynchronous task management. We're running Django for our framework. We have a celery task that has a self.retry() in it. Max_retries is set to 15. The retry is happening with an exponential backoff and takes 182 hours to complete all 15 retries.
Last week, this task went haywire, I think due to a bug in our code not properly handling a service outage. It resulted in exponential creation (retrying?) of the same celery task. It eventually used up all available memory and the worker crashed. Restarting the worker results in another crash a couple hours later, since all those tasks (and their retries) keep retrying and spawning new retries until we run out of memory again. Ultimately we ended up with nearly 600k tasks created!
We need our workers to ignore all the tasks with a specific celery GUID. Ideally we could just get rid of them for good. I was going to use revoke() but, per documentation (http://docs.celeryproject.org/en/3.1/userguide/workers.html#commands), this is only implemented for Redis and RabbitMQ, not SQS. Furthermore, when I go to the SQS service in the AWS console, it's showing zero messages in flight so it's not like I can just flush it.
Is there a way to delete or revoke a specific message from SQS using the Celery task ID? Or is there another way to fix this problem? Obviously we need to fix our code so we don't get into this situation again, but first we need to get our worker up and running because without it our website has reduced functionality. Thanks!

Authentication with Cognito - where to find logs

We have 2 React Native app are using AWS Cognito for authentication. We use library react-native-aws-cognito-js in our code. The apps are working fine until these 2 days. Apps are experiencing intermittent "Internal Server Error".
How can I find more information about this error? Any tool can help us pinpoint the cause?
Update
From CloudTrail, each API call has an event "CreateNetworkInterface". Many of such API calls have error code "Client.NetworkInterfaceLimitExceeded". What is the cause and solution to this?
According to this AWS Doc (in Chinese), CloudWatch will not write to log when error is due to insufficient IP/ENI. That explains the increase in error number but no logs in CloudWatch.
Upate 2
We have found a scheduled Lambda job which may exhausted IP addresses. We stopped the batch job. But still can't have too many user login to server due to "Client.NetworkInterfaceLimitExceeded" error. I realized that there are many "CreateNetworkInterface" event and few "DeleteNetworkInterface" event. How can I "clean up / reset" all network interface in VPC?
Short answer: Cloud Trail.
Long answer with a suggestion
Assuming your application code is fine, most likely the cause of your 500 error is based on Cognito's initial limitations (e.g., number of calls per user): https://docs.aws.amazon.com/cognito/latest/developerguide/limits.html.
AWS suggests to use Cloud Trail, for logging Api calls.
However I would suggest, to prove the limitations first, add some logs around the api call yourself, and in development you could call your app/api with a high number of calls; and most likely you will see the 500 error due to the limitations.
You could do the following in the terminal:
for i in `seq 1 1000`; do curl --cookie SecureCookie=TokenValueFromAWS http://localhost:desirablePort/SecuredPath; done

When using the Admin SDK directory API to insert Org Units a dailyLimitExceeded error is returned even though that quota has not been reached

I work for an Student Information System and we're using the Admin SDK directory API to create school districts Google Org Unit structures from within our software.
POST https://www.googleapis.com/admin/directory/v1/customer/customerId/orgunits
When generating these API requests we're consistently receiving dailyLimitExceeded errors even when the district's quota has not been reached.
This error can be bypassed by ignoring the error, and implementing an exponential back-off routine, but I believe this to be acting much more like the quotaExceeded error is intended to act rather than dailyLimitExceeded, in that the request succeeds afterward on the first retry of this request.
In detail, the test I just ran successfully completed 9 of these API calls and then I received this response on the 10th:
Google.Apis.Requests.RequestError
Quota limit exceeded for the day. [403]
Errors [Message[Quota limit exceeded for the day.] Location[ - ] Reason[dailyLimitExceeded] Domain[usageLimits]
From the start of the batch of API calls it took about 10 seconds to get to the point where the error occurred.
Thanks for your help!
What I would suggest is to slow down your API requests. Don't make like 10 requests in 1 second. Give it a space in between requests. You are correct to implement exponential backoff. Also, if you can, use other accounts as well to make requests.

Approach to crashed workers in amazon swf

We're currently implementing a workflow in Amazon SWF where we submit jobs/workflow executions from our web application. Everything was fairly quick and painless to get set up using the Ruby Flow framework. As long as the deciders/activity workers don't crash we seem to be able to handle most issues/exceptions gracefully.
My question is, what is common practice for the scenario where the decider process crashes midway through a workflow execution? If the task fails in that way, is it possible to push an SNS notification (I've seen no examples) or something to indicate to another process that there's been an unexpected failure/crash?
There are various types of "decider" failures.
Workflow worker crashes while processing a decision. The decision task is automatically rescheduled after specified timeout. Make sure that workflow type defaultTaskStartToCloseTimeout is not set too high. If this crash is not related to code correctness then rescheduled task is processed and workflow execution continues normally.
Workflow worker doesn't crash but workflow execution itself fails. In this case you can use ListClosedWorkflowExecutions to count such failed workflows.
Workflow worker doesn't crash but a decision task cannot complete as RespondDecisionTaskCompleted fails due to a bug in the Flow framework. As from SWF point of view task is never completed it at some point is marked as timed out and rescheduled. As bug is still present a new task is again never completes and rescheduled, and so on. The workflow execution that is experiencing such issue has a history with a tail that consists from repeated "decision task scheduled, decision task timed out" events. If your workflow has a known execution time limit then the best way to catch this issue is to set reasonable executionStartToCloseTimeout and look for timed out workflow executions. If the decision task timeout is set too low such workflows can also hit the limit on history size before the execution timeout.
All swf metrics are not published to cloud watch. So all completed and failed workflows will send the metrics to cloudwatch where you can create alarms to send you notifications when any workflow fails.