Driftctl ThrottlingException: Rate exceeded on AWS

Driftctl ThrottlingException: Rate exceeded on AWS - amazon-web-services

I face a rate limit error from AWS, any idea on how to fix it? is there an option to throttle the requests from driftctl?
ThrottlingException: Rate exceeded
status code: 400
Tried driftctl on GitHub action, I expected it to work properly

AWS rate limiting isn't really controllable directly, and can't be increased through AWS support. However, all of the AWS SDKs do automatic backoff and retry for throttling errors. It does partly depend on how driftctl is implemented too, and how it uses the AWS clients in the SDK.
Not having used the tool itself, but reading up on what it does, I suspect that it is just making a lot of API calls in a short period to try to scan all of your AWS infrastructure. I would start by configuring it not to do deep scans, and try it on a smaller terraform state file to see if you still get the problem.
It looks like it's written in go, and probably uses the go AWS SDK. If it uses version 2.x then there are some standard environment variables you can see for that to increase the number of retries it performs by default, particularly setting AWS_MAX_ATTEMPTS, which usually defaults to 3.
https://docs.aws.amazon.com/sdkref/latest/guide/feature-retry-behavior.html
Bear in mind that when you hit these rate limits, often something is happening that may not be desirable. It's worth turning on verbose logging for driftctl if possible, to see what the AWS API calls it's making actually are, and if they are ones you would expect to see.
If you continue to get the problem, it's worth logging an issue on their Github project, and trying to get someone who knows the code to help you debug it: https://github.com/snyk/driftctl

Related

Load testing AWS SDK client

What is the recommended way to performance test AWS SDK clients? I'm basically just listing/describing resources and would like to see what happens when I query 10k objects. Does AWS provide some type of mock API, or do I really need to request 10k of each type of resource to do this?
I can of course mock in at least two levels:
SDK: I wrap the SDK with my own interfaces and create mocks. This doesn't exercise the SDK's JSON to objects code and my mocks affect the AppDomain with additional memory, garbage collection, etc.
REST API: As I understand it the SDKs are just wrappers to the REST API (hence the HTTP response codes shown in the objects. It seems I can configure the SDK to go to custom endpoints.
This isolates the mocks from the main AppDomain and is more representative, but of course I'm still making some assumptions about response time, limits, etc.
Besides the above taking a long time to implement, I would like to make sure my code won't fail at scale, either locally or at AWS. The only way I see to guarantee that is creating (and paying for) the resources at AWS. Am I missing anything?

When you query 10k or more objects you'll have to deal with:
Pagination - the API usually returns only a limited number of items per call, providing NextToken for the next call.
Rate Limiting - if you hammer some AWS APIs too much they'll rate limit you which the SDK will probably report as some kind of Rate Limit Exceeded Exception.
Memory usage - hopefully you don't collect all the results in the memory before processing. Process them as they arrive to conserve your operating memory.
Other than that I don't see why it shouldn't work.
Update: Also check out Moto - the AWS mocking library (for Python) that can also run in a standalone mode for use with other languages. However as with any mocking it may not behave 100% the same as the real thing, for instance around the Rate Limiting behaviour.

Google API Speeds Slow in Cloud Run / Functions?

Bottom Line: Cloud Run and Cloud Functions seem to have bizarrely limited bandwidth to the Google Drive API endpoints. Looking for advice on how to work around, or, ideally, #Google support to fix the underlying issue(s) as I will not be the only like use case.
Background: I have what I think is a really simple use case. We're trying to automate private domain Google Drive users to take existing audio recordings and send them off to Speech API to generate a transcript on an ad hoc basis, and to dump the transcript back into the same Drive folder with email notification to the submitter. Easy, right? Only hard part is that Speech API will only read from Google Cloud Storage, so the 'hard part' should be moving the file over. 'Hard' doesn't really cover it...
Problem: Writing in nodejs and using the latest version of the official modules for Drive and GCS, the file copying was going extremely slow. When we broke things down, it became apparent that the GCS speed was acceptable (mostly -- honestly it didn't get a robust test, but was fast enough in limited testing); it was the Drive ingress which was causing the real problem. Using even the sample Google Drive Download app from the repo was slow as can be. Thinking the issue might be either my code or the library, though, I ran the same thing from the Cloud Console, and it was fast as lightning. Same with GCE. Same locally. But in Cloud Functions or Cloud Run, it's like molasses.
Request:
Has anyone in the community run into this or a like issue and found a workaround?
#Google -- Any chance that whatever the underlying performance bottleneck is, you can fix it? This is a quintessentially 'serverless' use case, and it's hard to believe that the folks who've been doing this the longest can't crack it.
Thank you all in advance!
Updated 1/4/19 -- GCS is also slow following more robust testing. Image base also makes no difference (tried nodejs10-alpine, nodejs12-slim, nodejs12-alpine without impact), and memory limits equally do not impact results locally or on GCP (256m works fine locally; 2Gi fails in GCP).
Google Issue at: https://issuetracker.google.com/147139116

Self-inflicted wound. Google-provided code seeks to be asynchronous and do work in the background. Cloud Run and Cloud Functions do not support that model (for now at least). Move to promise-chaining and all of a sudden it works like it should -- so long as the CPU keeps the attention it needs. Limits what we can do with CR / CF, but hopefully that too will evolve.

DynamoDB on-demand mode suddenly stops working

I have a table that is incrementally populated with a lambda function every hour. The write capacity metric is full of predictable spikes and throttling was normally avoided by relying on the burst capacity.
The first three loads after turning on-demand mode on kept working. Thereafter it stopped loading new entries into the table and began to time-out (from ~10 seconds to the current limit of 4 minutes). The lambda function was not modified at all.
Does anyone know why might this be happening?
EDIT: I just see timeouts in the logs.
Logs before failure
Logs after failure
Errors and availability (%)

Since you are using Lambda to perform incremental writes, this issue is more than likely on Lambda side. That is where I would start looking for this. Do you have CW logs to look through? If you cannot find it, open a case with AWS support.

Unless this was recently fixed, there is a known bug in Lambda where you can get a series of timeouts. We encountered it on a project I worked on: a lambda would just start up and sit there doing nothing, quite like yours.
So like Kirk, I'd guess the problem is with the Lambda, not DynamoDB.
At the time there was no fix. As a workaround, we had another Lambda checking the one that suffered from failures and rerunning those. Not sure if there are other solutions. Maybe deleting everything and setting it back up again (with your fingers crossed :))? Should be easy enough if everything is in Cloudformation.

How to relieve a rate-limited API?

We run a website which heavily relies on the Amazon Product Advertising API (APAA). What happens is that when we experience a sudden spike in users it happens that we hit the rate-limit and all functions relying on the APAA shut down for a while. What can we do so that doesn't happen?
So, obviously we have some basic caching in place, but the APAA doesn't allow us to cache data for a very long time, and APAA queries can vary a lot so there may not be any cached data at all to query.

I think that your only option is to retry the API calls until they work — but do so in a smart way. Unfortunately, that's what everybody that gets throttled does and AWS expects people to handle that themselves.
You can implement an exponential backoff and add jitter to prevent cluster calls. AWS has a great blog post about solutions for this kind of problem: https://www.awsarchitectureblog.com/2015/03/backoff.html

Does anyone have knowledge on context switching - logging from app to disk via FSYNC vs syslog-ng

Recently, one of our most senior engineers asked me about context switching with respect to using syslog-ng vs. writing logs out from our application to disk.
Context:
I want to use syslog-ng to log & ship output from our application written in C++ to logstash on a remote logserver host, then shove it all into elasticsearch & use Kibana as a front-end for log viewing, analysis, and derivation of useful metrics. ELK stack
We currently utilize an FSYNC buffer of either 4K or 8K that spits logs out in intervals to the logfile on disk; that is to say, we're not forcing a write to disk for each log entry.
Like any good performance-seeking engineer, he wants to understand whether we'll see greater context switching or if we can make any performance gains by leveraging syslog-ng.
So the question is: Will using syslog-ng reduce or increase context switches on that application's host.
My expertise leads me here, to ask the question - given that I don't have sufficient knowledge to come up with the answer on my own.
Long-time lurker, still new to posting. Thanks!

It depends on how many logs you have to handle and how much resources you're willing to dedicate.
We use Kibana/Elastic search with Logstash and the load does get pretty heavy, but then again we have over 400 servers so YMMW. Java isn't really known for being lenient on resources either. However, on the plus side it's fairly easy to set up.
Parsing the logs in Logstash can be done with grok. Throw up a couple of VMs and play around with it, if you have a large environment fine-tuning configuration is a must to make it bearable, or if you have money you can throw hardware at it until it behaves.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js