AWS Data Pipeline DynamoDB to S3 503 SlowDown Error - amazon-web-services

We have a Data pipeline that does a nightly copy of our DynamoDB to S3 buckets so we can run reports on the data with Athena. Occasionally the pipeline will fail with a 503 SlowDown error. The retries will usually "succeed" but create tons of duplicate records in S3. The DynamoDB has On-Demand read capacity and the pipeline has 0.5 myDDBReadThroughputRatio. A couple of questions here:
I assume reducing the myDDBReadThroughputRatio would probably lessen the problem, if true does anyone have a good ratio that will still be performant but not cause these errors?
Is there a way to prevent the duplicate records in S3? I can't figure out why these are being generated? (possibly the records from the failed run are not removed?)
Of course any other thoughts/solutions for the problem would be greatly appreciated.
Thanks!

Using AWS Data Pipeline for continuous backups is not recommended.
AWS recently launched a new functionality that allows you to export DynamoDB table data to S3 and can be further analysed by Athena. Check it out here
You can also use Amazon glue to do the same (link).
If you still want to continue to use data pipeline, then the issue seems to be happening due to S3 limits being reached. You might need to see if there are other requests also writing to S3 at same time OR if you can limit the request rate from pipeline using some configuration.

Related

S3 write concurrency using AWS Glue

I have a suspicion we are hitting an S3 write concurrency issue with an AWS Glue job. I am testing 10 DPUs writing 10k objects, 1 MB each (~10 GB total) and it is taking 2+ hours for just the write stage of the job. It seems like across 10 DPUs I should be able to distribute good enough to get much better throughput. I am hitting several different bucket prefixes and do not think I'm getting throttled by S3 or anything.
I see that my job is using EMRFS (the default S3 FileSystem API implementation for Glue), so that is good for write throughput from my understanding. I found some suggestions that say to adjust fs.s3.maxConnections, hive.mv.files.threads and set hive.blobstore.use.blobstore.as.scratchdir = false.
Where can I see what the current settings for these are in my Glue jobs and how can I configure them? While I see many settings and configurations in the Spark UI logs I can generate from my jobs, I'm not finding these settings.
How can I see what actual S3 write concurrency I'm getting in each worker in the job? Is this something I can see in the Spark UI logs or is there another metric somewhere that would show this?

How to query AWS load balancer log if there are terabytes of logs?

I want to query AWS load balancer log to automatically and on schedule send report for me.
I am using Amazon Athena and AWS Lambda to trigger Athena. I created data table based on guide here: https://docs.aws.amazon.com/athena/latest/ug/application-load-balancer-logs.html
However, I encounter following issues:
Logs bucket increases in size day by day. And I notice if Athena query need more than 5 minutes to return result, sometimes, it produce "unknown error"
Because the maximum timeout for AWS Lambda function is 15 minutes only. Therefore, I can not continue to increase Lambda function timeout to wait for Athena to return result (if in the case that Athena needs >15 minutes to return result, for example)
Can you guys suggest for me some better solution to solve my problem? I am thinking of using ELK stack but I have no experience in working with ELK, can you show me the advantages and disadvantages of ELK compared to the combo: AWS Lambda + AWS Athena? Thank you!
First off, you don't need to keep your Lambda running while the Athena query executes. StartQueryExecution returns a query identifier that you can then poll with GetQueryExecution to determine when the query finishes.
Of course, that doesn't work so well if you're invoking the query as part of a web request, but I recommend not doing that. And, unfortunately, I don't see that Athena is tied into CloudWatch Events, so you'll have to poll for query completion.
With that out of the way, the problem with reading access logs from Athena is that it isn't easy to partition them. The example that AWS provides defines the table inside Athena, and the default partitioning scheme uses S3 paths that have segments /column=value/. However, ALB access logs use a simpler yyyy/mm/dd partitioning Scheme.
If you use AWS Glue, you can define a table format that uses this simpler scheme. I haven't done that so can't give you information other than what's in the docs.
Another alternative is to limit the amount of data in your bucket. This can save on storage costs as well as reduce query times. I would do something like the following:
Bucket_A is the destination for access logs, and the source for your Athena queries. It has a life-cycle policy that deletes logs after 30 (or 45, or whatever) days.
Bucket_B is set up to replicate logs from Bucket_A (so that you retain everything, forever). It immediately transitions all replicated files to "infrequent access" storage, which cuts the cost in half.
Elasticsearch is certainly a popular option. You'll need to convert the files in order to upload it. I haven't looked, but I'm sure there's a Logstash plugin that will do so. Depending on what you're looking to do for reporting, Elasticsearch may be better or worse than Athena.

AWS lambda extract large data and upload to s3

I am trying to write a nodeJS lambda function to query data from our database cluster and upload this to s3, we require this for further analysis. But my doubt is, if the data to be queried from the db is large (9GB), how does the lambda function handle this as the memory limit is 3008 MB ?
There is also a disk storage limit of 500MB.
Therefore, you would need to stream the result to Amazon S3 as it is coming in from the database.
You might also run into a time limit of 15 minutes for a Lambda function, depending upon how fast the database can query and transfer that quantity of information.
You might consider an alternative strategy, such as having the Lambda function call Amazon Athena to query the database. The results of an Athena query are automatically saved to Amazon S3, which would avoid the need to transfer the data.
lambda have some limitations it term of run time and space . it's better to use crawler or job in amazon glue. it's the easy way of doing this.
for that go to `
amazon glue>>job>>create job
and fill basic requirements like source and destination.
and run job. there is no constrains for size and time limitation.
`

Export S3 Bucket To Blob Storage

I'm trying to download, transform and upload an entire S3 bucket to Azure Blob Storage. The task itself, though trivial, became really annoying due to throttling issues.
The bucket itself is 4.5TB and contains roughly 700,000,000 keys - my first approach was to create a Lambda to handle a batch of 2000 keys at a time and just attack S3. After launching all the lambdas I came across S3 throttling for the first time -
{
"errorMessage": "Please reduce your request rate.",
"errorType": "SlowDown"}
At first this was amusing, but eventually it became a blocker on the whole migration process. Transferring the entire bucket with this throttling policy will take me around 2 weeks.
Of course I implemented an exponential retry but in this scale of 100+ lambdas concurrently it has little effect.
Am I missing something? Is there a service I could use to that ? Can I overcome the throttling someway ?
Any help would be appreciated.

AWS Firehose not delivering to Redshift - where are the logs?

I know others have had this same problem[1], and now I have as well. I have tried all the suggested troubleshooting techniques on that question. To summarize:
This is a new Firehose to Redshift
S3 objects are predictably appearing with 100% success in CloudWatch
Redshift delivery is showing as 0% success, so it must be trying
I'm seeing that Firehose is making connections to Redshift so the firewall rules must be correct
I am using JSON formatted entries with an external column mapping file.
The Firehose and Redshift cluster are in the us-west-2 region, but the bucket is in US Standard (us-east-1) so I'm using the WITH REGION option.
Following in the path of others, I have tried deleting and recreating the firehose to no avail.
I also tried doing the COPY from the redshift cluster manually and found that it worked perfectly.
There doesn't appear to be anything in the Redshift error tables, nor in the errors section of the bucket.
I'm about to give up on this. Anyone have suggestions for where to find the error logs before I admit failure?
[1] AWS Kinesis Firehose not inserting data in Redshift