Is there any way to get the DynamoDB data in batches.
I want to fetch 2000 items from Dynamo DB but in batches, like 100 records per minute and feed it to the lambda function.
I've tried AWS batch, but it doesn't seems to be fruitful.
Related
I currently have 2 Athena queries in AWS Step Functions that are scheduled to run daily by EventBridge. Now I would like to send the number of results (i.e. number of rows) generated by my 2 Athena queries to SNS so that I can receive an email notification daily instead of logging in to check. How can I do this? Not sure how to get the count of the results generated by the 2 queries and send them to SNS.
I have a DynamoDB table and I am querying using one of my GSI some condition.
Now what I want is to graph everything that this query returns in Cloudwatch
Is that possible?
Also Dynamo has this limitation of only showing a max of 300 items at the time. Can I at least see the total of items the query is returning?
I'm new to AWS, and I'm working on archiving data from DynamoDB to S3. This is my solution and I have done the pipeline.
DynamoDB -> DynamoDB TTL + DynamoDB Stream -> Lambda -> Kinesis Firehose -> S3
But I found that the files in S3 has different number of JSON objects. Some files has 7 JSON objects, some has 6 or 4 objects. I have done ETL in lambda, the S3 only saves REMOVE item, and the JSON has been unmarshall.
I thought it would be a JSON object in a file, since the TTL value is different for each item, and the lambda would deliver the item immediately when the item is deleted by TTL.
Does it because the Kinesis Firehose batches the items? (It would wait for sometime after collecting more items then saving them to a file) Or there's other reason? Could I estimate how many files it will save if DynamoDB has a new item is deleted by TTL every 5 minutes?
Thank you in advance.
Kinesis Firehose splits your data based on buffer size or interval.
Let's say you have a buffer size of 1MB and an interval of 1 minute.
If you receive less than 1MB within the 1 minute interval, Kinesis Firehose will anyway create a batch file out of the received data, even if it is less than 1MB of data.
This is likely happening in scenarios with few data arriving. You can adjust your buffer size and interval to your needs. E.g. You could increase the interval to collect more items within a single batch.
You can choose a buffer size of 1–128 MiBs and a buffer interval of 60–900 seconds. The condition that is satisfied first triggers data delivery to Amazon S3.
From the AWS Kinesis Firehose Docs: https://docs.aws.amazon.com/firehose/latest/dev/create-configure.html
I want to send data from DynamoDB to AWS Lambda, and to configure fault tolerance I am looking into the data retention.
As per AWS docs DDB Stream keeps data for 24 hours. However when we set the Trigger to AWS Lambda, we can set the data retention to 7 days. How is this possible?
When enabled, DynamoDB Streams captures a time-ordered sequence of
item-level modifications in a DynamoDB table and durably stores the
information for up to 24 hours.
DynamoDB Stream Docs
Enabling Trigger and Data Retention during Error
How can trigger have maximum age to 7 days when the source DDB Stream can only keep data for 24 hours?
I have a Quick Sight dashboard pointed to Athena table. Now I want to schedule to refresh SPICE every hour. As per documentation, Refreshing imports the data into SPICE again, so the data includes any changes since the last import.
If I have a 2TB dataset in Athena and every hour new data added in Athena. So QuickSight will load 2TB every hour to find the delta? if yes, it will increase the Athena cost. Does QuickSight query on Athena to fetch data?
As of the date of answering (11/11/2019) SPICE does in fact perform a full data set reload (i.e. no delta calculation or incremental refresh). I was able to verify this by using a MySQL data set and watching the query log while the refresh was occurring.
The implication for your question is that you would be charged every hour for Athena to query the 2TB data set.
If you do not need the robust querying that Athena provides, I would recommend pointing QuickSight to the S3 data directly.
My data is in parquet format. I guess Quicksight does not support a direct query on s3 parquet data.
Yes, we need to use Athena to read the parquet.
When you say point QuickSight to S3 directly, do you mean without SPICE?
Don't do it, it will increase the Athena and S3 costs significantly.
Sollution:
Collect the delta from your source.
Push it into S3 (Unprocessed data)
Create a lambda function to pre-process the data (if needed)
Set up a trigger for lambda.
Process the data in lambda, and convert the data to parquet format with gzip compression.
Push the data into S3 (Processed data)
Remove the unprocessed data from S3 or set up an S3 lifecycle to maintain it.
Also create a metadata table with primary_key and required fields.
S3 & Athena do not support update records, so each time you push the data it will be appended to the old data, and the entire data will be scanned.
Both S3 and Athena follow the scan-first approach, so even though you are applying a filter it will scan the entire data before it applies the filter.
Use the metadata table to remove the old entry and insert the new entry.
Use partitions wherever possible to avoid scanning the entire data.
Once the data is available, configure Quicksight data refresh to pull the data into SPICE.
Best practice:
Always go with SPICE (Direct queries are expensive and have high latency)
Use the incremental refresh wherever possible.
Always use static data, do not process the data for each dashboard visit/refresh.
Increase your Quicksight SPICE data refresh frequency