We have a service where a DynamoDB table ~50GB is our feature repository, which we use for real-time, online applications.
We want to create a data lake from this table for historical data, model training and analytics insights. We want to guarantee a 30-minutes "freshness" of data lake data w.r.t. the original table.
However, I'm confused on what could be a good architecture for this: my understanding of data lakes is that you should use a storage service (i.e., S3) to store the raw data with no processing. Then, you perform ETL jobs, where you transform, process and filter the data (e.g., using Glue) before using for whatever app.
But here is my doubt: does this means that we have to dump the DynamoDB table into S3 every 30 minutes? This can be easily done, but it sounds weird (this would result in ~876TB/year).
Am I missing something in the data lake pipeline?
You've hit a common problem, and its one AWS are actively working on.
If you want continous sync-ing from dynamodb to S3, its possible using existing technology including dynamodb streams. I suggest checking out this project in awslabs. Frankly its quite a bit of effort.
However, I believe AWS are about to release a product that will keep dynamodb tables and S3 buckets in sync, without code, in a few clicks. Its called AWS Glue Elastic Views. The product is in preview. They announced the product in December 2020 so I'm hoping it available soon. There is also a form you can fill in to join the trial but there is no guarantee AWS will give to access.
Related
I have information in Amazon DynamoDB, that has frequently updated/added rows (it is updated by receiving events from Kinesis Stream and processing those events with a Lambda).
I want to provide a way for other teams to query that data through Athena.
It has to be as real-time as possible (the period between receiving the event and the query to Athena including that new/updated information).
The best/most cost optimized way to do that?
I know about some of the options:
scan the table regularly and put the information in Athena. This is going to be quite expensive and not real time.
start putting the raw events in S3 as well, not just DynamoDB, and make a glue crawler that scans the new records only. That's going to be closer to real time, but I don't know how to deal with duplicate events. (the information is quite frequently updated in DynamoDB, it updates old records). also not sure if it is the best way.
maybe update the data catalog directly from the lambda? not sure if that is even possible, I'm still new to the tech stack in aws.
Any better ways to do that?
You can use Athena Federated Query for this use-case.
We are building a customer facing App. For this app, data is being captured by IoT devices owned by a 3rd party, and is transferred to us from their server via API calls. We store this data in our AWS Documentdb cluster. We have the user App connected to this cluster with real time data feed requirements. Note: The data is time series data.
The thing is, for long term data storage and for creating analytic dashboards to be shared with stakeholders, our data governance folks are requesting us to replicate/copy the data daily from the AWS Documentdb cluster to their Google cloud platform -> Big Query. And then we can directly run queries on BigQuery to perform analysis and send data to maybe explorer or tableau to create dashboards.
I couldn't find any straightforward solutions for this. Any ideas, comments or suggestions are welcome. How do I achieve or plan the above replication? And how do I make sure the data is copied efficiently - memory and pricing? Also, don't want to disturb the performance of AWS Documentdb since it supports our user facing App.
This solution would need some custom implementation. You can utilize Change Streams and process the data changes in intervals to send to Big Query, so there is a data replication mechanism in place for you to run analytics. One of the use cases of using Change Streams is for analytics with Redshift, so Big Query should serve a similar purpose.
Using Change Streams with Amazon DocumentDB:
https://docs.aws.amazon.com/documentdb/latest/developerguide/change_streams.html
This document also contains a sample Python code for consuming change streams events.
I am building out a simple sensor which sends out 5 telemetry data to AWS IoT Core. I am confused between AWS Timestream DB and Elastic Search to store this telemetries.
For now I am experimenting with Timestream and wanted to know is this the right choice ? Any expert suggestions.
Secondly I want to store the db records for ever as this will feed into my machine
learning predictions in the future. Timestream deletes records after a while or is it possible to never delete it
I will be creating a custom web page to show this telemetries per tenant - any help with how I can do this. Should I directly query the timestream db over api or should i back it up in another db like dynamic etc ?
Your help will be greatly appreciated. Thank you.
For now I am experimenting with Timestream and wanted to know is this the right choice? Any expert suggestions.
I would not call myself an expert but Timestream DB looks like a sound solution for telemetry data. I think ElasticSearch would be overkill if each of your telemetry data is some numeric value. If your telemetry data is more complex (e.g. JSON objects with many keys) or you would benefit from full-text search, ElasticSearch would be the better choice. Timestream DB is probably also easier and cheaper to manage.
Secondly I want to store the db records for ever as this will feed into my machine learning predictions in the future. Timestream deletes records after a while or is it possible to never delete it
It looks like the retention is limited to 4 weeks 200 Years per default. You probably can increase that by contacting AWS support. But I doubt that they will allow infinite retention.
We use Amazon Kinesis Data Firehose with AWS Glue to store our sensor data on AWS S3. When we need to access the data for analysis, we use AWS Athena to query the data on S3.
I will be creating a custom web page to show this telemetries per tenant - any help with how I can do this. Should I directly query the timestream db over api or should i back it up in another db like dynamic etc ?
It depends on how dynamic and complex the queries are you want to display. I would start with querying Timestream directly and introduce DynamoDB where it makes sense to optimize cost.
Based on your approach " simple sensor which sends out 5 telemetry data to AWS IoT Core" Timestream is the way to go, fairly simple and cheaper solution for simple telemetry data.
The Magnetic storage is above what you will ever need (200years)
I have a use case wherein I want to take a data from DynamoDB and do some transformation on the data. After this I want to create 3 csv files (there will be 3 transformations on the same data) and dump them to 3 different s3 locations.
My architecture would be sort of following:
Is it possible to do so? I can't seem to find any documentation regarding it. If it's not possible using pipeline, are there any other services which could help me with my use case?
These dumps will be scheduled daily. My other consideration was using aws lamda. But according to my understanding, it's event based triggered rather time based scheduling, is that correct?
Yes it is possible but not using HiveActivity instead EMRActivity. If you look into Data pipeline documentation for HiveActivity, it clearly states its purpose and not suits your use case:
Runs a Hive query on an EMR cluster. HiveActivity makes it easier to set up an Amazon EMR activity and automatically creates Hive tables based on input data coming in from either Amazon S3 or Amazon RDS. All you need to specify is the HiveQL to run on the source data. AWS Data Pipeline automatically creates Hive tables with ${input1}, ${input2}, and so on, based on the input fields in the HiveActivity object.
Below is how your data pipeline should look like. There is also a inbuilt template Export DynamoDB table to S3 in UI for AWS Data Pipeline which creates the basic structure for you, and then you can extend/customize to suit your requirements.
To your next question using Lambda, Of course lambda can be configured to have event based triggering or schedule based triggering, but I wouldn't recommend using AWS Lambda for any ETL operations as they are time bound & usual ETLs are longer than lambda time limits.
AWS has specific optimized feature offerings for ETLs, AWS Data Pipeline & AWS Glue, I would always recommend to choose between one of two. In case your ETL involves data sources not managed within AWS compute and storage services OR any speciality use case which can't be sufficed by above two options, then AWS Batch will be my next consideration.
Thanks amith for your answer. I have been busy for quite some time now. I did some digging after you posted your answer. Turns out we can dump the data to different s3 locations using Hive activity as well.
This is how the data pipeline would like in that case.
But I believe writing multiple hive activities, when your input source is DynamoDB table, is not a good idea since hive doesn't load any data in memory. It does all the computations on the actual table which could deteriorate the performance of the table. Even documentation suggests to export the data incase you need to make multiple queries to same data. Reference
Enter a Hive command that maps a table in the Hive application to the data in DynamoDB. This table acts as a reference to the data stored in Amazon DynamoDB; the data is not stored locally in Hive and any queries using this table run against the live data in DynamoDB, consuming the table’s read or write capacity every time a command is run. If you expect to run multiple Hive commands against the same dataset, consider exporting it first.
In my case I needed to perform different type of aggregations on the same data once a day. Since dynamoDB doesn't support aggregations, I turned to Data pipeline using Hive. In the end we ended up using AWS Aurora which is My-SQL based.
I'm trying to design a big IoT solution of millions of devices starting from zero. That's why I need a highly scalable platform like AWS.
My devices are going to report data using AWS IoT, and that's the only thing I've really decided. I need to store a lot of data like a temperature measure every 15 minutes on each device so for that measures I've planned to insert those measures directly to DynamoDB using IoT Rules, but on the other side, I need a relational structure to store companies, temperature sensors, etc. So I thought I could store that in MySQL RDS.
After that, I need to configure a proper analysis tool, so I was thinking of Kinesis and load the data from Redshift after ETL using Data Pipeline since AWS Glue doesn't support DynamoDB.
I'm new with some of the services so I don't know exactly what I'm doing and I don't know if this approach is the best one. What do you think?.
Thanks.
I would have your applications write the edge data (Raw Data) to an S3 bucket with this flow:
Edge(With credentials) -> APIGateway -> Lambda -> S3
Save your Raw data as .json files in S3. Then you can use tools like Athena and Quicksight to visualize.
The benefits to this are:
1) Your edge devices don't have to have the AWS SDK
2) S3 is cheep and insanely scalable
3) JSON format can be read by any service so you are not locked into AWS for visualization.