Transfer/Replicate Data periodically from AWS Documentdb to Google Cloud Big Query - amazon-web-services

We are building a customer facing App. For this app, data is being captured by IoT devices owned by a 3rd party, and is transferred to us from their server via API calls. We store this data in our AWS Documentdb cluster. We have the user App connected to this cluster with real time data feed requirements. Note: The data is time series data.
The thing is, for long term data storage and for creating analytic dashboards to be shared with stakeholders, our data governance folks are requesting us to replicate/copy the data daily from the AWS Documentdb cluster to their Google cloud platform -> Big Query. And then we can directly run queries on BigQuery to perform analysis and send data to maybe explorer or tableau to create dashboards.
I couldn't find any straightforward solutions for this. Any ideas, comments or suggestions are welcome. How do I achieve or plan the above replication? And how do I make sure the data is copied efficiently - memory and pricing? Also, don't want to disturb the performance of AWS Documentdb since it supports our user facing App.

This solution would need some custom implementation. You can utilize Change Streams and process the data changes in intervals to send to Big Query, so there is a data replication mechanism in place for you to run analytics. One of the use cases of using Change Streams is for analytics with Redshift, so Big Query should serve a similar purpose.
Using Change Streams with Amazon DocumentDB:
https://docs.aws.amazon.com/documentdb/latest/developerguide/change_streams.html
This document also contains a sample Python code for consuming change streams events.

Related

Creating a data lake from a DynamoDB table

We have a service where a DynamoDB table ~50GB is our feature repository, which we use for real-time, online applications.
We want to create a data lake from this table for historical data, model training and analytics insights. We want to guarantee a 30-minutes "freshness" of data lake data w.r.t. the original table.
However, I'm confused on what could be a good architecture for this: my understanding of data lakes is that you should use a storage service (i.e., S3) to store the raw data with no processing. Then, you perform ETL jobs, where you transform, process and filter the data (e.g., using Glue) before using for whatever app.
But here is my doubt: does this means that we have to dump the DynamoDB table into S3 every 30 minutes? This can be easily done, but it sounds weird (this would result in ~876TB/year).
Am I missing something in the data lake pipeline?
You've hit a common problem, and its one AWS are actively working on.
If you want continous sync-ing from dynamodb to S3, its possible using existing technology including dynamodb streams. I suggest checking out this project in awslabs. Frankly its quite a bit of effort.
However, I believe AWS are about to release a product that will keep dynamodb tables and S3 buckets in sync, without code, in a few clicks. Its called AWS Glue Elastic Views. The product is in preview. They announced the product in December 2020 so I'm hoping it available soon. There is also a form you can fill in to join the trial but there is no guarantee AWS will give to access.

AWS Redshift or RDS for a Data warehouse?

Right now we have an ETL that extracts info from an API, transforms, and Store in one big table in our OLTP database we want to migrate this table to some OLAP solution. This table is only read to do some calculations that we store on our OLTP database.
Which service fits the most here?
We are currently evaluating Redshift but never used the service before. Also, we thought of some snowflake schema(some kind of fact table with dimensions) in an RDS because is intended to store 10GB to 100GB but don't know how much this approach can scale.
Which service fits the most here?
imho you could do a PoC to see which service is more feasible for you. It really depends on how much data you have, what queries and what load you plan to execute.
AWS Redshift is intended for OLAP on top of peta- or exa-bytes scale handling heavy parallel workload. RS can as well aggregate data from other data sources (jdbc, s3,..). However RS is not OLTP, it requires more static server overhead and extra skills for managing the deployment.
So without more numbers and use cases one cannot advice anything. Cloud is great that you can try and see what fits you.
AWS Redshift is really great when you only want to read the data from the database. Basically, Redshift in the backend is a column-oriented database that is more suitable for analytics. You can transfer all your existing data to redshift using the AWS DMS. AWS DMS is a service that basically needs your bin logs of the existing database and it will automatically transfer your data we don't have to do anything. From my Personal experience Redshift is really great.

AWS Timestream DB - AWS IOT

I am building out a simple sensor which sends out 5 telemetry data to AWS IoT Core. I am confused between AWS Timestream DB and Elastic Search to store this telemetries.
For now I am experimenting with Timestream and wanted to know is this the right choice ? Any expert suggestions.
Secondly I want to store the db records for ever as this will feed into my machine
learning predictions in the future. Timestream deletes records after a while or is it possible to never delete it
I will be creating a custom web page to show this telemetries per tenant - any help with how I can do this. Should I directly query the timestream db over api or should i back it up in another db like dynamic etc ?
Your help will be greatly appreciated. Thank you.
For now I am experimenting with Timestream and wanted to know is this the right choice? Any expert suggestions.
I would not call myself an expert but Timestream DB looks like a sound solution for telemetry data. I think ElasticSearch would be overkill if each of your telemetry data is some numeric value. If your telemetry data is more complex (e.g. JSON objects with many keys) or you would benefit from full-text search, ElasticSearch would be the better choice. Timestream DB is probably also easier and cheaper to manage.
Secondly I want to store the db records for ever as this will feed into my machine learning predictions in the future. Timestream deletes records after a while or is it possible to never delete it
It looks like the retention is limited to 4 weeks 200 Years per default. You probably can increase that by contacting AWS support. But I doubt that they will allow infinite retention.
We use Amazon Kinesis Data Firehose with AWS Glue to store our sensor data on AWS S3. When we need to access the data for analysis, we use AWS Athena to query the data on S3.
I will be creating a custom web page to show this telemetries per tenant - any help with how I can do this. Should I directly query the timestream db over api or should i back it up in another db like dynamic etc ?
It depends on how dynamic and complex the queries are you want to display. I would start with querying Timestream directly and introduce DynamoDB where it makes sense to optimize cost.
Based on your approach " simple sensor which sends out 5 telemetry data to AWS IoT Core" Timestream is the way to go, fairly simple and cheaper solution for simple telemetry data.
The Magnetic storage is above what you will ever need (200years)

What is the right architecture\design to perform javascript-client to aws-database website tracking system

We wish to build data pipeline system which tracks website interactions/events.
The goal is to track user behavior in a website so we would like to choose the right architecture to implement it having the following two constraints :
1) the system is Amazon
2) this is budgetary project so we cannot use redshift for this purpose
Based on the above two constraints my plan is to implement the following architecture:
website-javascript --> AWS-S3 -->(AWS-Lambda)--> AWS-RDS
website javascript client -
aws-firehose data delivery system to S3 - tracking user interaction and load them to aws-firehose which eventually write them in aws-S3.
AWS Lambda (Python) - Periodically task which pulls daily events from AWS-S3 and load them to AWS-RDS.
The reason I have chosen AWS-RDS is due to its cost-effectiveness for this objective
Appreciate any comment to the above mentioned implementation or any other architecture proposal that you may recommend to use instead of the above
If I understand your question correctly, you are proposing below solution to perform web analytics for your application:
WebServer --> Firehose --> AWS-S3 --> AWS-Lambda --> AWS-RDS
I see below pros and cons with above design
Pros:
low cost
easy to implement
Cons:
RDS may not be salable enough to handle analytics on massive amounts of web-streaming data, which tend to grow rapidly
Need to handle load balancing, failure scenarios and other complexities for lambda
You need to handle data transformation for RDS as it expects structured data to be ingested into relational tables
Proposal to store the data in S3 through Firehose sounds a good solution. But please keep in mind that minimum interval for Firehose is one minute, so your application needs to tolerate this minor latency. You may use Kinesis Streams to have millisecond latency, but then you need to manage your own application code and instances to handle Streams.
After ingesting data in Kinesis Firehose or Streams, you may also explore below alternatives:
Use Kinesis Analytics to track web users activity in real-time if it's available in your AWS region. It's only available in selected AWS regions currently
Within Firehose, transform your data using lambda and store it in S3 in optimized format for further analysis with AWS Athena
Use Elastic Search as a destination and perform web analytics with ELK stack instead of RDS
Though you mentioned that you can not use RedShift, it still may be the best solution for time series analysis. Exploring RedShift, RedShift Spectrum and formatted data stored in S3 may still be a cost effective solution with better cababilities
Adding few references from AWS, which you may go through before deciding on the solution:
Real-Time Web Analytics with Kinesis Data Analytics Solution
Near Real-time Analytics on Streaming Data with Amazon Kinesis and Amazon Elasticsearch
Schema-On-Read Analytics Pipeline Using Amazon Athena
Amazon Redshift Spectrum Extends Data Warehousing Out to Exabytes—No Loading Required
Hey folky this is a getting more and more common.
Generally the pattern is click events to Kinesis streams then you can monitor user interaction with the website in real time using Kinesis analytics. You can connect the stream to firehose to offload data in to an S3 bucket as well as incorporate Lambdas to transform the data.
There is some major complexity around handling Lambdas and Kinesis streams in parallel so this solution might not be as scalable as using AWS Kafka. Or perhaps run a job to move your s3 data into rds for whatever reporting you might need that is adhoc.
Here is a pattern AWS already has real-time-web-analytics-with-kinesis

Aws: best approach to process data from S3 to RDS

I'm trying to implement, I think, a very simple process, but I don't really know what's the best approach.
I want to read a big csv (around 30gb) file from S3, make some transformation and load it into RDS MySQL and I want this process to be replicable.
I tought that the best approach was Aws data pipeline, but I've found that this service is more designed to load data from different sources to redshift after several transformtions.
I've also seen that the process of creating a pipeline is slow and a little bit messy.
Then I've found the dataduct wrapper of Coursera, but after some research, it seems that this project has been abandoned (the last commit was one year ago).
So I don't know if I should continue trying with aws data pipeline or take another approach.
I've also read about AWS Simple Workflow and Step Functions, but I don't know if it's simpler.
Then I've seen a video of AWS glue and it looks nice, but unfortunatelly it's not yet available and I don't know when Amazon will launch it.
As you see, I'm a little bit confuse, can anyone enlight me?
Thanks in advance
If you are trying to get them into RDS so you can query them, there are other options that do not require the data to be moved from S3 to RDS to do SQL like queries.
You can use Redshift spectrum to read and query information from S3 now.
Using Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables
Step 1. Create an IAM Role for Amazon Redshift
Step 2: Associate the IAM Role with Your Cluster
Step 3: Create an External Schema and an External Table
Step 4: Query Your Data in Amazon S3
Or you can use Athena to query the data in S3 as well if Redshift is too much horsepower for the need job.
Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL.
You could use an ETL tool to do the transformations on your csv data and then load it into your RDS database. There are a number of open source tools that do not require large licensing costs. That way you can pull the data into the tool, do your transformations and then the tool will load the data into your MySQL database. For example there is Talend, Apache Kafka, and Scriptella. Here's some information on them for comparison.
I think Scriptella would be an option for this situation. It can use SQL scripts (or other scripting languages), and has JDBC/ODBC compliant drivers. With this you could create a script that would perform your transformations and then load the data into your MySQL database. And you would be using familiar SQL (I'm assuming you already can create SQL scripts) so there isn't a big learning curve.