I am looking to replicate Amazon Redshift data to a different region. However, Redshift does allow replication for backup reason only. What I am looking for is to process the data in one region and then replicate the processed data to another region so that tools in the second region can directly connect to Redshift in that region with lower latency.
I know I can Build Multi-AZ or Multi-Region Amazon Redshift Clusters | AWS Big Data Blog using Kinesis but since it is not a transactional app we really don't want to do that.
Related
I'm wondering whether it is possible to easily sync an Amazon RDS PostgreSQL database to Amazon S3 in near real time so that data can be used with Amazon Athena, just as read replicas do.
We have several RDS database and we would like to consolidate all the data in a single repository such as S3.
Thanks.
There is no capability to "export RDS to S3 in real time".
However, Amazon Athena can query Amazon RDS databases, so you could have some of your data in Amazon S3 and some in Amazon RDS.
See: Query any data source with Amazon Athena’s new federated query | AWS Big Data Blog
What you are describing sounds like a data warehouse, where information is extracted from many information sources and is stored in one place for easy querying -- often in 'wide' tables to make querying simpler. However, this is very difficult to do "in real time". It is typically updated nightly, or perhaps hourly.
You might want to consider using AWS Database Migration Service to continuously sync data between RDS and S3: https://aws.amazon.com/premiumsupport/knowledge-center/s3-bucket-dms-target/
saying this, it is only sensible when you don't have a read-only replica of the data and the queries might affect source RDS performance.
AWS Bucket, AWS Athena, AWS Quicksight must be in same region or can be in different region in order to process data?
The Athena documentation indicates:
You can query data in regions other than the region where you run
Athena. Standard inter-region data transfer rates for Amazon S3 apply
in addition to standard Athena charges.
Not 100% sure about QuickSight, though I know it cannot connect to VPC resources cross-region.
As a general rule, I'd try hard to ensure that my analytics tools were in the same region as the data, for both latency and network cost reasons. That might mean considering the cost of cross-region S3 replication.
I am trying to setup a sync between AWS Aurora and Redshift. What is the best way to achieve this sync?
Possible ways to sync can be: -
Query table to find changes in a table(since I am only doing inserts, updates don't matter), export these changes to a flat file in S3 bucket and use Redshift copy command to insert into Redshift.
Use python publisher and Boto3 to publish changes into a Kinesis stream and then consume this stream in Firehose from where I can copy directly into Redshift.
Use Kinesis Agent to detect changes in binlog (Is it possible to detect changes int binlog using Kinesis Agent) and publish it to Firehose and from there copy into Firehose.
I haven't explored AWS Datapipeline yet.
As pointed out by #Mark B, the AWS Database Migration Service can migrate data between databases. This can be done as a one-off exercise, or it can run continuously, keeping two databases in sync.
The documentation shows that Amazon Aurora can be a source and Amazon Redshift can be a target.
AWS has just announced this new feature: Amazon Aurora zero-ETL integration with Amazon Redshift
This natively provides near real-time (second) synchronization from Aurora to Redshift.
You can also use federated queries: https://docs.aws.amazon.com/redshift/latest/dg/federated-overview.html
I would like to backup a snapshot of my Amazon Redshift cluster into Amazon Glacier.
I don't see a way to do that using the API of either Redshift or Glacier. I also don't see a way to export a Redshift snapshot to a custom S3 bucket so that I can write a script to move the files into Glacier.
Any suggestion on how I should accomplish this?
There is no function in Amazon Redshift to export data directly to Amazon Glacier.
Amazon Redshift snapshots, while stored in Amazon S3, are only accessible via the Amazon Redshift console for restoring data back to Redshift. The snapshots are not accessible for any other purpose (eg moving to Amazon Glacier).
The closest option for moving data from Redshift to Glacier would be to use the Redshift UNLOAD command to export data to files in Amazon S3, and then to lifecycle the data from S3 into Glacier.
Alternatively, simply keep the data in Redshift snapshots. Backup storage beyond the provisioned storage size of your cluster and backups stored after your cluster is terminated are billed at standard Amazon S3 rates. This has the benefit of being easily loadable back into a Redshift cluster. While you'd be paying slightly more for storage (compared to Glacier), the real cost saving is in the convenience of quickly restoring the data in future.
Any use case to take a backup as Redshift automatically keeps snapshots. Here is a reference link
We would like to stream data directly from EC2 web server to RedShift. Do I need to use Kinesis? What is the best practice? I do not plan to do any special analysis before the storage on this data. I would like a cost effective solution (it might be costly to use DynamoDB as a temporary storage before loading).
If cost is your primary concern than the exact number of records/second combined with the record sizes can be important.
If you are talking very low volume of messages a custom app running on a t2.micro instance to aggregate the data is about as cheap as you can go, but it won't scale. The bigger downside is that you are responsible for monitoring, maintaining, and managing that EC2 instance.
The modern approach would be to use a combination of Kinesis + Lambda + S3 + Redshift to have the data stream in requiring no EC2 instances to mange!
The approach is described in this blog post: A Zero-Administration Amazon Redshift Database Loader
What that blog post doesn't mention is now with API Gateway if you do need to do any type of custom authentication or data transformation you can do that without needing an EC2 instance by using Lambda to broker the data into Kinesis.
This would look like:
API Gateway -> Lambda -> Kinesis -> Lambda -> S3 -> Redshift
Redshift is best suited for batch loading using the COPY command. A typical pattern is to load data to either DynamoDB, S3, or Kinesis, then aggregate the events before using COPY to Redshift.
See also this useful SO Q&A.
I implemented a such system last year inside my company using Kinesis and Kinesis connector. Kinesis connector is just a standalone app released by AWS we are running in a bunch of ElasticBeanStalk servers as Kinesis consumers, then the connector will aggregate messages to S3 every a while or every amount of messages, then it will trigger the COPY command from Redshift to load data into Redshift periodically. Since it's running on EBS, you can tune the auto-scaling conditions to make sure the cluster grows and shrinks with the volume of data from Kinesis stream.
BTW, AWS just announced Kinesis Firehose yesterday. I haven't played it but it definitely looks like a managed version of the Kinesis connector.