Redshift COPY or snapshots? - amazon-web-services

i'm looking at using AWS Redshift to let users submit queries against the old archived data which isn't available in my web page.
the total data i'm dealing with across all my users is a couple of terabytes. the data is already in an s3 bucket, split up into files by week. most requests won't deal with more than a few files totaling 100GB.
to keep costs down should i use snapshots and delete our cluster when not in use or should i have a smaller cluster which doesn't hold all of the data and only COPY data from S3 into a temporary table when running a query?

If you are just doing occasional queries where cost is more important than speed, you could consider using Amazon Athena, which can query data stored in Amazon S3. (Only in some AWS regions at the moment.) You are only charged for the amount of data read from disk.
To gain an appreciation for making Athena even better value, see: Analyzing Data in S3 using Amazon Athena
Amazon Redshift Spectrum can perform a similar job to Athena but requires an Amazon Redshift cluster to be running.
All other choices are really a trade-off between cost and access to your data. You could start by taking a snapshot of your Amazon Redshift database and then turning it off at night and on the weekends. Then, have a script that can restore it automatically for queries. Use fewer nodes to reduce costs -- this will make queries slower, but that doesn't seem to be an issue for you.

Related

Why do we first need to unload data from redshift to S3

I am trying to consume some data in redshift using sagemaker to train some model. After some research, I found the best way to do so is first unloading the data from redshift to an S3 bucket. I assume sagemaker has API to directly interact with redshift, but why do we need to first unload it to an S3 bucket?
UNLOADing is a best practice and generally the method that the docs will promote. This is due to efficiency and performance. Redshift is a cluster with a single leader and multiple compute nodes. S3 is a cluster - a distributed object store. Having multiple compute nodes connect to S3 when moving data is far faster and less of a burden to the database.
Also, tools that you may be using with sagemaker (like EMR) are also clusters and will also benefit from multiple parallel connections to S3.
The larger the amount of data being moved the greater this benefit will be.

Sync Amazon RDS (PostgreSQL) to S3 in near real time

I'm wondering whether it is possible to easily sync an Amazon RDS PostgreSQL database to Amazon S3 in near real time so that data can be used with Amazon Athena, just as read replicas do.
We have several RDS database and we would like to consolidate all the data in a single repository such as S3.
Thanks.
There is no capability to "export RDS to S3 in real time".
However, Amazon Athena can query Amazon RDS databases, so you could have some of your data in Amazon S3 and some in Amazon RDS.
See: Query any data source with Amazon Athena’s new federated query | AWS Big Data Blog
What you are describing sounds like a data warehouse, where information is extracted from many information sources and is stored in one place for easy querying -- often in 'wide' tables to make querying simpler. However, this is very difficult to do "in real time". It is typically updated nightly, or perhaps hourly.
You might want to consider using AWS Database Migration Service to continuously sync data between RDS and S3: https://aws.amazon.com/premiumsupport/knowledge-center/s3-bucket-dms-target/
saying this, it is only sensible when you don't have a read-only replica of the data and the queries might affect source RDS performance.

Does Amazon Redshift have its own storage backend

I'm new to Redshift and having some clarification on how Redshift operates:
Does Amazon Redshift has their own backend storage platform or it depends on S3 to store the data as objects and Redshift is used only for querying, processing and transforming and has temporary storage to pick up the specific slice from S3 and process it?
In the sense, does redshift has its own backend cloud space like oracle or Microsoft SQL having their own physical server in which data is stored?
Because, if I'm migrating from a conventional RDBMS system to Redshift due to increased volume, If I opt for Redshift alone would do or should I opt for combination of Redshift and S3.
This question seems to be basic, but I'm unable to find answer in Amazon websites or any of the blogs related to Redshift.
Yes, Amazon Redshift uses its own storage.
The prime use-case for Amazon Redshift is running complex queries against huge quantities of data. This is the purpose of a "data warehouse".
Whereas normal databases start to lose performance when there are 1+ million rows, Amazon Redshift can handle billions of rows. This is because data is distributed across multiple nodes and is stored in a columnar format, making it suitable for handling "wide" tables (which are typical in data warehouses). This is what gives Redshift its speed. In fact, it is the dedicated storage, and the way that data is stored, that gives Redshift its amazing speed.
The trade-off, however, means that while Redshift is amazing for queries large quantities of data, it is not designed for frequently updating data. Thus, it should not be substituted for a normal database that is being used by an application for transactions. Rather, Redshift is often used to take that transactional data, combine it with other information (customers, orders, transactions, support tickets, sensor data, website clicks, tracking information, etc) and then run complex queries that combine all that data.
Amazon Redshift can also use Amazon Redshift Spectrum, which is very similar to Amazon Athena. Both services can read data directly from Amazon S3. Such access is not as efficient as using data stored directly in Redshift, but can be improved by using columnar storage formats (eg ORC and Parquet) and by partitioning files. This, of course, is only good for querying data, not for performing transactions (updates) against the data.
The newer Amazon Redshift RA3 nodes also have the ability to offload less-used data to Amazon S3, and uses caching to run fast queries. The benefit is that it separates storage from compute.
Quick summary:
If you need a database for an application, use Amazon RDS
If you are building a data warehouse, use Amazon Redshift
If you have a lot of historical data that is rarely queried, store it in Amazon S3 and query it via Amazon Athena or Amazon Redshift Spectrum
looking at your question, you may benefit from professional help with your architecture.
However to get you started, Redshift::
has its own data storage, no link to s3.
Amazon Redshift Spectrum allows you to also query data held in s3 (similar to AWS
Athena)
is not a good alternative as a back-end database to replace a
traditional RDBMS as transactions are very slow.
is a great data warehouse tool, just use it for that!

How to control increasing data volume in Redshift?

I have a data warehouse maintained in AWS Redshift. The data volume and velocity both have increased lately. One option is to keep scaling the cluster horizontally at the expanse of a higher cost of course. I was wondering if there are any archiving options available so that I can query the entire data as usual (maybe with a compromise in the querying time) but with a low or no additional cost?
One option would be to use external tables and query data directly from S3 but the tools used for achieving this, like Athena and Glue have their own cost, that too on a per query basis.
Easy options:
Ensure all tables have compression SELECT * FROM svv_table_info;.
Maximize compression by changing large tables to use ENCODE zstd.
Switch small tables < ~50k rows (depends) to DISTSTYLE ALL (yes this saves space!).
Switch from SSD based nodes (dc2) to HDD nodes (ds2) which have more 8x storage space.
Less easy options:
UNLOAD older data from Redshift to S3 and query using Redshift Spectrum.
Convert unloaded data to Parquet or ORC format using AWS Glue or AWS EMR and then query using Redshift Spectrum.
Please experiment with Redshift Spectrum. Query performance is typically very good and gets even better if your data is in a columnar format (Parquet/ORC).
Data stored within Amazon Redshift will provide the highest performance.
However, if you have data that is less-frequently accessed, you could export (UNLOAD) it into Amazon S3, preferably as compressed, partitioned data and storing it as Parquet or ORC would be even better!
You could then use Amazon Redshift Spectrum to Query External Data in Amazon S3. You can even join external data with Redshift data, so you could query historical information and current information in the one query.
Alternatively, you could use Amazon Athena to query the data directly from Amazon S3. This is similar to Redshift Spectrum, but does not require Redshift. Amazon Athena is based on Presto, so it is super-fast, especially if data is stored as compressed, partitioned, Parquet/ORC.
See: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog
Please note that Redshift Spectrum and Amazon Athena charge based upon the amount of data read from disk. Therefore, compressed, partitioned Parquet/ORC is both cheaper and faster.

Do I need to set up backup data pipeline for AWS Dynamo DB on a daily basis?

I am considering using AWS DynamoDB for an application we are building. I understand that setting a backup job that exports data from DynamoDB to S3 involves a data pipeline with EMR. But my question is do I need to worry about having a backup job set up on day 1? What are the chances that a data loss would happen?
There are multiple use-cases for DynamoDB table data copy elsewhere:
(1) Create a backup in S3 on a daily basis, in order to restore in case of accidental deletion of data or worse yet drop table (code bugs?)
(2) Create a backup in S3 to become the starting point of your analytics workflows. Once this data is backed up in S3, you can combine it with, say, your RDBMS system (RDS or on-premise) or other S3 data from log files. Data Integration workflows could involve EMR jobs to be ultimately loaded into Redshift (ETL) for BI queries. Or directly load these into Redshift to do more ELT style - so transforms happen within Redshift
(3) Copy (the whole set or a subset of) data from one table to another (either within the same region or another region) - so the old table can be garbage collected for controlled growth and cost containment. This table-to-table copy could also be used as a readily consumable backup table in case of, say region-specific availability issues. Or, use this mechanism to copy data from one region to another to serve it from an endpoint closer to the DynamoDB client application that is using it.
(4) Periodic restore of data from S3. Possibly as a way to load back post-analytics data back into DynamoDB for serving it in online applications with high-concurrency, low-latency requirements.
AWS Data Pipeline helps schedule all these scenarios with flexible data transfer solutions (using EMR underneath).
One caveat when using these solutions is to note that this is not a point-in-time backup: so any changes to the underlying table happening during the backup might be inconsistent.
This is really subjective. IMO you shouldn't worry about them 'now'.
You can also use simpler solutions other than pipleline. Perhaps that will be a good place to start.
After running DynamoDB as our main production database for more than a year I can say it is a great experience. No data loss and no downtime. The only thing that we care about is sometimes SDK misbehaves and tweaking provisioned throughput.
data pipeline has limit regions.
https://docs.aws.amazon.com/general/latest/gr/rande.html#datapipeline_region
I would recommend setting up a Data pipeline to backup on daily basis to an S3 bucket - If you want to be really safe.
Dynamo DB itself might be very reliable, but nobody can protect you from your own accidental deletions (what if by mistake you or your colleague ended up deleting a table from the console). So I would suggest setup a backup on daily basis - It doesn't any case cost so much.
You can tell the Pipeline to only consume say may 25% of the capacity while backup is going on so that your real users don't see any delay. Every backup is "full" (not incremental), so in some periodic interval, you can delete some old backups if you are concerned about storage.