Stream many tables in realtime to Redshift: Bottleneck parallel concurrent load - amazon-web-services

We recently released an open source project to stream data to Redshift in near realtime.
Github: https://github.com/practo/tipoca-stream
The realtime data pipeline stream data to Redshift from RDS.
Debezium writes the RDS events to Kafka.
We wrote Redshiftsink to sink data from Kafka to Redshift.
We have 1000s of tables which are streaming to Redshift, we use COPY command. We wish to load every ~10 minutes to keep the data as near realtime as possible.
Problem
Parallel load becomes a bottleneck. Redshift is not good in ingesting data at such short interval. We do understand Redshift is not a realtime database. What is the best that can be done? Does Redshift plan to solve this in future?
Workaround that works for us!
We have 1000+ tables in Redshift but we use not over 400 in a day. This is the reason we now throttle loads for the unused table when needed. This feature makes sure the tables which are in use are always near realtime and keep the Redshift less burdened. This was very useful.
Looking for suggestions from the Redshift community!

Related

AWS Redshift or RDS for a Data warehouse?

Right now we have an ETL that extracts info from an API, transforms, and Store in one big table in our OLTP database we want to migrate this table to some OLAP solution. This table is only read to do some calculations that we store on our OLTP database.
Which service fits the most here?
We are currently evaluating Redshift but never used the service before. Also, we thought of some snowflake schema(some kind of fact table with dimensions) in an RDS because is intended to store 10GB to 100GB but don't know how much this approach can scale.
Which service fits the most here?
imho you could do a PoC to see which service is more feasible for you. It really depends on how much data you have, what queries and what load you plan to execute.
AWS Redshift is intended for OLAP on top of peta- or exa-bytes scale handling heavy parallel workload. RS can as well aggregate data from other data sources (jdbc, s3,..). However RS is not OLTP, it requires more static server overhead and extra skills for managing the deployment.
So without more numbers and use cases one cannot advice anything. Cloud is great that you can try and see what fits you.
AWS Redshift is really great when you only want to read the data from the database. Basically, Redshift in the backend is a column-oriented database that is more suitable for analytics. You can transfer all your existing data to redshift using the AWS DMS. AWS DMS is a service that basically needs your bin logs of the existing database and it will automatically transfer your data we don't have to do anything. From my Personal experience Redshift is really great.

AWS: Sync data from DyanmoDB to Redshift on hourly basis

I'm storing some events into DynamoDB. I have to sync (i.e. copy incrementally) the data with Redshift. Ultimately, I want to be able to analyze the data through AWS Quicksight.
I've come across multiple solutions but those are either one-time (using the one-time COPY command) or real-time (a streaming data pipeline using Kinesis Firehose).
The real-time solution seems superior to hourly sync, but I'm worried about performance and complexity. I was wondering if there's an easier way to batch the updates on an hourly basis.
What you are looking for are DynamoDB Streams (official docs). This can seamlessly flow into the Kinesis firehose as you have correctly pointed out.
This is the most optimal way and provides the best balance between the cost, operational overhead and the functionality itself. Allow me explain how:
DynamoDB streams: Streams are triggered when any activity happens on the database. This means that unlike a process that will scan the data on a periodic basis and consume the read capacity even if there is no update, you will be notified of the new data.
Kinesis Firehose: You can configure the Firehose to batch data either by the size of the data or time. This means that if you have a good inflow, you can set the stream to batch the records received in each 2 minutes interval and then issue just one COPY command to redshift. Same goes for the size of the data in the stream buffer. Read more about it here.
The ideal way to load data into Redshift is via COPY command and Kinesis Firehose does just that. You can also configure it to automatically create backup of the data into S3.
Remember that a reactive or push based system is almost always more performant and less costly than a reactive or push based system. You save on the compute capacity needed to run a cron process and also continuously scan for updates.

Does Amazon Redshift have its own storage backend

I'm new to Redshift and having some clarification on how Redshift operates:
Does Amazon Redshift has their own backend storage platform or it depends on S3 to store the data as objects and Redshift is used only for querying, processing and transforming and has temporary storage to pick up the specific slice from S3 and process it?
In the sense, does redshift has its own backend cloud space like oracle or Microsoft SQL having their own physical server in which data is stored?
Because, if I'm migrating from a conventional RDBMS system to Redshift due to increased volume, If I opt for Redshift alone would do or should I opt for combination of Redshift and S3.
This question seems to be basic, but I'm unable to find answer in Amazon websites or any of the blogs related to Redshift.
Yes, Amazon Redshift uses its own storage.
The prime use-case for Amazon Redshift is running complex queries against huge quantities of data. This is the purpose of a "data warehouse".
Whereas normal databases start to lose performance when there are 1+ million rows, Amazon Redshift can handle billions of rows. This is because data is distributed across multiple nodes and is stored in a columnar format, making it suitable for handling "wide" tables (which are typical in data warehouses). This is what gives Redshift its speed. In fact, it is the dedicated storage, and the way that data is stored, that gives Redshift its amazing speed.
The trade-off, however, means that while Redshift is amazing for queries large quantities of data, it is not designed for frequently updating data. Thus, it should not be substituted for a normal database that is being used by an application for transactions. Rather, Redshift is often used to take that transactional data, combine it with other information (customers, orders, transactions, support tickets, sensor data, website clicks, tracking information, etc) and then run complex queries that combine all that data.
Amazon Redshift can also use Amazon Redshift Spectrum, which is very similar to Amazon Athena. Both services can read data directly from Amazon S3. Such access is not as efficient as using data stored directly in Redshift, but can be improved by using columnar storage formats (eg ORC and Parquet) and by partitioning files. This, of course, is only good for querying data, not for performing transactions (updates) against the data.
The newer Amazon Redshift RA3 nodes also have the ability to offload less-used data to Amazon S3, and uses caching to run fast queries. The benefit is that it separates storage from compute.
Quick summary:
If you need a database for an application, use Amazon RDS
If you are building a data warehouse, use Amazon Redshift
If you have a lot of historical data that is rarely queried, store it in Amazon S3 and query it via Amazon Athena or Amazon Redshift Spectrum
looking at your question, you may benefit from professional help with your architecture.
However to get you started, Redshift::
has its own data storage, no link to s3.
Amazon Redshift Spectrum allows you to also query data held in s3 (similar to AWS
Athena)
is not a good alternative as a back-end database to replace a
traditional RDBMS as transactions are very slow.
is a great data warehouse tool, just use it for that!

DynamoDB to Redshift

I am asking this in context of loading data from DynamoDb into Redshift. Per the Redshift Docs:
To avoid consuming excessive amounts of provisioned read throughput, we recommend that you not load data from Amazon DynamoDB tables that are in production environments.
My data is in Production, so how do I get it out of there?
Alternately is DynamoDB Streams a better overall choice to move data from DynamoDB into Redshift? (I understand this does not add to my RCU cost.)
The warning is due to the fact that the export could consume much of your read capacity for a period of time, which would impact your production environment.
Some options:
Do it at night when you don't need as much capacity
Set READRATIO to a low value so that it consumes less of the capacity
Temporarily increase the Read Capacity Units of the table when performing the export (you can decrease capacity four times a day)
DynamoDB Streams provides a stream of data representing changes to a DynamoDB table. You would need to process these streams using AWS Lambda to send the data somewhere for loading into Redshift. For example, you could populate another DynamoDB table and use it for importing into Redshift. Or, you could write the data to Amazon S3 and import from there into Redshift. However, this involves lots of moving parts.
Using AWS Data pipeline, you can do a bulk copy data from DynamoDB to a new or existing Redshift table.

Performance issues with Redshift Spectrum

I am using Redhshift spectrum. I created an external table and uploaded a csv data file on S3 with around 5.5 million records. If fire a query on this external table, it is taking ~15 seconds whereas If I run same query on Amazon redshift, I was getting same result in ~2 seconds. What could be the reason for this performance lag where AWS claim it to be be very high performance platform. Please suggest solution for same performance using spectrum.
For your performance optimizations please have a look to understand your query.
Right now, the best performance is if you don't have a single CSV file but multiple. Typically, you could say you get great performance if the number of files per query is at least about an order of magnitude larger than the number of nodes of your cluster.
In addition, if you use Parquet files you get the advantage of a columnar format on S3 rather than reading CSV which will read the whole file from S3 - and decreases your cost as well.
You can use the script to convert data to Parquet:
Reply from AWS forum as follows :
I understand that you have the same query running on Redshift & Redshift Spectrum. However, the results are different, while one run in 2 seconds the other run in around 15 seconds.
First of all, we must agree that both Redshift and Spectrum are different services designed differently for different purpose. Their internal structure varies a lot from each other, while Redshift relies on EBS storage, Spectrum works directly with S3. Redshift Spectrum's queries employ massive parallelism to execute very fast against large datasets. Much of the processing occurs in the Redshift Spectrum layer, and most of the data remain in Amazon S3.
Spectrum is also designed to deal with Petabytes of data structured and semi-structured data from files in Amazon S3 without having to load the data into Amazon Redshift tables while Redshift offers you the ability to store data efficiently and in a highly-optimez manner by means of Distribution and Sort Keys.
AWS does not advertise Spectrum as a faster alternative to Redshift. We offer Amazon Redshift Spectrum as an add-on solution to provide access to data stored in Amazon S3 without having to load it into Redshift (similar to Amazon Athena).
In terms of query performance, unfortunately, we can't guarantee performance improvements since Redshift Spectrum layer produces query plans completely different from the ones produced by Redshift's database engine interpreter. This particular reason itself would be enough to discourage any query performance comparison between these services as it is not fair to neither of them.
About your question about on the fly nodes, Spectrum adds them based on the demands of your queries, Redshift Spectrum can potentially use thousands of instances to take advantage of massively parallel processing. There aren't any specific criteria to trigger this behavior, however, bearing in mind that by following the best practices about how to improve query performance[1] and how to create data files for queries[2] you can potentially improve the overall Spectrum's performance.
For the last, I would like to point some interesting documentation to clarify you a bit more about how to achieve better performance improvements. Please see the references at the end!
I am somewhat late to answer this. As of Feb-2018, AWS is supporting the AWS Spectrum queries on files in columnar formats like Parquet, ORC etc. In your case, you are storing the file as .CSV. CSV is row based which results in pulling out the entire row for any field queried. I will suggest you to convert the files from .csv to Parquet format before querying. This will certainly result in much faster performance. Details from AWS: Amazon Redshift Spectrum
These results are to be expected. The whole reason for using Amazon Redshift is that it stores data in a highly-optimized manner to provide fast queries. Data stored outside of Redshift will not run anywhere near as fast.
The intention of Amazon Redshift Spectrum is to provide access to data stored in Amazon S3 without having to load it into Redshift (similar to Amazon Athena), but it makes no performance guarantees.