While comparing (1 node, 2 node, 3 node) clusters...how to findout which cluster is best in performance?
In your point of view which cluster is the best in performance
How to copy a CSV file from S3 into redshift and specify where i have commited mistake?
The best way to determine the setup that provides the best performance is to perform your own tests. For example, try loading data and runnign queries with different numbers and sizes of nodes. This way, you will have results that are accurate based upon your own particular data and that way that it is stored in Amazon Redshift.
When loading data into Redshift, the load will run better when you have more nodes because it is performed in parallel across all nodes.
In general, more nodes will always give better performance because each node adds additional storage and compute. However, you will have to trade-off performance against cost.
Related
According to the AWS documentation for UNLOAD the number of files written to S3 are the "number of slices in the cluster". Our cluster has 24 nodes.
What counts as a slice? Why are we getting 64 files on S3 (see screenshot) and not 24?
Most files are around 37MB but some only ~300B (only contain the column headers), why are these files added?
You can think of a slice as a partition on each node, by default each node will have at least 2 slices however this default does vary by the node size.
A compute node is partitioned into slices. Each slice is allocated a portion of the node's memory and disk space, where it processes a portion of the workload assigned to the node. The leader node manages distributing data to the slices and apportions the workload for any queries or other database operations to the slices. The slices then work in parallel to complete the operation.
The files will vary in size depending on how the data is distributed across nodes (this is decided by your distribution keys). If you're getting some files with little data this would mean that the result set from your query is retrieving minimal data from one slive whilst gaining more from another slice.
If you haven't ever configured your distribution scheme in the schema of a table it will by default be EVEN.
The below links should help to go into further details on these subjects:
Data warehouse system architecture
Amazon Redshift clusters - Node type details
Distribution styles
Under 3 nodes using redshift we plan on doing 50-100 inserts every 10 seconds. Within that 10 second window we also will try to do the equivalent of a redshift upsert as documented here https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-upsert.html on about 50 to 100 rows as well.
I'm basically unaware if a 10 second window is realistic or a 10 minute window... etc is good for this kind of load. Should this be a daily batch? Should I try to re-architect to get rid of upserts?
My question is essentially can redshift handle this load? I feel the upsert is happening too many times. We are using structured streaming in spark to handle all of this. If yes what type of nodes should we be using? Has anyone who has done this have a ballpark estimate? If no, what are alternative architectures?
Essentially what we're trying to do is load entity data to be joined with the events in redshift. But we want the analytics to be as near real time as possible so we want load as fast as we can.
There's probably no exact answer for this, so any explanation that can get help me perform estimations on requirements based on load will be helpful.
I do not think you will achieve the performance you seek.
Running large numbers of INSERT statements is not an optimal way to load data into Amazon Redshift.
The best way is via running COPY from data stored in Amazon S3. This loads data in parallel across all nodes.
Unless you have a very real need to get data immediately into Redshift, it would be better to batch the data in S3 over a period of time (the larger the batch, the better), then load via COPY. This will also work well with the Staging Table approach to performing UPSERTS.
The best way to discover whether Redshift will handle a particular load is to try it! Spin up another cluster and try the various methods, measuring the performance each time.
I would recommend using Kinesis Firehose to insert data to Redshift. It will optimize for time / load and insert accordingly.
We tried inserting manually in batches, does not seems to be the cleaner way of handling it when an optimized cloud service exist for the same.
https://docs.aws.amazon.com/ses/latest/DeveloperGuide/event-publishing-redshift-firehose-stream.html
It does collect them in batches, compress and load them to Redshift.
Upsert Process:
If you want an upsert this is how I would have them done in a scalable way,
DynamoDB Table (Update) --> DynamoDB Streams --> Lambda --> Firehose --> Redshift
Have a scheduled job to cleanup any duplicate records based on created_timestamp.
Hope it helps.
I have been using AWS Athena to query analytics data stored on S3 across several tables. Over a period of time I have come up with 2-3 complex SQL queries (involving several joins) for pulling relevant data. Since, Athena is for ad-hoc queries (and not predefined queries), besides prohibitive costs for processing several TB and 30 minute timeout, I am looking for alternatives.
Two alternatives that I can think of are:
Use Presto based EMR cluster and run existing query. It removes the 30 minute limit and (might) reduce costs ($5/TB). However, the cons are reprocessing the same data on successive runs.
Do ETL (such as through AWS Glue) and denormalise data. This should reduce repeated joins, as only incremental data is processed. Subsequently query the flattened data with some SQL interface - Athena/Hive. However, I am not sure if denormalisation is a good idea, besides the cost of storing redundant (huge) data.
Which of these is a better choice or is there a better standard technique for this issue?
I think it's best to do 2 (denormalization) and then 1 (run Presto over the optimized data layout).
Also, Presto with Cost-Based Optimizer might be worth a look: https://www.starburstdata.com/technical-blog/starburst-presto-on-aws-18x-faster-than-emr/
Denormalization of the Data depends on your use case but mostly preferred for s3/hdfs structures. you can follow this link for better Athena storing and performance:
https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
I was studying Amazon Redshift using the Sybex Official Study Guide, at page 173 there are a couple of phrases:
You can configure the distribution style of a table to give Amazon RS hints as to how the data should be partitioned to best meet your query patterns. When you run a query, the optimizer shifts the rows to the compute node as needed to perform any joins and aggregates.
That leads me to some questions?
1) What is the role of "optimizer"? Do data re-arranged across compute nodes to boost performance for each new query?
2) if 1) is true and new query completly different is performed: What happen to the old data in the compute nodes?
3) Can you explain me better the 3 distribution styles (EVEN, KEY, ALL) particularly the KEY style.
Extra questions:
1) Does the leader node has records?
To clarify a few things:
The Distribution Key is not a hint -- data is actually distributed according to the key
When running a query, the data is not "shifted" -- rather, copies of data might be sent to other nodes so that data can be joined on a particular node, but the data does not then reside on the destination node
The optimizer doesn't actually "do" anything -- it just computes the process that the nodes will follow (Redshift apparently writes C programs that are sent to each node)
The only thing you really need to know about the Optimizer is:
Query optimizer
The Amazon Redshift query execution engine incorporates a query optimizer that is MPP-aware and also takes advantage of the columnar-oriented data storage. The Amazon Redshift query optimizer implements significant enhancements and extensions for processing complex analytic queries that often include multi-table joins, subqueries, and aggregation.
From Data Warehouse System Architecture:
Leader node
The leader node manages communications with client programs and all communication with compute nodes. It parses and develops execution plans to carry out database operations, in particular, the series of steps necessary to obtain results for complex queries. Based on the execution plan, the leader node compiles code, distributes the compiled code to the compute nodes, and assigns a portion of the data to each compute node.
The leader node distributes SQL statements to the compute nodes only when a query references tables that are stored on the compute nodes. All other queries run exclusively on the leader node. Amazon Redshift is designed to implement certain SQL functions only on the leader node. A query that uses any of these functions will return an error if it references tables that reside on the compute nodes.
The Leader Node does not contain any data (unless you launch a single-node cluster, in which case the same server is used as the Leader Node and Compute Node).
For information on Distribution Styles, refer to: Distribution Styles
If you really want to learn about Redshift, then read the Redshift Database Developer Guide. If you are merely studying for the Solutions Architect exam, the above links will be sufficient for the level of Redshift knowledge.
I'm using AWS Redshift as a back-end to my tableau desktop. AWS cluster is running with two dc1.large nodes and database table which I'm analyzing is of 30GB (with redshift compression enabled), I chose Redshift over tableau extract for performance issue but seems like Redshift live connection is much slower than extract. Any suggestions where shall I look into?
To use Redshift as a backend for a BI platform like Tableau, there are four things you can do to address latency:
1) Concurrency: Redshift is not great at running multiple queries at the same time so before you start tuning the database, make sure your query is not waiting in line behind other queries. (If you are the only one on the cluster, this shouldn't be a problem.)
2) Table size: Whenever you can, use aggregate tables for better performance. Fewer rows to scan means less IO and faster turnaround!
3) Query complexity: Ideally, you want your BI tool to issue simple, fast performing queries. Make sure your source tables are fast, and that Tableau isn't being forced to do a bunch of joins. Also, if your query does need to join multiple tables, make sure any large tables have the same distribution key.
4) "Indexing": Technically, Redshift does not support true indexing, but you can get close to the same thing by using "interleaved" sort keys. Traditional compound sort keys won't help, but an interleaved sort key can allow you to quickly access rows from multiple vectors (date and customer_id, for instance) without having to scan the entire table.
Reality Check
After all of these things are optimized, you will often find that you still can't be as fast as a Tableau extract. Simply stated, a "fast" Tableau dashboard needs to return data to it's user in <5 seconds. If you have 7 visuals on your dashboard, and each of the underlying queries takes 800 milliseconds to return (which is super fast for a database query), then you still are just barely reaching your target performance. Now, if just one of those queries take 5 seconds or more, your dashboard is going to feel "slow" no matter what you do.
In Summary
Redshift can be tuned using the approach above, but it may or may not be worth the effort. The best applications for using a live Redshift query instead of Tableau Extracts are in cases where the data is physically too large to create an extract of, and when you require data at a level of granularity that makes pre-aggregation infeasible. One good strategy is to create your main dashboard using an extract so that exploration/discovery is as fast as possible, and then use direct (live) Redshift queries for your drill-through reports (for instance, when you want to see exactly which customers roll up into your totals).
Few pointers as below
1) Use vacuum & Analyze once your ETL completes
2) Have you created the Table with proper Dist key and Sort Key
3) Aggregation if it's ok from the point of Data Granularity, requirement etc
1.Remove cursor, tableau access data from redshift leader node using cursor. Cursor works iteratively. Thus, impacting the performance.
2. Perform manual analyze on the table, after running heavy load operations. https://docs.aws.amazon.com/redshift/latest/dg/r_ANALYZE.html
3.Check the dist key distribution to avoid data skewness and improve performance.