AWS Redshift Auto MV feature causing performance issues - amazon-web-services

We are getting performance issue whenever Auto MV feature creates/refreshes Automatic MV on the table which has huge volume of data(> 400GB) This impacts badly on existing workload.
I am still investigating if we can disable Auto MV for the specific table or we have to disable it for entire cluster ?
We have tried every other options but issue is there.
This table contains only 3 months of data.
will there any imapct of we disable AutoMV ? is there any solution ?

Since this comments discussion has grown let me add an answer to the question with more detail.
When a query runs it consumes CPU cycles AND main memory AND disk bandwidth AND network bandwidth. Every query needs all these resources in varying amounts. Redshift WLM has the ability to contain (somewhat successfully) the amount of CPU and main memory a query uses but doesn’t have a good ability to put fences up for disk and network traffic. This means that a “greedy” query can easily impact other queries by tying up network and disk subsystems. Given the size of the table you mention I expect this is what you are seeing.
A materialized view is a way to pre-compute the results of queries and store the result for later use at reduced cost. These view need to be refreshed when the source data changes or the result will be out of date. AutoMV is based on Redshift detecting when an expensive query is being run multiple times without changing source data and use this query to create an MV without user intervention. It then detects when a query could use the MV and rewrites the query on the fly all without the user knowing what is happening. It then will refresh the MV automatically when the database usage is “low”. All this sounds good in theory.
The concern is that if you have a really expensive, brown-out the cluster, query that runs often then Redshift will move it to an AutoMV. Now this painful query will run less often but at unexpected times. This should be an improvement in overall cluster workload but you have lost control of when this query will impact your database.
The root issue is that the underlying query is too expensive. Possibly due to a poorly written query (in Redshift terms) or possibly due to data model inefficiencies. Either way the “brown-out” events (likely) won’t stop until the query is addressed. [There are other reasons that a query can be too expensive but query and data model design or the most likely.]
Turning off AutoMV will just make Redshift rerun the query every time and not use the MV. The impact of this will likely be increasing the load on the cluster but you will gain control of when these queries run. Now it is possible that Redshift is “generalizing” the query you issue so that the MV can be used in more cases and it is this “generalize” query that is the issue. I doubt it but if you can show this submit a bug to Amazon. So I expect things will get worse with AutoMV off.
Generally Redshift doesn’t like queries that “generate” lots of new data during execution. These “fat in the middle” queries when run against large tables like you describe can easily lead to brown-out events. Inequality JOINs or poorly written JOINs are likely causes so look for these. Often DISTINCT or GROUP BY are used to pare the results down by removing the many duplicates in the query. Look for queries in your catalog tables that have large spill (svl_query_summary) or high network traffic (stl_dist).
If you can identify the query (or class of queries) that is causing your issues then this forum can likely provide help is addressing the performance issues you are experiencing.

Related

Alternatives for Athena to query the data on S3

I have around 300 GBs of data on S3. Lets say the data look like:
## S3://Bucket/Country/Month/Day/1.csv
S3://Countries/Germany/06/01/1.csv
S3://Countries/Germany/06/01/2.csv
S3://Countries/Germany/06/01/3.csv
S3://Countries/Germany/06/02/1.csv
S3://Countries/Germany/06/02/2.csv
We are doing some complex aggregation on the data, and because some countries data is big and some countries data is small, the AWS EMR doesn't makes sense to use, as once the small countries are finished, the resources are being wasted, and the big countries keep running for long time. Therefore, we decided to use AWS Batch (Docker container) with Athena. One job works on one day of data per country.
Now there are roughly 1000 jobs which starts together and when they query Athena to read the data, containers failed because they reached Athena query limits.
Therefore, I would like to know what are the other possible ways to tackle this problem? Should I use Redshift cluster, load all the data there and all the containers query to Redshift cluster as they don't have query limitations. But it is expensive, and takes a lot of time to wramp up.
The other option would be to read data on EMR and use Hive or Presto on top of it to query the data, but again it will reach the query limitation.
It would be great if someone can give better options to tackle this problem.
As I understand, you simply send query to AWS Athena service and after all aggregation steps finish you simply retrieve resulting csv file from S3 bucket where Athena saves results, so you end up with 1000 files (one for each job). But the problem is number of concurrent Athena queries and not the total execution time.
Have you considered using Apache Airflow for orchestrating and scheduling your queries. I see airflow as an alternative to a combination of Lambda and Step Functions, but it is totally free. It is easy to setup on both local and remote machines, has reach CLI and GUI for task monitoring, abstracts away all scheduling and retrying logic. Airflow even has hooks to interact with AWS services. Hell, it even has a dedicated operator for sending queries to Athena, so sending queries is as easy as:
from airflow.models import DAG
from airflow.contrib.operators.aws_athena_operator import AWSAthenaOperator
from datetime import datetime
with DAG(dag_id='simple_athena_query',
schedule_interval=None,
start_date=datetime(2019, 5, 21)) as dag:
run_query = AWSAthenaOperator(
task_id='run_query',
query='SELECT * FROM UNNEST(SEQUENCE(0, 100))',
output_location='s3://my-bucket/my-path/',
database='my_database'
)
I use it for similar type of daily/weekly tasks (processing data with CTAS statements) which exceed limitation on a number of concurrent queries.
There are plenty blog posts and documentation that can help you get started. For example:
Medium post: Automate executing AWS Athena queries and moving the results around S3 with Airflow.
Complete guide to installation of Airflow, link 1 and link 2
You can even setup integration with Slack for sending notification when you queries terminate either in success or fail state.
However, the main drawback I am facing is that only 4-5 queries are getting actually executed at the same time, whereas all others just idling.
One solution would be to not launch all jobs at the same time, but pace them to stay within the concurrency limits. I don't know if this is easy or hard with the tools you're using, but it's never going to work out well if you throw all the queries at Athena at the same time. Edit: it looks like you should be able to throttle jobs in Batch, see AWS batch - how to limit number of concurrent jobs (by default Athena allows 25 concurrent queries, so try 20 concurrent jobs to have a safety margin – but also add retry logic to the code that launches the job).
Another option would be to not do it as separate queries, but try to bake everything together into fewer, or even a single query – either by grouping on country and date, or by generating all queries and gluing them together with UNION ALL. If this is possible or not is hard to say without knowing more about the data and the query, though. You'll likely have to post-process the result anyway, and if you just sort by something meaningful it wouldn't be very hard to split the result into the necessary pieces after the query has run.
Using Redshift is probably not the solution, since it sounds like you're doing this only once per day, and you wouldn't use the cluster very much. It would Athena is a much better choice, you just have to handle the limits better.
With my limited understanding of your use case I think using Lambda and Step Functions would be a better way to go than Batch. With Step Functions you'd have one function that starts N number of queries (where N is equal to your concurrency limit, 25 if you haven't asked for it to be raised), and then a poll loop (check the examples for how to do this) that checks queries that have completed, and starts new queries to keep the number of running queries at the max. When all queries are run a final function can trigger whatever workflow you need to run after everything is done (or you can run that after each query).
The benefit of Lambda and Step Functions is that you don't pay for idle resources. With Batch, you will pay for resources that do nothing but wait for Athena to complete. Since Athena, in contrast to Redshift for example, has an asynchronous API you can run a Lambda function for 100ms to start queries, then 100ms every few seconds (or minutes) to check if any have completed, and then another 100ms or so to finish up. It's almost guaranteed to be less than the Lambda free tier.
As I know Redshift Spectrum and Athena cost same. You should not compare Redshift to Athena, they have different purpose. But first of all I would think about addressing you data skew issue. Since you mentioned AWS EMR I assume you use Spark. To deal with large and small partitions you need to repartition you dataset by months, or some other equally distributed value.Or you can use month and country for grouping. You got the idea.
You can use redshift spectrum for this purpose. Yes, it is a bit costly but it is scalable and very good for performing complex aggregations.

Is AWS Glue + Athena/Hive right choice to replace complex SQL queries?

I have been using AWS Athena to query analytics data stored on S3 across several tables. Over a period of time I have come up with 2-3 complex SQL queries (involving several joins) for pulling relevant data. Since, Athena is for ad-hoc queries (and not predefined queries), besides prohibitive costs for processing several TB and 30 minute timeout, I am looking for alternatives.
Two alternatives that I can think of are:
Use Presto based EMR cluster and run existing query. It removes the 30 minute limit and (might) reduce costs ($5/TB). However, the cons are reprocessing the same data on successive runs.
Do ETL (such as through AWS Glue) and denormalise data. This should reduce repeated joins, as only incremental data is processed. Subsequently query the flattened data with some SQL interface - Athena/Hive. However, I am not sure if denormalisation is a good idea, besides the cost of storing redundant (huge) data.
Which of these is a better choice or is there a better standard technique for this issue?
I think it's best to do 2 (denormalization) and then 1 (run Presto over the optimized data layout).
Also, Presto with Cost-Based Optimizer might be worth a look: https://www.starburstdata.com/technical-blog/starburst-presto-on-aws-18x-faster-than-emr/
Denormalization of the Data depends on your use case but mostly preferred for s3/hdfs structures. you can follow this link for better Athena storing and performance:
https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

AWS Athena disable cache

I am in the process of comparing the performance of CSV and Parquet files in AWS Athena.
To ensure that I do not get a considerable reduction in the execution times of two consecutive runs of the same query, I would like to make sure that the cache is disabled.
Do we know if there is a solution for this?
Or if AThena doesn’t even have cache enabled by default.
How Athena configures the presto engine behind is totally out of our control. I have throughly tested Aws Athena and from my finding it doens't cache the data. I see that same query executed consecutively takes similar amount of time and data scan.
But Parquet should definitely give you better performance and lesser data scan for cost efficiency.

AWS Redshift + Tableau Performance Booster

I'm using AWS Redshift as a back-end to my tableau desktop. AWS cluster is running with two dc1.large nodes and database table which I'm analyzing is of 30GB (with redshift compression enabled), I chose Redshift over tableau extract for performance issue but seems like Redshift live connection is much slower than extract. Any suggestions where shall I look into?
To use Redshift as a backend for a BI platform like Tableau, there are four things you can do to address latency:
1) Concurrency: Redshift is not great at running multiple queries at the same time so before you start tuning the database, make sure your query is not waiting in line behind other queries. (If you are the only one on the cluster, this shouldn't be a problem.)
2) Table size: Whenever you can, use aggregate tables for better performance. Fewer rows to scan means less IO and faster turnaround!
3) Query complexity: Ideally, you want your BI tool to issue simple, fast performing queries. Make sure your source tables are fast, and that Tableau isn't being forced to do a bunch of joins. Also, if your query does need to join multiple tables, make sure any large tables have the same distribution key.
4) "Indexing": Technically, Redshift does not support true indexing, but you can get close to the same thing by using "interleaved" sort keys. Traditional compound sort keys won't help, but an interleaved sort key can allow you to quickly access rows from multiple vectors (date and customer_id, for instance) without having to scan the entire table.
Reality Check
After all of these things are optimized, you will often find that you still can't be as fast as a Tableau extract. Simply stated, a "fast" Tableau dashboard needs to return data to it's user in <5 seconds. If you have 7 visuals on your dashboard, and each of the underlying queries takes 800 milliseconds to return (which is super fast for a database query), then you still are just barely reaching your target performance. Now, if just one of those queries take 5 seconds or more, your dashboard is going to feel "slow" no matter what you do.
In Summary
Redshift can be tuned using the approach above, but it may or may not be worth the effort. The best applications for using a live Redshift query instead of Tableau Extracts are in cases where the data is physically too large to create an extract of, and when you require data at a level of granularity that makes pre-aggregation infeasible. One good strategy is to create your main dashboard using an extract so that exploration/discovery is as fast as possible, and then use direct (live) Redshift queries for your drill-through reports (for instance, when you want to see exactly which customers roll up into your totals).
Few pointers as below
1) Use vacuum & Analyze once your ETL completes
2) Have you created the Table with proper Dist key and Sort Key
3) Aggregation if it's ok from the point of Data Granularity, requirement etc
1.Remove cursor, tableau access data from redshift leader node using cursor. Cursor works iteratively. Thus, impacting the performance.
2. Perform manual analyze on the table, after running heavy load operations. https://docs.aws.amazon.com/redshift/latest/dg/r_ANALYZE.html
3.Check the dist key distribution to avoid data skewness and improve performance.

Suitable storage method for huge amount of data

what kind of storage do you recommend for very huge amount of data? (≈ 50 milions records per day). Is this proper situation for systems like Hadoop or RDBMS is still sufficient for this purpose?
With the amount of data you are describing, you might indeed be pushing into the Big Data terrirtory. Based on the amount of the details you provided, I would suggest loading raw data into Hadoop cluster, running map/reduce jobs to parse it and to load into date-based directories. You can then define an external Hive table partitioned by date (daily? weekly?) mapped to the results of your map/reduce jobs.
Next step would depend on the complexity of your reports and needed response time. If you can easily express them in SQL, you can just run queries on your Hive table. If they are more elaborated, you might have to write custom map/reduce jobs. Many suggest Pig for it, but I am personally more comforatble with the straight Java.
If you don't care about the response time of the reports, you can run them on-demand. If you care, but open to wait for the results for, say, tens of seconds or a few minutes, you can store report results also in Hive. If you want your reports to show up fast, say, in web-based or mobile UI, you might want to store the report data in a relational database.