Is AWS Glue + Athena/Hive right choice to replace complex SQL queries? - amazon-web-services

I have been using AWS Athena to query analytics data stored on S3 across several tables. Over a period of time I have come up with 2-3 complex SQL queries (involving several joins) for pulling relevant data. Since, Athena is for ad-hoc queries (and not predefined queries), besides prohibitive costs for processing several TB and 30 minute timeout, I am looking for alternatives.
Two alternatives that I can think of are:
Use Presto based EMR cluster and run existing query. It removes the 30 minute limit and (might) reduce costs ($5/TB). However, the cons are reprocessing the same data on successive runs.
Do ETL (such as through AWS Glue) and denormalise data. This should reduce repeated joins, as only incremental data is processed. Subsequently query the flattened data with some SQL interface - Athena/Hive. However, I am not sure if denormalisation is a good idea, besides the cost of storing redundant (huge) data.
Which of these is a better choice or is there a better standard technique for this issue?

I think it's best to do 2 (denormalization) and then 1 (run Presto over the optimized data layout).
Also, Presto with Cost-Based Optimizer might be worth a look: https://www.starburstdata.com/technical-blog/starburst-presto-on-aws-18x-faster-than-emr/

Denormalization of the Data depends on your use case but mostly preferred for s3/hdfs structures. you can follow this link for better Athena storing and performance:
https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

Related

Optimal Big Data solution for aggregating time-series data and storing results to DynamoDB

I am looking into different Big Data solutions and have not been able to find a clear answer or documentation on what might be the best approach and frameworks/services to use to address my Big Data use-case.
My Use-case:
I have a data producer that will be sending ~1-2 billion events to a
Kinesis Data Firehose delivery stream daily.
This data needs to be stored in some data lake / data warehouse, aggregated, and then
loaded into DynamoDB for our service to consume the aggregated data
in its business logic.
The DynamoDB table needs to be updated hourly. (hourly is not a hard requirement but we would like DynamoDB to be updated as soon as possible, at the longest intervals of daily updates if required)
The event schema is similar to: customerId, deviceId, countryCode, timestamp
The aggregated schema is similar to: customerId, deviceId, countryCode (the aggregation is on the customerId's/deviceId's MAX(countryCode) for each day over the last 29 days, and then the MAX(countryCode) overall over the last 29 days.
Only the CustomerIds/deviceIds that had their countryCode change from the last aggregation (from an hour ago) should be written to DynamoDB to keep required write capacity units low.
The raw data stored in the data lake / data warehouse needs to be deleted after 30 days.
My proposed solution:
Kinesis Data Firehose delivers the data to a Redshift staging table (by default using S3 as intermediate storage and then using the COPY command to load to Redshift)
An hourly Glue job that:
Drops the 30 day old time-series table and creates a new time-series table for today in Redshift if this is the first job run of a new day
Loads data from staging table to the appropriate time-series table
Creates a view on top of the last 29 days of time-series tables
Aggregates by customerId, deviceId, date, and MAX(CountryCode)
Then aggregates by customerId, deviceId, MAX(countryCode)
Writes the aggregated results to an S3 bucket
Checks the previous hourly Glue job's run aggregated results vs. the current runs aggregated results to find the customerIds/deviceIds that had their countryCode change
Writes the customerIds/deviceIds rows that had their countryCode change to DynamoDB
My questions:
Is Redshift the best storage choice here? I was also considering using S3 as storage and directly querying data from S3 using a Glue job, though I like the idea of a fully-managed data warehouse.
Since our data has a fixed retention period of 30 days, AWS documentation: https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-time-series-tables.html suggests to use time-series tables and running DROP TABLE on older data that needs to be deleted. Are there other approaches (outside of Redshift) that would make the data lifecycle management easier? Having the staging table, creating and loading into new time-series tables, dropping older time-series tables, updating the view to include the new time-series table and not the one that was dropped could be error prone.
What would be an optimal way to find the the rows (customerId/deviceId combinations) that had their countryCode change since the last aggregation? I was thinking the Glue job could create a table from the previous runs aggregated results S3 file and another table from the current runs aggregated results S3 file, run some variation of a FULL OUTER JOIN to find the rows that have different countryCodes. Is there a better approach here that I'm not aware of?
I am a newbie when it comes to Big Data and Big Data solutions so any and all input is appreciated!
tldr: Use step functions, not Glue. Use Redshift Spectrum with data in S3. Otherwise you overall structure looks on track.
You are on the right track IMHO but there are a few things that could be better. Redshift is great for sifting through tons of data and performing analytics on it. However I'm not sure you want to COPY the data into Redshift if all you are doing is building aggregates to be loaded into DDB. Do you have other analytic workloads being done that will justify storing the data in Redshift? Are there heavy transforms being done between the staging table and the time series event tables? If not you may want to make the time series tables external - read directly from S3 using Redshift Spectrum. This could be a big win as the initial data grouping and aggregating is done in the Spectrum layer in S3. This way the raw data doesn't have to be moved.
Next I would advise not using Glue unless you have a need (transform) that cannot easily be done elsewhere. I find Glue to require some expertise to get to do what you want and it sounds like you would just be using it for a data movement orchestrator. If this impression is correct you will be better off with a step function or even a data pipeline. (I've wasted way too much time trying to get Glue to do simple things. It's a powerful tool but make sure you'll get value from the time you will spend on it.)
If you are only using Redshift to do these aggregations and you go the Spectrum route above you will want to get as small a cluster as you can get away with. Redshift can be pricy and if you don't use its power, not cost effective. In this case you can run the cluster only as needed but Redshift boot up times are not fast and the smallest clusters are not expensive. So this is a possibility but only in the right circumstances. Depending on how difficult the aggregation is that you are doing you might want to look at Athena. If you are just running a few aggregating queries per hour then this could be the most cost effective approach.
Checking against the last hour's aggregations is just a matter of comparing the new aggregates against the old which are in S3. This is easily done with Redshift Spectrum or Athena as they can makes files (or sets of files) the source for a table. Then it is just running the queries.
In my opinion Glue is an ETL tool that can do high power transforms. It can do a lot of things but is not my first (or second) choice. It is touchy, requires a lot of configuration to do more than the basics, and requires expertise that many data groups don't have. If you are a Glue expert, knock you self out; If not, I would avoid.
As for data management, yes you don't want to be deleting tons of rows from the beginning of tables in Redshift. It creates a lot of data reorganization work. So storing your data in "month" tables and using a view is the right way to go in Redshift. Dropping tables doesn't create this housekeeping. That said if you organize you data in S3 in "month" folders then unneeded removing months of data can just be deleting these folders.
As for finding changing country codes this should be easy to do in SQL. Since you are comparing aggregate data to aggregate data this shouldn't be expensive either. Again Redshift Spectrum or Athena are tools that allow you to do this on S3 data.
As for being a big data newbie, not a worry, we all started there. The biggest difference from other areas is how important it is to move the data the fewest number of times. It sounds like you understand this when you say "Is Redshift the best storage choice here?". You seem to be recognizing the importance of where the data resides wrt the compute elements which is on target. If you need the horsepower of Redshift and will be accessing the data over and over again then the Redshift is the best option - The data is moved once to a place where the analytics need to run. However, Redshift is an expensive storage solution - it's not what it is meant to do. Redshift Spectrum is very interesting in that the initial aggregations of data is done in S3 and much reduced partial results are sent to Redshift for completion. S3 is a much cheaper storage solution and if your workload can be pattern-matched to Spectrum's capabilities this can be a clear winner.
I want to be clear that you have only described on area where you need a solution and I'm assuming that you don't have other needs for a Redshift cluster operating on the same data. This would change the optimization point.

Use Case for Amazon Athena

We are building an web application to allow customers insight into their activity based on events currently streaming into ElasticSearch. A customer is an organisation sending messages to people.
A concern has been raised that a requirement to host this data for three years infers a very large amount of storage and high cost of implementation given Elasticsearch.
An alternative is to process each day's data into a report CSV stored in S3 and use something like Amazon Athena to perform the queries. Is Athena something that our application can send ad-hoc queries to in response to a web browser request? It is unlikely to generate a large volume of requests all the time, but I'm uncertain what the latency could be like.
Yes, Athena would be a possible solution to this use case – and done right it could also be fairly cheap.
Athena is not a low latency query engine, but for reporting purposes it's usually good enough. There's no way to say for sure without knowing more, but done right we're talking low single digit seconds.
You can approach this in different ways, either you do as you say and generate a CSV every day, store these for as long as you need, and run queries against them as needed. From your description it sounds like these CSVs would already be aggregates, and I assume they would be significantly less than a megabyte per customer per day. If you partition by customer and month you should be able to run queries for arbitrary time periods in seconds.
Another approach would be to store all your data on S3 and run queries on the full data set. As you stream data into ElasticSearch, stream it to S3 too. Depending on how you do that you probably need some ETL in the form of Lambda functions that partitions the data per customer and time (day or month depending on the volume). You can then run Athena queries on the full historical data set. The downside would be slower queries (double digit seconds for most queries, but I don't know your data volumes), but the upside would be full flexibility on what you can query.
With more details about the particulars of the use case I could help you with the details.
Athena is serverless. You can quickly query your data without having to set up and manage any servers or data warehouses. Just point to your data in Amazon S3, define the schema, and start querying using the built-in query editor.
Amazon Athena automatically executes queries in parallel, so most results come back within seconds/mins.

AWS Athena - Query over large external table generated from Glue crawler?

I have a large set of history log files on aws s3 that sum billions of lines,
I used a glue crawler with a grok deserializer to generate an external table on Athena, but querying it has proven to be unfeasible.
My queries have timed out and I am trying to find another way of handling this data.
From what I understand, through Athena, external tables are not actual database tables, but rather, representations of the data in the files, and queries are run over the files themselves, not the database tables.
How can I turn this large dataset into a query friendly structure?
Edit 1: For clarification, I am not interested in reshaping the hereon log files, those are taken care of. Rather, I want a way to work with the current file base I have on s3. I need to query these old logs and at its current state it's impossible.
I am looking for a way to either convert these files into an optimal format or to take advantage of the current external table to make my queries.
Right now, by default of the crawler, the external tables are only partitined by day and instance, my grok pattern explodes the formatted logs into a couple more columns that I would love to repartition on, if possible, which I believe would make my queries easier to run.
Your where condition should be on partitions (at-least one condition). By sending support ticket, you may increase athena timeout. Alternatively, you may use Redshift Spectrum
But you may seriously thing to optimize query. Athena query timeout is 30min. It means your query ran for 30mins before timed out.
By default athena times out after 30 minutes. This timeout period can be increased but raising a support ticket with AWS team. However, you should first optimize your data and query as 30 minutes is good time for executing most of the queries.
Here are few tips to optimize the data that will give major boost to athena performance:
Use columnar formats like orc/parquet with compression to store your data.
Partition your data. In your case you can partition your logs based on year -> month -> day.
Create larger and lesser number of files per partition instead of small and more number of files.
The following AWS article gives detailed information for performance tuning in amazon athena
Top 10 performance tuning tips for amazon-athena

Alternatives for Athena to query the data on S3

I have around 300 GBs of data on S3. Lets say the data look like:
## S3://Bucket/Country/Month/Day/1.csv
S3://Countries/Germany/06/01/1.csv
S3://Countries/Germany/06/01/2.csv
S3://Countries/Germany/06/01/3.csv
S3://Countries/Germany/06/02/1.csv
S3://Countries/Germany/06/02/2.csv
We are doing some complex aggregation on the data, and because some countries data is big and some countries data is small, the AWS EMR doesn't makes sense to use, as once the small countries are finished, the resources are being wasted, and the big countries keep running for long time. Therefore, we decided to use AWS Batch (Docker container) with Athena. One job works on one day of data per country.
Now there are roughly 1000 jobs which starts together and when they query Athena to read the data, containers failed because they reached Athena query limits.
Therefore, I would like to know what are the other possible ways to tackle this problem? Should I use Redshift cluster, load all the data there and all the containers query to Redshift cluster as they don't have query limitations. But it is expensive, and takes a lot of time to wramp up.
The other option would be to read data on EMR and use Hive or Presto on top of it to query the data, but again it will reach the query limitation.
It would be great if someone can give better options to tackle this problem.
As I understand, you simply send query to AWS Athena service and after all aggregation steps finish you simply retrieve resulting csv file from S3 bucket where Athena saves results, so you end up with 1000 files (one for each job). But the problem is number of concurrent Athena queries and not the total execution time.
Have you considered using Apache Airflow for orchestrating and scheduling your queries. I see airflow as an alternative to a combination of Lambda and Step Functions, but it is totally free. It is easy to setup on both local and remote machines, has reach CLI and GUI for task monitoring, abstracts away all scheduling and retrying logic. Airflow even has hooks to interact with AWS services. Hell, it even has a dedicated operator for sending queries to Athena, so sending queries is as easy as:
from airflow.models import DAG
from airflow.contrib.operators.aws_athena_operator import AWSAthenaOperator
from datetime import datetime
with DAG(dag_id='simple_athena_query',
schedule_interval=None,
start_date=datetime(2019, 5, 21)) as dag:
run_query = AWSAthenaOperator(
task_id='run_query',
query='SELECT * FROM UNNEST(SEQUENCE(0, 100))',
output_location='s3://my-bucket/my-path/',
database='my_database'
)
I use it for similar type of daily/weekly tasks (processing data with CTAS statements) which exceed limitation on a number of concurrent queries.
There are plenty blog posts and documentation that can help you get started. For example:
Medium post: Automate executing AWS Athena queries and moving the results around S3 with Airflow.
Complete guide to installation of Airflow, link 1 and link 2
You can even setup integration with Slack for sending notification when you queries terminate either in success or fail state.
However, the main drawback I am facing is that only 4-5 queries are getting actually executed at the same time, whereas all others just idling.
One solution would be to not launch all jobs at the same time, but pace them to stay within the concurrency limits. I don't know if this is easy or hard with the tools you're using, but it's never going to work out well if you throw all the queries at Athena at the same time. Edit: it looks like you should be able to throttle jobs in Batch, see AWS batch - how to limit number of concurrent jobs (by default Athena allows 25 concurrent queries, so try 20 concurrent jobs to have a safety margin – but also add retry logic to the code that launches the job).
Another option would be to not do it as separate queries, but try to bake everything together into fewer, or even a single query – either by grouping on country and date, or by generating all queries and gluing them together with UNION ALL. If this is possible or not is hard to say without knowing more about the data and the query, though. You'll likely have to post-process the result anyway, and if you just sort by something meaningful it wouldn't be very hard to split the result into the necessary pieces after the query has run.
Using Redshift is probably not the solution, since it sounds like you're doing this only once per day, and you wouldn't use the cluster very much. It would Athena is a much better choice, you just have to handle the limits better.
With my limited understanding of your use case I think using Lambda and Step Functions would be a better way to go than Batch. With Step Functions you'd have one function that starts N number of queries (where N is equal to your concurrency limit, 25 if you haven't asked for it to be raised), and then a poll loop (check the examples for how to do this) that checks queries that have completed, and starts new queries to keep the number of running queries at the max. When all queries are run a final function can trigger whatever workflow you need to run after everything is done (or you can run that after each query).
The benefit of Lambda and Step Functions is that you don't pay for idle resources. With Batch, you will pay for resources that do nothing but wait for Athena to complete. Since Athena, in contrast to Redshift for example, has an asynchronous API you can run a Lambda function for 100ms to start queries, then 100ms every few seconds (or minutes) to check if any have completed, and then another 100ms or so to finish up. It's almost guaranteed to be less than the Lambda free tier.
As I know Redshift Spectrum and Athena cost same. You should not compare Redshift to Athena, they have different purpose. But first of all I would think about addressing you data skew issue. Since you mentioned AWS EMR I assume you use Spark. To deal with large and small partitions you need to repartition you dataset by months, or some other equally distributed value.Or you can use month and country for grouping. You got the idea.
You can use redshift spectrum for this purpose. Yes, it is a bit costly but it is scalable and very good for performing complex aggregations.

AWS Redshift + Tableau Performance Booster

I'm using AWS Redshift as a back-end to my tableau desktop. AWS cluster is running with two dc1.large nodes and database table which I'm analyzing is of 30GB (with redshift compression enabled), I chose Redshift over tableau extract for performance issue but seems like Redshift live connection is much slower than extract. Any suggestions where shall I look into?
To use Redshift as a backend for a BI platform like Tableau, there are four things you can do to address latency:
1) Concurrency: Redshift is not great at running multiple queries at the same time so before you start tuning the database, make sure your query is not waiting in line behind other queries. (If you are the only one on the cluster, this shouldn't be a problem.)
2) Table size: Whenever you can, use aggregate tables for better performance. Fewer rows to scan means less IO and faster turnaround!
3) Query complexity: Ideally, you want your BI tool to issue simple, fast performing queries. Make sure your source tables are fast, and that Tableau isn't being forced to do a bunch of joins. Also, if your query does need to join multiple tables, make sure any large tables have the same distribution key.
4) "Indexing": Technically, Redshift does not support true indexing, but you can get close to the same thing by using "interleaved" sort keys. Traditional compound sort keys won't help, but an interleaved sort key can allow you to quickly access rows from multiple vectors (date and customer_id, for instance) without having to scan the entire table.
Reality Check
After all of these things are optimized, you will often find that you still can't be as fast as a Tableau extract. Simply stated, a "fast" Tableau dashboard needs to return data to it's user in <5 seconds. If you have 7 visuals on your dashboard, and each of the underlying queries takes 800 milliseconds to return (which is super fast for a database query), then you still are just barely reaching your target performance. Now, if just one of those queries take 5 seconds or more, your dashboard is going to feel "slow" no matter what you do.
In Summary
Redshift can be tuned using the approach above, but it may or may not be worth the effort. The best applications for using a live Redshift query instead of Tableau Extracts are in cases where the data is physically too large to create an extract of, and when you require data at a level of granularity that makes pre-aggregation infeasible. One good strategy is to create your main dashboard using an extract so that exploration/discovery is as fast as possible, and then use direct (live) Redshift queries for your drill-through reports (for instance, when you want to see exactly which customers roll up into your totals).
Few pointers as below
1) Use vacuum & Analyze once your ETL completes
2) Have you created the Table with proper Dist key and Sort Key
3) Aggregation if it's ok from the point of Data Granularity, requirement etc
1.Remove cursor, tableau access data from redshift leader node using cursor. Cursor works iteratively. Thus, impacting the performance.
2. Perform manual analyze on the table, after running heavy load operations. https://docs.aws.amazon.com/redshift/latest/dg/r_ANALYZE.html
3.Check the dist key distribution to avoid data skewness and improve performance.