Alternatives for Athena to query the data on S3 - amazon-web-services

I have around 300 GBs of data on S3. Lets say the data look like:
## S3://Bucket/Country/Month/Day/1.csv
S3://Countries/Germany/06/01/1.csv
S3://Countries/Germany/06/01/2.csv
S3://Countries/Germany/06/01/3.csv
S3://Countries/Germany/06/02/1.csv
S3://Countries/Germany/06/02/2.csv
We are doing some complex aggregation on the data, and because some countries data is big and some countries data is small, the AWS EMR doesn't makes sense to use, as once the small countries are finished, the resources are being wasted, and the big countries keep running for long time. Therefore, we decided to use AWS Batch (Docker container) with Athena. One job works on one day of data per country.
Now there are roughly 1000 jobs which starts together and when they query Athena to read the data, containers failed because they reached Athena query limits.
Therefore, I would like to know what are the other possible ways to tackle this problem? Should I use Redshift cluster, load all the data there and all the containers query to Redshift cluster as they don't have query limitations. But it is expensive, and takes a lot of time to wramp up.
The other option would be to read data on EMR and use Hive or Presto on top of it to query the data, but again it will reach the query limitation.
It would be great if someone can give better options to tackle this problem.

As I understand, you simply send query to AWS Athena service and after all aggregation steps finish you simply retrieve resulting csv file from S3 bucket where Athena saves results, so you end up with 1000 files (one for each job). But the problem is number of concurrent Athena queries and not the total execution time.
Have you considered using Apache Airflow for orchestrating and scheduling your queries. I see airflow as an alternative to a combination of Lambda and Step Functions, but it is totally free. It is easy to setup on both local and remote machines, has reach CLI and GUI for task monitoring, abstracts away all scheduling and retrying logic. Airflow even has hooks to interact with AWS services. Hell, it even has a dedicated operator for sending queries to Athena, so sending queries is as easy as:
from airflow.models import DAG
from airflow.contrib.operators.aws_athena_operator import AWSAthenaOperator
from datetime import datetime
with DAG(dag_id='simple_athena_query',
schedule_interval=None,
start_date=datetime(2019, 5, 21)) as dag:
run_query = AWSAthenaOperator(
task_id='run_query',
query='SELECT * FROM UNNEST(SEQUENCE(0, 100))',
output_location='s3://my-bucket/my-path/',
database='my_database'
)
I use it for similar type of daily/weekly tasks (processing data with CTAS statements) which exceed limitation on a number of concurrent queries.
There are plenty blog posts and documentation that can help you get started. For example:
Medium post: Automate executing AWS Athena queries and moving the results around S3 with Airflow.
Complete guide to installation of Airflow, link 1 and link 2
You can even setup integration with Slack for sending notification when you queries terminate either in success or fail state.
However, the main drawback I am facing is that only 4-5 queries are getting actually executed at the same time, whereas all others just idling.

One solution would be to not launch all jobs at the same time, but pace them to stay within the concurrency limits. I don't know if this is easy or hard with the tools you're using, but it's never going to work out well if you throw all the queries at Athena at the same time. Edit: it looks like you should be able to throttle jobs in Batch, see AWS batch - how to limit number of concurrent jobs (by default Athena allows 25 concurrent queries, so try 20 concurrent jobs to have a safety margin – but also add retry logic to the code that launches the job).
Another option would be to not do it as separate queries, but try to bake everything together into fewer, or even a single query – either by grouping on country and date, or by generating all queries and gluing them together with UNION ALL. If this is possible or not is hard to say without knowing more about the data and the query, though. You'll likely have to post-process the result anyway, and if you just sort by something meaningful it wouldn't be very hard to split the result into the necessary pieces after the query has run.
Using Redshift is probably not the solution, since it sounds like you're doing this only once per day, and you wouldn't use the cluster very much. It would Athena is a much better choice, you just have to handle the limits better.
With my limited understanding of your use case I think using Lambda and Step Functions would be a better way to go than Batch. With Step Functions you'd have one function that starts N number of queries (where N is equal to your concurrency limit, 25 if you haven't asked for it to be raised), and then a poll loop (check the examples for how to do this) that checks queries that have completed, and starts new queries to keep the number of running queries at the max. When all queries are run a final function can trigger whatever workflow you need to run after everything is done (or you can run that after each query).
The benefit of Lambda and Step Functions is that you don't pay for idle resources. With Batch, you will pay for resources that do nothing but wait for Athena to complete. Since Athena, in contrast to Redshift for example, has an asynchronous API you can run a Lambda function for 100ms to start queries, then 100ms every few seconds (or minutes) to check if any have completed, and then another 100ms or so to finish up. It's almost guaranteed to be less than the Lambda free tier.

As I know Redshift Spectrum and Athena cost same. You should not compare Redshift to Athena, they have different purpose. But first of all I would think about addressing you data skew issue. Since you mentioned AWS EMR I assume you use Spark. To deal with large and small partitions you need to repartition you dataset by months, or some other equally distributed value.Or you can use month and country for grouping. You got the idea.

You can use redshift spectrum for this purpose. Yes, it is a bit costly but it is scalable and very good for performing complex aggregations.

Related

Use Case for Amazon Athena

We are building an web application to allow customers insight into their activity based on events currently streaming into ElasticSearch. A customer is an organisation sending messages to people.
A concern has been raised that a requirement to host this data for three years infers a very large amount of storage and high cost of implementation given Elasticsearch.
An alternative is to process each day's data into a report CSV stored in S3 and use something like Amazon Athena to perform the queries. Is Athena something that our application can send ad-hoc queries to in response to a web browser request? It is unlikely to generate a large volume of requests all the time, but I'm uncertain what the latency could be like.
Yes, Athena would be a possible solution to this use case – and done right it could also be fairly cheap.
Athena is not a low latency query engine, but for reporting purposes it's usually good enough. There's no way to say for sure without knowing more, but done right we're talking low single digit seconds.
You can approach this in different ways, either you do as you say and generate a CSV every day, store these for as long as you need, and run queries against them as needed. From your description it sounds like these CSVs would already be aggregates, and I assume they would be significantly less than a megabyte per customer per day. If you partition by customer and month you should be able to run queries for arbitrary time periods in seconds.
Another approach would be to store all your data on S3 and run queries on the full data set. As you stream data into ElasticSearch, stream it to S3 too. Depending on how you do that you probably need some ETL in the form of Lambda functions that partitions the data per customer and time (day or month depending on the volume). You can then run Athena queries on the full historical data set. The downside would be slower queries (double digit seconds for most queries, but I don't know your data volumes), but the upside would be full flexibility on what you can query.
With more details about the particulars of the use case I could help you with the details.
Athena is serverless. You can quickly query your data without having to set up and manage any servers or data warehouses. Just point to your data in Amazon S3, define the schema, and start querying using the built-in query editor.
Amazon Athena automatically executes queries in parallel, so most results come back within seconds/mins.

AWS Athena too slow for an api?

The plan was to get data from aws data exchange, move it to an s3 bucket then query it by aws athena for a data api. Everything works, just feels a bit slow.
No matter the dataset nor the query I can't get below 2 second in athena response time. Which is a lot for an API. I checked the best practices but seems that those are also above 2 sec.
So my question:
Is 2 sec the minimal response time for athena?
If so then I have to switch to postgres.
Athena is indeed not a low latency data store. You will very rarely see response times below one second, and often they will be considerably longer. In the general case Athena is not suitable as a backend for an API, but of course that depends on what kind of an API it is. If it's some kind of analytics service, perhaps users don't expect sub second response times? I have built APIs that use Athena that work really well, but those were services where response times in seconds were expected (and even considered fast), and I got help from the Athena team to tune our account to our workload.
To understand why Athena is "slow", we can dissect what happens when you submit a query to Athena:
Your code starts a query by using the StartQueryExecution API call
The Athena service receives the query, and puts it on a queue. If you're unlucky your query will sit in the queue for a while
When there is available capacity the Athena service takes your query from the queue and makes a query plan
The query plan requires loading table metadata from the Glue catalog, including the list of partitions, for all tables included in the query
Athena also lists all the locations on S3 it got from the tables and partitions to produce a full list of files that will be processed
The plan is then executed in parallel, and depending on its complexity, in multiple steps
The results of the parallel executions are combined and a result is serialized as CSV and written to S3
Meanwhile your code checks if the query has completed using the GetQueryExecution API call, until it gets a response that says that the execution has succeeded, failed, or been cancelled
If the execution succeeded your code uses the GetQueryResults API call to retrieve the first page of results
To respond to that API call, Athena reads the result CSV from S3, deserializes it, and serializes it as JSON for the API response
If there are more than 1000 rows the last steps will be repeated
A Presto expert could probably give more detail about steps 4-6, even though they are probably a bit modified in Athena's version of Presto. The details aren't very important for this discussion though.
If you run a query over a lot of data, tens of gigabytes or more, the total execution time will be dominated by step 6. If the result is also big, 7 will be a factor.
If your data set is small, and/or involves thousands of files on S3, then 4-5 will instead dominate.
Here are some reasons why Athena queries can never be fast, even if they wouldn't touch S3 (for example SELECT NOW()):
There will at least be three API calls before you get the response, a StartQueryExecution, a GetQueryExecution, and a GetQueryResults, just their round trip time (RTT) would add up to more than 100ms.
You will most likely have to call GetQueryExecution multiple times, and the delay between calls will puts a bound on how quickly you can discover that the query has succeeded, e.g. if you call it every 100ms you will on average add half of 100ms + RTT to the total time because on average you'll miss the actual completion time by this much.
Athena will writes the results to S3 before it marks the execution as succeeded, and since it produces a single CSV file this is not done in parallel. A big response takes time to write.
The GetQueryResults must read the CSV from S3, parse it and serialize it as JSON. Subsequent pages must skip ahead in the CSV, and may be even slower.
Athena is a multi tenant service, all customers are competing for resources, and your queries will get queued when there aren't enough resources available.
If you want to know what affects the performance of your queries you can use the ListQueryExecutions API call to list recent query execution IDs (I think you can go back 90 days at the most), and then use GetQueryExecution to get query statistics (see the documentation for QueryExecution.Statistics for what each property means). With this information you can figure out if your slow queries are because of queueing, execution, or the overhead of making the API calls (if it's not the first two, it's likely the last).
There are some things you can do to cut some of the delays, but these tips are unlikely to get you down to sub second latencies:
If you query a lot of data use file formats that are optimized for that kind of thing, Parquet is almost always the answer – and also make sure your file sizes are optimal, around 100 MB.
Avoid lots of files, and avoid deep hierarchies. Ideally have just one or a few files per partition, and don't organize files in "subdirectories" (S3 prefixes with slashes) except for those corresponding to partitions.
Avoid running queries at the top of the hour, this is when everyone else's scheduled jobs run, there's significant contention for resources the first minutes of every hour.
Skip GetQueryExecution, download the CSV from S3 directly. The GetQueryExecution call is convenient if you want to know the data types of the columns, but if you already know, or don't care, reading the data directly can save you some precious tens of milliseconds. If you need the column data types you can get the ….csv.metadata file that is written alongside the result CSV, it's undocumented Protobuf data, see here and here for more information.
Ask the Athena service team to tune your account. This might not be something you can get without higher tiers of support, I don't really know the politics of this and you need to start by talking to your account manager.

AWS Athena - Query over large external table generated from Glue crawler?

I have a large set of history log files on aws s3 that sum billions of lines,
I used a glue crawler with a grok deserializer to generate an external table on Athena, but querying it has proven to be unfeasible.
My queries have timed out and I am trying to find another way of handling this data.
From what I understand, through Athena, external tables are not actual database tables, but rather, representations of the data in the files, and queries are run over the files themselves, not the database tables.
How can I turn this large dataset into a query friendly structure?
Edit 1: For clarification, I am not interested in reshaping the hereon log files, those are taken care of. Rather, I want a way to work with the current file base I have on s3. I need to query these old logs and at its current state it's impossible.
I am looking for a way to either convert these files into an optimal format or to take advantage of the current external table to make my queries.
Right now, by default of the crawler, the external tables are only partitined by day and instance, my grok pattern explodes the formatted logs into a couple more columns that I would love to repartition on, if possible, which I believe would make my queries easier to run.
Your where condition should be on partitions (at-least one condition). By sending support ticket, you may increase athena timeout. Alternatively, you may use Redshift Spectrum
But you may seriously thing to optimize query. Athena query timeout is 30min. It means your query ran for 30mins before timed out.
By default athena times out after 30 minutes. This timeout period can be increased but raising a support ticket with AWS team. However, you should first optimize your data and query as 30 minutes is good time for executing most of the queries.
Here are few tips to optimize the data that will give major boost to athena performance:
Use columnar formats like orc/parquet with compression to store your data.
Partition your data. In your case you can partition your logs based on year -> month -> day.
Create larger and lesser number of files per partition instead of small and more number of files.
The following AWS article gives detailed information for performance tuning in amazon athena
Top 10 performance tuning tips for amazon-athena

is 100-200 upserts and inserts in a 10 second window into a 3 node redshift cluster a realistic architecture?

Under 3 nodes using redshift we plan on doing 50-100 inserts every 10 seconds. Within that 10 second window we also will try to do the equivalent of a redshift upsert as documented here https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-upsert.html on about 50 to 100 rows as well.
I'm basically unaware if a 10 second window is realistic or a 10 minute window... etc is good for this kind of load. Should this be a daily batch? Should I try to re-architect to get rid of upserts?
My question is essentially can redshift handle this load? I feel the upsert is happening too many times. We are using structured streaming in spark to handle all of this. If yes what type of nodes should we be using? Has anyone who has done this have a ballpark estimate? If no, what are alternative architectures?
Essentially what we're trying to do is load entity data to be joined with the events in redshift. But we want the analytics to be as near real time as possible so we want load as fast as we can.
There's probably no exact answer for this, so any explanation that can get help me perform estimations on requirements based on load will be helpful.
I do not think you will achieve the performance you seek.
Running large numbers of INSERT statements is not an optimal way to load data into Amazon Redshift.
The best way is via running COPY from data stored in Amazon S3. This loads data in parallel across all nodes.
Unless you have a very real need to get data immediately into Redshift, it would be better to batch the data in S3 over a period of time (the larger the batch, the better), then load via COPY. This will also work well with the Staging Table approach to performing UPSERTS.
The best way to discover whether Redshift will handle a particular load is to try it! Spin up another cluster and try the various methods, measuring the performance each time.
I would recommend using Kinesis Firehose to insert data to Redshift. It will optimize for time / load and insert accordingly.
We tried inserting manually in batches, does not seems to be the cleaner way of handling it when an optimized cloud service exist for the same.
https://docs.aws.amazon.com/ses/latest/DeveloperGuide/event-publishing-redshift-firehose-stream.html
It does collect them in batches, compress and load them to Redshift.
Upsert Process:
If you want an upsert this is how I would have them done in a scalable way,
DynamoDB Table (Update) --> DynamoDB Streams --> Lambda --> Firehose --> Redshift
Have a scheduled job to cleanup any duplicate records based on created_timestamp.
Hope it helps.

Is AWS Glue + Athena/Hive right choice to replace complex SQL queries?

I have been using AWS Athena to query analytics data stored on S3 across several tables. Over a period of time I have come up with 2-3 complex SQL queries (involving several joins) for pulling relevant data. Since, Athena is for ad-hoc queries (and not predefined queries), besides prohibitive costs for processing several TB and 30 minute timeout, I am looking for alternatives.
Two alternatives that I can think of are:
Use Presto based EMR cluster and run existing query. It removes the 30 minute limit and (might) reduce costs ($5/TB). However, the cons are reprocessing the same data on successive runs.
Do ETL (such as through AWS Glue) and denormalise data. This should reduce repeated joins, as only incremental data is processed. Subsequently query the flattened data with some SQL interface - Athena/Hive. However, I am not sure if denormalisation is a good idea, besides the cost of storing redundant (huge) data.
Which of these is a better choice or is there a better standard technique for this issue?
I think it's best to do 2 (denormalization) and then 1 (run Presto over the optimized data layout).
Also, Presto with Cost-Based Optimizer might be worth a look: https://www.starburstdata.com/technical-blog/starburst-presto-on-aws-18x-faster-than-emr/
Denormalization of the Data depends on your use case but mostly preferred for s3/hdfs structures. you can follow this link for better Athena storing and performance:
https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/