Athena/Presto: Getting maximum partition value, at cheapest scan cost

Athena/Presto: Getting maximum partition value, at cheapest scan cost - amazon-athena

I am wanting to get the maximum value from a partition of my Athena table.
Given that the volume of scanned data is cost, am seeking a way to do this with minimum scan.
Admittedly, I have little data in there now but will grow over time once in production.
Does anyone know about what happens under the hood for these 2 approaches, how they differ, and which would be the most efficient?
Thanks
Method (1)
SELECT max(dt)
FROM mydb.mytable
-- Console Output:
-- Time in queue:0.166 sec Run time:3.153 sec Data scanned:-
Method (2)
SELECT max(dt)
FROM mydb."mytable$partitions"
-- Console Output:
-- Time in queue:0.223 sec Run time:1.347 sec Data scanned:0.02 KB

Very very very late answer, but this question helped me a lot so I look it up, maybe it can help others:
SHOW PARTITIONS lists the partitions in metadata.
If you want to execute a SHOW PARTITIONS on a query you use:
SELECT * FROM "table_name$partitions"
The second example you posted it's faster because it doesn't look into the filesystem (S3) but only into the metadata.
AWS Documentation: https://docs.aws.amazon.com/athena/latest/ug/show-partitions.html

Related

How to Decrease Query Compile Time in Redshift

I have seen that the first time query execution taking longer time to execute but second execution takes less time, seems like query compile time is taking longer time at first, can we do anything here which will increase the performance of compile time ?
Scenario:
enable_result_cache_for_session is off
We have SLA defined to execute specific query is 15 seconds but when run for the first time it is taking 33 seconds to compile and run the query that time SLA is miss but subsequent run took 10 seconds which is SLA hit.
Q: How do I tune this part ? How do I make sure this does not happens ?
Do we have any database configuration parameter for the same?

The title of the question says compile time but I understand that you are interested in improving the execution time, right?
For sure the John Rotenstein comment makes total sense, to improve the Redshift execution query time you need to understand the RS architecture and how to distribute your data in the best way you can to improve the queries time.
You will need to understand the DISTKEY and SORTKEY
Useful links
Redshift Architecture
https://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html
https://medium.com/#dpazetojr/redshift-architecture-basics-4aae5068b8e3
Redshift Distribuition Styles
https://docs.aws.amazon.com/redshift/latest/dg/c_choosing_dist_sort.html
https://medium.com/#dpazetojr/redshift-distkey-and-sortkey-d247b01b01f6
UPDATE 1:
In order you tune query and know how/when use DISTKEY and SORTKEY, we can start using the EXPLAIN command in the query you run and based on that act more precisely.
https://docs.aws.amazon.com/redshift/latest/dg/r_EXPLAIN.html
https://dev.to/ronsoak/the-r-a-g-redshift-analyst-guide-understanding-the-query-plan-explain-360d

Q: AWS Redshift: ANALYZE COMPRESSION 'Table Name' - How to save the result set into a table / Join to other Table

I am looking for a way to save the result set of an ANALYZE Compression to a table / Joining it to another table in order to automate compression scripts.
Is it Possible and how?

You can always run analyze compression from an external program (bash script is my go to), read the results and store them back up to Redshift with inserts. This is usually the easiest and fastest way when I run into these type of "no route from leader to compute node" issues on Redshift. These are often one-off scripts that don't need automation or support.
If I need something programatic I'll usually write a Lambda function (or possibly a python program on an ec2). Fairly easy and execution speed is high but does require an external tool and some users are not happy running things outside of the database.
If it needs to be completely Redshift internal then I make a procedure that keeps the results of the leader only query in cursor and then loops on the cursor inserting the data into a table. Basically the same as reading it out and then inserting back in but the data never leaves Redshift. This isn't too difficult but is slow to execute. Looping on a cursor and inserting 1 row at a time is not efficient. Last one of these I did took 25 sec for 1000 rows. It was fast enough for the application but if you need to do this on 100,000 rows you will be waiting a while. I've never done this with analyze compression before so there could be some issue but definitely worth a shot if this needs to be SQL initiated.

AWS Athena partition fetch all paths

Recently, I've experienced an issue with AWS Athena when there is quite high number of partitions.
The old version had a database and tables with only 1 partition level, say id=x. Let's take one table; for example, where we store payment parameters per id (product), and there are not plenty of IDs. Assume its around 1000-5000. Now while querying that table with passing id number on where clause like ".. where id = 10". The queries were returned pretty fast actually. Assume we update the data twice a day.
Lately, we've been thinking to add another partition level for day like, "../id=x/dt=yyyy-mm-dd/..". This means that partition number grows xID times per day if a month passes and if we have 3000 IDs, we'd approximately get 3000x30=90000 partitions a month. Thus, a rapid grow in number of partitions.
On, say 3 months old data (~270k partitions), we'd like to see a query like the following would return in at most 20 seconds or so.
select count(*) from db.table where id = x and dt = 'yyyy-mm-dd'
This takes like a minute.
The Real Case
It turns out Athena first fetches the all partitions (metadata) and s3 paths (regardless the usage of where clause) and then filter those s3 paths that you would like to see on where condition. The first part (fetching all s3 paths by partitions lasts long proportionally to the number of partitions)
The more partitions you have, the slower the query executed.
Intuitively, I expected that Athena fetches only s3 paths stated on where clause, I mean this would be the one way of magic of the partitioning. Maybe it fetches all paths
Does anybody know a work around, or do we use Athena in a wrong way ?
Should Athena be used only with small number of partitions ?
Edit
In order to clarify the statement above, I add a piece from support mail.
from Support
...
You mentioned that your new system has 360000 which is a huge number.
So when you are doing select * from <partitioned table>, Athena first download all partition metadata and searched S3 path mapped with
those partitions. This process of fetching data for each partition
lead to longer time in query execution.
...
Update
An issue opened on AWS forums. The linked issue raised on aws forums is here.
Thanks.

This is impossible to properly answer without knowing the amount of data, what file formats, and how many files we're talking about.
TL; DR I suspect you have partitions with thousands of files and that the bottleneck is listing and reading them all.
For any data set that grows over time you should have a temporal partitioning, on date or even time, depending on query patterns. If you should have partitioning on other properties depends on a lot of factors and in the end it often turns out that not partitioning is better. Not always, but often.
Using reasonably sized (~100 MB) Parquet can in many cases be more effective than partitioning. The reason is that partitioning increases the number of prefixes that have to be listed on S3, and the number of files that have to be read. A single 100 MB Parquet file can be more efficient than ten 10 MB files in many cases.
When Athena executes a query it will first load partitions from Glue. Glue supports limited filtering on partitions, and will help a bit in pruning the list of partitions – so to the best of my knowledge it's not true that Athena reads all partition metadata.
When it has the partitions it will issue LIST operations to the partition locations to gather the files that are involved in the query – in other words, Athena won't list every partition location, just the ones in partitions selected for the query. This may still be a large number, and these list operations are definitely a bottleneck. It becomes especially bad if there is more than 1000 files in a partition because that's the page size of S3's list operations, and multiple requests will have to be made sequentially.
With all files listed Athena will generate a list of splits, which may or may not equal the list of files – some file formats are splittable, and if files are big enough they are split and processed in parallel.
Only after all of that work is done the actual query processing starts. Depending on the total number of splits and the amount of available capacity in the Athena cluster your query will be allocated resources and start executing.
If your data was in Parquet format, and there was one or a few files per partition, the count query in your question should run in a second or less. Parquet has enough metadata in the files that a count query doesn't have to read the data, just the file footer. It's hard to get any query to run in less than a second due to the multiple steps involved, but a query hitting a single partition should run quickly.
Since it takes two minutes I suspect you have hundreds of files per partition, if not thousands, and your bottleneck is that it takes too much time to run all the list and get operations in S3.

Depth of sys.dm_pdw_exec_requests on Azure SQL Data Warehouse

I am running tests that take many hours to complete on ADW and the amount of SQL involved rolls off the 10,000 row limit of sys.dm_pdw_exec_requests (as documented at https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-service-capacity-limits ) in less than 30 minutes.
Is my only option to create a process to capture into a table in my database the data on sys.dm_pdw_exec_requests every N minutes (where N << 30 )?

I'm not sure what your use case is, but perhaps you can get the same useful information out of the audit logs?
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-auditing-overview

You might be able to use something that was already built for that purpose, instead of reinventing the wheel:
https://github.com/andrealibero/Azure_SQL_DWH_Perf_Stats
the PowerShell script can collect output of DMVs (configured in an XML file) in a loop or for a number of specified iterations.
Given how quickly the DMVs roll out for you this might help in your scenario.

Incremental update of millions of records, indexed vs. join

I'm currently developing a strategy for an incremental update of our user data. We assume 100_000_000 records in our database of which approximately 1_000_000 records are updated per workflow.
The idea is to update records in a MapReduce job. Is it useful to use an indexed storage (eg. Cassandra) to be able to access current records randomly? Or is it preferable to retrieve data from HDFS and join new information to existing records.
The record size is O(200 Bytes). The user data has a fixed length but should be extendable. The log events have a similar but not equal structure. The number of user records is likely to grow. Near real-time updates are desirable, ie. a 3 hour time gap is not acceptable, few minutes is OK.
Have you made any experiences with either of these strategies and data of this size?
Is the pig JOIN fast enough? Is it a bottleneck always to read all records? Is Cassandra able to hold this amount of data efficiently? Which solution is scalable? What about the complexity of the system?

You need to define your requirements first. Your record volumes are not a problem, but you don't give a record length. Are they fixed length, fixed field number, likely to change format over time? Are we talking 100 byte records or 100,000 byte records? You need an index on a field/column if you wish to query by that field/column, unless you do all your work using map/reduce. Will the number of user records stay at 100mill (1 server will probably suffice) or will it grow 100% per year ( probably multiple servers adding new ones over time).
How you access records for updating depends on whether you need to update them in real-time or whether you can run a batch job. Will updates be every minute, or hour, or month?
I would strongly suggest you do some experimenting. Have you done any testing already? This will give you a context for your questions and this will lead to more objective questions and answers. It is unlikely that you can 'whiteboard' a solution based on your question.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js