AWS Athena partition projection - amazon-web-services

AWS Athena partition projection - amazon-web-services

Seemingly cannot get Athena to partition projection to work.
When I add partitions the "old fashioned" way and then run a MSCK REPAIR TABLE testparts; I can query the data.
I drop the table and recreate with the partition projections below and it fails to query any data at all. The queries that I do get to run take a very very long time with no results, or they time out like below query.
For the sake of argument I followed AWS documentation:
select distinct year from testparts;
I get :
HIVE_EXCEEDED_PARTITION_LIMIT: Query over table 'mydb.testparts' can potentially read more than 1000000 partitions.
I have ~7500 files in there at the moment in the file structures indicated in the table setups below.
I have:
Tried entering the separated parts as date type, provided format "yyyy-MM-dd" and still it did not work (including deleting and changing my s3 structures as well). I then tried to split the dates into different folders and set as integers (which you see below) and still did not work.
Given I can get it to operate "manually" after repairing the table, then successfully querying my structures - I must be doing something wrong at a fundamental level with partition projections.
I have also changed user from injected type to enum (not ideal given it's a plain old string, but did it for the purpose of testing)
Table creation :
CREATE EXTERNAL TABLE `testparts`(
`thisdata` array<struct<thistype:string,amount:float,identifiers:map<string,struct<id:string,type:string>>,selections:map<int,array<int>>>> COMMENT 'from deserializer')
PARTITIONED BY (
`year` int,
`month` int,
`day` int,
`user` string,
`thisid` int,
`account` int)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://testoutputs/partitiontests/responses'
TBLPROPERTIES (
'classification'='json',
'projection.day.digits'='2',
'projection.day.range'='1,31',
'projection.day.type'='integer',
'projection.enabled'='true',
'projection.account.range'='1,50',
'projection.account.type'='integer',
'projection.month.digits'='2',
'projection.month.range'='1,12',
'projection.month.type'='integer',
'projection.thisid.range'='1,15',
'projection.thisid.type'='integer',
'projection.user.type'='enum',
'projection.user.values'='usera,userb',
'projection.year.range'='2017,2027',
'projection.year.type'='integer',
'storage.location.template'='s3://testoutputs/partitiontests/responses/year=${year}/month=${month}/day=${day}/user=${user}/thisid=${thisid}/account=${account}/',
'transient_lastDdlTime'='1653445805')

If you run a query like SELECT * FROM testparts Athena will generate all permutations of possible values for the partition keys and list the corresponding location on S3. For your table this means doing 11,160,000 listings.
I don't believe that there's any optimization for SELECT DISTINCT year FROM testparts that would skip building the list of partition key values, so something similar would happen with that query too. Similarly, if you use "Preview table" to run SELECT * FROM testparts LIMIT 10 there is no optimization that skips building the list of partitons or skips listing the locations on S3.
Try running a query that doesn't wildcard any of the partition keys to validate that your config is correct.
Partition projection works differently from adding partitions to the catalog, and some care needs to be taken with wildcards. When partitions are in the catalog non-existent partitions can be eliminated cheaply, but with partition projection S3 has to be listed for every permutation of partition keys after predicates have been applied.
Partition projection works best when there are never wildcards on partition keys, to minimize the number of S3 listings that need to happen.

Related

How to use max, min functions in projection.columnName.range to the AWS Glue Table Property

I use the below table property to set range to date column
'projection.date.range' = 'NOW-365DAYS,NOW+1DAYS'
The table has no data from NOW-365DAYS as it is a new table. While querying from Athena on this table results in a high volume of listbucket requests. I don't want to happen this. SO thought to set a range like below
'projection.date.range' = 'MAX(2022/01/12, NOW-365DAYS), NOW+1DAYS' so that i can avoid empty partitions. But it is throwing me an error.
Is there a way to use MAX/MIN functions in projection.date.range?

It's not possible to qualify the partition projection range like that, unfortunately. I suggest setting the lower bound to the actual first date with data until the relative range makes sense.
Since you say Athena is making a lot of S3 list requests I assume you are querying the table without filters on this partition key. This will always result in a lot of S3 listings, at least 365 of them, regardless of whether there is data or not. Why aren't your queries filtering on the date partition key?
Is the reason why you want the range to be the last 365 days that you will remove data after one year, or is there another reason?

AWS Athena External Table not returning data

I created an external table in Athena using the DDL script below. The table creates successfully in Athena but when I query it, it returns 0 rows. The files in the s3 bucket specified are csv.gz files (there is one json file that I'm trying to exclude in TBLPROPERTIES). The s3 bucket is in a different account then where I'm querying it from. Supposedly the IAM role I'm using has access to read the data from the source S3 bucket in the other account. Is there anything else I need to specify in the TBLPROPERTIES to make this work? Thanks.
CREATE EXTERNAL TABLE default_schema.customer_all(
identity_line_item_id string,
identity_time_interval string,
payer string,
YEAR string,
MONTH string)
PARTITIONED BY (
payer string,
year string,
month string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://path/to/data/'
TBLPROPERTIES (
'transient_lastDdlTime'='1602464902', 'classification'='csv', 'exclude'='s3://path/to/data/*.json', 'delimiter'=',','compressionType'='gzip');

The table is partitioned, and for partitioned tables you must either manually add all partitions, or configure partition projection so that Athena knows where to find the files for each partition.
Adding partitions manually can be done with Athena DDL (i.e. SQL) or via the Glue Data Catalog API. I'll show you how to do it with DDL statements, because using the API is very verbose (although synchronous and faster, which suits some situations better).
You can add one or more partitions like this:
ALTER TABLE customer_all ADD IF NOT EXISTS
PARTITION (player = 'foo', year = '2020', month = '10') LOCATION 's3://path/to/data/foo/2020/10/'
PARTITION (player = 'bar', year = '2020', month = '10') LOCATION 's3://path/to/data/bar/2020/10/'
PARTITION (player = 'bar', year = '2020', month = '11') LOCATION 's3://path/to/data/bar/2020/11/'
The partition key values are given first, and then a location for that partition. Note that the partition location doesn't need to correspond to the values in the partition keys, your locations could be s3://path/to/data/2020-10/foo/ if you want to (but keeping them looking similar is recommended for your own sanity, of course).
Using partition projection is often more convenient than manually adding partitions, but it has some limitations. Your partition keys must be enumerable, or ranges, which your player partition key is most likely not. There is a special partition projection key type that can use values injected from the query, though, and if all your queries contain a value for the player partition key, this configuration could work:
…
TBLPROPERTIES (
…
'projection.enabled' = 'true',
'storage.location.template' = 's3://path/to/data/${player}/${year}/${month}',
'projection.player.type' = 'injected',
'projection.year.type' = 'integer',
'projection.year.range' = '2020-2030',
'projection.month.type' = 'integer',
'projection.month.range' = '1-12',
'projection.month.digits' = '2'
)
I can't promise that this will work as-is because there are things I don't know about your data. For example, I set the year to be a range from 2020 to 2030, I can't say if that's reasonable, you could set the end to 9999 if you want to, or 2022. I also don't know if your months are zero-filled or not, I assumed they were, but if they are not you need to remove projection.month.digits or change it to 1. I also assumed your data is organised on S3 like the pattern in storage.location.template, which might not be the case.
And to repeat what I said above: this partition projection configuration only works if all queries contain a value for player, otherwise you will get an error because Athena won't be able to figure out where to find the data – it can't enumerate over all possible values of player.
Some people will tell you to run MSCK REPAIR TABLE customer_all to load your partitions, but this is not something I recommend. Firstly it will only work if your data is partitioned using Hive style partitioning (e.g. …/player=foo/year=2020/month=3/…), and secondly it is very slow and inefficient. It works fine if you have a couple of partitions, but not if you have hundreds, and it's not a scalable solution to managing partitions.

Does "$path" limit the amount of data scanned by Athena?

The main question:
I can't seem to find definitive info about how $path works when used in a where clause in Athena.
select * from <database>.<table> where $path = 'know/path/'
Given a table definition at the top level of a bucket, if there are no partitions specified but the bucket is organized using prefixes does it scan the whole table? Or does it limit the scan to the specified path in a similar way to partitions? Any reference to an official statement on this?
The specific case:
I have information being stored in s3, this information needs to be counted and queried once or twice a day, the prefixes are two different IDs (s3:bucket/IDvalue1/IDvalue2/) and then the file with the relevant data. On a given day any number of new folders might be created (on busy days it could be day tens of thousands) or new files added to existing prefixes. So, maintaining the partition catalog up to date seems a little complicated.
One proposed approach to avoid partitions is using $path when getting data from a know combination of IDs, but I cannot seem to find whether using such approach would actually limit the amount of data scanned per query. I read a comment saying it does not but I cannot find it in the documentation and was wondering if anyone knows how it works and can point to the proper reference.
So far googling and reading the docs has not clarified this.

Athena does not have any optimisation for limiting the files scanned when using $path in a query. You can verify this for yourself by running SELECT * FROM some_table and SELECT * FROM some_table WHERE $path = '…' and comparing the bytes scanned (they will be the same, if there was an optimisation they would be different – assuming there is more than one file of course).
See Query by "$path" field and Athena: $path vs. partition
For your use case I suggest using partition projection with the injected type. This way you can limit the prefixes on S3 that Athena will scan, while at the same time not have to explicitly add partitions.
You could use something like the following table properties to set it up (use the actual column names in place of id_col_1 and id_col_2, obviously):
CREATE EXTERNAL TABLE some_table
…
TBLPROPERTIES (
"projection.id_col_1.type" = "injected",
"projection.id_col_2.type" = "injected",
"storage.location.template" = "s3://bucket/${id_col_1}/${id_col_2}/"
)
Note that when querying a table that uses partition projection with the injected type all queries must contain explicit values for the the projected columns.

For each distinct value in col_a, yield a new table

I have an Athena table of data in S3 that acts as a source table, with columns id, name, event. For every unique name value in this table, I would like to output a new table with all of the rows corresponding to that name value, and save to a different bucket in S3. This will result in n new files stored in S3, where n is also the number of unique name values in the source table.
I have tried single Athena queries in Lambda using PARTITION BY and CTAS queries, but can't seem to get the result that I wanted. It seems that AWS Glue may be able to get my expected result, but I've read online that it's more expensive, and that perhaps I may be able to get my expected result using Lambda.
How can I store a new file (JSON format, preferably) that contains all rows corresponding to each unique name in S3?
Preferably I would run this once a day to update the data stored by name, but the question above is the main concern for now.

When u write your spark/glue code you will need to partition the data using the name column. However this will result in a path having the below format
S3://bucketname/folder/name=value/file.json
This should give a separate set of files for each name value, but if u want to access that as a separate table u might need to get rid of that = sign from the key before u crawl the data and make it available via Athena
If u do use a lambda, the operation involves going through the data , similar to what glue does, and partitioning the data
I guess it all depends on the volume of data that it needs to process. Glue, if using spark may have a little bit of an extra start up time. Glue python shells have comparatively better start up times

Google Big Query splitting an ingestion time partitioned table

I have an ingestion time partitioned table that's getting a little large. I wanted to group by the values in one of the columns and use that to split it into multiple tables. Is there an easy way to do that while retaining the original _PARTITIONTIME values in the set of new ingestion time partitioned tables?
Also I'm hoping for something that's relatively simple/cheap. I could do something like copy my table a bunch of times and then delete the data for all but one value on each copy, but I'd get charged a huge amount for all those DELETE operations.
Also I have enough unique values in the column I want to split on that saving a "WHERE column = value" query result to a table for every value would be cost prohibitive. I'm not finding any documentation that mentions whether this approach would even preserve the partitions, so even if it weren't cost prohibitive it may not work.

Case you describe required having two level partitioning which is not supported yet
You can create column partition table https://cloud.google.com/bigquery/docs/creating-column-partitions
And after this build this value of column as needed that used to partitioning before insert - but in this case you lost _PARTITIONTIME value
Based on additional clarification - I had similar problem - and my solution was to write python application that will read source table (read is important here - not query - so it will be free) - split data based on your criteria and stream data (simple - but not free) or generate json/csv files and upload it into target tables (which also will be free but with some limitation on number of these operations) - will required more coding/exception handling if you go second route.
You can also can do it via DataFlow - it will be definitely more expensive than custom solution but potentially more robust.
Examples for gcloud python library
client = bigquery.Client(project="PROJECT_NAME")
t1 = client.get_table(source_table_ref)
target_schema = t1.schema[1:] #removing first column which is a key to split
ds_target = client.dataset(project=target_project, dataset_id=target_dataset)
rows_to_process_iter = client.list_rows( t1, start_index=start_index, max_results=max_results)
# convert to list
rows_to_process = list(rows_to_process_iter)
# doing something with records
# stream records to destination
errors = client.create_rows(target_table, records_to_stream)

BigQuery now supports clustered partitioned tables, which allow you to specify additional columns that the data should be split by.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js