Does hierarchical partitioning works in AWS Athena/S3? - amazon-web-services

I am new to AWS and trying to use S3 and Athena for a use case.
I want the data saved as json files in S3 to be queried from Athena. To reduce the data scan i have created directory structure like this
../customerid/date/*.json (format)
../100/2020-04-29/*.json
../100/2020-04-30/*.json
.
.
../101/2020-04-29/*.json
In Athena the table structure has been created according to the data we are expecting and 2 partitions have been created namely customer (customerid) and dt (date).
I want to query all the data for customer '100' and limit my scan to its directory for which i am trying to load the partition as follows
alter table <table_name> add
partition (customer=100) location 's3://<location>/100/’
But I get the following error
FAILED: SemanticException partition spec {customer=100} doesn't contain all (2) partition columns
Clearly its not loading a single partition when multiple partitions have been created
Giving both partitions in alter table
alter table <table_name> add
partition (customer=100, dt=2020-04-22) location 's3://<location>/100/2020-04-22/'
I get this error
missing 'column' at 'partition' (service: amazonathena; status code: 400; error code: invalidrequestexception;
Am i doing something wrong?
Does this even works?
If not is there a way to work with hierarchical partitions?

The issue is with the S3 hierarchial structure that you have given. it should be
../customer=100/dt=2020-04-29/*.json
instead of
../100/2020-04-29/*.json
If you have data in S3 stored in the correct prefix structure as mentioned, then you could add partitions with simple msck repair msck repair table <table_name> command.
Hope this clarifies

I figured the mistake i was making so wanted to share in case anyone finds themselves in the same situation.
For data not partitioned in hive format (Refer this for hive and non-hive format)
Taking the example above, following is the alter command that works
alter table <table_name> add
partition (customer=100, dt=date '2020-04-22') location 's3://<location>/100/2020-04-22/'
Notice the change in the syntax of "dt" partition. As my partition datatype was set to "date" type and not using it while loading the partition was giving the error.
although not giving the data type also works we just need to give single quotes that defaults the partition type to string/varchar
alter table <table_name> add
partition (customer=100, dt='2020-04-22') location 's3://<location>/100/2020-04-22/'
I prefer giving the date data type while adding as that is how i configured my partition.
Hope this helps.

Related

Renaming AWS glue table column name without changing underlying parquet files

I am having a parquet file with the below structure
column_name_old
First
Second
I am crawling this file to a table in AWS glue, however in the schema table I want the table structure as below without changing anything in parquet files
column_name_new
First
Second
I tried updating table structure using boto3
col_list = js['Table']['StorageDescriptor']['Columns']
for x in col_list:
if isinstance(x, dict):
x.update({'Name': x['Name'].replace('column_name_old', 'column_name_new')})
And it works as I can see the table structure updated in Glue catalog, but when I query the table using the new column name I don't get any data as it seems the mapping between the table structure and partition files is lost.
Is this approach even possible or I must change the parquet files itself? If it's possible what I am doing wrong?
You can create a view of the column name mapped to other value.
I believe a change in the column name will break the meta catalogue.

Is it possible delete entire table stored in S3 buckets from athena query?

I want a table to store the history of a object for a week and then replace the same with history of next week. What would be the best way to achieve this in aws?
The data is stored in json format in s3 is a weekly dump. The pipeline runs the script weekly once and dumps data into s3 for analysis. For the next run of the script i do not need the previous week-1 data, so this needs to be replaced with new week-2 data. The schema of the table remains constant but the data keeps changing every week.
I would recommend to use data partitioning to solve your issue without deleting underlying S3 files from previous weeks (which is not possible via an Athena query).
Thus, the idea is to use a partition key based on the date, and then use this partition key in the WHERE clause of your Athena request, which will cause Athena to ignore previous files (which are not under the last partition).
For example, if you use the file dump date as partition key (let's say we chose to name it dump_key), your files will have to be stored in subfolders like
s3://your-bucket/subfolder/dump_key=2021-01-01-13-00/files.csv
s3://your-bucket/subfolder/dump_key=2021-01-07-13-00/files.csv
Then, during your data processing, you'll first need to create your table and specify a partition key with the PARTITIONED BY option.
Then, you'll have to make sure you added a new partition using the PARTITION ADD command every time it's necessary for your use case:
ALTER TABLE your_table ADD PARTITION (dump_key='2021-01-07-13-00') location 's3://your-bucket/subfolder/dump_key=2021-01-07-13-00/'
Then you'll be able to query your table by filtering previous data using the right WHERE clause:
SELECT * FROM my_table WHERE dump_key >= 2021-01-05-00-00
This will cause Athena to ignore files in previous partitions when querying your table.
Documentation here:
https://docs.aws.amazon.com/athena/latest/ug/partitions.html

Moving a partitioned table across regions (from US to EU)

I'm trying to move a partitioned table over from the US to the EU region but whenever I manage to do so, It doesn't partition the table on the correct column.
The current process that I'm taking is:
Create a Storage bucket in the region that I want the partitioned table to be in
Export the partitioned table over via CSV to the original bucket (within the old region)
Transfer the table across buckets (from the original bucket to the new one)
Create a new table using the CSV from the new bucket (auto-detect schema is on)
bq --location=eu load --autodetect --source_format=CSV table_test_set.test_table [project ID/test_table]
I expect that the column to be partitioned on the DATE column but instead it's partitioned on the column PARTITIONTIME
Also a note that I'm currently doing this with CLI commands. This will need to be redone multiple times and so having reusable code is a must.
When I migrate data from 1 table to another one, I follow this process
I extract the data to GCS (CSV or other format)
I extract the schema to the source table with this command bq show --schema <dataset>.<table>
I create via the GUI the destination table with the edit as text schema and I paste it. I define manually the partition field that I want to use from the schema;
I load the data from GCS to the destination table.
This process has 2 advantages:
When you import a CSV format, you define the REAL type that you want. Remember, in schema autodetect, Bigquery look about 10 or 20 lines and deduce the schema. Often, string fields are set as INTEGER but the first line of my file doesn't contains letter, only numbers (in serial number for example)
You can define your partition fields properly
The process is quite easy to script. I use the GUI for creating destination table, but bq command lines are great for doing the same thing.
After some more digging I managed to find out the solution. By using "--time_partitioning_field [column name]" you are able to partition by a specific column. So the command would look like this:
bq --location=eu --schema [where your JSON schema file is] load --time_partitioning_field [column name] --source_format=NEWLINE_DELIMITED_JSON table_test_set.test_table [project ID/test_table]
I also found that using JSON files to make things easier.

Amazon Athena not able to read data from partition

I am working on partition in athena. I have a directory in s3 where date wise files are placed. I am trying to create a date partitioned table and set the location of each partition to the file of that date. Although the set location query for partition is running successfully, I am not able to see data in that partition through select query.
After executing below query i can see the data :
alter table tbl_name partition (date='2018-05-28') set location 's3://bucket_name//test/'
But not after executing this :
alter table tbl_name partition (date='2018-05-28') set location 's3://bucket_name//test/test.csv'
Thus if i set location to a directory it is able to pick data but not when setting location to a file.
But I need to set the location of a partition to a file name. This is working prefectly in Hive. Need Help for athena.
If your have the folder structure like this,
S3://bucket/myfodler/logs/2018/04/02/file1.csv
S3://bucket/myfodler/logs/2018/04/02/file2.csv
S3://bucket/myfodler/logs/2018/04/03/file1.csv
S3://bucket/myfodler/logs/2018/04/03/file2.csv
Then you can create partition like,
ALTER TABLE table_name ADD
PARTITION (YEAR = '2018', MONTH='04', day='02') LOCATION 'S3://bucket/myfodler/logs/2018/04/02'
In your case,
s3://bucket_name//test/test.csv -is not a proper structure to create the partition.
If you share your s3 folder structure, then I can try to help you on this.
For more about Athena partition: Read Here

can athena table be created for s3 bucket sub-directories?

Our s3 buckets generally have a number of sub-directories, so that the path to a bucket is something like s3:top-level-function-group/more-specific-folder/org-tenant-company-id/entityid/actual-data
We're looking into Athena to be able to query against data on that /actual-data level, but within the org-tenant-company-id, so that would have to be passed as some kind of parameter.
Or would that org-tenant-company-id be a partition?
is it possible to create an athena table that queries against this structure? And what would the s3 location be on the create table wizard? I tried it with s3:top-level-function-group/more-specific-folder/ but when it ran, I think it said something like '0 Kb data read'.
You can create a partitioned table as follows, where the partition keys are defined only in the PARTITIONED BY clause, not in the list of table fields:
CREATE EXTERNAL TABLE mydb.mytable (
id int,
stuff string,
...
)
PARTITIONED BY (
orgtenantcompanyid string
)
LOCATION 's3://mybucket/top-level-function-group/more-specific-folder/';
After creating the table, you can then load individual partitions:
ALTER TABLE mydb.mytable ADD PARTITION (orgtenantcompanyid='org1')
LOCATION 's3://mybucket/top-level-function-group/more-specific-folder/org1';
Result rows will contain the partition fields like orgtenantcompanyid.
Yes, it is possible to create tables that only use contents of a specific subdirectory.
It's normal that after creating your table you see 0kb read. That's because no data is read when you CREATE a table.
To check whether you can acutally query the data do something like:
SELECT * FROM <table_name> LIMIT 10
Partitioning only makes sense if the data structure is identical in all the different directories so that the table definition applies to all the data under the location.
And yes, it's possible to use the path structure to create partitions. However, not automatically if it's not in the right format /key=value/. You can use the path as an attribute, though, as explained here: How to get input file name as column in AWS Athena external tables