I am working on partition in athena. I have a directory in s3 where date wise files are placed. I am trying to create a date partitioned table and set the location of each partition to the file of that date. Although the set location query for partition is running successfully, I am not able to see data in that partition through select query.
After executing below query i can see the data :
alter table tbl_name partition (date='2018-05-28') set location 's3://bucket_name//test/'
But not after executing this :
alter table tbl_name partition (date='2018-05-28') set location 's3://bucket_name//test/test.csv'
Thus if i set location to a directory it is able to pick data but not when setting location to a file.
But I need to set the location of a partition to a file name. This is working prefectly in Hive. Need Help for athena.
If your have the folder structure like this,
S3://bucket/myfodler/logs/2018/04/02/file1.csv
S3://bucket/myfodler/logs/2018/04/02/file2.csv
S3://bucket/myfodler/logs/2018/04/03/file1.csv
S3://bucket/myfodler/logs/2018/04/03/file2.csv
Then you can create partition like,
ALTER TABLE table_name ADD
PARTITION (YEAR = '2018', MONTH='04', day='02') LOCATION 'S3://bucket/myfodler/logs/2018/04/02'
In your case,
s3://bucket_name//test/test.csv -is not a proper structure to create the partition.
If you share your s3 folder structure, then I can try to help you on this.
For more about Athena partition: Read Here
Related
I am having a parquet file with the below structure
column_name_old
First
Second
I am crawling this file to a table in AWS glue, however in the schema table I want the table structure as below without changing anything in parquet files
column_name_new
First
Second
I tried updating table structure using boto3
col_list = js['Table']['StorageDescriptor']['Columns']
for x in col_list:
if isinstance(x, dict):
x.update({'Name': x['Name'].replace('column_name_old', 'column_name_new')})
And it works as I can see the table structure updated in Glue catalog, but when I query the table using the new column name I don't get any data as it seems the mapping between the table structure and partition files is lost.
Is this approach even possible or I must change the parquet files itself? If it's possible what I am doing wrong?
You can create a view of the column name mapped to other value.
I believe a change in the column name will break the meta catalogue.
I want a table to store the history of a object for a week and then replace the same with history of next week. What would be the best way to achieve this in aws?
The data is stored in json format in s3 is a weekly dump. The pipeline runs the script weekly once and dumps data into s3 for analysis. For the next run of the script i do not need the previous week-1 data, so this needs to be replaced with new week-2 data. The schema of the table remains constant but the data keeps changing every week.
I would recommend to use data partitioning to solve your issue without deleting underlying S3 files from previous weeks (which is not possible via an Athena query).
Thus, the idea is to use a partition key based on the date, and then use this partition key in the WHERE clause of your Athena request, which will cause Athena to ignore previous files (which are not under the last partition).
For example, if you use the file dump date as partition key (let's say we chose to name it dump_key), your files will have to be stored in subfolders like
s3://your-bucket/subfolder/dump_key=2021-01-01-13-00/files.csv
s3://your-bucket/subfolder/dump_key=2021-01-07-13-00/files.csv
Then, during your data processing, you'll first need to create your table and specify a partition key with the PARTITIONED BY option.
Then, you'll have to make sure you added a new partition using the PARTITION ADD command every time it's necessary for your use case:
ALTER TABLE your_table ADD PARTITION (dump_key='2021-01-07-13-00') location 's3://your-bucket/subfolder/dump_key=2021-01-07-13-00/'
Then you'll be able to query your table by filtering previous data using the right WHERE clause:
SELECT * FROM my_table WHERE dump_key >= 2021-01-05-00-00
This will cause Athena to ignore files in previous partitions when querying your table.
Documentation here:
https://docs.aws.amazon.com/athena/latest/ug/partitions.html
I am new to AWS and trying to use S3 and Athena for a use case.
I want the data saved as json files in S3 to be queried from Athena. To reduce the data scan i have created directory structure like this
../customerid/date/*.json (format)
../100/2020-04-29/*.json
../100/2020-04-30/*.json
.
.
../101/2020-04-29/*.json
In Athena the table structure has been created according to the data we are expecting and 2 partitions have been created namely customer (customerid) and dt (date).
I want to query all the data for customer '100' and limit my scan to its directory for which i am trying to load the partition as follows
alter table <table_name> add
partition (customer=100) location 's3://<location>/100/’
But I get the following error
FAILED: SemanticException partition spec {customer=100} doesn't contain all (2) partition columns
Clearly its not loading a single partition when multiple partitions have been created
Giving both partitions in alter table
alter table <table_name> add
partition (customer=100, dt=2020-04-22) location 's3://<location>/100/2020-04-22/'
I get this error
missing 'column' at 'partition' (service: amazonathena; status code: 400; error code: invalidrequestexception;
Am i doing something wrong?
Does this even works?
If not is there a way to work with hierarchical partitions?
The issue is with the S3 hierarchial structure that you have given. it should be
../customer=100/dt=2020-04-29/*.json
instead of
../100/2020-04-29/*.json
If you have data in S3 stored in the correct prefix structure as mentioned, then you could add partitions with simple msck repair msck repair table <table_name> command.
Hope this clarifies
I figured the mistake i was making so wanted to share in case anyone finds themselves in the same situation.
For data not partitioned in hive format (Refer this for hive and non-hive format)
Taking the example above, following is the alter command that works
alter table <table_name> add
partition (customer=100, dt=date '2020-04-22') location 's3://<location>/100/2020-04-22/'
Notice the change in the syntax of "dt" partition. As my partition datatype was set to "date" type and not using it while loading the partition was giving the error.
although not giving the data type also works we just need to give single quotes that defaults the partition type to string/varchar
alter table <table_name> add
partition (customer=100, dt='2020-04-22') location 's3://<location>/100/2020-04-22/'
I prefer giving the date data type while adding as that is how i configured my partition.
Hope this helps.
Say I have a table in Hive named T1. It is partitioned by column dt, which is a date field. In the hive metastore, the directory structure has a folder by the name of the table T1, with subdirectories within - one folder for each date.
My objective is to copy the table's data into Amazon S3, while maintaining the directory structure. If I try to write table contents directly to S3 file as follows, the output is written as a single file and the directory structure is lost:
INSERT OVERWRITE DIRECTORY "s3://<DESTINATION>" SELECT * FROM T1;
Alternatively, if I try to copy the directory from HIVE-metatore directly to s3 using the command, the directory in its entirety is copied to S3, but the underlying files are not comma delimited anymore... it is some unreadable character instead:
s3-dist-cp --src=hdfs://<directory location> --dest=s3://<destination>
Can anyone help me accomplish this? Any suggestions or alternatives?
Possible solution is to create table with the same schema and set location to the desired location, then load data using Hive and dynamic partitioning:
create table T2 like T1;
Alter table T2 set location = 'your destination location';
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
Insert overwrite table T2 partition (dt)
select * from T1
distribute by dt;
Our s3 buckets generally have a number of sub-directories, so that the path to a bucket is something like s3:top-level-function-group/more-specific-folder/org-tenant-company-id/entityid/actual-data
We're looking into Athena to be able to query against data on that /actual-data level, but within the org-tenant-company-id, so that would have to be passed as some kind of parameter.
Or would that org-tenant-company-id be a partition?
is it possible to create an athena table that queries against this structure? And what would the s3 location be on the create table wizard? I tried it with s3:top-level-function-group/more-specific-folder/ but when it ran, I think it said something like '0 Kb data read'.
You can create a partitioned table as follows, where the partition keys are defined only in the PARTITIONED BY clause, not in the list of table fields:
CREATE EXTERNAL TABLE mydb.mytable (
id int,
stuff string,
...
)
PARTITIONED BY (
orgtenantcompanyid string
)
LOCATION 's3://mybucket/top-level-function-group/more-specific-folder/';
After creating the table, you can then load individual partitions:
ALTER TABLE mydb.mytable ADD PARTITION (orgtenantcompanyid='org1')
LOCATION 's3://mybucket/top-level-function-group/more-specific-folder/org1';
Result rows will contain the partition fields like orgtenantcompanyid.
Yes, it is possible to create tables that only use contents of a specific subdirectory.
It's normal that after creating your table you see 0kb read. That's because no data is read when you CREATE a table.
To check whether you can acutally query the data do something like:
SELECT * FROM <table_name> LIMIT 10
Partitioning only makes sense if the data structure is identical in all the different directories so that the table definition applies to all the data under the location.
And yes, it's possible to use the path structure to create partitions. However, not automatically if it's not in the right format /key=value/. You can use the path as an attribute, though, as explained here: How to get input file name as column in AWS Athena external tables