I've read Partitioning data in Athena, however it is not clear how to make partitions for a table, when S3 has the following structure:
aws s3 ls s3://xxx-s3-zzz-datalake-prod/yyy/2022/09/
PRE 01/
PRE 02/
PRE 03/
PRE 04/
PRE 05/
PRE 06/
etc...
How could I create partition for such structure, is it possible? Or should I rename it to
aws s3 ls s3://xxx-s3-yyy-datalake-prod/zzz/2022/09/
PRE day=01/
PRE day=02/
PRE day=03/
etc...
and then add PARTITIONED BY (day int) ?
The partitions can be added in both cases and you have to use different methods in these cases.If you have data in below format:
aws s3 ls s3://xxx-s3-zzz-datalake-prod/yyy/2022/09/
PRE 01/
PRE 02/
PRE 03/
PRE 04/
PRE 05/
PRE 06/
etc...
Then you can only add these partitions information to the table using below query below:
ALTER TABLE orders ADD
PARTITION (day = '01') LOCATION 's3://xxx-s3-zzz-datalake-prod/yyy/2022/09/01'
PARTITION (day = '02') LOCATION 's3://xxx-s3-zzz-datalake-prod/yyy/2022/09/02';
Refer to below link for more information.
https://docs.aws.amazon.com/athena/latest/ug/alter-table-add-partition.html
Also try to add more partitions to your table by adding year and month to PARTITIONED BY clause as only adding a day will not do any good.
In case of below structure it is straight forward and easy:
aws s3 ls s3://xxx-s3-yyy-datalake-prod/zzz/2022/09/
PRE day=01/
PRE day=02/
PRE day=03/
etc...
Here you can run MSCK REPAIR TABLE <table-name> which will automatically populates table with partitions information as the structure is in Hive key-value supported format.The same information also can be added by Glue crawler.
Below link has more explanation for hive style and non hive style partitioning formats.
https://aws.amazon.com/premiumsupport/knowledge-center/athena-create-use-partitioned-tables/
Related
I am having a problem, where i have set enableUpdateCatalog=True and also updateBehaviour=LOG to update my glue table which has 1 partition key. After the job, runs there are no new partitions added on my glue catalog table, but data in S3 is separated by the partition key i have used, how do i get the job to automatically partition my glue catalog table?
Currently i have to manually run boto3 create_partition to create partitions on my glue catalog table. I want my job to automatically be able to create partitions as it discovers in S3 path separated by partition Keys
Code:
additionalOptions = {
"enableUpdateCatalog": True,
"updateBehavior": "LOG"}
additionalOptions["partitionKeys"] = ["partition_key0", "partition_key1"]
my_df = glueContext.write_dynamic_frame_from_catalog(frame=last_transform, database=<dst_db_name>,
table_name=<dst_tbl_name>, transformation_ctx="DataSink1",
additional_options=additionalOptions)
job.commit()
PS: I am currently using PARQUET format
Am i missing any Rights that has to be added to my job so that it can create partitions from the job itself?
I got it to work by adding useGlueParquetWriter: 'true' to the CATALOG table properties. And also I have added
format_options = {
'useGlueParquetWriter': True
}
in the write_dynamic_frame.from_catalog calls.
These steps got it to start working :)
I run some test with spectrum.
I created two AWS Glue crawler.
The first one called hive-tst which scans:
s3://hive-test/type='a'/year='2021'/month='01'
s3://hive-test/type='b'/year='2021'/month='01'
s3://hive-test/type='c'/year='2021'/month='01'
s3://hive-test/type='d'/year='2021'/month='01'
s3://hive-test/type='e'/year='2021'/month='01'
The second one scans:
s3://non-hive-test/a/2021/01
s3://non-hive-test/b/2021/01
s3://non-hive-test/c/2021/01
s3://non-hive-test/d/2021/01
s3://non-hive-test/e/2021/01
Both has two files in each bucket partition, both files are parquet files with 50mb.
Then I run a test of querying first partition of each spectrum table:
select distinct event from test.hive_tst;
It took 8s 272
select distinct partition_0 from test.nonhive_tst;
It took 8s 66ms
So it doesn't seem that adding the = improves performance.
Also checked that both tables have Hive format in partitions.
select *
from svv_external_partitions
where schemaname='test'
and tablename='hive_tst';
values
location
input_format
output_format
serialization_lib
["a","2021","01"]
s3://hive-test/event=a/year=2021/month=01/
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
select *
from svv_external_partitions
where schemaname='test'
and tablename='nonhive_tst';
values
location
input_format
output_format
serialization_lib
["a","2021","01"]
s3://hive-test/a/2021/01/
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
Maybe the data volume in the folders is not enough to test it, but everything, execution-times, and partitions format using svv_external_partitions seem the same.
Then the question is:
Do AWS Spectrum really need = in s3 location to understand it as hive format?
Finally, after a lot of search and reading I get into a conclusion:
Both S3 buckets have partitions, and as we use AWS Glue, all partitions are automatically added.
The only difference is that prefix such as year=2020 corresponds to hive naming convention, so AWS Glue knows how to handle that when add the partitions, and then partitions have a pretty name such as year instead of partition_x.
Then, answering: Do AWS Spectrum really need = in s3 location to understand it as hive format?
No, you don't need it to understant as hive format, but to understand it with Hive naming convention
Sources
My own test above
AWS Glue partitioned data post
Hive naming convention on S3
I am trying query AWS S3 Invetory List using Athena. I can do this if I have only one source bucket. I am not sure how to configure this to work with multiple source buckets.
we are using all the default configuration options with data format as CSV. S3 Inventory destination bucket name pattern for hive is like this:
destination-prefix/source-bucket/config-ID/hive/dt=YYYY-MM-DD-HH-MM/symlink.txt
So when I am creating an Athena table I have to use static hive path.
CREATE EXTERNAL TABLE your_table_name(
//column names
)
PARTITIONED BY (dt string)
//options ignored
LOCATION 's3://destination-prefix/source-bucket/config-ID/hive/';
So if I want to query inventory data for multiple source buckets, it seems like I have to create a table for each "source-bucket".
Alternatively, with out using Athena, I am trying to do this with AWS CLI
aws s3 ls s3://our-bucket-name/prefix/abc --recursive | awk ‘$1 > “2019-04-01”
But this gets every single file first as there is no option to set --include or --exclude with "S3 ls"
finally, the questions are:
Can I configure AWS Inventory to generate inventory for multiple S3 buckets to that it puts everything into the same "hive" directory (i.e. ignore "source-bucket" prefix while generating Invetory)?
Is it possible to configure Athena read from multiple hive locations? But with the possibility of new buckets getting created and dropped, I guess this gets ugly.
Is there any alternative way to query inventory list instead of Athena or AWS CLI or writing a custom code to use manifest.json file to get these csv files.
You can't make S3 Inventory create one inventory for multiple buckets, unfortunately. You can however splice the inventories together into one table.
The guide you link to says to run MSCK REPAIR TABLE … to load your inventories. I would recommend you to not do that, because it will create weird tables with partitions that each represent the inventory of some point in time, which is something you might want if you want to compare what's in a bucket from day to day or week to week, but probably not what you want most of the time. Most of the time you want to know what's in the bucket right now. To get multiple inventories into the same table you should also not run that command.
First you change how you create the table slightly:
CREATE EXTERNAL TABLE your_table_name(
//column names
)
PARTITIONED BY (bucket_name string)
//options ignored
LOCATION 's3://destination-prefix/source-bucket/config-ID/hive/';
Notice that I changed the partitioning from dt string to bucket_name string.
Next you add the partitions manually:
ALTER TABLE your_table_name
ADD PARTITION (bucket_name = 'some-bucket') LOCATION 's3://destination-prefix/source-bucket/config-ID1/hive/dt=YYYY-MM-DD/'
ADD PARTITION (bucket_name = 'another-bucket') LOCATION 's3://destination-prefix/source-bucket/config-ID2/hive/dt=YYYY-MM-DD/';
The locations should be the S3 URIs to the latest dates under the "hive" prefix of the inventory for each bucket.
The downside of this is that when new inventories are delivered you will need to update the table to point to these new locations. You can do this by first dropping the partitions:
ALTER TABLE your_table_name
DROP PARTITION (bucket_name = 'some-bucket')
DROP PARTITION (bucket_name = 'another-bucket');
and then adding them again using the same SQL as above, but with new S3 URIs.
I have a Spark batch job which is executed hourly. Each run generates and stores new data in S3 with the directory naming pattern DATA/YEAR=?/MONTH=?/DATE=?/datafile.
After uploading the data to S3, I want to investigate it using Athena. Also, I would like to visualize them in QuickSight by connecting to Athena as a data source.
The problem is that after each run of my Spark batch, the newly generated data stored in S3 will not be discovered by Athena, unless I manually run the query MSCK REPAIR TABLE.
Is there a way to make Athena update the data automatically, so that I can create a fully automatic data visualization pipeline?
There are a number of ways to schedule this task. How do you schedule your workflows? Do you use a system like Airflow, Luigi, Azkaban, cron, or using an AWS Data pipeline?
From any of these, you should be able to fire off the following CLI command.
$ aws athena start-query-execution --query-string "MSCK REPAIR TABLE some_database.some_table" --result-configuration "OutputLocation=s3://SOMEPLACE"
Another option would be AWS Lambda. You could have a function that calls MSCK REPAIR TABLE some_database.some_table in response to a new upload to S3.
An example Lambda Function could be written as such:
import boto3
def lambda_handler(event, context):
bucket_name = 'some_bucket'
client = boto3.client('athena')
config = {
'OutputLocation': 's3://' + bucket_name + '/',
'EncryptionConfiguration': {'EncryptionOption': 'SSE_S3'}
}
# Query Execution Parameters
sql = 'MSCK REPAIR TABLE some_database.some_table'
context = {'Database': 'some_database'}
client.start_query_execution(QueryString = sql,
QueryExecutionContext = context,
ResultConfiguration = config)
You would then configure a trigger to execute your Lambda function when new data are added under the DATA/ prefix in your bucket.
Ultimately, explicitly rebuilding the partitions after you run your Spark Job using a job scheduler has the advantage of being self documenting. On the other hand, AWS Lambda is convenient for jobs like this one.
You should be running ADD PARTITION instead:
aws athena start-query-execution --query-string "ALTER TABLE ADD PARTITION..."
Which adds a the newly created partition from your S3 location
Athena leverages Hive for partitioning data.
To create a table with partitions, you must define it during the CREATE TABLE statement. Use PARTITIONED BY to define the keys by which to partition data.
There's multiple ways to solve the issue and get the table updated:
Call MSCK REPAIR TABLE. This will scan ALL data. It's costly as every file is read in full (at least it's fully charged by AWS). Also it's painfully slow. In short: Don't do it!
Create partitions by your own by calling ALTER TABLE ADD PARTITION abc .... This is good in a sense no data is scanned and costs are low. Also the query is fast, so no problems here. It's also a good choice if you have very cluttered file structure without any common pattern (which doesn't seem it's your case as it's a nicely organised S3 key pattern). There's also downsides to this approach: A) It's hard to maintain B) All partitions will to be stored in GLUE catalog. This can become an issue when you have a lot of partitions as they need to be read out and passed to Athena and EMRs Hadoop infrastructure.
Use partition projection. There's two different styles you might want to evaluate. Here's the variant with does create the partitions for Hadoop at query time. This means there's no GLUE catalog entries send over the network and thus large amounts of partitions can be handled quicker. The downside is you might 'hit' some partitions that might not exist. These will of course be ignored, but internally all partitions that COULD match your query will be generated - no matter if they are on S3 or not (so always add partition filters to your query!). If done correctly, this option is a fire and forget approach as there's no updates needed.
CREATE EXTERNAL TABLE `mydb`.`mytable`
(
...
)
PARTITIONED BY (
`YEAR` int,
`MONTH` int,
`DATE` int)
...
LOCATION
's3://DATA/'
TBLPROPERTIES(
"projection.enabled" = "true",
"projection.account.type" = "integer",
"projection.account.range" = "1,50",
"projection.YEAR.type" = "integer",
"projection.YEAR.range" = "2020,2025",
"projection.MONTH.type" = "integer",
"projection.MONTH.range" = "1,12",
"projection.DATE.type" = "integer",
"projection.DATE.range" = "1,31",
"storage.location.template" = "s3://DATA/YEAR=${YEAR}/MONTH=${MONTH}/DATE=${DATE}/"
);
https://docs.aws.amazon.com/athena/latest/ug/partition-projection.html
Just to list all options: You can also use GLUE crawlers. But it doesn't seemed to be a favourable approach as it's not as flexible as advertised.
You get more control on GLUE using Glue Data Catalog API directly, which might be an alternative to approach #2 if you have a lot of automated scripts
that do the preparation work to setup your table.
In short:
If your application is SQL centric, you like the leanest approach with no scripts, use partition projection
If you have many partitions, use partition projection
If you have a few partitions or partitions do not have a generic pattern, use approach #2
If you're script heavy and scripts do most of the work anyway and are easier to handle for you, consider approach #5
If you're confused and have no clue where to start - try partition projection first! It should fit 95% of the use cases.
As per this AWS Forum Thread, does anyone know how to use AWS Glue to create an AWS Athena table whose partitions contain different schemas (in this case different subsets of columns from the table schema)?
At the moment, when I run the crawler over this data and then make a query in Athena, I get the error 'HIVE_PARTITION_SCHEMA_MISMATCH'
My use case is:
Partitions represent days
Files represent events
Each event is a json blob in a single s3 file
An event contains a subset of columns (dependent on the type of event)
The 'schema' of the entire table is the full set of columns for all the event types (this is correctly put together by Glue crawler)
The 'schema' of each partition is the subset of columns for the event types that occurred on that day (hence in Glue each partition potentially has a different subset of columns from the table schema)
This inconsistency causes the error in Athena I think
If I were to manually write a schema I could do this fine as there would just be one table schema, and keys which are missing in the JSON file would be treated as Nulls.
Thanks in advance!
I had the same issue, solved it by configuring crawler to update table metadata for preexisting partitions:
It also fixed my issue!
If somebody need to provision This Configuration Crawler with Terraform so here is how I did it:
resource "aws_glue_crawler" "crawler-s3-rawdata" {
database_name = "my_glue_database"
name = "my_crawler"
role = "my_iam_role.arn"
configuration = <<EOF
{
"Version": 1.0,
"CrawlerOutput": {
"Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }
}
}
EOF
s3_target {
path = "s3://mybucket"
}
}
This helped me. Posting the image for others in case the link is lost
Despite selecting Update all new and existing partitions with metadata from the table. in the crawler's configuration, it still occasionally failed to set the expected parameters for all partitions (specifically jsonPath wasn't inherited from the table's properties in my case).
As suggested in https://docs.aws.amazon.com/athena/latest/ug/updates-and-partitions.html, "to drop the partition that is causing the error and recreate it" helped
After dropping the problematic partitions, glue crawler re-created them correctly on the following run