Automate aws Athena partition loading [duplicate] - amazon-athena

I have a Spark batch job which is executed hourly. Each run generates and stores new data in S3 with the directory naming pattern DATA/YEAR=?/MONTH=?/DATE=?/datafile.
After uploading the data to S3, I want to investigate it using Athena. Also, I would like to visualize them in QuickSight by connecting to Athena as a data source.
The problem is that after each run of my Spark batch, the newly generated data stored in S3 will not be discovered by Athena, unless I manually run the query MSCK REPAIR TABLE.
Is there a way to make Athena update the data automatically, so that I can create a fully automatic data visualization pipeline?

There are a number of ways to schedule this task. How do you schedule your workflows? Do you use a system like Airflow, Luigi, Azkaban, cron, or using an AWS Data pipeline?
From any of these, you should be able to fire off the following CLI command.
$ aws athena start-query-execution --query-string "MSCK REPAIR TABLE some_database.some_table" --result-configuration "OutputLocation=s3://SOMEPLACE"
Another option would be AWS Lambda. You could have a function that calls MSCK REPAIR TABLE some_database.some_table in response to a new upload to S3.
An example Lambda Function could be written as such:
import boto3
def lambda_handler(event, context):
bucket_name = 'some_bucket'
client = boto3.client('athena')
config = {
'OutputLocation': 's3://' + bucket_name + '/',
'EncryptionConfiguration': {'EncryptionOption': 'SSE_S3'}
}
# Query Execution Parameters
sql = 'MSCK REPAIR TABLE some_database.some_table'
context = {'Database': 'some_database'}
client.start_query_execution(QueryString = sql,
QueryExecutionContext = context,
ResultConfiguration = config)
You would then configure a trigger to execute your Lambda function when new data are added under the DATA/ prefix in your bucket.
Ultimately, explicitly rebuilding the partitions after you run your Spark Job using a job scheduler has the advantage of being self documenting. On the other hand, AWS Lambda is convenient for jobs like this one.

You should be running ADD PARTITION instead:
aws athena start-query-execution --query-string "ALTER TABLE ADD PARTITION..."
Which adds a the newly created partition from your S3 location
Athena leverages Hive for partitioning data.
To create a table with partitions, you must define it during the CREATE TABLE statement. Use PARTITIONED BY to define the keys by which to partition data.

There's multiple ways to solve the issue and get the table updated:
Call MSCK REPAIR TABLE. This will scan ALL data. It's costly as every file is read in full (at least it's fully charged by AWS). Also it's painfully slow. In short: Don't do it!
Create partitions by your own by calling ALTER TABLE ADD PARTITION abc .... This is good in a sense no data is scanned and costs are low. Also the query is fast, so no problems here. It's also a good choice if you have very cluttered file structure without any common pattern (which doesn't seem it's your case as it's a nicely organised S3 key pattern). There's also downsides to this approach: A) It's hard to maintain B) All partitions will to be stored in GLUE catalog. This can become an issue when you have a lot of partitions as they need to be read out and passed to Athena and EMRs Hadoop infrastructure.
Use partition projection. There's two different styles you might want to evaluate. Here's the variant with does create the partitions for Hadoop at query time. This means there's no GLUE catalog entries send over the network and thus large amounts of partitions can be handled quicker. The downside is you might 'hit' some partitions that might not exist. These will of course be ignored, but internally all partitions that COULD match your query will be generated - no matter if they are on S3 or not (so always add partition filters to your query!). If done correctly, this option is a fire and forget approach as there's no updates needed.
CREATE EXTERNAL TABLE `mydb`.`mytable`
(
...
)
PARTITIONED BY (
`YEAR` int,
`MONTH` int,
`DATE` int)
...
LOCATION
's3://DATA/'
TBLPROPERTIES(
"projection.enabled" = "true",
"projection.account.type" = "integer",
"projection.account.range" = "1,50",
"projection.YEAR.type" = "integer",
"projection.YEAR.range" = "2020,2025",
"projection.MONTH.type" = "integer",
"projection.MONTH.range" = "1,12",
"projection.DATE.type" = "integer",
"projection.DATE.range" = "1,31",
"storage.location.template" = "s3://DATA/YEAR=${YEAR}/MONTH=${MONTH}/DATE=${DATE}/"
);
https://docs.aws.amazon.com/athena/latest/ug/partition-projection.html
Just to list all options: You can also use GLUE crawlers. But it doesn't seemed to be a favourable approach as it's not as flexible as advertised.
You get more control on GLUE using Glue Data Catalog API directly, which might be an alternative to approach #2 if you have a lot of automated scripts
that do the preparation work to setup your table.
In short:
If your application is SQL centric, you like the leanest approach with no scripts, use partition projection
If you have many partitions, use partition projection
If you have a few partitions or partitions do not have a generic pattern, use approach #2
If you're script heavy and scripts do most of the work anyway and are easier to handle for you, consider approach #5
If you're confused and have no clue where to start - try partition projection first! It should fit 95% of the use cases.

Related

AWS Glue enableUpdateCatalog not creating new partitions after successful job run

I am having a problem, where i have set enableUpdateCatalog=True and also updateBehaviour=LOG to update my glue table which has 1 partition key. After the job, runs there are no new partitions added on my glue catalog table, but data in S3 is separated by the partition key i have used, how do i get the job to automatically partition my glue catalog table?
Currently i have to manually run boto3 create_partition to create partitions on my glue catalog table. I want my job to automatically be able to create partitions as it discovers in S3 path separated by partition Keys
Code:
additionalOptions = {
"enableUpdateCatalog": True,
"updateBehavior": "LOG"}
additionalOptions["partitionKeys"] = ["partition_key0", "partition_key1"]
my_df = glueContext.write_dynamic_frame_from_catalog(frame=last_transform, database=<dst_db_name>,
table_name=<dst_tbl_name>, transformation_ctx="DataSink1",
additional_options=additionalOptions)
job.commit()
PS: I am currently using PARQUET format
Am i missing any Rights that has to be added to my job so that it can create partitions from the job itself?
I got it to work by adding useGlueParquetWriter: 'true' to the CATALOG table properties. And also I have added
format_options = {
'useGlueParquetWriter': True
}
in the write_dynamic_frame.from_catalog calls.
These steps got it to start working :)

AWS Glue - table version increases on data load even with no schema changes

I have a lambda job which infrequently dumps a parquet file into an S3 bucket/Glue table using AWS Wrangler.
This Glue table appears to be increasing the table version number every time there is new data, even though the schema is unchanged.
I do not think the problem is with the lambda job/wrangler, since it deposits the parquet files as expected. I have also tested that code separately and it works as expected.
Something is going on with the Glue data catalogue table that makes it increase versions despite no changes to the schema.
I have checked for differences in the underlying parquet files to see if there are some schema, data type etc changes between updates, and there are none.
I have checked for differences between the Glue table versions via the console and AWS CLI (aws glue get-table-versions) and found no differences there either (only the UpdateTime and VersionId changes).
I have tried to recreate my setup with the same code and do not find this issue. I have tried to delete and recreate the Glue table in the same place, but the issue reoccurs.
Question: What could be causing my Glue table version numbers to increase when there are no schema changes?
Note:
The code in question looks like this. It's part of a bigger function (this is really just generating logs of what the main lambda function is doing). It works fine on its own and doesn't use variables etc from the rest of the code. I don't see how this could be the issue but including it here anyway.
#other functions do some things when triggered by a new file in another s3 bucket
#this function is just logging which files were processed. It's the Glue table from these log files which is having issues with the version number increasing every time a new log file is added.
import aws-wrangler as wr
def log(resource, filename):
log_df = build_log(resource, filename) # for building the log df, just columns of date, time, file used etc
wr.s3.to_parquet(
df=log_df,
path=log_path(), #s3 bucket where parquet logs are being put
dataset=True,
catalog_versioning=False,
database="MYDB",
partition_cols=['date'],
table='log',
mode='append'
)
This is, I think due to partitioning. You are partitioning based on date, so I guess for every day of time unit a new partition will be added. The new partitions are the reason why the table version is being incremented.

Do AWS Spectrum really need = in s3 location to understand it as hive format?

I run some test with spectrum.
I created two AWS Glue crawler.
The first one called hive-tst which scans:
s3://hive-test/type='a'/year='2021'/month='01'
s3://hive-test/type='b'/year='2021'/month='01'
s3://hive-test/type='c'/year='2021'/month='01'
s3://hive-test/type='d'/year='2021'/month='01'
s3://hive-test/type='e'/year='2021'/month='01'
The second one scans:
s3://non-hive-test/a/2021/01
s3://non-hive-test/b/2021/01
s3://non-hive-test/c/2021/01
s3://non-hive-test/d/2021/01
s3://non-hive-test/e/2021/01
Both has two files in each bucket partition, both files are parquet files with 50mb.
Then I run a test of querying first partition of each spectrum table:
select distinct event from test.hive_tst;
It took 8s 272
select distinct partition_0 from test.nonhive_tst;
It took 8s 66ms
So it doesn't seem that adding the = improves performance.
Also checked that both tables have Hive format in partitions.
select *
from svv_external_partitions
where schemaname='test'
and tablename='hive_tst';
values
location
input_format
output_format
serialization_lib
["a","2021","01"]
s3://hive-test/event=a/year=2021/month=01/
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
select *
from svv_external_partitions
where schemaname='test'
and tablename='nonhive_tst';
values
location
input_format
output_format
serialization_lib
["a","2021","01"]
s3://hive-test/a/2021/01/
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
Maybe the data volume in the folders is not enough to test it, but everything, execution-times, and partitions format using svv_external_partitions seem the same.
Then the question is:
Do AWS Spectrum really need = in s3 location to understand it as hive format?
Finally, after a lot of search and reading I get into a conclusion:
Both S3 buckets have partitions, and as we use AWS Glue, all partitions are automatically added.
The only difference is that prefix such as year=2020 corresponds to hive naming convention, so AWS Glue knows how to handle that when add the partitions, and then partitions have a pretty name such as year instead of partition_x.
Then, answering: Do AWS Spectrum really need = in s3 location to understand it as hive format?
No, you don't need it to understant as hive format, but to understand it with Hive naming convention
Sources
My own test above
AWS Glue partitioned data post
Hive naming convention on S3

Can AWS Glue write to DynamoDB?

I need to do some grouping job from a Source DynamoDB table, then write each resulting Item to another Target DynamoDB table (or a secondary index of the Source one).
Here I see that DynamoDB can be used as a Source (as well as reported in Connection Types).
However, it's not clear to me if a DynamoDB table can be used as Target as well.
Note: each resulting grouping item must be written into a separate DynamoDB Item (i.e., if there are X objects resulting from grouping, X Items must be written to Target DynamoDB table).
Glue can now read and write to DynamoDB. The option to write is not available via the console, but can be done by editing the script.
Example:
Datasink1 = glueContext.write_dynamic_frame_from_options(
frame=ApplyMapping_Frame1,
connection_type="dynamodb",
connection_options={
"dynamodb.output.tableName": "myDDBTable",
"dynamodb.throughput.write.percent": "1.0"
}
)
As per:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#etl-connect-dynamodb-as-sink
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-dynamo-db-cross-account.html
The Glue Job scripts can be customized to write to any datasource. If you are using the auto generated scripts, you can add boto3 library to write to DynamoDb tables.
If you want to test the scripts easily, you can create a Dev endpoint through AWS console & launch a jupyter notebook to write and test your glue job scripts.

AWS Quicksight cant see Athena DB in another region

My Athena DB is in ap-south-1 region and AWS QuickSight doesn't exist in that region.
How can I connect QuickSight with Athena in that case?
All you need to do is to copy table definitions from one region to another. There are several ways to do that
With AWS Console
This approach is the most simple one and doesn't require additional setup as everything is based on Athena DDL statements.
Get table definition with
SHOW CREATE TABLE `database`.`table`;
This should output something like:
CREATE EXTERNAL TABLE `database`.`table`(
`col_1` string,
`col_2` bigint,
...
`col_n` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://some/location/on/s3'
TBLPROPERTIES (
'classification'='parquet',
...
'compressionType'='gzip')
Change to a desired region
Create database where you want to store table definitions, or use default one.
Execute statement produced by SHOW CREATE TABLE. Note, you might need to change name of database with respect to previous step
If you table is partitioned then you would need to load all partitions.
If data on S3 adheres HIVE partitioning style, i.e.
s3://some/location/on/s3
|
├── day=01
| ├── hour=00
| └── hour=01
...
then you can use
MSCK REPAIR TABLE `database`.`table`
Alternatively, you can load partitions one by one
ALTER TABLE `database`.`table`
ADD PARTITION (day='01', hour='00')
LOCATION 's3://some/location/on/s3/01/00';
ALTER TABLE `database`.`table`
ADD PARTITION (day='01', hour='01')
LOCATION 's3://some/location/on/s3/01/01';
...
With AWS API
You can use AWS SDK, e.g. boto3 for python, which provide an easy to use, object-oriented API. Here you have two options:
Use Athena client. Like in a previous approach, you would need to get table definition statement from AWS Console. But all other steps, can be done in scripted manner with the use of start_query_execution method of Athena Client. There are plenty resources online, e.g. this one
Use AWS Glue client. This method is solely based on operation within AWS Glue Data Catalog, which is used by Athena during query execution. Main idea is to create two glue clients, one for source and one for destination catalog. For example
import boto3
KEY_ID = "__KEY_ID__"
SECRET = "__SECRET__"
glue_source = boto3.client(
'glue',
region_name="ap-south-1",
aws_access_key_id=KEY_ID,
aws_secret_access_key=SECRET
)
glue_destination = boto3.client(
'glue',
region_name="us-east-1",
aws_access_key_id=KEY_ID,
aws_secret_access_key=SECRET
)
# Or you can do it with creating sessions
glue_source = boto3.session.Session(profile_name="profile_for_ap_south_1").client("glue")
glue_destination = boto3.session.Session(profile_name="profile_for_us_east_1").client("glue")
Then you would need to use get and create type methods. This would also require parsing responses that would get from glue clients.
With AWS Glue crawlers
Although, you can use AWS Glue crawlers to "rediscover" data on S3, I wouldn't recommend this approach since you already know structure of you data.
The answer of #Ilya Kisil is correct but I would like to bring some more details and alternative solutions.
There are two different approaches you can take.
As suggested by Ilya, copy the table definitions from one region (source region) to another (destination region). The idea is to reference the data of the other region.
I found the Glue Crawlers much easier and faster. You need to create a Glue Crawler in the source region and specify the S3 bucket of the destination region where the metadata is located. Once you do it, you will see in the Athena source region all the tables of the destination region! Behind the scenes what Glue Crawler does is what Ilya explained in the "With AWS Console" section. So, instead of creating the table one by one and loading the partitions (if exist), you can just create one Glue Crawler.
Note, that it holds a reference to your destination region tables. So that it doesn't copy the data. At first glance, it seems to be great! Why should we copy the data if we could reference it? But when you take a deeper look, you can find that you are probably going to pay more money $$$. When you reference data, you will pay for the data each query returns and if you consume the data a lot, and you have TB/PB of data, it might be too expensive, and if cost is a consideration for you, I would recommend you consider the second solution.
Also note, that although the data is not being copied to the source region and just referenced, behind the scenes, when you execute a query, AWS saves the data temporarily in the source region. So, if you need to be GDPR compliant you might need to be aware of that.
Copy the data from the destination region to the source region and have a process that keeps synchronizing it. Then you will not pay for the Athena queries, but rather pay for the storage that is usually cheaper. If possible, you can also copy just what you need or aggregate the data, so you have less copied storage => and less cost.
A convenient way to do it is by creating a Glue Job that will be responsible for copying the data from the destination region S3 bucket to the source region S3 bucket. And then you can add it to a Glue Workflow that will run this job once a day or whatever is proper for you.
To Summarize:
There are lots of things to consider and I mentioned some of them. In each use case, you have advantages and disadvantages and you can find what is the right one for you.
(Solution 1) Advantages:
Easy. Just some clicks.
Fast.
Referencing the data and no need to have duplicated data.
(Solution 1) Disadvantages:
Might be way more expensive (depends on the data usage).
(Solution 2) Advantages:
Might be much cheaper
(Solution 2) Disadvantages:
Slow/Longer solution
Need to copy existing data and then have a process to copy new data