Assigning default table properties to tables created by a crawler - amazon-web-services

I'm trying to assign table properties to the tables that are created with a crawler.
The idea is to have all of the tables that are created with a crawler have the same default properties (plus the ones they'd usually have).
I examined the options in the crawler creation interface but didn't see such an option.
Creating a python boto3 script to alter table property values after the table creation is the only thing that comes to mind.
If this is not possible with the default crawler functionality, what is a viable approach to attach table properties to every table that is created with a certain crawler?
EDIT: One possible solution would be to create a lambda function that checks if the custom parameters exist on the glue tables and if not creates them.

Option 1
Directly adding the fields in the definition might be the best way in approaching this (using CloudFormation).
https://docs.amazonaws.cn/en_us/AWSCloudFormation/latest/UserGuide/aws-properties-glue-classifier-csvclassifier.html
Option 2
I guess there's some reason why you do not add the table fields directly. If this should be triggered by the data itself the clean way you might want to look into is writing custom classifiers:
https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html
Option 3
When you need a quick hack you could merge the schema by crawling an additional file with the schema info that's missing and let the crawler merge the fields:
If you have JSON S3 files for example (or any consistent format for your use case) you can add an additional init file and add the columns there. Set
{
"Version": 1.0,
"Grouping": {
"TableGroupingPolicy": "CombineCompatibleSchemas" }
}
Cite from AWS doc:
"To help illustrate this option, suppose that you define a crawler with an include path s3://bucket/table1/. When the crawler runs, it finds two JSON files with the following characteristics:
File 1 – S3://bucket/table1/year=2017/data1.json
File content – {“A”: 1, “B”: 2}
Schema – A:int, B:int
File 2 – S3://bucket/table1/year=2018/data2.json
File content – {“C”: 3, “D”: 4}
Schema – C: int, D: int
By default, the crawler creates two tables, named year_2017 and year_2018 because the schemas are not sufficiently similar. However, if the option Create a single schema for each S3 path is selected, and if the data is compatible, the crawler creates one table. The table has the schema
A:int,B:int,C:int,D:int and partitionKey year:string.
See https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html

Related

AWS Glue - custom s3 partition to single table

I have multiple files stored into an S3 bucket under uniquely named folders which I would expect AWS Glue to put into a single table - instead it creates one per file. Any ideas how to configure the crawler to get a single table ?
The current tS3 structure is s3://bucket_name/YYYYMMDDUUID/data.json:
20210801123123cfec/data.json
20210808876551cedc/data.json
....
20210810112313feed/data.json
The json schema is definitely not a problem, it is similar - for example when I change the folder names from the custom names to "1", "2", ... etc I get a single table with multiple partitions.

AWS Quicksight cant see Athena DB in another region

My Athena DB is in ap-south-1 region and AWS QuickSight doesn't exist in that region.
How can I connect QuickSight with Athena in that case?
All you need to do is to copy table definitions from one region to another. There are several ways to do that
With AWS Console
This approach is the most simple one and doesn't require additional setup as everything is based on Athena DDL statements.
Get table definition with
SHOW CREATE TABLE `database`.`table`;
This should output something like:
CREATE EXTERNAL TABLE `database`.`table`(
`col_1` string,
`col_2` bigint,
...
`col_n` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://some/location/on/s3'
TBLPROPERTIES (
'classification'='parquet',
...
'compressionType'='gzip')
Change to a desired region
Create database where you want to store table definitions, or use default one.
Execute statement produced by SHOW CREATE TABLE. Note, you might need to change name of database with respect to previous step
If you table is partitioned then you would need to load all partitions.
If data on S3 adheres HIVE partitioning style, i.e.
s3://some/location/on/s3
|
├── day=01
| ├── hour=00
| └── hour=01
...
then you can use
MSCK REPAIR TABLE `database`.`table`
Alternatively, you can load partitions one by one
ALTER TABLE `database`.`table`
ADD PARTITION (day='01', hour='00')
LOCATION 's3://some/location/on/s3/01/00';
ALTER TABLE `database`.`table`
ADD PARTITION (day='01', hour='01')
LOCATION 's3://some/location/on/s3/01/01';
...
With AWS API
You can use AWS SDK, e.g. boto3 for python, which provide an easy to use, object-oriented API. Here you have two options:
Use Athena client. Like in a previous approach, you would need to get table definition statement from AWS Console. But all other steps, can be done in scripted manner with the use of start_query_execution method of Athena Client. There are plenty resources online, e.g. this one
Use AWS Glue client. This method is solely based on operation within AWS Glue Data Catalog, which is used by Athena during query execution. Main idea is to create two glue clients, one for source and one for destination catalog. For example
import boto3
KEY_ID = "__KEY_ID__"
SECRET = "__SECRET__"
glue_source = boto3.client(
'glue',
region_name="ap-south-1",
aws_access_key_id=KEY_ID,
aws_secret_access_key=SECRET
)
glue_destination = boto3.client(
'glue',
region_name="us-east-1",
aws_access_key_id=KEY_ID,
aws_secret_access_key=SECRET
)
# Or you can do it with creating sessions
glue_source = boto3.session.Session(profile_name="profile_for_ap_south_1").client("glue")
glue_destination = boto3.session.Session(profile_name="profile_for_us_east_1").client("glue")
Then you would need to use get and create type methods. This would also require parsing responses that would get from glue clients.
With AWS Glue crawlers
Although, you can use AWS Glue crawlers to "rediscover" data on S3, I wouldn't recommend this approach since you already know structure of you data.
The answer of #Ilya Kisil is correct but I would like to bring some more details and alternative solutions.
There are two different approaches you can take.
As suggested by Ilya, copy the table definitions from one region (source region) to another (destination region). The idea is to reference the data of the other region.
I found the Glue Crawlers much easier and faster. You need to create a Glue Crawler in the source region and specify the S3 bucket of the destination region where the metadata is located. Once you do it, you will see in the Athena source region all the tables of the destination region! Behind the scenes what Glue Crawler does is what Ilya explained in the "With AWS Console" section. So, instead of creating the table one by one and loading the partitions (if exist), you can just create one Glue Crawler.
Note, that it holds a reference to your destination region tables. So that it doesn't copy the data. At first glance, it seems to be great! Why should we copy the data if we could reference it? But when you take a deeper look, you can find that you are probably going to pay more money $$$. When you reference data, you will pay for the data each query returns and if you consume the data a lot, and you have TB/PB of data, it might be too expensive, and if cost is a consideration for you, I would recommend you consider the second solution.
Also note, that although the data is not being copied to the source region and just referenced, behind the scenes, when you execute a query, AWS saves the data temporarily in the source region. So, if you need to be GDPR compliant you might need to be aware of that.
Copy the data from the destination region to the source region and have a process that keeps synchronizing it. Then you will not pay for the Athena queries, but rather pay for the storage that is usually cheaper. If possible, you can also copy just what you need or aggregate the data, so you have less copied storage => and less cost.
A convenient way to do it is by creating a Glue Job that will be responsible for copying the data from the destination region S3 bucket to the source region S3 bucket. And then you can add it to a Glue Workflow that will run this job once a day or whatever is proper for you.
To Summarize:
There are lots of things to consider and I mentioned some of them. In each use case, you have advantages and disadvantages and you can find what is the right one for you.
(Solution 1) Advantages:
Easy. Just some clicks.
Fast.
Referencing the data and no need to have duplicated data.
(Solution 1) Disadvantages:
Might be way more expensive (depends on the data usage).
(Solution 2) Advantages:
Might be much cheaper
(Solution 2) Disadvantages:
Slow/Longer solution
Need to copy existing data and then have a process to copy new data

Use AWS Athena To Query S3 Object Tagging

Is it possible to use AWS Athena to query S3 Object Tagging? For example, if I have an S3 layout such as this
bucketName/typeFoo/object1.txt
bucketName/typeFoo/object2.txt
bucketName/typeFoo/object3.txt
bucketName/typeBar/object1.txt
bucketName/typeBar/object2.txt
bucketName/typeBar/object3.txt
And each object has an S3 Object Tag such as this
#For typeFoo/object1.txt and typeBar/object1.txt
id=A
#For typeFoo/object2.txt and typeBar/object2.txt
id=B
#For typeFoo/object3.txt and typeBar/object3.txt
id=C
Then is it possible to run an AWS Athena query to get any object with the associated tag such as this
select * from myAthenaTable where tag.id = 'A'
# returns typeFoo/object1.txt and typeBar/object1.txt
This is just an example and doesn't reflect my actual S3 bucket/object-prefix layout. Feel free to use any layout you wish in your answers/comments.
Ultimately I have a plethora of objects that could be in different buckets and folder paths but they are related to each other and my goal is to tag them so that I can query for a particular id value and get all objects related to that id. The id value would be a GUID and that GUID would map to many different types of objects that are related e.g., I could have a video file, a picture file, a meta-data file, and a json file and I want to get all of those files using their common id value; please feel free to offer suggestions too because I have the ability to structure this as I see fit.
Update - Note
S3 Object Metadata and S3 Object Tagging are two different things.
Athena does not support querying based on s3 tag
one workaround is,
you can create a meta file which contains the tag and file mapping using lambda i.e whenever new file comes to s3 and lambda would update a file in s3 with tag and name details.

AWS Glue custom crawler based on file name

So what I am trying to do is to crawl data on S3 bucket with AWS Glue. Data stored as nested json and path looks like this:
s3://my-bucket/some_id/some_subfolder/datetime.json
When running default crawler (no custom classifiers) it does partition it based on path and deserializes json as expected, however, I would like to get a timestamp from the file name as well in a separate field. For now Crawler omits it.
For example if I run crawler on:
s3://my-bucket/10001/fromage/2017-10-10.json
I get table schema like this:
Partition 1: 10001
Partition 2: fromage
Array: JSON data
I did try to add custom classifier based on Grok pattern:
%{INT:id}/%{WORD:source}/%{TIMESTAMP_ISO8601:timestamp}
However, whenever I re-run crawler it skips custom classifier and uses default JSON one. As a solution obviously I could append file name to the JSON itself before running a crawler, but was wondering if I can avoid this step?
Classifiers only analyze the data within the file, not the filename itself. What you want to do is not possible today. If you can change the path where the files land, you could add the date as another partition:
s3://my-bucket/id=10001/source=fromage/timestamp=2017-10-10/data-file-2017-10-10.json

How to create AWS Glue table where partitions have different columns? ('HIVE_PARTITION_SCHEMA_MISMATCH')

As per this AWS Forum Thread, does anyone know how to use AWS Glue to create an AWS Athena table whose partitions contain different schemas (in this case different subsets of columns from the table schema)?
At the moment, when I run the crawler over this data and then make a query in Athena, I get the error 'HIVE_PARTITION_SCHEMA_MISMATCH'
My use case is:
Partitions represent days
Files represent events
Each event is a json blob in a single s3 file
An event contains a subset of columns (dependent on the type of event)
The 'schema' of the entire table is the full set of columns for all the event types (this is correctly put together by Glue crawler)
The 'schema' of each partition is the subset of columns for the event types that occurred on that day (hence in Glue each partition potentially has a different subset of columns from the table schema)
This inconsistency causes the error in Athena I think
If I were to manually write a schema I could do this fine as there would just be one table schema, and keys which are missing in the JSON file would be treated as Nulls.
Thanks in advance!
I had the same issue, solved it by configuring crawler to update table metadata for preexisting partitions:
It also fixed my issue!
If somebody need to provision This Configuration Crawler with Terraform so here is how I did it:
resource "aws_glue_crawler" "crawler-s3-rawdata" {
database_name = "my_glue_database"
name = "my_crawler"
role = "my_iam_role.arn"
configuration = <<EOF
{
"Version": 1.0,
"CrawlerOutput": {
"Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }
}
}
EOF
s3_target {
path = "s3://mybucket"
}
}
This helped me. Posting the image for others in case the link is lost
Despite selecting Update all new and existing partitions with metadata from the table. in the crawler's configuration, it still occasionally failed to set the expected parameters for all partitions (specifically jsonPath wasn't inherited from the table's properties in my case).
As suggested in https://docs.aws.amazon.com/athena/latest/ug/updates-and-partitions.html, "to drop the partition that is causing the error and recreate it" helped
After dropping the problematic partitions, glue crawler re-created them correctly on the following run