I have multiple files stored into an S3 bucket under uniquely named folders which I would expect AWS Glue to put into a single table - instead it creates one per file. Any ideas how to configure the crawler to get a single table ?
The current tS3 structure is s3://bucket_name/YYYYMMDDUUID/data.json:
20210801123123cfec/data.json
20210808876551cedc/data.json
....
20210810112313feed/data.json
The json schema is definitely not a problem, it is similar - for example when I change the folder names from the custom names to "1", "2", ... etc I get a single table with multiple partitions.
Related
I'm trying to assign table properties to the tables that are created with a crawler.
The idea is to have all of the tables that are created with a crawler have the same default properties (plus the ones they'd usually have).
I examined the options in the crawler creation interface but didn't see such an option.
Creating a python boto3 script to alter table property values after the table creation is the only thing that comes to mind.
If this is not possible with the default crawler functionality, what is a viable approach to attach table properties to every table that is created with a certain crawler?
EDIT: One possible solution would be to create a lambda function that checks if the custom parameters exist on the glue tables and if not creates them.
Option 1
Directly adding the fields in the definition might be the best way in approaching this (using CloudFormation).
https://docs.amazonaws.cn/en_us/AWSCloudFormation/latest/UserGuide/aws-properties-glue-classifier-csvclassifier.html
Option 2
I guess there's some reason why you do not add the table fields directly. If this should be triggered by the data itself the clean way you might want to look into is writing custom classifiers:
https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html
Option 3
When you need a quick hack you could merge the schema by crawling an additional file with the schema info that's missing and let the crawler merge the fields:
If you have JSON S3 files for example (or any consistent format for your use case) you can add an additional init file and add the columns there. Set
{
"Version": 1.0,
"Grouping": {
"TableGroupingPolicy": "CombineCompatibleSchemas" }
}
Cite from AWS doc:
"To help illustrate this option, suppose that you define a crawler with an include path s3://bucket/table1/. When the crawler runs, it finds two JSON files with the following characteristics:
File 1 – S3://bucket/table1/year=2017/data1.json
File content – {“A”: 1, “B”: 2}
Schema – A:int, B:int
File 2 – S3://bucket/table1/year=2018/data2.json
File content – {“C”: 3, “D”: 4}
Schema – C: int, D: int
By default, the crawler creates two tables, named year_2017 and year_2018 because the schemas are not sufficiently similar. However, if the option Create a single schema for each S3 path is selected, and if the data is compatible, the crawler creates one table. The table has the schema
A:int,B:int,C:int,D:int and partitionKey year:string.
See https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html
I have tried this not achieving required results-
I have multiple CSV files in a folder of s3 bucket but when it creates multiple table for it then Athena returns zero results so I made a different folder for each file then it works fine.
problem-
but if in future more folders will be added then I have to go to crawler and have to add a new location path for each newly added folder so is there any way to do it automatically or some other way to do it. I am using glue crawler and s3 bucket athena for query run on multiple CSV files.
In general a table needs all of its files to be in a directory, and no other files to be in that directory.
There is however, a mechanism that makes it possible to create tables that include just specific files. You can read more about that in the the second part of this answer: Partition Athena query by S3 created date (scroll down a bit after the horizontal rule). You can also find an example in the S3 Inventory documentation: https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html
I am trying query AWS S3 Invetory List using Athena. I can do this if I have only one source bucket. I am not sure how to configure this to work with multiple source buckets.
we are using all the default configuration options with data format as CSV. S3 Inventory destination bucket name pattern for hive is like this:
destination-prefix/source-bucket/config-ID/hive/dt=YYYY-MM-DD-HH-MM/symlink.txt
So when I am creating an Athena table I have to use static hive path.
CREATE EXTERNAL TABLE your_table_name(
//column names
)
PARTITIONED BY (dt string)
//options ignored
LOCATION 's3://destination-prefix/source-bucket/config-ID/hive/';
So if I want to query inventory data for multiple source buckets, it seems like I have to create a table for each "source-bucket".
Alternatively, with out using Athena, I am trying to do this with AWS CLI
aws s3 ls s3://our-bucket-name/prefix/abc --recursive | awk ‘$1 > “2019-04-01”
But this gets every single file first as there is no option to set --include or --exclude with "S3 ls"
finally, the questions are:
Can I configure AWS Inventory to generate inventory for multiple S3 buckets to that it puts everything into the same "hive" directory (i.e. ignore "source-bucket" prefix while generating Invetory)?
Is it possible to configure Athena read from multiple hive locations? But with the possibility of new buckets getting created and dropped, I guess this gets ugly.
Is there any alternative way to query inventory list instead of Athena or AWS CLI or writing a custom code to use manifest.json file to get these csv files.
You can't make S3 Inventory create one inventory for multiple buckets, unfortunately. You can however splice the inventories together into one table.
The guide you link to says to run MSCK REPAIR TABLE … to load your inventories. I would recommend you to not do that, because it will create weird tables with partitions that each represent the inventory of some point in time, which is something you might want if you want to compare what's in a bucket from day to day or week to week, but probably not what you want most of the time. Most of the time you want to know what's in the bucket right now. To get multiple inventories into the same table you should also not run that command.
First you change how you create the table slightly:
CREATE EXTERNAL TABLE your_table_name(
//column names
)
PARTITIONED BY (bucket_name string)
//options ignored
LOCATION 's3://destination-prefix/source-bucket/config-ID/hive/';
Notice that I changed the partitioning from dt string to bucket_name string.
Next you add the partitions manually:
ALTER TABLE your_table_name
ADD PARTITION (bucket_name = 'some-bucket') LOCATION 's3://destination-prefix/source-bucket/config-ID1/hive/dt=YYYY-MM-DD/'
ADD PARTITION (bucket_name = 'another-bucket') LOCATION 's3://destination-prefix/source-bucket/config-ID2/hive/dt=YYYY-MM-DD/';
The locations should be the S3 URIs to the latest dates under the "hive" prefix of the inventory for each bucket.
The downside of this is that when new inventories are delivered you will need to update the table to point to these new locations. You can do this by first dropping the partitions:
ALTER TABLE your_table_name
DROP PARTITION (bucket_name = 'some-bucket')
DROP PARTITION (bucket_name = 'another-bucket');
and then adding them again using the same SQL as above, but with new S3 URIs.
I got 2Gb csv file (pipe separated) in s3,
Run a glue crawler on it, created new table.
When run a query from aws-athena it found zero record (even though it return the columns correctly)
didn't applied any partition, just run the crawler as default as possible.
any suggestion?
note - used aws consol for all actions
The possibility that the query is not returning the data is
If you have specified the file name in the bucket name while Adding
the crawler.
Let's say your bucket name is testbucket and the csv file is test.csv,
so while adding the crawler you need to specify your bucket name as s3://testbucket/
and not s3://testbucket/test.csv
Also, if the fields are separated by pipe, then they will be displayed under single column only as the file extension is .csv (comma separated). So ideally the fields should be comma separated in order to fetch the proper output.
Hence, try specifying the bucket name as mentioned above. Hope this will return the data.
If the data is still not returned, try creating a new crawler and while creating the new crawler do not use the existing IAM role. Create a new role. Sometimes, IAM policies make a glitch while fetching the data.
I'm testing out S3 Select and as far as I understand from the examples, you can treat a single object (CSV or JSON) as a data store.
I wanted to have a single JSON document per S3 object and search the entire bucket as a 'database'. I'm saving each 'file' as <ID>.json and each file has JSON documents with the same schema.
Is it possible to search multiple objects in a single call? i.e. Find all JSON documents where customerId = 123 ?
It appears that Amazon S3 Select operates on only one object.
You can use Amazon Athena to run queries across paths, which will include all files within that path. It also supports partitioning.
Simple, just iterate over the folder key in which you have all the files and grab the key and use the same to leverage S3 Select.