AWS Glue Crawler creates multiple tables when reading empty files - amazon-web-services

I'm writing a Glue Crawler as a part of an ETL, and I have a very annoying problem -
The S3 bucket I'm crawling contains many different JSON files, all with the same schema. When crawling the bucket, the crawler creates a new table for every empty file and one additional table for the non-empty files.
When manually deleting the empty files and running the crawler - I get the expected behaviour, one table is created with the non-empty files data.
Is there a way to avoid this? I'm having issues to delete the empty files before crawling.
Many thanks.

Related

Is it required to have 1 table schema in 1 s3 folder , so that crawler can pick the data in AWS Glue?

When I try to have multiple files in an s3 folder(with different tables schemas) and use the location to create multiple tables using crawler and AWS glue , the athena doesnt detect any data and it gives blank data . However if we have files with only single table schema (tables with same column structure ) then , it detects the data well . So the question is , Is there a way athena can create multiple tables with different structures from the same s3 folder ?
I have tried creating different folders for different files and crawler picks up the table schema well and it gives us the exact result , However it is not feasible as creating different folders for 100's of files is not a Solution . Hence searching for another way.
When defining a table in Amazon Athena (and AWS Glue), the location parameter should point to a folder path in an Amazon S3 bucket.
When running a query, Athena will look in every file in that folder, including sub-folders.
Therefore, you should only keep files of the same format (and schema) in that directory and all of its subdirectories. All of these files will populate the one table.
Do not put multiple files in the same directory if they are meant to populate different tables or have different schemas.

Athena tables having history of records of every csv

I am uploading CSV files in the s3 bucket and creating tables through glue crawler and seeing the tables in Athena, making connection between Athena and Quicksight, and showing the result graphically there in quicksight.
But what I need to do now is keep the history of the files uploaded, instead of a new CSV file being uploaded and crawler updating the table, can I have crawler save each record separately? or is it even a reasonable thing to do? since I wonder it would then create so many tables and it'll be a mess?
I'm just trying to figure out a way to keep a history of previous records. how can i achieve this?
When you run an Amazon Athena query, Athena will look at the location parameter defined in the table's DDL. This specifies where the data is stored in an Amazon S3 bucket.
Athena will include all files in that location when it runs the query on that table. Thus, if you wish to add more data to the table, simply add another file in that S3 location. To replace data in that table, you can overwrite the file(s) in that location. To delete data, you can delete files from that location.
There is no need to run a crawler on a regular basis. The crawler can be used to create the table definition and it can be run again to update the table definition if anything has changed. But you typically only need to use the crawler once to create the table definition.
If you wish to preserve historical data in the table while adding more data to the table, simply upload the data to new files and keep the existing data files in place. That way, any queries will include both the historical data and the new data because Athena simply looks at all the files in that location.

Aws Glue Crawler is not updating the table after 1st crawl

I am adding a new file in parquet format which is created by a Glue Databrew in my S3 folder. The new file has the same schema as the previous file. But when I am running the Crawler for the 2nd time it is neither updating the table nor creating a new one in the data catalog. Also when I am crawling both the files together, both of them are getting added.
Log File is giving the following information:
INFO : Created partitions with values [[New file name]] for table
BENCHMARK : Finished writing to Catalog
I have tried with and without "Create a single schema for each S3 path". But the crawler is not updating the table with the new file. Sooner I will add new files on a daily basis to do my analysis. Any solution?
The best way to approach this issue in my opinion is to use AWS DataBrew output to Data Catalog directly. Data Catalog can be updated either by the crawler or by DataBrew directly but the recommended practice is that you employ any one of those mechanisms not both.
Can you try running the job with output as your data catalog and let Databrew manage your catalog? It should update your catalog table with right data/files.

Glue crawler is not combining data - also no visible data in tables

I'm testing this architecture: Kinesis Firehose → S3 → Glue → Athena. For now I'm using dummy data which is generated by Kinesis, each line looks like this: {"ticker_symbol":"NFLX","sector":"TECHNOLOGY","change":-1.17,"price":97.83}
However, there are two problems. First, a Glue Crawler creates a separate table per file. I've read that if the schema is matching Glue should provide only one table. As you can see in the screenshots below, the schema is identical. In Crawler options, I tried ticking Create a single schema for each S3 path on and off, but no change.
Files also sit in the same path, which leads me to the second problem: when those tables are queried, Athena doesn't show any data. That's likely because files share a folder - I've read about it here, point 1, and tested several times. If I remove all but one file from S3 folder and crawl, Athena shows data.
Can I force Kinesis to put each file in a separate folder or Glue to record that data in a single table?
File1:
File2:
Regarding the AWS Glue creating separate tables there could be some reasons based on the AWS documentation:
Confirm that these files use the same schema, format, and compression type as the rest of your source data. It seems this doesn't your issue but still to make sure I suggest you test it with smaller files by dropping all the rows except a few of them in each file.
combine compatible schemas when you create the crawler by choosing to Create a single schema for each S3 path. For this case, file schema should be similar, setting should be enabled, and data should be compatible. For more information, see How to Create a Single Schema for Each Amazon S3 Include Path.
When using CSV data, be sure that you're using headers consistently. If some of your files have headers and some don't, the crawler creates multiple tables
One another really important point is, you should have one folder at root and inside it, you should have partition sub-folders. If you have partitions at S3 bucket level, it will not create one table.(mentioned by Sandeep in this Stackoverflow Question)
I hope this could help you to resolve your problem.

AWS Glue not deleting or deprecating tables generated over now removed S3 data

Due to user error, our S3 directory over which a Glue crawler ran routinely became flooded with .csv files. When Glue ran over the S3 directory- it created a table for each of the 200,000+ csv files. I ran a script that deleted the .csv files shortly after (S3 bucket has versioning enabled), and re-ran the Glue crawler with the following settings:
Schema updates in the data store Update the table definition in the data catalog.
Inherit schema from table Update all new and existing partitions with metadata from the table.
Object deletion in the data store Delete tables and partitions from the data catalog.
Within the cloudwatch logs- it's updating the tables matching the remaining data, but it's not deleting any of the tables generated from those .csv files. According to it's configuration log on Cloudwatch- it should be able to do so.
INFO : Crawler configured with Configuration
{
"Version": 1,
"Grouping": {
"TableGroupingPolicy": "CombineCompatibleSchemas"
}
}
and SchemaChangePolicy
{
"UpdateBehavior": "UPDATE_IN_DATABASE",
"DeleteBehavior": "DELETE_FROM_DATABASE"
I should include there is another crawler that is set to crawl over the S3 bucket, but it's not been run in over a year, so I doubt that could be a point of conflict.
I'm are stumped on what could be the issue; as it stands, I can write a script to pattern match the existing tables and drop those with a csv in their name or delete and rebuild the database by having Glue re crawl S3, but if possible- I'd much rather Glue drops the tables itself after identifying they point to no files within S3 itself.
I'm currently taking the approach of writing a script to delete the tables created by Athena. All the generated files from Athena queries are 49 characters long, have five _ charachters for the results file and six _ for the metadata, and generally follow the format of ending in a _csv for the resulting query results, and _csv_metadata for the query metadata.
I'm getting a list of all the table names in my database, filtering it only include those that are 49 characters long, end with a _csv_metadata, and have six _ charachters within them. I'm iterating on each string and deleting their corresponding table in the database. For the resulting query that ends with _csv, I'm cutting of the trailing nine charachters of the the _csv_metadata string which cuts off _metadata.
If I were to improve on this, I'd also query the table and ensure it has no data in it and matches certain column name definitions.