AWS Glue not deleting or deprecating tables generated over now removed S3 data - amazon-web-services

Due to user error, our S3 directory over which a Glue crawler ran routinely became flooded with .csv files. When Glue ran over the S3 directory- it created a table for each of the 200,000+ csv files. I ran a script that deleted the .csv files shortly after (S3 bucket has versioning enabled), and re-ran the Glue crawler with the following settings:
Schema updates in the data store Update the table definition in the data catalog.
Inherit schema from table Update all new and existing partitions with metadata from the table.
Object deletion in the data store Delete tables and partitions from the data catalog.
Within the cloudwatch logs- it's updating the tables matching the remaining data, but it's not deleting any of the tables generated from those .csv files. According to it's configuration log on Cloudwatch- it should be able to do so.
INFO : Crawler configured with Configuration
{
"Version": 1,
"Grouping": {
"TableGroupingPolicy": "CombineCompatibleSchemas"
}
}
and SchemaChangePolicy
{
"UpdateBehavior": "UPDATE_IN_DATABASE",
"DeleteBehavior": "DELETE_FROM_DATABASE"
I should include there is another crawler that is set to crawl over the S3 bucket, but it's not been run in over a year, so I doubt that could be a point of conflict.
I'm are stumped on what could be the issue; as it stands, I can write a script to pattern match the existing tables and drop those with a csv in their name or delete and rebuild the database by having Glue re crawl S3, but if possible- I'd much rather Glue drops the tables itself after identifying they point to no files within S3 itself.

I'm currently taking the approach of writing a script to delete the tables created by Athena. All the generated files from Athena queries are 49 characters long, have five _ charachters for the results file and six _ for the metadata, and generally follow the format of ending in a _csv for the resulting query results, and _csv_metadata for the query metadata.
I'm getting a list of all the table names in my database, filtering it only include those that are 49 characters long, end with a _csv_metadata, and have six _ charachters within them. I'm iterating on each string and deleting their corresponding table in the database. For the resulting query that ends with _csv, I'm cutting of the trailing nine charachters of the the _csv_metadata string which cuts off _metadata.
If I were to improve on this, I'd also query the table and ensure it has no data in it and matches certain column name definitions.

Related

Athena tables having history of records of every csv

I am uploading CSV files in the s3 bucket and creating tables through glue crawler and seeing the tables in Athena, making connection between Athena and Quicksight, and showing the result graphically there in quicksight.
But what I need to do now is keep the history of the files uploaded, instead of a new CSV file being uploaded and crawler updating the table, can I have crawler save each record separately? or is it even a reasonable thing to do? since I wonder it would then create so many tables and it'll be a mess?
I'm just trying to figure out a way to keep a history of previous records. how can i achieve this?
When you run an Amazon Athena query, Athena will look at the location parameter defined in the table's DDL. This specifies where the data is stored in an Amazon S3 bucket.
Athena will include all files in that location when it runs the query on that table. Thus, if you wish to add more data to the table, simply add another file in that S3 location. To replace data in that table, you can overwrite the file(s) in that location. To delete data, you can delete files from that location.
There is no need to run a crawler on a regular basis. The crawler can be used to create the table definition and it can be run again to update the table definition if anything has changed. But you typically only need to use the crawler once to create the table definition.
If you wish to preserve historical data in the table while adding more data to the table, simply upload the data to new files and keep the existing data files in place. That way, any queries will include both the historical data and the new data because Athena simply looks at all the files in that location.

AWS Glue Crawler creates multiple tables when reading empty files

I'm writing a Glue Crawler as a part of an ETL, and I have a very annoying problem -
The S3 bucket I'm crawling contains many different JSON files, all with the same schema. When crawling the bucket, the crawler creates a new table for every empty file and one additional table for the non-empty files.
When manually deleting the empty files and running the crawler - I get the expected behaviour, one table is created with the non-empty files data.
Is there a way to avoid this? I'm having issues to delete the empty files before crawling.
Many thanks.

AWS Athena - What happens when you add new files to S3 folder

I have a sample working where I put a file in S3.
What I'm confused about is what happens when I add new CSV files (with the same format) to that folder.
Are they instantly available in queries? Or do you have to run Glue or something to process them? So for example, what if set up a Lambda function to extract a new CSV every hour, or even every 5 minutes to that same S3 directory.
Does Athena actually load the data into some database somewhere in order to do fast performing queries?
If your table is not partitioned or you add a file to an existing partition the data will be available right away.
However, if you constantly add files you may want to consider partition your table to optimize query performance, see:
Table Location in Amazon S3
Partitioning Data
Athena itself doesn't have any caching, any query will hit the S3 location of the table.

How Amazon Athena selecting new files/records from S3

I'm adding files on Amazon S3 from time to time, and I'm using Amazon Athena to perform a query on these data and save it in another S3 bucket as CSV format (aggregated data), I'm trying to find way for Athena to select only new data (which not queried before by Athena), in order to optimize the cost and avoid data duplication.
I have tried to update the records after been selected by Athena, but update query not supported in Athena.
Is any idea to solve this ?
Athena does not keep track of files on S3, it only figures out what files to read when you run a query.
When planning a query Athena will look at the table metadata for the table location, list that location, and finally read all files that it finds during query execution. If the table is partitioned it will list the locations of all partitions that matches the query.
The only way to control which files Athena will read during query execution is to partition a table and ensure that queries match the partitions you want it to read.
One common way of reading only new data is to put data into prefixes on S3 that include the date, and create tables partitioned by date. At query time you can then filter on the last week, month, or other time period to limit the amount of data read.
You can find more information about partitioning in the Athena documentation.

Updating manually created aws glue data catalog table with crawler

I'm working with AWS glue and many files on s3, with new files appended every day. I try to create and run a crawler to deduce a schema of those csv files. Instead of just one data catalog table with schema, crawler creates many tables (even with Create a single schema for each S3 path option selected), which means that crawler recognize different schemas and can't combine them into one. But I need just one table in data catalog for all those files!
So I created separate data catalog table manually, and when I use this table with glue job, none of the s3 csv files are processed. I guess that is because every time crawler runs, it checks for new files and partitions (and in good case of single schema table we can see those files and partitions by clicking on View partitions button in Tables).
So in here there is way to update manually created table with a crawler, I followed it with a hope that crawler will not change data types for columns that I selected, but update list of files and partitions for glue job to process later:
You might want to create AWS Glue Data Catalog tables manually and then keep them updated with AWS Glue crawlers. Crawlers running on a schedule can add new partitions and update the tables with any schema changes. This also applies to tables migrated from an Apache Hive metastore.
To do this, when you define a crawler, instead of specifying one or more data stores as the source of a crawl, you specify one or more existing Data Catalog tables. The crawler then crawls the data stores specified by the catalog tables. In this case, no new tables are created; instead, your manually created tables are updated.
It doesn't happen for some reason, in crawler log I see this:
INFO : Some files do not match the schema detected. Remove or exclude the following files from the crawler (truncated to first 200 files):
bucket1/customer/dt=2020-02-26/delta_20200226_080101.csv
INFO : Multiple tables are found under location bucket1/customer/. Table customer is skipped.
But there is no "Exclude patterns" option to exclude that file when crawler uses existing data catalog table, documentation says that in this case "The crawler then crawls the data stores specified by the catalog tables".
And crawler doesn't add any partitions or files to my table.
Is there a way to update my manually created table with new files from s3?
Considering your crawler is detecting different schemas, it will continue to do the same no matter what option I choose. You can get it to use the table definition from the table for all the partitions and then only log changes to avoid updating the table schema. But if there is a difference in schema for the files , I’m not sure if your queries will work.
Another option would be to add partitions using boto3 for your s3 path. I can get the table schema using the get table function and then create a partition in glue with that table schema
I don't know why, but the crawler I created can't update list of files and partitions for glue job to process later, it skips my manually created data catalog table, I see it in the cloudwatch log. To solve this problem, I needed to add repair table query into my glue script, so it does what crawler is supposed to do (and I disabled the crawler itself, so it doesn't changes my manually created table and doesn't create many tables for individual csv files and partitions), before actual ETL process:
import boto3
...
# Athena query part
client = boto3.client('athena', region_name='us-east-2')
data_catalog_table = "customer"
db = "inner_customer" # glue data_catalog db, not Postgres DB
# this supposed to update all partitions for data_catalog_table, so glue job can upload new file data into DB
q = "MSCK REPAIR TABLE "+data_catalog_table
# output of the query goes to s3 file normally
output = "s3://bucket_to_store_query_results/results/"
response = client.start_query_execution(
QueryString=q,
QueryExecutionContext={
'Database': db
},
ResultConfiguration={
'OutputLocation': output,
}
)`
After that query "MSCK REPAIR TABLE customer" executes, it writes to s3://bucket_to_store_query_results/results/ a xxx-xxx-xxx.txt file with content like this:
Partitions not in metastore: customer:dt=2020-03-28 customer:dt=2020-03-29 customer:dt=2020-03-30
Repair: Added partition to metastore customer:dt=2020-03-28
Repair: Added partition to metastore customer:dt=2020-03-29
Repair: Added partition to metastore customer:dt=2020-03-30
And if I open Glue->Tables-> select customer table, then click on "View partitions" button on the right top of the page, I see all my partitions from the s3 bucket. After that part the glue job continues as before. I understand that "repair table" query hack is not really optimal, and may be will change it to something more sophisticated, like described in here.