Directly related to: Crawler is creating a table with weird suffix to the name
I have an AWS glue crawler, crawling an S3 bucket.
I changed the location of the data to a different S3 bucket, updated the crawler, and deleted completly the old tables - using DROP TABLE and making sure the table doesn't exist in glue data catalog, but whenever I run the crawler again it creates tables with hash in the suffix.
Is there a way to prevent this behaviour?
Related
I'm trying to crawler an S3 bucket with multiple folders, each one containing some csv files extracted by a Glue Job from Amazon RDS.
In the moment, this is basically the schema for S3:
s3://bucket/folder_table_x/files
s3://bucket/folder_table_y/files
The goal is to crawler this buckets and folders to create a new database and then query via Amazon Athena.
But i'm getting the following tables (i mean, with the name of the file, not the folder):
run-unnamed-1-part-r-00000
Most of the tables are being created correctly, but I'm not being able to deal with some.
I've already set the table level as 2 (is that right?) and also set the option that says "Create a single schema for each S3 path"
Theses files that are being created as tables it contains only the header, but none info.
Anyone can help?
I have a few AWS Glue crawlers setup to crawl CSV's in S3 to populate my tables in Athena.
My scenario and question:
I replace the .csv files in S3 daily with updated versions. Do I have to run the existing crawlers again perhaps on a schedule to update the tables on Athena with the latest content? Or is the crawler only required to run if schema changes such as additional columns added? I just want to ensure that my tables in Athena always output all of the data as per the updated CSV's - I rarely do any schema changes to the table structures. If the crawlers are only required to run when actual structure changes take place then I would prefer to run them a lot less frequently
When a glue crawler runs, the following actions take place:
It classifies data to determine the format, schema, and associated properties of the raw data
Groups data into tables or partitions
Writes metadata to the Data Catalog
The schema of tables created in the Data Catalog is referenced by Athena to query the specified S3 datasource. So, if the schema remains constant, scheduling the crawler runs can be reduced.
You can also refer the documentation here to understand working with glue crawlers and csv files in Athena: https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html
I have data that is coming into an S3 bucket and I would like to run a query on it every hour. The data comes in as a JSON. I crawl it, run a job on the data to transform it to ORC format, and crawl it again to create a table that's faster for queries than the original JSONs (as they are deeply nested). I'm trying to query the data with Athena. I have managed to link the previous steps together using Lambda and cloudwatch events.
The problem here is that the last crawler is supposed to create new tables instead of just partitions of the same table, so the table name is not known prior to running the list of jobs. I found that you can listen for the creation of a new table and the completion of a crawler, but the log for the end of a crawler's run doesn't contain the name of the new table created (using Amazon's Documentation). Is there a way to get this table name dynamically and query it using Lambda or Athena? Thanks
Why not invoke lambda from glue job after crawler completes? Table name is folder in S3 bucket in which you stored orc data. Since it is done in glue job, I believe you already have folder name which you can pass to lambda from glue job.
I am trying AWS Glue crawler to create tables in athena.
The source that I am pulling it from is a Postgresql server. The crawler is able to parse the tables, create metadata and show the tables and columns in the Glue data catalog but the tables are not added in athena despite the fact that I have added the target database from athena.
Not sure why this is happening
Also, if I choose a csv source from s3 then it is able to create a table in athena with _csv as a suffix
Any help?
Athena doesn't recognize my Postgres tables added by Glue either. My guess is that Athena is used for querying data stored on S3, so it's not working for database queries.
Also, to be able to query your CSV files on S3, files need to be under a folder crawled by glue. If you just crawl a single file with Glue, Athena will return 0 records from the query.
I have a data catalog managed by AWS Glue, and any update that my developers does in our S3 bucket with new tables or partitions we are using the crawlers to update that every day to keep the new partitions healthy.
But, we also need custom table properties. In our hive we have the data source of each table as a table property, and we added to the tables in the Data Catalog in glue but, every time we run the crawler it overwrites the custom table properties like Description.
Am I doing anything wrong? Or is this a bug from AWS Glue?
Have you checked Schema change policy in your crawler definition?