I have a data catalog managed by AWS Glue, and any update that my developers does in our S3 bucket with new tables or partitions we are using the crawlers to update that every day to keep the new partitions healthy.
But, we also need custom table properties. In our hive we have the data source of each table as a table property, and we added to the tables in the Data Catalog in glue but, every time we run the crawler it overwrites the custom table properties like Description.
Am I doing anything wrong? Or is this a bug from AWS Glue?
Have you checked Schema change policy in your crawler definition?
Related
I have a few AWS Glue crawlers setup to crawl CSV's in S3 to populate my tables in Athena.
My scenario and question:
I replace the .csv files in S3 daily with updated versions. Do I have to run the existing crawlers again perhaps on a schedule to update the tables on Athena with the latest content? Or is the crawler only required to run if schema changes such as additional columns added? I just want to ensure that my tables in Athena always output all of the data as per the updated CSV's - I rarely do any schema changes to the table structures. If the crawlers are only required to run when actual structure changes take place then I would prefer to run them a lot less frequently
When a glue crawler runs, the following actions take place:
It classifies data to determine the format, schema, and associated properties of the raw data
Groups data into tables or partitions
Writes metadata to the Data Catalog
The schema of tables created in the Data Catalog is referenced by Athena to query the specified S3 datasource. So, if the schema remains constant, scheduling the crawler runs can be reduced.
You can also refer the documentation here to understand working with glue crawlers and csv files in Athena: https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html
I am working on crawling data to data catalog via aws glue. But I am a bit confused about the database definition. From what I can find in aws doc, A database in the AWS Glue Data Catalog is a container that holds tables. You use databases to organize your tables into separate categories.. I wonder what exactly a database contains. Does it load all the data from other data sources and create a catalog on them? Or does it only contain catalog? How do I know the size of tables in glue database? And what type of database it uses, like nosql, rds?
For example, I create a crawler to load data from s3 and create a catalog table in glue. Does the glue table includes all the data from s3 bucket? If I delete s3 bucket, will it have impact on other jobs in glue which runs against the catalog table created by the crawler?
If the catalog table only includes data schema, how can I keep it update to data if my data source is modified?
The Catalog is just a metadata store. Its mission is to document the data that lives elsewhere, and to export that to other tools, like Athena or EMR, so they can discover the data.
Data is not replicated into the catalog, but remains in the origin. If you remove the table from the catalog, the data in origin remains intact.
If you delete the origin data (as you described in your question), the other services will not have access to the data anymore, as it is deleted. If you run the crawler again it should detect it is not there.
If you want to keep the crawler schema up to date, you can either schedule automatic runs of the crawler, or execute on demand whenever your data changes. When the crawler is run again it will update accordingly things like the number of records, partitions, or even changes in the schema. Please refer to the documentation to see the effect changes in the schema can have on your catalog.
I have data that is coming into an S3 bucket and I would like to run a query on it every hour. The data comes in as a JSON. I crawl it, run a job on the data to transform it to ORC format, and crawl it again to create a table that's faster for queries than the original JSONs (as they are deeply nested). I'm trying to query the data with Athena. I have managed to link the previous steps together using Lambda and cloudwatch events.
The problem here is that the last crawler is supposed to create new tables instead of just partitions of the same table, so the table name is not known prior to running the list of jobs. I found that you can listen for the creation of a new table and the completion of a crawler, but the log for the end of a crawler's run doesn't contain the name of the new table created (using Amazon's Documentation). Is there a way to get this table name dynamically and query it using Lambda or Athena? Thanks
Why not invoke lambda from glue job after crawler completes? Table name is folder in S3 bucket in which you stored orc data. Since it is done in glue job, I believe you already have folder name which you can pass to lambda from glue job.
I am new to AWS Glue and am having difficulty fully understanding the AWS docs, but am struggling through the following use case:
We have an s3 bucket with a number of Avro files. We have decided to use Avro due to having extensive support for data schema changes overtime, allowing new fields to be applied to old data with no problem.
With AWS Glue, I understand that a new table is created by a crawler whenever there is a schema change. When our schema has changed, this has caused a number of new tables to be created by the crawler, as expected, but not quite as we desire...
Ultimately, we would like the crawler to detect the most recent schema and apply this schema to all the data that we are crawling in the s3 bucket, outputting only one table. We had (perhaps incorrectly) assumed that by using Avro, this would not be an issue as the crawler could apply new schema fields with a given default or null value to older data (the benefit of using Avro), and only output one table that we then could query using AWS Athena.
Is there a way in AWS Glue to use a given schema for all data in the s3 bucket, enabling us to leverage the Avro benefit of schema evolution, so that all data is output into one table?
I haven't worked with Avro files specifically but AWS Glue lets you configure the crawler in several ways.
If you create a new crawler, you'll be prompted with a few options under the "Configure the crawler's output" section.
Based on your situation, I think you'll need to tick the box that says Update all new and existing partitions with metadata from the table.
This is how that sub-menu looks like.
I have a service running that populates my S3 bucket with the compressed log files, but the log files do not have a fixed schema and athena expects a fixed schema. (Which I wrote while creating the table)
So my question is as in the title, is there any way around through which I can query a dynamic schema? If not is there any other service like athena to do the same thing?
Amazon Athena can't do that by itself, but you can configure an AWS Glue crawler to automatically infer the schema of your JSON files. The crawler can run on a schedule, so your files will be indexed automatically even if the schema changes. Athena will use the Glue data catalog if AWS Glue is available in the region you're running Athena in.
See Cataloging Tables with a Crawler in the AWS Glue docs for the details on how to set that up.