using AWS Glue with Apache Avro on schema changes - amazon-web-services

I am new to AWS Glue and am having difficulty fully understanding the AWS docs, but am struggling through the following use case:
We have an s3 bucket with a number of Avro files. We have decided to use Avro due to having extensive support for data schema changes overtime, allowing new fields to be applied to old data with no problem.
With AWS Glue, I understand that a new table is created by a crawler whenever there is a schema change. When our schema has changed, this has caused a number of new tables to be created by the crawler, as expected, but not quite as we desire...
Ultimately, we would like the crawler to detect the most recent schema and apply this schema to all the data that we are crawling in the s3 bucket, outputting only one table. We had (perhaps incorrectly) assumed that by using Avro, this would not be an issue as the crawler could apply new schema fields with a given default or null value to older data (the benefit of using Avro), and only output one table that we then could query using AWS Athena.
Is there a way in AWS Glue to use a given schema for all data in the s3 bucket, enabling us to leverage the Avro benefit of schema evolution, so that all data is output into one table?

I haven't worked with Avro files specifically but AWS Glue lets you configure the crawler in several ways.
If you create a new crawler, you'll be prompted with a few options under the "Configure the crawler's output" section.
Based on your situation, I think you'll need to tick the box that says Update all new and existing partitions with metadata from the table.
This is how that sub-menu looks like.

Related

What does data category contain in AWS glue?

I am working on crawling data to data catalog via aws glue. But I am a bit confused about the database definition. From what I can find in aws doc, A database in the AWS Glue Data Catalog is a container that holds tables. You use databases to organize your tables into separate categories.. I wonder what exactly a database contains. Does it load all the data from other data sources and create a catalog on them? Or does it only contain catalog? How do I know the size of tables in glue database? And what type of database it uses, like nosql, rds?
For example, I create a crawler to load data from s3 and create a catalog table in glue. Does the glue table includes all the data from s3 bucket? If I delete s3 bucket, will it have impact on other jobs in glue which runs against the catalog table created by the crawler?
If the catalog table only includes data schema, how can I keep it update to data if my data source is modified?
The Catalog is just a metadata store. Its mission is to document the data that lives elsewhere, and to export that to other tools, like Athena or EMR, so they can discover the data.
Data is not replicated into the catalog, but remains in the origin. If you remove the table from the catalog, the data in origin remains intact.
If you delete the origin data (as you described in your question), the other services will not have access to the data anymore, as it is deleted. If you run the crawler again it should detect it is not there.
If you want to keep the crawler schema up to date, you can either schedule automatic runs of the crawler, or execute on demand whenever your data changes. When the crawler is run again it will update accordingly things like the number of records, partitions, or even changes in the schema. Please refer to the documentation to see the effect changes in the schema can have on your catalog.

Why is my AWS Glue crawler not creating any tables?

I'm attempting to use AWS Glue to ETL a MySQL database in RDS to S3 so that I can work with the data in services like SageMaker or Athena. At this time, I don't care about transformations, this is a prototype and I simply want to dump the DB to S3 to start testing the various tool chains.
I've set up a Glue database and tested the connection to RDS successfully
I am using the AWS provide Glue IAM service role
My S3 bucket has the correct prefix of aws-glue-*
I created a crawler using the Glue database, AWSGlue service role, and S3 bucket above with the options:
Schema updates in the data store: Update the table definition in the data catalog
Object deletion in the data store: Delete tables and partitions from the data catalog.
When I run the crawler, it completes in ~60 seconds but it does not create any tables in the database.
I've tried adding the Admin policy to the glue service role to eliminate IAM access issues and the result is the same.
Also, CloudWatch logs are empty. Log groups are created for the test connection and the crawler but neither contains any entries.
I'm not sure how to further troubleshoot this, info on AWS Glue seems pretty sparse.
Figured it out. I had a syntax error in my "include path" for the crawler. Make sure the connection is the data source (RDS in this case) and the include path lists the data target you want e.g. mydatabase/% (I forgot the /%).
You can substitute the percent (%) character for a schema or table. For databases that support schemas, type MyDatabase/MySchema/% to match all tables in MySchema with MyDatabase. Oracle and MySQL don't support schema in the path, instead type MyDatabase/%. For information about which JDBC data stores support schema, see Cataloging Tables with a Crawler.
Ryan Fisher is correct in the sense that it's an error. I wouldn't categorize it as a syntax error. When I ran into this it was because the 'Include path' didn't include the default schema that sql server lovingly provides to you.
I had this: database_name/table_name
When it needed to be: database_name/dbo/table_name

What does an AWS Glue Crawler do

I've read the AWS glue docs re: the crawlers here: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html but I'm still on unclear on what exactly the Glue crawler does. Does a Crawler go through your S3 buckets, and create pointers to those buckets?
When the docs say "The output of the crawler consists of one or more metadata tables that are defined in your Data Catalog" what is the purpose of these metadata tables?
The CRAWLER creates the metadata that allows GLUE and services such as ATHENA to view the S3 information as a database with tables. That is, it allows you to create the Glue Catalog.
This way you can see the information that s3 has as a database composed of several tables.
For example if you want to create a crawler you must specify the following fields:
Database --> Name of database
Service role service-role/AWSGlueServiceRole
Selected classifiers --> Specify Classifier
Include path --> S3 location
Crawlers are needed to analyze data in specified s3 location and generate/update Glue Data Catalog which is basically is a meta-store for actual data (similar to Hive metastore). In other words it persists information about physical location of data, its schema, format and partitions which makes it possible to query actual data via Athena or to load it in Glue jobs.
I would suggest to read this documentation to understand Glue crawlers better and of course make some experiments.

Athena can't resolve CSV files from AWS DMS

I've DMS configured to continuously replicate data from MySQL RDS to S3. This creates two type of CSV files: a full load and change data capture (CDC). According to my tests, I have the following files:
testdb/addresses/LOAD001.csv.gz
testdb/addresses/20180405_205807186_csv.gz
After DMS is running properly, I trigger a AWS Glue Crawler to build the Data Catalog for the S3 Bucket that contains the MySQL Replication files, so the Athena users will be able to build queries in our S3 based Data Lake.
Unfortunately the crawlers are not building the correct table schema for the tables stored in S3.
For the example above It creates two tables for Athena:
addresses
20180405_205807186_csv_gz
The file 20180405_205807186_csv.gz contains a one line update, but the crawler is not capable of merging the two informations (taking the first load from LOAD001.csv.gz and making the updpate described in 20180405_205807186_csv.gz).
I also tried to create the table in the Athena console, as described in this blog post:https://aws.amazon.com/pt/blogs/database/using-aws-database-migration-service-and-amazon-athena-to-replicate-and-run-ad-hoc-queries-on-a-sql-server-database/.
But it does not yield the desired output.
From the blog post:
When you query data using Amazon Athena (later in this post), you
simply point the folder location to Athena, and the query results
include existing and new data inserts by combining data from both
files.
Am I missing something?
The AWS Glue crawler is not able to reconcile the different schemas in the initial LOAD csvs and incremental CDC csvs for each table. This blog post from AWS and its associated cloudformation templates demonstrate how to use AWS Glue jobs to process and combine these two type of DMS target outputs.
Athena will combine the files in am S3 if they are the same structure. The blog speaks to only inserts of new data in the cdc files. You'll have to build a process to merge the CDC files. Not what you wanted to hear, I'm sure.
From the blog post:
"When you query data using Amazon Athena (later in this post), due to the way AWS DMS adds a column indicating inserts, deletes and updates to the new file created as part of CDC replication, we will not be able to run the Athena query by combining data from both files (initial load and CDC files)."

AWS Glue Crawler overwrite custom table properties

I have a data catalog managed by AWS Glue, and any update that my developers does in our S3 bucket with new tables or partitions we are using the crawlers to update that every day to keep the new partitions healthy.
But, we also need custom table properties. In our hive we have the data source of each table as a table property, and we added to the tables in the Data Catalog in glue but, every time we run the crawler it overwrites the custom table properties like Description.
Am I doing anything wrong? Or is this a bug from AWS Glue?
Have you checked Schema change policy in your crawler definition?