I've read the AWS glue docs re: the crawlers here: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html but I'm still on unclear on what exactly the Glue crawler does. Does a Crawler go through your S3 buckets, and create pointers to those buckets?
When the docs say "The output of the crawler consists of one or more metadata tables that are defined in your Data Catalog" what is the purpose of these metadata tables?
The CRAWLER creates the metadata that allows GLUE and services such as ATHENA to view the S3 information as a database with tables. That is, it allows you to create the Glue Catalog.
This way you can see the information that s3 has as a database composed of several tables.
For example if you want to create a crawler you must specify the following fields:
Database --> Name of database
Service role service-role/AWSGlueServiceRole
Selected classifiers --> Specify Classifier
Include path --> S3 location
Crawlers are needed to analyze data in specified s3 location and generate/update Glue Data Catalog which is basically is a meta-store for actual data (similar to Hive metastore). In other words it persists information about physical location of data, its schema, format and partitions which makes it possible to query actual data via Athena or to load it in Glue jobs.
I would suggest to read this documentation to understand Glue crawlers better and of course make some experiments.
Related
In AWS Glue jobs, in order to retrieve data from DB or S3, we can get using 2 approaches. 1) Using Crawler 2) Using direct connection to DB or S3.
So my question is: How does crawler much better than direct connecting to a database and retrieve data?
AWS Glue Crawlers will not retrieve the actual data. Crawlers accesses your data stores and progresses through a prioritized list of classifiers to extract the schema of your data and other statistics, and then populates the Glue Data Catalog with this metadata. Crawlers can be scheduled to run periodically that will detect the availability of the new data along with the change to the existing data, including the table definition changes made by the data crawler. Crawlers automatically adds new table, new partitions to the existing table and the new versions of table definitions.
AWS Glue Data Catalog becomes a common metadata repository between
Amazon Athena, Amazon Redshift Spectrum, Amazon S3. AWS Glue Crawlers
helps in building this metadata repository.
I have a few AWS Glue crawlers setup to crawl CSV's in S3 to populate my tables in Athena.
My scenario and question:
I replace the .csv files in S3 daily with updated versions. Do I have to run the existing crawlers again perhaps on a schedule to update the tables on Athena with the latest content? Or is the crawler only required to run if schema changes such as additional columns added? I just want to ensure that my tables in Athena always output all of the data as per the updated CSV's - I rarely do any schema changes to the table structures. If the crawlers are only required to run when actual structure changes take place then I would prefer to run them a lot less frequently
When a glue crawler runs, the following actions take place:
It classifies data to determine the format, schema, and associated properties of the raw data
Groups data into tables or partitions
Writes metadata to the Data Catalog
The schema of tables created in the Data Catalog is referenced by Athena to query the specified S3 datasource. So, if the schema remains constant, scheduling the crawler runs can be reduced.
You can also refer the documentation here to understand working with glue crawlers and csv files in Athena: https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html
I am working on crawling data to data catalog via aws glue. But I am a bit confused about the database definition. From what I can find in aws doc, A database in the AWS Glue Data Catalog is a container that holds tables. You use databases to organize your tables into separate categories.. I wonder what exactly a database contains. Does it load all the data from other data sources and create a catalog on them? Or does it only contain catalog? How do I know the size of tables in glue database? And what type of database it uses, like nosql, rds?
For example, I create a crawler to load data from s3 and create a catalog table in glue. Does the glue table includes all the data from s3 bucket? If I delete s3 bucket, will it have impact on other jobs in glue which runs against the catalog table created by the crawler?
If the catalog table only includes data schema, how can I keep it update to data if my data source is modified?
The Catalog is just a metadata store. Its mission is to document the data that lives elsewhere, and to export that to other tools, like Athena or EMR, so they can discover the data.
Data is not replicated into the catalog, but remains in the origin. If you remove the table from the catalog, the data in origin remains intact.
If you delete the origin data (as you described in your question), the other services will not have access to the data anymore, as it is deleted. If you run the crawler again it should detect it is not there.
If you want to keep the crawler schema up to date, you can either schedule automatic runs of the crawler, or execute on demand whenever your data changes. When the crawler is run again it will update accordingly things like the number of records, partitions, or even changes in the schema. Please refer to the documentation to see the effect changes in the schema can have on your catalog.
I've DMS configured to continuously replicate data from MySQL RDS to S3. This creates two type of CSV files: a full load and change data capture (CDC). According to my tests, I have the following files:
testdb/addresses/LOAD001.csv.gz
testdb/addresses/20180405_205807186_csv.gz
After DMS is running properly, I trigger a AWS Glue Crawler to build the Data Catalog for the S3 Bucket that contains the MySQL Replication files, so the Athena users will be able to build queries in our S3 based Data Lake.
Unfortunately the crawlers are not building the correct table schema for the tables stored in S3.
For the example above It creates two tables for Athena:
addresses
20180405_205807186_csv_gz
The file 20180405_205807186_csv.gz contains a one line update, but the crawler is not capable of merging the two informations (taking the first load from LOAD001.csv.gz and making the updpate described in 20180405_205807186_csv.gz).
I also tried to create the table in the Athena console, as described in this blog post:https://aws.amazon.com/pt/blogs/database/using-aws-database-migration-service-and-amazon-athena-to-replicate-and-run-ad-hoc-queries-on-a-sql-server-database/.
But it does not yield the desired output.
From the blog post:
When you query data using Amazon Athena (later in this post), you
simply point the folder location to Athena, and the query results
include existing and new data inserts by combining data from both
files.
Am I missing something?
The AWS Glue crawler is not able to reconcile the different schemas in the initial LOAD csvs and incremental CDC csvs for each table. This blog post from AWS and its associated cloudformation templates demonstrate how to use AWS Glue jobs to process and combine these two type of DMS target outputs.
Athena will combine the files in am S3 if they are the same structure. The blog speaks to only inserts of new data in the cdc files. You'll have to build a process to merge the CDC files. Not what you wanted to hear, I'm sure.
From the blog post:
"When you query data using Amazon Athena (later in this post), due to the way AWS DMS adds a column indicating inserts, deletes and updates to the new file created as part of CDC replication, we will not be able to run the Athena query by combining data from both files (initial load and CDC files)."
I have a service running that populates my S3 bucket with the compressed log files, but the log files do not have a fixed schema and athena expects a fixed schema. (Which I wrote while creating the table)
So my question is as in the title, is there any way around through which I can query a dynamic schema? If not is there any other service like athena to do the same thing?
Amazon Athena can't do that by itself, but you can configure an AWS Glue crawler to automatically infer the schema of your JSON files. The crawler can run on a schedule, so your files will be indexed automatically even if the schema changes. Athena will use the Glue data catalog if AWS Glue is available in the region you're running Athena in.
See Cataloging Tables with a Crawler in the AWS Glue docs for the details on how to set that up.