I have a service running that populates my S3 bucket with the compressed log files, but the log files do not have a fixed schema and athena expects a fixed schema. (Which I wrote while creating the table)
So my question is as in the title, is there any way around through which I can query a dynamic schema? If not is there any other service like athena to do the same thing?
Amazon Athena can't do that by itself, but you can configure an AWS Glue crawler to automatically infer the schema of your JSON files. The crawler can run on a schedule, so your files will be indexed automatically even if the schema changes. Athena will use the Glue data catalog if AWS Glue is available in the region you're running Athena in.
See Cataloging Tables with a Crawler in the AWS Glue docs for the details on how to set that up.
Related
I'm attempting to use AWS Glue to ETL a MySQL database in RDS to S3 so that I can work with the data in services like SageMaker or Athena. At this time, I don't care about transformations, this is a prototype and I simply want to dump the DB to S3 to start testing the various tool chains.
I've set up a Glue database and tested the connection to RDS successfully
I am using the AWS provide Glue IAM service role
My S3 bucket has the correct prefix of aws-glue-*
I created a crawler using the Glue database, AWSGlue service role, and S3 bucket above with the options:
Schema updates in the data store: Update the table definition in the data catalog
Object deletion in the data store: Delete tables and partitions from the data catalog.
When I run the crawler, it completes in ~60 seconds but it does not create any tables in the database.
I've tried adding the Admin policy to the glue service role to eliminate IAM access issues and the result is the same.
Also, CloudWatch logs are empty. Log groups are created for the test connection and the crawler but neither contains any entries.
I'm not sure how to further troubleshoot this, info on AWS Glue seems pretty sparse.
Figured it out. I had a syntax error in my "include path" for the crawler. Make sure the connection is the data source (RDS in this case) and the include path lists the data target you want e.g. mydatabase/% (I forgot the /%).
You can substitute the percent (%) character for a schema or table. For databases that support schemas, type MyDatabase/MySchema/% to match all tables in MySchema with MyDatabase. Oracle and MySQL don't support schema in the path, instead type MyDatabase/%. For information about which JDBC data stores support schema, see Cataloging Tables with a Crawler.
Ryan Fisher is correct in the sense that it's an error. I wouldn't categorize it as a syntax error. When I ran into this it was because the 'Include path' didn't include the default schema that sql server lovingly provides to you.
I had this: database_name/table_name
When it needed to be: database_name/dbo/table_name
I've read the AWS glue docs re: the crawlers here: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html but I'm still on unclear on what exactly the Glue crawler does. Does a Crawler go through your S3 buckets, and create pointers to those buckets?
When the docs say "The output of the crawler consists of one or more metadata tables that are defined in your Data Catalog" what is the purpose of these metadata tables?
The CRAWLER creates the metadata that allows GLUE and services such as ATHENA to view the S3 information as a database with tables. That is, it allows you to create the Glue Catalog.
This way you can see the information that s3 has as a database composed of several tables.
For example if you want to create a crawler you must specify the following fields:
Database --> Name of database
Service role service-role/AWSGlueServiceRole
Selected classifiers --> Specify Classifier
Include path --> S3 location
Crawlers are needed to analyze data in specified s3 location and generate/update Glue Data Catalog which is basically is a meta-store for actual data (similar to Hive metastore). In other words it persists information about physical location of data, its schema, format and partitions which makes it possible to query actual data via Athena or to load it in Glue jobs.
I would suggest to read this documentation to understand Glue crawlers better and of course make some experiments.
I am trying AWS Glue crawler to create tables in athena.
The source that I am pulling it from is a Postgresql server. The crawler is able to parse the tables, create metadata and show the tables and columns in the Glue data catalog but the tables are not added in athena despite the fact that I have added the target database from athena.
Not sure why this is happening
Also, if I choose a csv source from s3 then it is able to create a table in athena with _csv as a suffix
Any help?
Athena doesn't recognize my Postgres tables added by Glue either. My guess is that Athena is used for querying data stored on S3, so it's not working for database queries.
Also, to be able to query your CSV files on S3, files need to be under a folder crawled by glue. If you just crawl a single file with Glue, Athena will return 0 records from the query.
Recently we started to store our backups in aws s3. It is all csv files that we need to query through aws athena.
We tried to insert the tables one by one but it's taking too long, it is a fair amount of data. Is there any API that we can use or something that is alredy set?
we were about to do something with spark, but maybe there is a simpler way, or something that's already have been done.
thanks
You can simply create an external table on top of CSV files with the required properties.
Reference : Create External Table on AWS Athena
You can also use Glue Crawler and configure it to automatically populate the tables for you.
Reference : Cataloging tables with a crawler
There are different AWS SDK's available (here) to automate your tasks like uploading files to S3, creating athena tables or cataloging tables through glue clawler.
I have a data catalog managed by AWS Glue, and any update that my developers does in our S3 bucket with new tables or partitions we are using the crawlers to update that every day to keep the new partitions healthy.
But, we also need custom table properties. In our hive we have the data source of each table as a table property, and we added to the tables in the Data Catalog in glue but, every time we run the crawler it overwrites the custom table properties like Description.
Am I doing anything wrong? Or is this a bug from AWS Glue?
Have you checked Schema change policy in your crawler definition?