How to avoid use of crawler in aws glue - amazon-web-services

AWS glue crawler has cost associated with it, how to avoid us of the crawler in aws glue.
Is there any way we can avoid the use of crawler and infer schema from any other option, so that cost can be reduced.

In addition to what bdcloud has said, it's also possible to add tables to the data catalogue using the 'AWS::Glue::Table' resource in CloudFormation.
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-glue-table.html
It's easier to do this if you have a table schema you can use as a template (aws glue get-table --database-name <db name> --name <table name> will give you JSON that is pretty close to what CloudFormation is expecting).
Again, you need to know your schema in advance, but choose the approach that best fits the workflow you're going with.

You can use Athena to create tables in Glue catalog, but to do so you need to know the schema of the file or you can get the DDL from the existing table created by running SHOW CREATE TABLE <table-name> in Athena and then you can modified the DDL statement according to your schema.
DDL queries are free in Athena and incurs no charges.
One other way of doing it is by issuing a Glue create table API call. Please refer to this for python syntax.

Related

What is the simplest way to extract all 26 tables from a single DynamoDB db into AWS Glue Catalog

I am trying to build AWS QuickSight reports using AWS Athena that builds the specific views for said reports. however, I seem to only be able to select a single table in creating the Glue job despite being able to select all tables i need for the crawler of the entire DB from Dynamo.
What is the simplest route to get a complete extract of all tables that is queryable in Athena.
I dont want to connect the reports direct to dynamoDB as it s a production database and want to create some separation to avoid any performance degradation by a poor query etc.

replace glue with presto built-in commands

I have completed all the steps mentioned in this tutorial.
https://aws.amazon.com/blogs/big-data/improve-amazon-athena-query-performance-using-aws-glue-data-catalog-partition-indexes/
I am getting the expected results. But I will like to know if this is possible without glue.
I will like to use only Athena (as well as S3) and nothing else to achieve the same results.
Athena uses the Glue Data Catalog to store table metadata. Technically you can use Athena Federation to not use the Glue Data Catalog, but for normal usage the Glue Data Catalog is necessary, just like S3 is.
Partition indexes is a feature of Glue Data Catalog, and without implementing something like it yourself and use Federation there is no equivalent feature in Athena itself, since Athena does not store table metadata itself.
Perhaps you could explain in more detail why you don't want to use Glue Data Catalog?

Create Athena resources with Terraform

I would like to create via Terraform an Athena database including tables and views. I have already searched a lot and found some posts, e.g. here: Create AWS Athena view programmatically
I know that I can use Terraform provisioners to execute AWS CLI commands to create these resources, for example like this: AWS Athena Create table view with SQL
But I don't want to do that. I want to create everything (as far as possible) with Terraform so that I don't have to worry about lifecycle etc.
As far as I understand, an Athena database can be a Glue database, depending on the source you choose. If I choose the AWSDataCatalog (Glue) as data source in Athena, it should not matter if I create an Athena database or a Glue database with Terraform, correct?
In Glue I can also create tables, but no views. Do the Glue tables automatically correspond to Athena tables? How can I create Athena views? I would like to create everything with SQL DDL, just like you can do it in the AWS Web Console. How does this work via Terraform? If this functionality is not available, what is the best way to go? I am grateful for every tip and help!
Athena uses the Glue Data Catalog to store metadata about databases, tables, and views. All Athena tables are Glue tables. However, not all Glue tables work with Athena – you can create tables in Glue that won't be visible in Athena, and you can create tables that will be visible but won't work (for example cause runtime errors when you query them).
Athena uses Glue Data Catalog for views, but the format is very specific to Athena, unlike regular tables which can be made interoperable with for example Spark.
In an answer to the question you link to I explain in detail the anatomy of an Athena view. I have created views with CloudFormation with that information so it can be done with Terraform too. Unless you write code you will have to jump through all the hoops and repeat most of the information as Presto metadata, unfortunately.

Add location dynamically in Amazon Redshift create table statement

I am trying to create external table in Amazon Redshift using statement
mentioned at this link.
In my case I want location To be parameterized instead of static value
I am using dB Weaver for Amazon redshift
If your partitions are hive compatible(<partition_column_name>=<partition_column_value>) and your table is defined via Glue or Athena, then you can run MSCK REPAIR TABLE on the Athena table directly, which would add them. Read this thread for more info: https://forums.aws.amazon.com/thread.jspa?messageID=800945
You can also try using partition projections, if you don't use hive compatible partitions, where you define the structure of the files location in relation to the partitions and parameters.
If those don't work with you, you can use AWS Glue Crawlers which supposedly automatically detect partitions: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html
If that doesn't work for you, well then your problem is very specific. I suggest pulling up your sleeves and write some code, deploy on Lambda or AWS Glue Python Shell Job. Here's a bunch of examples where other people tried that:
https://medium.com/swlh/add-newly-created-partitions-programmatically-into-aws-athena-schema-d773722a228e
https://medium.com/#alsmola/partitioning-cloudtrail-logs-in-athena-29add93ee070

Why is my AWS Glue crawler not creating any tables?

I'm attempting to use AWS Glue to ETL a MySQL database in RDS to S3 so that I can work with the data in services like SageMaker or Athena. At this time, I don't care about transformations, this is a prototype and I simply want to dump the DB to S3 to start testing the various tool chains.
I've set up a Glue database and tested the connection to RDS successfully
I am using the AWS provide Glue IAM service role
My S3 bucket has the correct prefix of aws-glue-*
I created a crawler using the Glue database, AWSGlue service role, and S3 bucket above with the options:
Schema updates in the data store: Update the table definition in the data catalog
Object deletion in the data store: Delete tables and partitions from the data catalog.
When I run the crawler, it completes in ~60 seconds but it does not create any tables in the database.
I've tried adding the Admin policy to the glue service role to eliminate IAM access issues and the result is the same.
Also, CloudWatch logs are empty. Log groups are created for the test connection and the crawler but neither contains any entries.
I'm not sure how to further troubleshoot this, info on AWS Glue seems pretty sparse.
Figured it out. I had a syntax error in my "include path" for the crawler. Make sure the connection is the data source (RDS in this case) and the include path lists the data target you want e.g. mydatabase/% (I forgot the /%).
You can substitute the percent (%) character for a schema or table. For databases that support schemas, type MyDatabase/MySchema/% to match all tables in MySchema with MyDatabase. Oracle and MySQL don't support schema in the path, instead type MyDatabase/%. For information about which JDBC data stores support schema, see Cataloging Tables with a Crawler.
Ryan Fisher is correct in the sense that it's an error. I wouldn't categorize it as a syntax error. When I ran into this it was because the 'Include path' didn't include the default schema that sql server lovingly provides to you.
I had this: database_name/table_name
When it needed to be: database_name/dbo/table_name