replace glue with presto built-in commands - amazon-athena

I have completed all the steps mentioned in this tutorial.
https://aws.amazon.com/blogs/big-data/improve-amazon-athena-query-performance-using-aws-glue-data-catalog-partition-indexes/
I am getting the expected results. But I will like to know if this is possible without glue.
I will like to use only Athena (as well as S3) and nothing else to achieve the same results.

Athena uses the Glue Data Catalog to store table metadata. Technically you can use Athena Federation to not use the Glue Data Catalog, but for normal usage the Glue Data Catalog is necessary, just like S3 is.
Partition indexes is a feature of Glue Data Catalog, and without implementing something like it yourself and use Federation there is no equivalent feature in Athena itself, since Athena does not store table metadata itself.
Perhaps you could explain in more detail why you don't want to use Glue Data Catalog?

Related

Create Athena resources with Terraform

I would like to create via Terraform an Athena database including tables and views. I have already searched a lot and found some posts, e.g. here: Create AWS Athena view programmatically
I know that I can use Terraform provisioners to execute AWS CLI commands to create these resources, for example like this: AWS Athena Create table view with SQL
But I don't want to do that. I want to create everything (as far as possible) with Terraform so that I don't have to worry about lifecycle etc.
As far as I understand, an Athena database can be a Glue database, depending on the source you choose. If I choose the AWSDataCatalog (Glue) as data source in Athena, it should not matter if I create an Athena database or a Glue database with Terraform, correct?
In Glue I can also create tables, but no views. Do the Glue tables automatically correspond to Athena tables? How can I create Athena views? I would like to create everything with SQL DDL, just like you can do it in the AWS Web Console. How does this work via Terraform? If this functionality is not available, what is the best way to go? I am grateful for every tip and help!
Athena uses the Glue Data Catalog to store metadata about databases, tables, and views. All Athena tables are Glue tables. However, not all Glue tables work with Athena – you can create tables in Glue that won't be visible in Athena, and you can create tables that will be visible but won't work (for example cause runtime errors when you query them).
Athena uses Glue Data Catalog for views, but the format is very specific to Athena, unlike regular tables which can be made interoperable with for example Spark.
In an answer to the question you link to I explain in detail the anatomy of an Athena view. I have created views with CloudFormation with that information so it can be done with Terraform too. Unless you write code you will have to jump through all the hoops and repeat most of the information as Presto metadata, unfortunately.

Add location dynamically in Amazon Redshift create table statement

I am trying to create external table in Amazon Redshift using statement
mentioned at this link.
In my case I want location To be parameterized instead of static value
I am using dB Weaver for Amazon redshift
If your partitions are hive compatible(<partition_column_name>=<partition_column_value>) and your table is defined via Glue or Athena, then you can run MSCK REPAIR TABLE on the Athena table directly, which would add them. Read this thread for more info: https://forums.aws.amazon.com/thread.jspa?messageID=800945
You can also try using partition projections, if you don't use hive compatible partitions, where you define the structure of the files location in relation to the partitions and parameters.
If those don't work with you, you can use AWS Glue Crawlers which supposedly automatically detect partitions: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html
If that doesn't work for you, well then your problem is very specific. I suggest pulling up your sleeves and write some code, deploy on Lambda or AWS Glue Python Shell Job. Here's a bunch of examples where other people tried that:
https://medium.com/swlh/add-newly-created-partitions-programmatically-into-aws-athena-schema-d773722a228e
https://medium.com/#alsmola/partitioning-cloudtrail-logs-in-athena-29add93ee070

Copy data from PostgreSQL to S3 using AWS Data Pipeline

I am trying to copy all the tables from a schema (PostgreSQL, 50+ tables) to Amazon S3.
What is the best way to do this? I am able to create 50 different copy activities, but is there a simple way to copy all tables in a schema or write one pipeline and loop?
I think the old method is :
1. Unload your data from PostgreSQL to a CSV file first using something like psql
2. Then just copy the csv to S3
But, AWS gives u a script to do so , RDSToS3CopyActivity See this link from AWS
Since you have a large number of tables. I would recommend using AWS Glue as compared to AWS Data Pipeline. Glue is easily configurable having crawlers etc that allows you the flexibility to choose columns, define etc. Moreover,he underlying jobs in AWS Glue are pyspark jobs that scale really well giving you really good performance.

When to use Amazon Redshift spectrum over AWS Glue ETL to query on Amazon S3 data

As AWS Glue ETL can be a python script, it can be used to perform SQL queries using database interfaces and the data can be loaded from Amazon S3 into a DynamicFrame. I am trying to understand when it is advantageous to use Amazon Redshift spectrum to query on S3 data.
AWS Glue is used for gathering metadata (crawling) and for ETL. It is not for reporting or analytics. It can apply highly complex transformations (ideal for complex ETL requirement).
Redshift Spectrum is primarily used to produce reports and analysis against data stored in S3, usually combined with data stored on Redshift. However is CAN also be used for simple ETL. Much simpler to set up and use than Glue if you just need simple type ETL.
There is one other option that you do not mention, that is amazon Athena, this is a great tool to run queries directly against S3 data. It is similar to Redshift Spectrum but usually faster and cheaper, depending on your use case. It cannot combine S3 data with Redshift data.

Exporting Spark Dataframe to Athena

I'm running a pyspark job which creates a dataframe and stores it to S3 as below:
df.write.saveAsTable(table_name, format="orc", mode="overwrite", path=s3_path)
I can read the orcfile without a problem, just by using spark.read.orc(s3_path), so there's schema information in the orcfile, as expected.
However, I'd really like to view the dataframe contents using Athena. Clearly if I wrote to my hive metastore, I can call hive and do a show create table ${table_name}, but that's a lot of work when all I want is a simple schema.
Is there another way?
One of the approaches would be to set up a Glue crawler for your S3 path, which would create a table in the AWS Glue Data Catalog. Alternatively, you could create the Glue table definition via the Glue API.
The AWS Glue Data Catalog is fully integrated with Athena, so you would see your Glue table in Athena, and be able to query it directly:
http://docs.aws.amazon.com/athena/latest/ug/glue-athena.html