Exporting Spark Dataframe to Athena

Exporting Spark Dataframe to Athena - amazon-web-services

I'm running a pyspark job which creates a dataframe and stores it to S3 as below:
df.write.saveAsTable(table_name, format="orc", mode="overwrite", path=s3_path)
I can read the orcfile without a problem, just by using spark.read.orc(s3_path), so there's schema information in the orcfile, as expected.
However, I'd really like to view the dataframe contents using Athena. Clearly if I wrote to my hive metastore, I can call hive and do a show create table ${table_name}, but that's a lot of work when all I want is a simple schema.
Is there another way?

One of the approaches would be to set up a Glue crawler for your S3 path, which would create a table in the AWS Glue Data Catalog. Alternatively, you could create the Glue table definition via the Glue API.
The AWS Glue Data Catalog is fully integrated with Athena, so you would see your Glue table in Athena, and be able to query it directly:
http://docs.aws.amazon.com/athena/latest/ug/glue-athena.html

Related

AWS Athena - UPDATE table rows using SQL

I am newbie to AWS ecosystem. I am creating an application which queries data using AWS Athena. Data is transformed from JSON into parquet using AWS Glue and stored in S3.
Now use case is to update that parquet data using SQL.
can we update underlying parquet data using AWS Athena SQL command?

No, it is not possible to use UPDATE in Amazon Athena.
Amazon Athena is a query engine, not a database. It performs queries on data that is stored in Amazon S3. It reads those files, but it does not modify or update those files. Therefore, it cannot 'update' a table.
The closest capability is using CREATE TABLE AS to create a new table. You can provide a SELECT query that uses data from other tables, so you could effectively modify information and store it in a new table, and tell it to use Parquet for that new table. In fact, this is an excellent way to convert data from other formats into Snappy-compressed Parquet files (with partitioning, if you wish).

Depending on how data is stored in Athena, you can update it using SQL UPDATE statmements. See Updating Iceberg table data and Using governed tables.

Unable to access data in tables configured using Partition Projection from AWS glue jobs using create_dynamic_frame.from_catalog

I've setup a table in Athena using partition projection. I haven't defined any partitions in the glue metadata catalogue and I can view the data in Athena OK using SQL.
When I go and setup a Glue job using this table, Glue doesn't seem to have access to the data:
data = glueContext.create_dynamic_frame.from_catalog(
database="db", table_name="a_table")
print (data.count()) # returns 0 :(
Is there any way to access the data without needing to define Glue metadata partitions? I was under the impression that if Athena could see the data, so could Glue.

Glue does not support Partition Projection, it's an Athena-only feature.
Glue ETL uses Spark, and Athena is Presto under the hood (with modifications, including Partition Projection). Glue ETL also does not support Athena views, and various other minor things.

Create Athena resources with Terraform

I would like to create via Terraform an Athena database including tables and views. I have already searched a lot and found some posts, e.g. here: Create AWS Athena view programmatically
I know that I can use Terraform provisioners to execute AWS CLI commands to create these resources, for example like this: AWS Athena Create table view with SQL
But I don't want to do that. I want to create everything (as far as possible) with Terraform so that I don't have to worry about lifecycle etc.
As far as I understand, an Athena database can be a Glue database, depending on the source you choose. If I choose the AWSDataCatalog (Glue) as data source in Athena, it should not matter if I create an Athena database or a Glue database with Terraform, correct?
In Glue I can also create tables, but no views. Do the Glue tables automatically correspond to Athena tables? How can I create Athena views? I would like to create everything with SQL DDL, just like you can do it in the AWS Web Console. How does this work via Terraform? If this functionality is not available, what is the best way to go? I am grateful for every tip and help!

Athena uses the Glue Data Catalog to store metadata about databases, tables, and views. All Athena tables are Glue tables. However, not all Glue tables work with Athena – you can create tables in Glue that won't be visible in Athena, and you can create tables that will be visible but won't work (for example cause runtime errors when you query them).
Athena uses Glue Data Catalog for views, but the format is very specific to Athena, unlike regular tables which can be made interoperable with for example Spark.
In an answer to the question you link to I explain in detail the anatomy of an Athena view. I have created views with CloudFormation with that information so it can be done with Terraform too. Unless you write code you will have to jump through all the hoops and repeat most of the information as Presto metadata, unfortunately.

aws glue tables created by athena are read twice by emr spark

When I create a table in athena with CTAS syntax (example below), tables are registered to glue in a way that when I read the same table on an EMR cluster with (py)spark, every partition is read twice, but when I read it with athena, it is alright. When I create a table through spark with write.saveAsTable syntax, it's registered to glue properly and this table is read properly with spark and with athena.
I didn't find anything in the spark/athena/glue documentation about this. After some trial and errors I found out that there is a glue table property that is set by spark and not set by athena: spark.sql.sources.provider='parquet'. When I set this manually on tables created via athena, spark will read it properly. But this feels like an ugly workaround and I would like to understand what's happening in the background. And I didn't find anything about this table property.
Athena create table syntax:
CREATE TABLE {database}.{table}
WITH (format = 'Parquet',
parquet_compression = 'SNAPPY',
external_location = '{s3path}')
AS SELECT

AWS Glue Crawler is not creating tables in schema

I am trying AWS Glue crawler to create tables in athena.
The source that I am pulling it from is a Postgresql server. The crawler is able to parse the tables, create metadata and show the tables and columns in the Glue data catalog but the tables are not added in athena despite the fact that I have added the target database from athena.
Not sure why this is happening
Also, if I choose a csv source from s3 then it is able to create a table in athena with _csv as a suffix
Any help?

Athena doesn't recognize my Postgres tables added by Glue either. My guess is that Athena is used for querying data stored on S3, so it's not working for database queries.
Also, to be able to query your CSV files on S3, files need to be under a folder crawled by glue. If you just crawl a single file with Glue, Athena will return 0 records from the query.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Exporting Spark Dataframe to Athena - amazon-web-services

Related

AWS Athena - UPDATE table rows using SQL

Unable to access data in tables configured using Partition Projection from AWS glue jobs using create_dynamic_frame.from_catalog

Create Athena resources with Terraform

aws glue tables created by athena are read twice by emr spark

AWS Glue Crawler is not creating tables in schema

Categories

Resources