Query csv tables stored s3 through athena

Query csv tables stored s3 through athena - amazon-web-services

Recently we started to store our backups in aws s3. It is all csv files that we need to query through aws athena.
We tried to insert the tables one by one but it's taking too long, it is a fair amount of data. Is there any API that we can use or something that is alredy set?
we were about to do something with spark, but maybe there is a simpler way, or something that's already have been done.
thanks

You can simply create an external table on top of CSV files with the required properties.
Reference : Create External Table on AWS Athena
You can also use Glue Crawler and configure it to automatically populate the tables for you.
Reference : Cataloging tables with a crawler
There are different AWS SDK's available (here) to automate your tasks like uploading files to S3, creating athena tables or cataloging tables through glue clawler.

Related

Create Athena resources with Terraform

I would like to create via Terraform an Athena database including tables and views. I have already searched a lot and found some posts, e.g. here: Create AWS Athena view programmatically
I know that I can use Terraform provisioners to execute AWS CLI commands to create these resources, for example like this: AWS Athena Create table view with SQL
But I don't want to do that. I want to create everything (as far as possible) with Terraform so that I don't have to worry about lifecycle etc.
As far as I understand, an Athena database can be a Glue database, depending on the source you choose. If I choose the AWSDataCatalog (Glue) as data source in Athena, it should not matter if I create an Athena database or a Glue database with Terraform, correct?
In Glue I can also create tables, but no views. Do the Glue tables automatically correspond to Athena tables? How can I create Athena views? I would like to create everything with SQL DDL, just like you can do it in the AWS Web Console. How does this work via Terraform? If this functionality is not available, what is the best way to go? I am grateful for every tip and help!

Athena uses the Glue Data Catalog to store metadata about databases, tables, and views. All Athena tables are Glue tables. However, not all Glue tables work with Athena – you can create tables in Glue that won't be visible in Athena, and you can create tables that will be visible but won't work (for example cause runtime errors when you query them).
Athena uses Glue Data Catalog for views, but the format is very specific to Athena, unlike regular tables which can be made interoperable with for example Spark.
In an answer to the question you link to I explain in detail the anatomy of an Athena view. I have created views with CloudFormation with that information so it can be done with Terraform too. Unless you write code you will have to jump through all the hoops and repeat most of the information as Presto metadata, unfortunately.

AWS Glue Crawler query

I have a few AWS Glue crawlers setup to crawl CSV's in S3 to populate my tables in Athena.
My scenario and question:
I replace the .csv files in S3 daily with updated versions. Do I have to run the existing crawlers again perhaps on a schedule to update the tables on Athena with the latest content? Or is the crawler only required to run if schema changes such as additional columns added? I just want to ensure that my tables in Athena always output all of the data as per the updated CSV's - I rarely do any schema changes to the table structures. If the crawlers are only required to run when actual structure changes take place then I would prefer to run them a lot less frequently

When a glue crawler runs, the following actions take place:
It classifies data to determine the format, schema, and associated properties of the raw data
Groups data into tables or partitions
Writes metadata to the Data Catalog
The schema of tables created in the Data Catalog is referenced by Athena to query the specified S3 datasource. So, if the schema remains constant, scheduling the crawler runs can be reduced.
You can also refer the documentation here to understand working with glue crawlers and csv files in Athena: https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html

Athena can't resolve CSV files from AWS DMS

I've DMS configured to continuously replicate data from MySQL RDS to S3. This creates two type of CSV files: a full load and change data capture (CDC). According to my tests, I have the following files:
testdb/addresses/LOAD001.csv.gz
testdb/addresses/20180405_205807186_csv.gz
After DMS is running properly, I trigger a AWS Glue Crawler to build the Data Catalog for the S3 Bucket that contains the MySQL Replication files, so the Athena users will be able to build queries in our S3 based Data Lake.
Unfortunately the crawlers are not building the correct table schema for the tables stored in S3.
For the example above It creates two tables for Athena:
addresses
20180405_205807186_csv_gz
The file 20180405_205807186_csv.gz contains a one line update, but the crawler is not capable of merging the two informations (taking the first load from LOAD001.csv.gz and making the updpate described in 20180405_205807186_csv.gz).
I also tried to create the table in the Athena console, as described in this blog post:https://aws.amazon.com/pt/blogs/database/using-aws-database-migration-service-and-amazon-athena-to-replicate-and-run-ad-hoc-queries-on-a-sql-server-database/.
But it does not yield the desired output.
From the blog post:
When you query data using Amazon Athena (later in this post), you
simply point the folder location to Athena, and the query results
include existing and new data inserts by combining data from both
files.
Am I missing something?

The AWS Glue crawler is not able to reconcile the different schemas in the initial LOAD csvs and incremental CDC csvs for each table. This blog post from AWS and its associated cloudformation templates demonstrate how to use AWS Glue jobs to process and combine these two type of DMS target outputs.

Athena will combine the files in am S3 if they are the same structure. The blog speaks to only inserts of new data in the cdc files. You'll have to build a process to merge the CDC files. Not what you wanted to hear, I'm sure.
From the blog post:
"When you query data using Amazon Athena (later in this post), due to the way AWS DMS adds a column indicating inserts, deletes and updates to the new file created as part of CDC replication, we will not be able to run the Athena query by combining data from both files (initial load and CDC files)."

Can Amazon Athena be used to query a dynamic schema?

I have a service running that populates my S3 bucket with the compressed log files, but the log files do not have a fixed schema and athena expects a fixed schema. (Which I wrote while creating the table)
So my question is as in the title, is there any way around through which I can query a dynamic schema? If not is there any other service like athena to do the same thing?

Amazon Athena can't do that by itself, but you can configure an AWS Glue crawler to automatically infer the schema of your JSON files. The crawler can run on a schedule, so your files will be indexed automatically even if the schema changes. Athena will use the Glue data catalog if AWS Glue is available in the region you're running Athena in.
See Cataloging Tables with a Crawler in the AWS Glue docs for the details on how to set that up.

Creating Table As substitution

I am currently working with AWS-Athena and it does not support CREATE TABLE AS which is fine so I thought I would approach it by doing INSERT OVERWRITE DIRECTORY S3://PATH and then loading from S3 but apparently that doesn't seem to work either. How would I create a table from a query if both of these options are out the window?

Amazon Athena is read-only. It cannot be used to create tables in Amazon S3.
However, the output of an Amazon Athena query is stored in Amazon S3 and could be used as input for another query. However, you'd have to know the path of the output.
Amazon Athena is ideal for individual queries against data stored in Amazon S3, but is not the best tool for ETL actions, which typically involve transforming data, storing it and then sequentially processing it again.

You don't have to use INSERT, just create an external table over the location of the previous query results
https://aws.amazon.com/premiumsupport/knowledge-center/athena-query-results/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Query csv tables stored s3 through athena - amazon-web-services

Related

Create Athena resources with Terraform

AWS Glue Crawler query

Athena can't resolve CSV files from AWS DMS

Can Amazon Athena be used to query a dynamic schema?

Creating Table As substitution

Categories

Resources