How to query on AWS Athena for .csv file's creation or last modification date

How to query on AWS Athena for .csv file's creation or last modification date - amazon-athena

With AWS Athena it is easily to query for .csv file's name using this:
select "$path", * from my_table;
I wonder if it is possible to also do the same for the creation or modification timestamps:
select "$creation_date" from my_table;
select "$modification_date" from my_table;
I could not find anything regarding this topic.

There is no such option as of today in Athena. Athena uses PrestoDB and this feature has been introduced in Trino (formerly prestosql).
Have filed a request for the same with PrestoDB.

Related

How does Amazon Athena manage rename of columns?

everyone!
I'm working on a solution that intends to use Amazon Athena to run SQL queries from Parquet files on S3.
Those filed will be generated from a PostgreSQL database (RDS). I'll run a query and export data to S3 using Python's Pyarrow.
My question is: since Athena is schema-on-read, add or delete of columns on database will not be a problem...but what will happen when I get a column renamed on database?
Day 1: COLUMNS['col_a', 'col_b', 'col_c']
Day 2: COLUMNS['col_a', 'col_beta', 'col_c']
On Athena,
SELECT col_beta FROM table;
will return only data from Day 2, right?
Is there a way that Athena knows about these schema evolution or I would have to run a script to iterate through all my files on S3, rename columns and update table schema on Athena from 'col_a' to 'col_beta'?
Would AWS Glue Data Catalog help in any way to solve this?
I'll love to discuss more about this!

I recommend reading more about handling schema updates with Athena here. Generally Athena supports multiple ways of reading Parquet files (as well as other columnar data formats such as ORC). By default, using Parquet, columns will be read by name, but you can change that to reading by index as well. Each way has its own advantages / disadvantages dealing with schema changes. Based on your example, you might want to consider reading by index if you are sure new columns are only appended to the end.
A Glue crawler can help you to keep your schema updated (and versioned), but it doesn't necessarily help you to resolve schema changes (logically). And it comes at an additional cost, of course.
Another approach could be to use a schema that is a superset of all schemas over time (using columns by name) and define a view on top of it to resolve changes "manually".

You can set a granularity based on 'On Demand' or 'Time Based' for the AWS Glue crawler, so every time your data on the S3 updates a new schema will be generated (you can edit the schema on the data types for the attributes). This way your columns will stay updated and you can query on the new field.

Since AWS Athena reads data in CSV and TSV in the "order of the columns" in the schema and returns them in the same order. It does not use column names for mapping data to a column, which is why you can rename columns in CSV or TSV without breaking Athena queries.

Spark SQL query to get the last updated timestamp of a Athena table stored as CSV in AWS S3

Is it possible to get the last updated timestamp of a Athena Table stored as a CSV file format in S3 location using Spark SQL query.
If yes, can someone please provide more information on it.

There are multiple ways to do this.
Use the athena jdbc driver and do a spark read where the format is jdbc. In this read you will provide your "select max(timestamp) from table" query. Then as the next step just save to s3 fcrom the spark dataframe
You can skip the jdbc read altogther and just use boto3 to run the above query. It would be a combination of start_query_execution and get_query_results. You can then save this to s3 as well.

Is it possible to re-partition the data using AWS glue crawler?

I have inherited a S3 bucket from a former colleague, where the files inside are partitioned with id and time, such as:
s3://bucket/partition_id=0/year=2017/month=6/day=1/file
The data in all these files is one table, can be queried through Athena. From the Glue catalogue it also showed that the partition(0) is id, partition(1) is year and so on.
Recently I want to reconstruct the work, and figured the partition using id is not very straightforward. I tried to use the Glue crawler and direct it to the S3 bucket. But there is no where I could choose if I only want it to partition with time, not id, like this:
s3://bucket/year=2017/month=6/day=1/file
I am quite new with AWS and not sure if it is possible or even makes sense to you. Please give me some feedback. Thank you.

I dont think you can do it with help of crawler, however you can create new table manually in Athena like this (also see https://docs.aws.amazon.com/en_us/athena/latest/ug/ctas-examples.html)
CREATE TABLE new_table
WITH (
format = 'ORC',
external_location = 's3://...',
partitioned_by = ARRAY['year', 'month', 'day'])
AS select *
FROM old_table;

Write python shell job using s3 boto apis to reorganize folder structure and then run crawler

Can AWS Athena update or insert data stored in S3?

The document just says that it is a query service but not explicitly states that it can or cannot perform data update.
If Athena cannot do insert or update, is there any other aws service which can do like a normal DB?

Amazon Athena is, indeed, a query service -- it only allows data to be read from Amazon S3.
One exception, however, is that the results of the query are automatically written to S3. You could, therefore, use a query to generate results that could be used by something else. It's not quite updating data but it is generating data.
My previous attempts to use Athena output in another Athena query didn't work due to problems with the automatically-generated header, but there might be some workarounds available.
If you are seeking a service that can update information in S3, you could use Amazon EMR, which is basically a managed Hadoop cluster. Very powerful and capable, and can most certainly update information in S3, but it is rather complex to learn.

Amazon Athena adds support for inserting data into a table using the results of a SELECT query or using a provided set of values
Amazon Athena now supports inserting new data to an existing table using the INSERT INTO statement.
https://aws.amazon.com/about-aws/whats-new/2019/09/amazon-athena-adds-support-inserting-data-into-table-results-of-select-query/
https://docs.aws.amazon.com/athena/latest/ug/insert-into.html
Bucketed tables not supported
INSERT INTO is not supported on bucketed tables. For more information, see Bucketing vs Partitioning.

AWS S3 is a object storage. Both Athena and S3 Select is for queries. The only way to modify a object(file) in S3 is to retrieve from S3, modify and upload back to S3.

As of September 20, 2019 Athena also supports INSERT INTO: https://aws.amazon.com/about-aws/whats-new/2019/09/amazon-athena-adds-support-inserting-data-into-table-results-of-select-query/

Finally there is a solution from AWS. Now you can perform CRUD (create, read, update and delete) operations on AWS Athena. Athena Iceberg integration is generally available now. Create the table with:
TBLPROPERTIES ( 'table_type' ='ICEBERG' [, property_name=property_value])
then you can use it's amazing feature.
For a quick introduction, you can watch this video. (Or search Insert / Update / Delete on S3 With Amazon Athena and Apache Iceberg | Amazon Web Services on Youtube)
Read Considerations and Limitations

Athena supports CTAS (create table as) statements as of October 2018. You can specify output location and file format among other options.
https://docs.aws.amazon.com/athena/latest/ug/ctas.html
To INSERT into tables you can write additional files in the same format to the S3 path for a given table (this is somewhat of a hack), or preferably add partitions for the new data.
Like many big data systems, Athena is not capable of handling UPDATE statements.

We could use something known as Apache Iceberg in collaboration with Athena to perform CRUD operations on S3 data inside AWS itself.
The only caveat being that at the time of table creation we need to use extra parameter as table_type = 'ICEBERG'.
Eg:
create table demo
(
id string,
attr1 string
)
location 's3://path'
TBLPROPERTIES (
'table_type' = 'ICEBERG'
)
For more details : https://www.youtube.com/watch?v=u1v666EXCJw

Aws Athena - Rename column name

I am trying to change a column name in an AWS Athena table.
From old_name to new_name.
Normal DDL commands does not affect the table (They cannot be executed).
Is It possible to change a column name without deleting and re-creating the table from scratch ?

I was mistaken, Athena uses HIVE DDL syntax so the correct command is :
ALTER TABLE %%table-name%% CHANGE %%old-column-name%% %%new-column-name%%<string>;
I based my answer on a hive related question.

You can find more about supported and unsupported DDLs here

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to query on AWS Athena for .csv file's creation or last modification date - amazon-athena

There is no such option as of today in Athena. Athena uses PrestoDB and this feature has been introduced in Trino (formerly prestosql). Have filed a request for the same with PrestoDB.

Related

How does Amazon Athena manage rename of columns?

Spark SQL query to get the last updated timestamp of a Athena table stored as CSV in AWS S3

Is it possible to re-partition the data using AWS glue crawler?

Can AWS Athena update or insert data stored in S3?

Aws Athena - Rename column name

Categories

Resources