Add upload time column to AWS S3 CSV upload - amazon-web-services

I am uploading a .CSV file into AWS S3 and then this is pulled into an AWS Athena table.
Is it possible to automatically add a column at the end of the Athena table that shows the time that the CSV file was uploaded?
The process is that I receive external data in regular intervals and will always be uploading this data using S3. It would be great if the upload time can be included for every CSV.
Is this possible?

There is a special $PATH column that provides the "Amazon S3 file location for the data in a table row":
SELECT "$path" FROM "my_database"."my_table" WHERE year=2019;
Therefore, if the filename of the CSV file contained a date/time, you could extract it in the query.
However, there is no special column for showing the upload date of the file.
See: Getting the file locations for source data in Amazon S3 - Amazon Athena

Related

Query to multiple csv fles at S3 through Athena

I exported my SQL DB into S3 in csv format. Each table is exported into separate csv files and saved in Amazon S3. Now, can I send any query to that S3 bucket which can join multiple tables (multiple csv files in S3) and get a result-set? How can I do that and save in a separate csv file?
The steps are:
Put all files related to one table into a separate folder (directory path) in the S3 bucket. Do not mix files from multiple tables in the same folder because Amazon Athena will assume they all belong to one table.
Use the CREATE TABLE to define a new table in Amazon Athena, and specify where the files are kept via the LOCATION 's3://bucket_name/[folder]/' parameter. This tells Athena which folder to use when reading the data.
Or, instead of using CREATE TABLE, an easier way is:
Go to the AWS Glue management console
Select Create crawler
Select Add a data source provide the location in S3 where the data is stored
Provide other information as prompted (you'll figure it out)
Then, run the crawler and AWS Glue will look at the data files in the specified folder and will automatically create a table for that data. The table will appear in the Amazon Athena console.
Once you have created the tables, you can use normal SQL to query and join the tables.

How to unload table data in redshift to s3 bucket in excel format

I have table stored in redshift.I want to share data with my colleagues in excel format for s3 bucket.
I know how to share in csv format but not excel format. Please help.
This can be done via a Lambda function that you program. You can use the Redshift data client to read the data from a Redshift table. In your Lambda function, you can write the data to an Excel file using an Excel API such as Apache POI. Then use the Amazon S3 API to write the Excel file to an Amazon S3 bucket.
Amazon Redshift only UNLOADs in CSV or Parquet format.
Excel can open CSV files.
If you want the files in Excel format, you will need 'something' to do the conversion. This could be a Python program, an SQL client, or probably many other options. However, S3 will not do it for you.

Partition csv data in s3 bucket for querying using Athena

I have csv log data coming every hour in a single s3 bucket and I want to partition it for improving queries performance as well as converting it to parquet.
Also how can I add partitions automatically for new logs that will be added.
Note :
csv file names follow standard date format
files are written from external source and cannot be edited to be written in folders but only in the main bucket
I wanted to convert csv files to parquet separately
It appears that your situation is:
Objects are being uploaded to an Amazon S3 bucket
You would like those objects to be placed in a path hierarchy to support Amazon Athena partitioning
You could configure an Amazon S3 event to trigger an AWS Lambda function whenever a new object is created.
The Lambda function would:
Read the filename (or the contents of the file) to determine where it should be placed in the hierarchy
Perform a CopyObject() to put the object in the correct location (S3 does not have a 'move' command)
Delete the original object with DeleteObject()
Be careful that the above operation does not result in an event that triggers the Lambda function again (eg do it in a different folder or bucket), otherwise an infinite loop would occur.
When you wish to convert the CSV files to Parquet, see:
Converting to Columnar Formats - Amazon Athena
Using AWS Athena To Convert A CSV File To Parquet | CloudForecast Blog

How can I download s3 bucket data?

I'm trying to find some way to export data from an s3 bucket such as file path, filenames, metadata tags, last modified, and file size to something like a .csv .xml or .json. Is there any way to generate this without having to manually step through and hand generate it?
Please note I'm not trying to download all the files, rather I'm trying to get at a way to export the exposed data about those files presented in the s3 console.
Yes!
From Amazon S3 Inventory - Amazon Simple Storage Service:
Amazon S3 inventory provides comma-separated values (CSV), Apache optimized row columnar (ORC) or Apache Parquet (Parquet) output files that list your objects and their corresponding metadata on a daily or weekly basis for an S3 bucket or a shared prefix (that is, objects that have names that begin with a common string).

AWS Glue ETL Job fails with AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'

I'm trying to create AWS Glue ETL Job that would load data from parquet files stored in S3 in to a Redshift table.
Parquet files were writen using pandas with 'simple' file schema option into multiple folders in an S3 bucked.
The layout looks like this:
s3://bucket/parquet_table/01/file_1.parquet
s3://bucket/parquet_table/01/file_2.parquet
s3://bucket/parquet_table/01/file_3.parquet
s3://bucket/parquet_table/01/file_1.parquet
s3://bucket/parquet_table/02/file_2.parquet
s3://bucket/parquet_table/02/file_3.parquet
I can use AWS Glue Crawler to create a table in the AWS Glue Catalog and that table can be queried from Athena, but it does not work when i try to create ETL Job that would copy the same table to Redshift.
If I Crawl a single file or if I crawl multiple files in one folder, it works, as soon as there are multiple folders involved, I get the above mentioned error
AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'
Similar issues appear if instead of 'simple' schema I use 'hive'. Then we have multiple folders and also empty parquet files that throw
java.io.IOException: Could not read footer: java.lang.RuntimeException: xxx is not a Parquet file (too small)
Is there some recommendation on how to read Parquet files and structure them ins S3 when using AWS Glue (ETL and Data Catalog)?
Redshift doesn't support parquet format. Redshift Spectrum does. Athena also supports parquet format.
The error that you're facing is because the when reading the parquet files from s3 from spark/glue it expects the data to be in hive partition i.e the partition names should have key- value pair, You'll to have the s3 hierarchy in hive style partition something like below
s3://your-bucket/parquet_table/id=1/file1.parquet
s3://your-bucket/parquet_table/id=2/file2.parquet
and so on..
then use the below path to read all the files in bucket
location : s3://your-bucket/parquet_table
If the data in s3 partition the above way, you'll not the face any issues.