How can I download s3 bucket data? - amazon-web-services

I'm trying to find some way to export data from an s3 bucket such as file path, filenames, metadata tags, last modified, and file size to something like a .csv .xml or .json. Is there any way to generate this without having to manually step through and hand generate it?
Please note I'm not trying to download all the files, rather I'm trying to get at a way to export the exposed data about those files presented in the s3 console.

Yes!
From Amazon S3 Inventory - Amazon Simple Storage Service:
Amazon S3 inventory provides comma-separated values (CSV), Apache optimized row columnar (ORC) or Apache Parquet (Parquet) output files that list your objects and their corresponding metadata on a daily or weekly basis for an S3 bucket or a shared prefix (that is, objects that have names that begin with a common string).

Related

Redshift COPY from AWS S3 directory full of CSV files

I am trying to perform a COPY query in Redshift in order to load different .csv files stored in a AWS S3 path (let's say s3://bucket/path/csv/). The .csv files in that path contain a date in their filenames (i.e.: s3://bucket/path/csv/file_20200605.csv, s3://bucket/path/csv/file_20200604.csv,...) since they the data inside them corresponds to the data for a specific day. My question here is (since the order of loading the files matter), will Redshift load these files in alphabetical order?
The COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture to read and load data in parallel from files in an Amazon S3 bucket.
so regards to your question, the files will load in parallel.

Partition csv data in s3 bucket for querying using Athena

I have csv log data coming every hour in a single s3 bucket and I want to partition it for improving queries performance as well as converting it to parquet.
Also how can I add partitions automatically for new logs that will be added.
Note :
csv file names follow standard date format
files are written from external source and cannot be edited to be written in folders but only in the main bucket
I wanted to convert csv files to parquet separately
It appears that your situation is:
Objects are being uploaded to an Amazon S3 bucket
You would like those objects to be placed in a path hierarchy to support Amazon Athena partitioning
You could configure an Amazon S3 event to trigger an AWS Lambda function whenever a new object is created.
The Lambda function would:
Read the filename (or the contents of the file) to determine where it should be placed in the hierarchy
Perform a CopyObject() to put the object in the correct location (S3 does not have a 'move' command)
Delete the original object with DeleteObject()
Be careful that the above operation does not result in an event that triggers the Lambda function again (eg do it in a different folder or bucket), otherwise an infinite loop would occur.
When you wish to convert the CSV files to Parquet, see:
Converting to Columnar Formats - Amazon Athena
Using AWS Athena To Convert A CSV File To Parquet | CloudForecast Blog

Tagging objects read by spark on s3

I use pyspark to read objects on an s3 bucket on amazon s3. My bucket is composed if many json files which I read and then save as parquet files with
spark.read.json('s3://my-bucket/directory1/')
spark.write.parquet('s3://bucket-with-parquet/', mode='append')
Every day I will upload some new files on s3://my-bucket/directory1/ and I would like to update them to s3://bucket-with-parquet/ is there a way to ensure that I do not update the data two times. My idea is to tag every files which I read with spark (do not know how to do it). I can then use those tags to tell spark not to read the file again after (dunno how to do it as well). If an AWS guru could help me on that I would be very grateful.
There are a couple of things you could do, one is to write a script which reads timestamp from the metadata of the bucket and gives the list of files added on that day. You can process only those files which are mentioned in this list. (https://medium.com/faun/identifying-the-modified-or-newly-added-files-in-s3-11b577774729)
Second, you can enable versioning in S3 bucket to make sure if you overwrite any files you can retrieve the old file. You can also set ACL for read-only and write once permission as mentioned here Amazon S3 ACL for read-only and write-once access.
I hope this helps.

how to append multiple csv files from different folder in s3

I have many csv files under different sub-directory in S3.
I try to append all data with one csv file.
Is there any way to append all files using s3 or other AWS services?
Thanks
If the result csv is not very large (less than couple of Gb), you can use AWS Lambda to go through all subdirectories (keys) in S3 and write result file [for example into S3 again].
Also you can use AWS Glue for this operation, but I didn't use it.
In both cases you should write some script for joining files.

AWS S3 LISTING is slow

I am trying to execute the following command using AWS CLI on an S3 bucket:
aws s3 ls s3://bucket name/folder_name --summarize --human-readable --recursive
I am trying to get the size of the folder, but given there are multiple levels and a huge number of files, it running for hours.
Is there an efficient way to quickly get the size at folder level on Amazon S3?
You can use Amazon S3 Inventory:
Amazon S3 inventory provides comma-separated values (CSV) or Apache optimized row columnar (ORC) output files that list your objects and their corresponding metadata on a daily or weekly basis for an S3 bucket or a shared prefix (that is, objects that have names that begin with a common string).
You would need to parse the file, but all information is provided.
It is only updated daily, so if you need something faster then you'd have to make the calls yourself.