how to append multiple csv files from different folder in s3 - amazon-web-services

I have many csv files under different sub-directory in S3.
I try to append all data with one csv file.
Is there any way to append all files using s3 or other AWS services?
Thanks

If the result csv is not very large (less than couple of Gb), you can use AWS Lambda to go through all subdirectories (keys) in S3 and write result file [for example into S3 again].
Also you can use AWS Glue for this operation, but I didn't use it.
In both cases you should write some script for joining files.

Related

How to combine two manifests to a file in AWS S3?

I have train.manifest and validation.manifest in my S3 bucket. I need to combine them to a file, but I don't know how to combine them

Redshift COPY from AWS S3 directory full of CSV files

I am trying to perform a COPY query in Redshift in order to load different .csv files stored in a AWS S3 path (let's say s3://bucket/path/csv/). The .csv files in that path contain a date in their filenames (i.e.: s3://bucket/path/csv/file_20200605.csv, s3://bucket/path/csv/file_20200604.csv,...) since they the data inside them corresponds to the data for a specific day. My question here is (since the order of loading the files matter), will Redshift load these files in alphabetical order?
The COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture to read and load data in parallel from files in an Amazon S3 bucket.
so regards to your question, the files will load in parallel.

Combine all csv files into one single csv inside AWS CLI

I have a S3 bucket with a folder called batches. Inside the batches folder, I have 20 CSV files. Using AWS ClI (or a bash file to be exact), how can I combine all these csv files into a single CSV file and move it up one folder.
Normally in terminal this is how I do it:
cd batches && cat ./*csv > combined.csv
What would be a comparable way to do this for an S3 bucket inside AWS CLI?
If you only have 20 CSV files in total, then Marcin's suggestion is probably the right way to go. If you have more bigger-sized CSV, then I would suggest taking advantage of multipart upload in S3 and processing them in aws lambda.

Tagging objects read by spark on s3

I use pyspark to read objects on an s3 bucket on amazon s3. My bucket is composed if many json files which I read and then save as parquet files with
spark.read.json('s3://my-bucket/directory1/')
spark.write.parquet('s3://bucket-with-parquet/', mode='append')
Every day I will upload some new files on s3://my-bucket/directory1/ and I would like to update them to s3://bucket-with-parquet/ is there a way to ensure that I do not update the data two times. My idea is to tag every files which I read with spark (do not know how to do it). I can then use those tags to tell spark not to read the file again after (dunno how to do it as well). If an AWS guru could help me on that I would be very grateful.
There are a couple of things you could do, one is to write a script which reads timestamp from the metadata of the bucket and gives the list of files added on that day. You can process only those files which are mentioned in this list. (https://medium.com/faun/identifying-the-modified-or-newly-added-files-in-s3-11b577774729)
Second, you can enable versioning in S3 bucket to make sure if you overwrite any files you can retrieve the old file. You can also set ACL for read-only and write once permission as mentioned here Amazon S3 ACL for read-only and write-once access.
I hope this helps.

How can I download s3 bucket data?

I'm trying to find some way to export data from an s3 bucket such as file path, filenames, metadata tags, last modified, and file size to something like a .csv .xml or .json. Is there any way to generate this without having to manually step through and hand generate it?
Please note I'm not trying to download all the files, rather I'm trying to get at a way to export the exposed data about those files presented in the s3 console.
Yes!
From Amazon S3 Inventory - Amazon Simple Storage Service:
Amazon S3 inventory provides comma-separated values (CSV), Apache optimized row columnar (ORC) or Apache Parquet (Parquet) output files that list your objects and their corresponding metadata on a daily or weekly basis for an S3 bucket or a shared prefix (that is, objects that have names that begin with a common string).