I have a S3 bucket with a folder called batches. Inside the batches folder, I have 20 CSV files. Using AWS ClI (or a bash file to be exact), how can I combine all these csv files into a single CSV file and move it up one folder.
Normally in terminal this is how I do it:
cd batches && cat ./*csv > combined.csv
What would be a comparable way to do this for an S3 bucket inside AWS CLI?
If you only have 20 CSV files in total, then Marcin's suggestion is probably the right way to go. If you have more bigger-sized CSV, then I would suggest taking advantage of multipart upload in S3 and processing them in aws lambda.
Related
I have an S3 bucket with millions of files, and I want to download all of them. Since I don't have enough storage, I would like to download them, compress them on the fly and only then save them. How do I do this?
To illustrate what I mean:
aws s3 cp --recursive s3://bucket | gzip > file
If you want to compress them all into a single file, as your question seems to indicate, you can add a - to the end of the CLI command to make it write to StdOut:
aws s3 cp --recursive s3://bucket - | gzip > file
If you want to compress them as individual files, then you'll need to first get a listing of all the files, then iterate through them and download/compress one at a time.
But you'll probably find it faster as well as cheaper to spin up an public EC2 instance in the same region with enough disk space to hold the uncompressed files, download them all at once, and compress them there (data going from S3 to EC2 is free as long as it doesn't go through a NAT or cross regions). You can then download the compressed files from S3 and shut down the instance.
I have an S3 bucket with a bunch of zip files. I want to decompress the zip files and for each decompressed item, I want to create an $file.gz and save it to another S3 bucket. I was thinking of creating a Glue job for it but I don't know how to begin with. Any leads?
Eventually, I would like to terraform my solution and it should be triggered whenever there are new files in the S3 bucket,
Would a Lambda function or any other service be more suited for this?
From an architectural point of view, it depends on the file size of your ZIP files - if the process takes less than 15 minutes, then you can use Lambda functions.
If more, you will hit the current 15 minute Lambda timeout so you'll need to go ahead with a different solution.
However, for your use case of triggering on new files, S3 triggers will allow you to trigger a Lambda function when there are files created/deleted from the bucket.
I would recommend to segregate the ZIP files into their own bucket otherwise you'll also be paying for checking to see if any file uploaded is in your specific "folder" as the Lambda will be triggered for the entire bucket (it'll be negligible but still worth pointing out). If segregated, you'll know that any file uploaded is a ZIP file.
Your Lambda can then download the file from S3 using download_file (example provided by Boto3 documentation), unzip it using zipfile & eventually GZIP compress the file using gzip.
You can then upload the output file to the new bucket using upload_object(example provided by Boto3 documentation) & then delete the original file from the original bucket using delete_object.
Terraforming the above should also be relatively simple as you'll mostly be using the aws_lambda_function & aws_s3_bucket resources.
Make sure your Lambda has the correct execution role with the appropriate IAM policies to access both S3 buckets & you should be good to go.
I have lots of .tar.gz files (millions) stored in a bucket on Amazon S3.
I'd like to untar them and create the corresponding folders on Amazon S3 (in the same bucket or another).
Is it possible to do this without me having to download/process them locally?
It's not possible with only S3. You'll have to have something like EC2, ECS or Lambda, preferably running in the same region as the S3 bucket, to download the .tar files, extract them, and upload every file that was extracted back to S3.
With AWS lambda you can do this! You won't have the extra costs with downloading and uploading to AWS network.
You should follow this blog but then unzip instead of zipping!
https://www.antstack.io/blog/create-zip-using-lambda-with-files-streamed-from-s3/
I want to merge a set of csv files and zip them in GCP.
I will be getting a folder containing a lot of csv files in GCP bucket (40 GB of data).
Once the entire data is received, I need to merge all the csv files together into 1 file and zip it.
Then store it to another location. I only need to do this once a month.
What is the best way in which I can achieve this?
I was planning to use the below strategy, but dont know if its a good solution
a Pub/Sub to listen to the bucket folder and invoke a cloud
function from there.
Cloud function will call a cloud composer containing a Dag
to do the activity
It might be a lot easier to send the CSV files to a directory inside an GCP instance once there you can use a cron job to zip the files and finally copy it into your bucket with gsutil
If sending the files to the instance is not feasible you can download them with gsutil, zip them and upload the zip file again.
Either way, you will have to give the instance service account the proper IAM roles to modify the content of the bucket or give it ACL level access finally don't forget to give it the proper scopes to your instance
I have many csv files under different sub-directory in S3.
I try to append all data with one csv file.
Is there any way to append all files using s3 or other AWS services?
Thanks
If the result csv is not very large (less than couple of Gb), you can use AWS Lambda to go through all subdirectories (keys) in S3 and write result file [for example into S3 again].
Also you can use AWS Glue for this operation, but I didn't use it.
In both cases you should write some script for joining files.