I want to merge a set of csv files and zip them in GCP.
I will be getting a folder containing a lot of csv files in GCP bucket (40 GB of data).
Once the entire data is received, I need to merge all the csv files together into 1 file and zip it.
Then store it to another location. I only need to do this once a month.
What is the best way in which I can achieve this?
I was planning to use the below strategy, but dont know if its a good solution
a Pub/Sub to listen to the bucket folder and invoke a cloud
function from there.
Cloud function will call a cloud composer containing a Dag
to do the activity
It might be a lot easier to send the CSV files to a directory inside an GCP instance once there you can use a cron job to zip the files and finally copy it into your bucket with gsutil
If sending the files to the instance is not feasible you can download them with gsutil, zip them and upload the zip file again.
Either way, you will have to give the instance service account the proper IAM roles to modify the content of the bucket or give it ACL level access finally don't forget to give it the proper scopes to your instance
Related
I'm trying to implement a backup mechanism to S3 bucket in my code.
Each time a condition is met I need to upload an entire directory contents to an S3 bucket.
I am using this code example:
https://github.com/aws/aws-sdk-go/tree/c20265cfc5e05297cb245e5c7db54eed1468beb8/example/service/s3/sync
Which creates an iterator of the directory content's and then use s3manager.Upload.UploadWithIterator to upload them.
Everything works, however I noticed it uploads all files and overwrites existing files on the bucket even if they weren't modified since last backup, I only want to upload the delta between each backup.
I know aws cli has the command aws s3 sync <dir> <bucket> which does exactly what I need, however I couldn't find anything equivalent on aws-sdk documentation.
Appreciate the help, thank you!
There is no such feature in aws-sdk. You could instrument it yourself for each file to check the hash of both objects before upload. Or use a community solution https://www.npmjs.com/package/s3-sync-client
I have an S3 bucket with a bunch of zip files. I want to decompress the zip files and for each decompressed item, I want to create an $file.gz and save it to another S3 bucket. I was thinking of creating a Glue job for it but I don't know how to begin with. Any leads?
Eventually, I would like to terraform my solution and it should be triggered whenever there are new files in the S3 bucket,
Would a Lambda function or any other service be more suited for this?
From an architectural point of view, it depends on the file size of your ZIP files - if the process takes less than 15 minutes, then you can use Lambda functions.
If more, you will hit the current 15 minute Lambda timeout so you'll need to go ahead with a different solution.
However, for your use case of triggering on new files, S3 triggers will allow you to trigger a Lambda function when there are files created/deleted from the bucket.
I would recommend to segregate the ZIP files into their own bucket otherwise you'll also be paying for checking to see if any file uploaded is in your specific "folder" as the Lambda will be triggered for the entire bucket (it'll be negligible but still worth pointing out). If segregated, you'll know that any file uploaded is a ZIP file.
Your Lambda can then download the file from S3 using download_file (example provided by Boto3 documentation), unzip it using zipfile & eventually GZIP compress the file using gzip.
You can then upload the output file to the new bucket using upload_object(example provided by Boto3 documentation) & then delete the original file from the original bucket using delete_object.
Terraforming the above should also be relatively simple as you'll mostly be using the aws_lambda_function & aws_s3_bucket resources.
Make sure your Lambda has the correct execution role with the appropriate IAM policies to access both S3 buckets & you should be good to go.
I have a requirement of copying certain files from an S3 bucket to local machine. Below are the important points to note on my requirement:
The files are kept in S3 bucket based on the date folder.
The files are in csv.gz extension and I need to change it to csv and copy it to my local machine.
It keeps on updating every minute and I need to copy only the new files and process it. The processed files needs not to be copied again.
I have tried using sync folder but after processing of the file, the file name is renamed and again the csv.gz file is synced with the local folder.
I am planning to use some scheduled task to con.
Amazon S3 is a storage service. It cannot 'process' files for you.
If you wish to change the contents of a file (eg converting from .csv.gz to .csv), you would need to do this yourself on your local computer.
The AWS Command-Line Interface (CLI) aws s3 sync command makes it easy to copy files that have been changed/added since the previous sync. However, if you are changing the files locally (unzipping), then you will likely need to write your own program to download from Amazon S3.
There are AWS SDKs available for popular programming languages. You can also do a web search to find sample code for using Amazon S3.
I want my users to be able to download many files from AWS S3 bucket(potentially over few hundred GBs sized when accumulated) as one large ZIP file. I would download those selected files from S3 first and upload a newly created ZIP file on S3. This job will be rarely invoked during our service, so I decided to use Lambda for it.
But Lambda has its own limitations - 15 min of execution time, ~500MB /tmp storage, etc. I found several workaround solutions on Google that can beat the storage limit(streaming) but found no way to solve execution time limit.
Here are what I've found so far:
https://dev.to/lineup-ninja/zip-files-on-s3-with-aws-lambda-and-node-1nm1
Create a zip file on S3 from files on S3 using Lambda Node
Note that programming language is not a concern here.
Could you please give me a suggestion?
I use pyspark to read objects on an s3 bucket on amazon s3. My bucket is composed if many json files which I read and then save as parquet files with
spark.read.json('s3://my-bucket/directory1/')
spark.write.parquet('s3://bucket-with-parquet/', mode='append')
Every day I will upload some new files on s3://my-bucket/directory1/ and I would like to update them to s3://bucket-with-parquet/ is there a way to ensure that I do not update the data two times. My idea is to tag every files which I read with spark (do not know how to do it). I can then use those tags to tell spark not to read the file again after (dunno how to do it as well). If an AWS guru could help me on that I would be very grateful.
There are a couple of things you could do, one is to write a script which reads timestamp from the metadata of the bucket and gives the list of files added on that day. You can process only those files which are mentioned in this list. (https://medium.com/faun/identifying-the-modified-or-newly-added-files-in-s3-11b577774729)
Second, you can enable versioning in S3 bucket to make sure if you overwrite any files you can retrieve the old file. You can also set ACL for read-only and write once permission as mentioned here Amazon S3 ACL for read-only and write-once access.
I hope this helps.