Automating folder creation in S3 - amazon-web-services

I have an S3 bucket into which clients drop data files (CSV files) each month. I was wondering there was a way that I could automatically create a new "folder" (object) every time the files are dropped each month and put the newest files into that "folder". I need the CSV files separated by month so that AWS Glue can create new partitions when I run incremental crawlers on this bucket.
For example, let's say I have a S3 bucket called "client." On December 1st, a new CSV file ("DecClientData") will be dropped into that "client" bucket. I want to know if there is a way to automate the following two processes:
Create a "folder" (let's call it "dec") within "client".
Place the "DecClientData" file in the "dec" "folder".
Thanks in advance for any assistance you can provide!

S3 doesn't have the notion of folders commonly found in file systems but instead has a flat structure, more details can be found here.
Instead, the full path of an object is stored in its Key (filename). For example, an object can be stored in Amazon S3 with a Key of files/2020-12/data.txt regardless of the existence of files and 2020-12 directories (they are not really directories but zero-length objects).
In your case, to solve both points you are mentioning, you should leverage S3 event notifications and use them as a Lambda Trigger. When the Lambda function is triggered, it is passed the name of the object (Key) as an argument, at that point you can simply change its Key.
I.e. Object is uploaded in s3://my_bucket/uploads/file.txt, this creates an event notification that triggers a Lambda function. The functions gets the object and re-uploads it to s3://my_bucket/files/dec/file.txt (and deletes the original one).

Write an AWS Lambda function to create a folder in the client bucket and move the most recent .csv file (or files) in the new folder.
Then, configure the client S3 bucket to trigger the AWS Lambda function on new uploads through the event notification settings.

Related

AWS Lambda avoid recursive trigger

I'm downloading data from an API and writing it to a csv file that I store in an S3 bucket. I'm then copying my file from this input bucket into an output bucket with a Lambda function. From the output bucket I'm ingesting it into a MySQL RDS instance with another Lambda function.
The copy-to-another-bucket and upload-to-RDS lambda functions both get triggered when I create a new object in a bucket. Since I'm appending to my csv file, the upload-to-RDS function gets triggered way more than it should and I end up with ~30 rows in my database instead of 6.
I thought by copying the files between S3 buckets I could avoid this, but it doesn't help. Is there any way to only upload the csv file to the database once it has been written and not while it's being updated? Can I delay the trigger maybe?
The only other solution I can think of is to skip the copy-to-another-bucket function altogether and to schedule the upload-to-RDS function.
You need to realize that S3 doesn't support updating an existing file. If you are appending a row to an existing CSV file in S3, then that operation requires uploading the entire contents of the CSV file to S3 again, which S3 sees as a new object.
If you need to store a temporary version of the CSV file in S3 while you are updating it, then you should store it in a separate path, like s3://your_bucket/tmp and then when you have completed your updates, move it to the final path like s3://your_bucket/complete and only configure the Lambda trigger on the /complete path.

is there any way to setup s3 bucket to get append to the existing object for each run?

We have a requirement to append to the existing S3 object, when we run the spark application every hour. I have tried this code:
df.coalesce(1).write.partitionBy("name").mode("append").option("compression", "gzip").parquet("s3n://path")
This application is creating new parquet files for every run. Hence, I am looking for a workaround to achieve this requirement.
Question is:
How we can configure the S3 bucket to get append to the existing object?
It is not possible to append to objects in Amazon S3. They can be overwritten, but not appended.
There is apparently a sneaky method where a file can be multi-part copied, with the 'source' set to the file and then set to some additional data. However, that cannot be accomplished in the method you show.
If you wish to add additional data to an External Table (eg used by EMR or Athena), then simply add an additional file in the correct folder for the desired partition.

AWS Lambda function getting called repeatedly

I have written a Lambda function which gets invoked automatically when a file comes into my S3 bucket.
I perform certain validations on this file, modify the particular and put the file at the same location.
Due to this "put", my lambda is called again and the process goes on till my lambda execution times out.
Is there any way to trigger this lambda only once?
I found an approach where I can store the file name in DynamoDB and can apply a check in lambda function, but can there be any other approach where DynamoDB's use can be avoided?
You have a couple options:
You can put the file to a different location in s3 and delete the original
You can add a metadata field to the s3 object when you update it. Then check for the presence of that field in s3 so you know if you have processed it already. Now this might not work perfectly since s3 does not always provide the most recent data on reads after updates.
AWS allows different type of s3 event triggers. You can try playing s3:ObjectCreated:Put vs s3:ObjectCreated:Post.
You can upload your files in a folder, say
s3://bucket-name/notvalidated
and store the validated in another folder, say
s3://bucket-name/validated.
Update your S3 Event notification to invoke your lambda function whenever there is a ObjectCreate(All) event in the /notvalidated prefix.
The second answer does not seem to be correct (put vs post) - there is not really a concept of update in S3 in terms of POST or PUT. The request to update an object will be the same as the initial POST of the object. See here for details on the available S3 events.
I had this exact problem last year - I was doing an image resize on PUT and every time a file was overwritten, it would be triggered again. My recommended solution would be to have two folders in your s3 bucket - one for the original file and one for the finalized file. You could then create the lambda trigger with the lambda prefix so it only checks the files in the original folder
The events are triggered in S3 based on if the object is put/post/copy/complete Multipart Upload - All these operations corresponds to ObjectCreate as per AWS documentation .
https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
The best solution is to restrict your S3 object create event to particular bucket location. So that any change in that bucket location will trigger lambda function.
You can do the modification in some other bucket location which is not configured to trigger lambda function when object is created in that location.
Hope it helps!

How to change file upload date in Amazon S3 using AWS CLI

I need to move some files (thousands) to Amazon S3 bucket, from where they will be displayed to the end-user by another application (instead of the current one).
Problem is, that these files have creation/upload date now (dates very between 2012 and 2017, when they were uploaded to current application), and when I move them they all start to be of the same date. That is a problem because when you look at the files in the new application, you don't understand the time hierarchy which is sometimes very important.
Is there any way I can modify upload date of a file(s) in S3?
The Last Modification Date is generated by Amazon S3 and cannot be set via the API.
If dates and other information (eg user) are important to your application, you can store it as metadata on the object. Then, retrieve the metadata when displaying dates, user, etc.
What I did was renaming the file to something else and then renaming it again to its original name.
As you cannot rename directly, you have to copy the file to a new name, and then copy it back to its original name. (and delete the auxiliary file, of course)
It is not optimal, but that's the solution when using AWS client. I hope one day AWS will have all function the FTP used to have.
You can just copy over the same object and the timestamp will update.
This technique is also used to prolong the expire of an object in a bucket with a lifecycle rule.

S3 bucket script to add timestamp in filename on upload

I'm looking for a way to add a timestamp in every file that is uploaded to an S3 bucket, Amazon-side. There is, of course, an option to do this client-side before the upload, but I don't think this is as nice and clean as it would be to have some script to run in the bucket itself everytime a new file is uploaded. I didn't find anything in the docs, though.
There is no capability within Amazon S3 to change the Key (filename) of a file based upon upload time.
Given that your desire is to avoid name conflicts, some choices are:
Use a unique GUID or a timestamp to name the file when uploading. This will avoid naming conflicts.
Upload the file to Bucket A, then use a Lambda function triggered on ObjectCreation to copy the object to Bucket B with a unique name based on timestamp
You can try with a lambda function handling the ObjectCreated event. See this tutorial.
Not sure that works though.