is there any way to setup s3 bucket to get append to the existing object for each run? - amazon-web-services

We have a requirement to append to the existing S3 object, when we run the spark application every hour. I have tried this code:
df.coalesce(1).write.partitionBy("name").mode("append").option("compression", "gzip").parquet("s3n://path")
This application is creating new parquet files for every run. Hence, I am looking for a workaround to achieve this requirement.
Question is:
How we can configure the S3 bucket to get append to the existing object?

It is not possible to append to objects in Amazon S3. They can be overwritten, but not appended.
There is apparently a sneaky method where a file can be multi-part copied, with the 'source' set to the file and then set to some additional data. However, that cannot be accomplished in the method you show.
If you wish to add additional data to an External Table (eg used by EMR or Athena), then simply add an additional file in the correct folder for the desired partition.

Related

Update data in csv table which is stored in AWS S3 bucket

I need a solution for entering new data in csv that is stored in S3 bucket in AWS.
At this point we are downloading the file, editing and then uploading it again in s3 and we would like to automatize this process.
We need to add one row in a three column.
Thank you in advance!
I think you will be able to do that using Lambda Functions. You will need to programmatically make the modifications you need over the CSV but there are multiple programming languages that allow you to do that. One quick example is using python and the csv library
Then you can invoke that lambda or add more logic to the operations you want to do using an AWS API Gateway.
You can access the CSV file (object) inside the S3 Bucket from the lambda code using the AWS SDK and append the new rows with data you pass as parameters to the function
There is no way to directly modify the csv stored in S3 (if that is what you're asking). The process will always entail some version of download, modify, upload. There are many examples of how you can do this, for example here

Copy ~200.000 of s3 files to new prefixes

I have ~200.000 s3 files that I need to partition, and have made an Athena query to produce a target s3 key for each of the original s3 keys. I can clearly create a script out of this, but how to make the process robust/reliable?
I need to partition csv files using info inside each csv so that each file is moved to a new prefix in the same bucket. The files are mapped 1-to-1, but the new prefix depends on the data inside the file
The copy command for each would be something like:
aws s3 cp s3://bucket/top_prefix/file.csv s3://bucket/top_prefix/var1=X/var2=Y/file.csv
And I can make a single big script to copy all through Athena and bit of SQL, but I am concerned about doing this reliably so that I can be sure that all are copied across, and not have the script fail, timeout etc. Should I "just run the script"? From my machine or better to put it in an ec2 1st? These kinds of questions
This is a one-off, as the application code producing the files in s3 will start outputting directly to partitions.
If each file contains data for only one partition, then you can simply move the files as you have shown. This is quite efficient because the content of the files do not need to be processed.
If, however, lines within the files each belong to different partitions, then you can use Amazon Athena to 'select' lines from an input table and output the lines to a destination table that resides in a different path, with partitioning configured. However, Athena does not "move" the files -- it simply reads them and then stores the output. If you were to do this for new data each time, you would need to use an INSERT statement to copy the new data into an existing output table, then delete the input files from S3.
Since it is one-off, and each file belongs in only one partition, I would recommend you simply "run the script". It will go slightly faster from an EC2 instance, but the data is not uploaded/downloaded -- it all stays within S3.
I often create an Excel spreadsheet with a list of input locations and output locations. I create a formula to build the aws s3 cp <input> <output_path> commands, copy them to a text file and execute it as a batch. Works fine!
You mention that the destination depends on the data inside the object, so it would probably work well as a Python script that would loop through each object, 'peek' inside the object to see where it belongs, then issue a copy_object() command to send it to the right destination. (smart-open ยท PyPI is a great library for reading from an S3 object without having to download it first.)

AWS Lambda avoid recursive trigger

I'm downloading data from an API and writing it to a csv file that I store in an S3 bucket. I'm then copying my file from this input bucket into an output bucket with a Lambda function. From the output bucket I'm ingesting it into a MySQL RDS instance with another Lambda function.
The copy-to-another-bucket and upload-to-RDS lambda functions both get triggered when I create a new object in a bucket. Since I'm appending to my csv file, the upload-to-RDS function gets triggered way more than it should and I end up with ~30 rows in my database instead of 6.
I thought by copying the files between S3 buckets I could avoid this, but it doesn't help. Is there any way to only upload the csv file to the database once it has been written and not while it's being updated? Can I delay the trigger maybe?
The only other solution I can think of is to skip the copy-to-another-bucket function altogether and to schedule the upload-to-RDS function.
You need to realize that S3 doesn't support updating an existing file. If you are appending a row to an existing CSV file in S3, then that operation requires uploading the entire contents of the CSV file to S3 again, which S3 sees as a new object.
If you need to store a temporary version of the CSV file in S3 while you are updating it, then you should store it in a separate path, like s3://your_bucket/tmp and then when you have completed your updates, move it to the final path like s3://your_bucket/complete and only configure the Lambda trigger on the /complete path.

Automating folder creation in S3

I have an S3 bucket into which clients drop data files (CSV files) each month. I was wondering there was a way that I could automatically create a new "folder" (object) every time the files are dropped each month and put the newest files into that "folder". I need the CSV files separated by month so that AWS Glue can create new partitions when I run incremental crawlers on this bucket.
For example, let's say I have a S3 bucket called "client." On December 1st, a new CSV file ("DecClientData") will be dropped into that "client" bucket. I want to know if there is a way to automate the following two processes:
Create a "folder" (let's call it "dec") within "client".
Place the "DecClientData" file in the "dec" "folder".
Thanks in advance for any assistance you can provide!
S3 doesn't have the notion of folders commonly found in file systems but instead has a flat structure, more details can be found here.
Instead, the full path of an object is stored in its Key (filename). For example, an object can be stored in Amazon S3 with a Key of files/2020-12/data.txt regardless of the existence of files and 2020-12 directories (they are not really directories but zero-length objects).
In your case, to solve both points you are mentioning, you should leverage S3 event notifications and use them as a Lambda Trigger. When the Lambda function is triggered, it is passed the name of the object (Key) as an argument, at that point you can simply change its Key.
I.e. Object is uploaded in s3://my_bucket/uploads/file.txt, this creates an event notification that triggers a Lambda function. The functions gets the object and re-uploads it to s3://my_bucket/files/dec/file.txt (and deletes the original one).
Write an AWS Lambda function to create a folder in the client bucket and move the most recent .csv file (or files) in the new folder.
Then, configure the client S3 bucket to trigger the AWS Lambda function on new uploads through the event notification settings.

How to change file upload date in Amazon S3 using AWS CLI

I need to move some files (thousands) to Amazon S3 bucket, from where they will be displayed to the end-user by another application (instead of the current one).
Problem is, that these files have creation/upload date now (dates very between 2012 and 2017, when they were uploaded to current application), and when I move them they all start to be of the same date. That is a problem because when you look at the files in the new application, you don't understand the time hierarchy which is sometimes very important.
Is there any way I can modify upload date of a file(s) in S3?
The Last Modification Date is generated by Amazon S3 and cannot be set via the API.
If dates and other information (eg user) are important to your application, you can store it as metadata on the object. Then, retrieve the metadata when displaying dates, user, etc.
What I did was renaming the file to something else and then renaming it again to its original name.
As you cannot rename directly, you have to copy the file to a new name, and then copy it back to its original name. (and delete the auxiliary file, of course)
It is not optimal, but that's the solution when using AWS client. I hope one day AWS will have all function the FTP used to have.
You can just copy over the same object and the timestamp will update.
This technique is also used to prolong the expire of an object in a bucket with a lifecycle rule.