AWS Lambda - Multiple objects uploaded to the same folder at same time - amazon-web-services

I am using a script that does binary merge on files within a folder in AWS S3 whenever a new file gets uploaded. I have configured the Lambda trigger on the specific bucket on Object PUT to start merging the new file in folder with the existing files in same location. I have a scenario wherein I upload multiple files to the same folder at the same time and I am trying to understand how does Lambda merge files in this scenario. Can anyone please help me understand if the Lambda drops some files and triggers only once the script or will all the file creates trigger events and Lambda invokes the script to do merge on all files without dropping any?
Thanks

Related

To get notification to email when they are unprocessed folders in GCP

I am beginner to GCP, I want to have two folders processed and unprocessed folder
in the cloud storage bucket. whenever a files comes to the google storage bucket from any source, after which cloud function will get triggered, if the files are successfully inserted into the target such as Bigquery, the file will go into the processed folder, if not into the unprocessed folder.
I want to know how can I get alerts when the files go into the unprocessed folder or error folder??
Do I have to write a code or Should I write a cloud function or anything else which gets me alerts??
Any help will be appreciated
Thank you
As you mentioned usage of Cloud Functions is the right approach.
A simple function is required, then it should be deployed with the proper trigger associated with a bucket.
More details, with examples can be found here:
https://cloud.google.com/functions/docs/calling/storage

AWS Lambda avoid recursive trigger

I'm downloading data from an API and writing it to a csv file that I store in an S3 bucket. I'm then copying my file from this input bucket into an output bucket with a Lambda function. From the output bucket I'm ingesting it into a MySQL RDS instance with another Lambda function.
The copy-to-another-bucket and upload-to-RDS lambda functions both get triggered when I create a new object in a bucket. Since I'm appending to my csv file, the upload-to-RDS function gets triggered way more than it should and I end up with ~30 rows in my database instead of 6.
I thought by copying the files between S3 buckets I could avoid this, but it doesn't help. Is there any way to only upload the csv file to the database once it has been written and not while it's being updated? Can I delay the trigger maybe?
The only other solution I can think of is to skip the copy-to-another-bucket function altogether and to schedule the upload-to-RDS function.
You need to realize that S3 doesn't support updating an existing file. If you are appending a row to an existing CSV file in S3, then that operation requires uploading the entire contents of the CSV file to S3 again, which S3 sees as a new object.
If you need to store a temporary version of the CSV file in S3 while you are updating it, then you should store it in a separate path, like s3://your_bucket/tmp and then when you have completed your updates, move it to the final path like s3://your_bucket/complete and only configure the Lambda trigger on the /complete path.

Automating folder creation in S3

I have an S3 bucket into which clients drop data files (CSV files) each month. I was wondering there was a way that I could automatically create a new "folder" (object) every time the files are dropped each month and put the newest files into that "folder". I need the CSV files separated by month so that AWS Glue can create new partitions when I run incremental crawlers on this bucket.
For example, let's say I have a S3 bucket called "client." On December 1st, a new CSV file ("DecClientData") will be dropped into that "client" bucket. I want to know if there is a way to automate the following two processes:
Create a "folder" (let's call it "dec") within "client".
Place the "DecClientData" file in the "dec" "folder".
Thanks in advance for any assistance you can provide!
S3 doesn't have the notion of folders commonly found in file systems but instead has a flat structure, more details can be found here.
Instead, the full path of an object is stored in its Key (filename). For example, an object can be stored in Amazon S3 with a Key of files/2020-12/data.txt regardless of the existence of files and 2020-12 directories (they are not really directories but zero-length objects).
In your case, to solve both points you are mentioning, you should leverage S3 event notifications and use them as a Lambda Trigger. When the Lambda function is triggered, it is passed the name of the object (Key) as an argument, at that point you can simply change its Key.
I.e. Object is uploaded in s3://my_bucket/uploads/file.txt, this creates an event notification that triggers a Lambda function. The functions gets the object and re-uploads it to s3://my_bucket/files/dec/file.txt (and deletes the original one).
Write an AWS Lambda function to create a folder in the client bucket and move the most recent .csv file (or files) in the new folder.
Then, configure the client S3 bucket to trigger the AWS Lambda function on new uploads through the event notification settings.

Putting a TWS file dependencies on AWS S3 stored file

I have an ETL application which is suppose to migrate to AWS infra. The scheduler being used in my application is Tivoli Work Scheduler and we want to use the same on cloud as well which has file dependencies.
Now when we move to aws , the files to be watched will land in S3 Bucket. Can we put the OPEN dependency for files in S3? If yes, What would be the hostname ( HOST#Filepath ) ?
If Not, what services should be aligned to serve the purpose. I have both time as well as file dependency in my SCHEDULES.
Eg. The file might get uploaded on S3 at 1AM. AT 3 AM my schedule will get triggered, look for the file in S3 bucket. If present, starts execution and if not then it should wait as per other parameters on tws.
Any help or advice would be nice to have.
If I understand this correctly, job triggered at 3am will identify all files uploaded within last e.g. 24 hours.
You can list all s3 files to list everything uploaded within specific period of time.
Better solution would be to create S3 upload trigger which will send information to SQS and have your code inspect the depth (number of messages) there and start processing the files one by one. An additional benefit would be an assurance that all items are processed without having to worry about time overalpse.

Upload/Download LARGE files to/from Lambda function using API Gateway without making any use of S3 Bucket

I'm implementing a serverless API, using:
API Gateway
Lambda
S3 Bucket "If needed"
My flow is to :
Call POST or PUT method with a binary file "zip", upload it to Lambda.
In Lambda: unzip the file.
In Lambda: Run a determined script over the extracted files.
In Lambda: Generate a new zip.
Return it to my desktop.
This flow is already implemented and it's working well with small files, 10MB for uploading and 6MB for downloading.
But I'm getting issues when dealing with large files as it'll be the case on many occasions. To solve such issue I'm thinking about the following flow:
Target file gets uploaded S3 Bucket.
A new event is generated and Lambda gets triggered.
Lambda Internal Tasks:
3.1 Lambda Download the file from S3 bucket.
3.2 Lambda Generate the corresponding WPK Package.
3.3 Lambda Upload the generated WPK package into S3.
3.4 Lambda returns a signed URL related to the uploaded file as a response.
But my problem with such design is that it requires more than a request to get completed. I want to do all this process in only 1 request, passing the target zip file in it and get the new zip as the response.
Any Ideas, please?
My Components and Flow Diagram would be:
Component and Flow Diagram
There are a couple of things you could do if you'd like to unzip large files while keeping a serverless approach :
Use Node.js for streaming of the zip file, unzipping the file in a pipe, putting the content in a write stream pipe back to S3.
Deploy your code to an AWS Glue Job.
Upload the file to S3,AWS Lambda gets triggered pass the file name as the key to the glue job and the rest will be done.
This way you have a serverless approach and a code that does not cause memory issues while unzipping large files