I am running a continuous code deployment with Jenkins that will automatically compile and upload binaries to S3 in parallel for multiple targets.
The final step in my deployment mechanism is to detect that all the binaries for a particular build has been uploaded, and then deploy them together.
S3 has event notifications that can trigger when objects have been pushed, but do they have anything more sophisticated that can trigger when multiple objects have been pushed?
Example:
Build machine on Windows uploads binary to S3.
Build machine on OS X uploads binary to S3.
S3 detects that both binaries are now uploaded and triggers an event.
Build machine takes both binaries and releases them together.
Right now the only solution I can think of is to set up AWS Lambda and have the event handler manually check for the existence of the other binary, which may not even be feasible if S3 has special race conditions.
Any ideas?
The short answer is no. There is no mechanism that would let you trigger an action when all three objects are uploaded. There is no conditional notification, just simple events.
But you can use something else. Create a DynamoDB table for the build records and create a row there when your build is successful from any build machine, before you upload any files. Now for each build, create a separate attribute on the row. Have S3 publish a notification to your Lambda and have your Lambda lookup and update this row and when all your attributes are in desired state, you can have this Lambda do the release.
Amazon S3 is a "base" system upon which many things can be built (eg DropBox!). As such, the functionality of Amazon S3 is limited (but very scalable and reliable).
Thus, you'll have to build your own logic on top of Amazon S3 to implement your desired solution.
One option would be to trigger an AWS Lambda function when an object is created. This Lambda function could then implement any logic you desire, such as your step #3.
Related
Required procedure:
Someone does an upload to an S3 bucket.
This triggers a Lambda function that does some processing on the uploaded file(s).
Processed objects are now copied into a "processed" folder within the same bucket.
The copy-operation in Step 3 should never re-trigger the initial Lambda function itself.
I know that the general guidance is to use a different bucket for storing the processed objects in a situation like this (but this is not possible in this case).
So my approach was to set up the S3 trigger to only listen to PUT/POST-Method and excluded the COPY-Method. The lambda function itself uses python-boto (S3_CLIENT.copy_object(..)). The approach seems to work (the lambda function seems to not be retriggered by the copy operation)
However I wanted to ask if this approach is really reliable - is it?
You can filter which events trigger the S3 notification.
There are 2 ways to trigger lambda from S3 event in general: bucket notifications and EventBridge.
Notifications: https://docs.aws.amazon.com/AmazonS3/latest/userguide/notification-how-to-filtering.html
EB: https://aws.amazon.com/blogs/aws/new-use-amazon-s3-event-notifications-with-amazon-eventbridge/
In your case, a quick search doesn't show me that you can setup a "negative" rule, so "everything which doesn't have processed prefix". But you can rework your bucket structure a bit and dump unprocessed items into unprocessed and setup filter based on that prefix only.
When setting up an S3 trigger for lambda function, there is the possibility, to define which kind of overarching S3-event should be listened to:
I have an S3 bucket with different files. I need to read those files and publish SQS msg for each row in the file.
I cannot use S3 events as the files need to be processed with a delay - put to SQS after a month.
I can write a scheduler to do this task, read and publish. But can I was AWS for this purpose?
AWS Batch or AWS data pipeline or Lambda.?
I need to pass the date(filename) of the data to be read and published.
Edit : The data volume to be dealt is huge
I can think of two ways to do this entirely using AWS serverless offerings without even having to write a scheduler.
You could use S3 events to start a Step Function that waits for a month before reading the S3 file and sending messages through SQS.
With a little more work, you could use S3 events to trigger a Lambda function which writes the messages to DynamoDB with a TTL of one month in the future. When the TTL expires, you can have another Lambda that listens to the DynamoDB streams, and when there’s a delete event, it publishes the message to SQS. (A good introduction to this general strategy can be found here.)
While the second strategy might require more effort, you might find it less expensive than using Step Functions depending on the overall message throughput and whether or not the S3 uploads occur in bursts or in a smooth distribution.
At the core, you need to do two things:
Enumerate all of the object in a bucket in S3, and perform some action on any object uploaded more than a month ago.
Can you use Lambda or Batch to do this? Sure. A Lambda could be set to trigger once a day, enumerate the files, and post the results to SQS.
Should you? No clue. A lot depends on your scale, and what you plan to do if it takes a long time to perform this work. If your S3 bucket has hundreds of objects, it won't be a problem. If it has billions, your Lambda will need to be able to handle being interrupted, and continuing paging through files from a previous run.
Alternatively, you could use S3 events to trigger a simple Lambda that adds a row to a database. Then, again, some Lambda could run on a cron job that asks the database for old rows, and publishes that set to SQS for others to consume. That's slightly cleaner, maybe, and can handle scaling up to pretty big bucket sizes.
Or, you could do the paging through files, deciding what to do, and processing old files all on a t2.micro if you just need to do some simple work to a few dozen files every day.
It all depends on your workload and needs.
I was wondering if there are any AWS Services or projects which allow us to configure a data pipeline using AWS Lambdas in code. I am looking for something like below. Assume there is a library called pipeline
from pipeline import connect, s3, lambda, deploy
p = connect(s3('input-bucket/prefix'),
lambda(myPythonFunc, dependencies=[list_of_dependencies])
s3('output-bucket/prefix'))
deploy(p)
There can be many variations of this idea of course. This use case assumes only one s3 bucket for e.g. There could be a list of input s3 buckets.
Can this be done by AWS Data Pipeline? The documentation I have(quickly) read says that Lambda is used to trigger a pipeline.
I think the closest thing that is available is the State Machine functionality within the newly released Lambda Step Functions. With these you can coordinate multiple steps that transform your data. I don't believe that they support standard event sources, so you would have to create a standard lambda function (potentially using the Serverless Application Model) to read from S3 and trigger your State Machine.
I'm new to AWS and have a feasibility question for a file management system I'm trying to build. I would like to set up a system where people will use the Amazon S3 browser and drop either a csv or excel file into their specific bucket. Then I would like to automate the process of taking that csv/excel file and inserting that into a table within RDS. Now this is assuming that the table has already been built and those excel/csv file will always be formatted the same and will be in the same exact place every single time. Is it possible to automate this process or at least get it to point where very minimal human interference is needed. I'm new to AWS so I'm not exactly sure of the limits of S3 to RDS. Thank you in advance.
It's definitely possible. AWS supports notifications from S3 to SNS, which can be forwarded automatically to SQS: http://aws.amazon.com/blogs/aws/s3-event-notification/
S3 can also send notifications to AWS Lambda to run your own code directly.
Trying to sync a large (millions of files) S3 bucket from cloud to local storage seems to be troublesome process for most S3 tools, as virtually everything I've seen so far uses GET Bucket operation, patiently getting the whole list of files in bucket, then diffing it against a list local of files, then performing the actual file transfer.
This looks extremely unoptimal. For example, if one could list files in a bucket that were created / changed since the given date, this could be done quickly, as list of files to be transferred would include just a handful, not millions.
However, given that answer to this question is still true, it's not possible to do so in S3 API.
Are there any other approaches to do periodic incremental backups of a given large S3 bucket?
On AWS S3 you can configure event notifications (Ex: s3:ObjectCreated:*). To request notification when an object is created. It supports SNS, SQS and Lambda services. So you can have an application that listens on the event and updates the statistics. You may also want to ad timestamp as part of the statistic. Then just "query" the result for a certain period of time and you will get your delta.