I created an EventBridge rule that triggers a Sagemaker Pipeline when someone uploads a new file to an S3 bucket. As new input files become available, they will be uploaded to the bucket for processing. I'd like the pipeline to process only the uploaded file, and so thought to pass in the S3 URL of the file as a parameter to the Pipeline. Since the full URL doesn't exist as a single field value in the S3 event, I was wondering if there is some way to concatenate multiple field values into a single parameter value that EventBridge will pass on to the target.
For example, I know the name of the uploaded file can be sent from EventBridge using $.detail.object.key and the bucket name can be sent using $.detail.bucket.name, so I'm wondering if I can send both somehow to get something like this to the Sagemaker Pipeline s3://my-bucket/path/to/file.csv
For what it's worth, I tried splitting the parameter into two (one being s3://bucket-name/ and the other being default_file.csv) when defining the pipeline, but got an error saying Pipeline variables do not support concatenation when combining the two into one.
The relevant pipeline step is
step_transform = TransformStep(name = "Name", transformer=transformer,inputs=TransformInput(data=variable_of_s3_path)
Input transformers manipulate the event payload that EventBridge sends to the target. Transforms consist of (1) an "input path" that maps substitution variable names to JSON-paths in the event and (2) a "template" that references the substitution variables.
Input path:
{
"detail-bucket-name": "$.detail.bucket.name",
"detail-object-key": "$.detail.object.key"
}
Input template that concatenates the s3 url and outputs it along with the original event payload:
{
"s3Url": "s3://<detail-bucket-name>/<detail-object-key>",
"original": "$"
}
Define the transform in the EventBridge console by editing the rule: Rule > Select Targets > Additional Settings.
Related
I'm currently struggling to configure automated input to my AWS StepFunctions state machine. Basically, I am trying to set up a state machine that is notified whenever an object create event takes place in a certain S3 bucket. When that happens, the input is passed to a choice state which checks the file size. If the file is small enough, it invokes a Lambda function to process the file contents. If the file is too large, it invokes a Lambda to split up the file into files of manageable size, and then invokes the other Lambda to process the contents of those files. The problem with this is that I cannot figure out a way to pass the file size in as input to the state machine.
I am generally aware of how input is passed to StepFunctions, and I know that S3 Lambda triggers contain file size as a parameter, but I still haven't been able to figure out a practical way of passing file size as an input parameter to a StepFunctions state machine.
I would greatly appreciate any help on this issue and am happy to clarify or answer any questions that you have to the best of my ability. Thank you!
Currently S3 events can't triggers Step Function directly, so one option would be to create a S3 event that triggers a lambda. The lambda works as a proxy and passes the file info to the step function and kicks it off, also you can select data you want and only pass selective data to Step Functions.
The other option is to configure a state machine as a target for a CloudWatch Events rule. This will start an execution when files are added to an Amazon S3 bucket.
The first option is more flexible.
I am configuring a Lambda function that should take .png and .pdf files from my bucket.
How can I provide Lambda config with multiple suffixes?
Here is what I want to do:
Please advise how to do it?
As it may seen from the tooltip description; you can't set multiple suffixes on aws console.
Enter a single optional suffix to limit the notifications to objects with keys that end with matching characters.
What you may do is to create multiple triggers and define each suffix in a separate s3 trigger.
The documentation has some sample xml configuration to support multiple/non-overlapping suffix/prefix options but i think it is not possible to set them on web console.
I have created a rule to send the incoming IoT messages to a S3 bucket.
The problem is that any time IoT recieves a messages is sended and stored in a new file (with the same name) in S3.
I want this S3 file to keep all the data from before and not truncate each time a new message is stored.
How can I do that?
When you set up an IoT S3 rule action, you need to specify a bucket and a key. The key is what we might think of as a "path and file name". As the docs say, we can specify the key string by using a substitution template, which is just a fancy way of saying "build a path out of these pieces of information". When you are building your substitution template, you can reference fields inside the message as well as use use a bunch of other functions
Especially look at the functions topic, timestamp, as well as some of the string manipulator functions.
Let's say your topic names are something like things/thing-id-xyz/location and you just want to store each incoming JSON message in a "folder" for the thing-id it came in from. You might specify a key like:
${topic(2)}/${timestamp()).json
it would evaluate to something like:
thing-id-xyz/1481825251155.json
where the timestamp part is the time the message came in. That will be different for each message, and then the messages would not overwrite each other.
You can also specify parts of the message itself. Let's imagine our incoming messages look something like this:
{
"time": "2022-01-13T10:04:03Z",
"latitude": 40.803274,
"longitude": -74.237926,
"note": "Great view!"
}
Let's say you want to use the nice ISO date value you have in your data instead of the timestamp of the file. You could reference the time field no problem, like:
${topic(2)}/${time}.json
Now the file would be written as the key:
thing-id-xyz/2022-01-13T10:04:03Z.json
You should be able to find some combination of values that works for your needs, and that most importantly, is UNIQUE for each message so they don't overwrite each other in S3.
You can do it using AWS IoT SQL variable expressions. For example use following as a key ${newuuid()}. This will create new s3 object for each message received.
See more about SQL Functions https://docs.aws.amazon.com/iot/latest/developerguide/iot-sql-functions.html
You can't do this with the S3 IoT Rule Action. You can get similar results using AWS Firehose, which will batch up several messages and write to one file. You will still end up with multiple files though.
I have written a Lambda function which gets invoked automatically when a file comes into my S3 bucket.
I perform certain validations on this file, modify the particular and put the file at the same location.
Due to this "put", my lambda is called again and the process goes on till my lambda execution times out.
Is there any way to trigger this lambda only once?
I found an approach where I can store the file name in DynamoDB and can apply a check in lambda function, but can there be any other approach where DynamoDB's use can be avoided?
You have a couple options:
You can put the file to a different location in s3 and delete the original
You can add a metadata field to the s3 object when you update it. Then check for the presence of that field in s3 so you know if you have processed it already. Now this might not work perfectly since s3 does not always provide the most recent data on reads after updates.
AWS allows different type of s3 event triggers. You can try playing s3:ObjectCreated:Put vs s3:ObjectCreated:Post.
You can upload your files in a folder, say
s3://bucket-name/notvalidated
and store the validated in another folder, say
s3://bucket-name/validated.
Update your S3 Event notification to invoke your lambda function whenever there is a ObjectCreate(All) event in the /notvalidated prefix.
The second answer does not seem to be correct (put vs post) - there is not really a concept of update in S3 in terms of POST or PUT. The request to update an object will be the same as the initial POST of the object. See here for details on the available S3 events.
I had this exact problem last year - I was doing an image resize on PUT and every time a file was overwritten, it would be triggered again. My recommended solution would be to have two folders in your s3 bucket - one for the original file and one for the finalized file. You could then create the lambda trigger with the lambda prefix so it only checks the files in the original folder
The events are triggered in S3 based on if the object is put/post/copy/complete Multipart Upload - All these operations corresponds to ObjectCreate as per AWS documentation .
https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
The best solution is to restrict your S3 object create event to particular bucket location. So that any change in that bucket location will trigger lambda function.
You can do the modification in some other bucket location which is not configured to trigger lambda function when object is created in that location.
Hope it helps!
I am considering moving to lambdas and after spending some time reading docs and various blogs with user experiences I am still struggling with a simple question. Is there a proposed/proper way to use lambda with existing s3 files?
I have an s3 bucket that contains archived data spanning a couple of years. The size of these data is rather large (hundreds of GB). Each file is a simple txt file. Each line in the file represents an event and it's just a comma separated string.
My endgame is to consume these files, parse each one of them line by line apply some transformation, create batches of lines and send them to an external service. From what I've read so far, if I write a proper lambda, this will be triggered by an s3 event (for example an upload of a new file).
Is there a way to apply the lambda to all the existing contents of my bucket?
Thanks
For existing resources you would need to write a script that gets a listing of all your resources and sends each item to a Lambda function somehow. I'd probably look into sending the location of each of your existing S3 objects to a Kenesis stream and configure a Lambda function to pull records from that stream and process them.
Try using s3cmd.
s3cmd modify --recursive --add-header="touched:touched" s3://path/to/s3/bucket-or-folder
This will modify metadata and invoke an event for lambda
I had a similar problem I solved it with minimal changes to my existing Lambda function. The solution involves creating API Gateway trigger (in addition to S3 trigger) - the API gateway trigger is used to process historical files in S3 & the regular S3 trigger will processes my files as new files are uploaded to my S3 bucket.
Initially - I started by building my function to expect a S3 event as trigger. Recall that the S3 events have this structure - so I would look for the S3 bucket name and key to process - like so:
for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = unquote_plus(record['s3']['object']['key'], encoding='utf-8')
temp_dir = tempfile.TemporaryDirectory()
video_filename = os.path.basename(key)
local_video_filename = os.path.join(temp_dir.name, video_filename)
s3_client.download_file(bucket, key, local_video_filename)
But when you send the API Gateway trigger there is no "Records" object in the request/event. You can use query parameters in the API Gateway Trigger - so the modification required to the above snippet of code is:
if 'Records' in event:
# this means we are working off of an S3 event
records_to_process = event['Records']
else:
# this is for ad-hoc posts via API Gateway trigger for Lambda
records_to_process = [{
"s3":{"bucket": {"name": event["queryStringParameters"]["bucket"]},
"object":{"key": event["queryStringParameters"]["file"]}}
}]
for record in records_to_process:
# below lines of code s same as the earlier snippet of code
bucket = record['s3']['bucket']['name']
key = unquote_plus(record['s3']['object']['key'], encoding='utf-8')
temp_dir = tempfile.TemporaryDirectory()
video_filename = os.path.basename(key)
local_video_filename = os.path.join(temp_dir.name, video_filename)
s3_client.download_file(bucket, key, local_video_filename)
Postman result of sending the post request
Try to copy your bucket content and catch create events with lambda.
copy:
s3cmd sync s3://from/this/bucket/ s3://to/this/bucket
for larger buckets:
https://github.com/paultuckey/s3_bucket_to_bucket_copy_py