job is invoking two times by s3 notifications - amazon-web-services

i have created lambda function to invoke glue job on s3 bucket notifications. if i place any object in s3 bucket the lambda is invoking glue job correctly.
But, i have created SMB fileshare for s3 bucket, if i place any objects in that fileshare also job is invoking correctly. but the job is running two times if i placed any objects in s3 file share.
does anyone knows why this job will starts two times if i place any objects in s3 mapped drive ???
if requires i will share the code. but code seems perfect.

Related

Best way to move contents from one s3 object/folder to another within the same bucket?

I have a job that needs to transfer ~150GB from one folder into another. This runs once a day.
def copy_new_data_to_official_location(bucket_name):
s3 = retrieve_aws_connection('s3')
objects_to_move = s3.list_objects(
Bucket=bucket_name, Prefix='my/prefix/here')
for item in objects_to_move['Contents']:
print(item['Key'])
copy_source = {
'Bucket': bucket_name,
'Key': item['Key']
}
original_key_name = item['Key'].split('/')[2]
s3.copy(copy_source, bucket_name, original_key_name)
I have the following. This process takes a bit of time and also, if I'm reading correctly, I'm paying transfer fees moving between objects.
Is there a better way?
Flow:
Run large scale job on Spark to feed data in from folder_1 and external source
Copy output to folder_2
Delete all contents from folder_1
Copy contents of folder_2 to folder_1
Repeat above flow on daily cadence.
Spark is a bit strange, so need to copy output to folder_2, otherwise redirecting to folder_1 causes a data wipe before the job even kicks off.
There are no Data Transfer fees if the source and destination buckets are in the same Region. Since you are simply copying within the same bucket, there would be no Data Transfer fee.
150 GB is not very much data, but it can take some time to copy if there are many objects. The overhead of initiating the copy can sometimes take more time than actually copying the data. When using the copy() command, all data is transferred within Amazon S3 -- nothing is copied down to the computer where the command is issued.
There are several ways you could make the process faster:
You could issue the copy() commands in parallel. In fact, this is how the AWS Command-Line Interface (CLI) works when using aws s3 cp --recursive and aws s3 sync.
You could use the AWS CLI to copy the objects rather writing your own program.
Instead of copying objects once per day, you could configure replication within Amazon S3 so that objects are copied as soon as they are created. (Although I haven't tried this with the same source and destination bucket.)
If you need to be more selective about the objects to copy immediately, you could configure Amazon S3 to trigger an AWS Lambda function whenever a new object is created. The Lambda function could apply some business logic to determine whether to copy the object, and then it can issue the copy() command.

Copy data from S3 and post process

There is a service that generates data in S3 bucket that is used for warehouse querying. Data is inserted into S3 as daily mechanism.
I am interested in copying that data from S3 to my service account to further classify the data. The classification needs to happen in my AWS service account as it is based on information present in my service account. Classification needs to happens in my service account as it is specific to my team/service. The service generating the data in S3 is neither concerned about the classification nor has the data to make classification decision.
Each S3 file consists of json objects (record) in it. For every record, I need to look into a dynamodb table. Based on whether data exists in Dynamo table, I need to include an additional attribute to the json object and store the list into another S3 bucket in my account.
The way I am considering doing this:
Trigger a scheduled CW event periodically to invoke a Lambda that will copy the files from Source S3 bucket into a bucket (lets say Bucket A) in my account.
Then, use another scheduled CW event to invoke a Lambda to read the records in the json and compare with dynamodb table to determine classification and write to updated record to another bucket (lets say Bucket B).
I have few questions regarding this:
Are there better alternatives for achieving this?
Would using aws s3 sync in the first Lambda be a good way to achieve this? My concerns revolve around lambdas getting timed out due large amount of data, especially for the second lambda that needs to compare against DDB for every record.
Rather than setting up scheduled events, you can trigger the AWS Lambda functions in real-time.
Use Amazon S3 Events to trigger the Lambda function as soon as a file is created in the source bucket. The Lambda function can call CopyObject() to copy the object to Bucket-A for processing.
Similarly, an Event on Bucket-A could then trigger another Lambda function to process the file. Some things to note:
Lambda functions run for a maximum of 15 minutes
You can increase the memory assigned to a Lambda function, which will also increase the amount of CPU assigned. So, this might speed-up the function if it is taking longer than 15 minutes.
There is a maximum of 512MB of storage space made available for a Lambda function.
If the data is too big, or takes too long to process, then you will need to find a way to do it outside of AWS Lambda. For example, using Amazon EC2 instances.
If you can export the data from DynamoDB (perhaps on a regular basis), you might be able to use Amazon Athena to do all the processing, but that depends on what you're trying to do. If it is simple SELECT/JOIN queries, it might be suitable.

Identifying and deleting S3 Objects that are not being accessed?

I have recently joined a company that uses S3 Buckets for various different projects within AWS. I want to identify and potentially delete S3 Objects that are not being accessed (read and write), in an effort to reduce the cost of S3 in my AWS account.
I read this, which helped me to some extent.
Is there a way to find out which objects are being accessed and which are not?
There is no native way of doing this at the moment, so all the options are workarounds depending on your usecase.
You have a few options:
Tag each S3 Object (e.g. 2018-10-24). First turn on Object Level Logging for your S3 bucket. Set up CloudWatch Events for CloudTrail. The Tag could then be updated by a Lambda Function which runs on a CloudWatch Event, which is fired on a Get event. Then create a function that runs on a Scheduled CloudWatch Event to delete all objects with a date tag prior to today.
Query CloudTrail logs on, write a custom function to query the last access times from Object Level CloudTrail Logs. This could be done with Athena, or a direct query to S3.
Create a Separate Index, in something like DynamoDB, which you update in your application on read activities.
Use a Lifecycle Policy on the S3 Bucket / key prefix to archive or delete the objects after x days. This is based on upload time rather than last access time, so you could copy the object to itself to reset the timestamp and start the clock again.
No objects in Amazon S3 are required by other AWS services, but you might have configured services to use the files.
For example, you might be serving content through Amazon CloudFront, providing templates for AWS CloudFormation or transcoding videos that are stored in Amazon S3.
If you didn't create the files and you aren't knowingly using the files, can you probably delete them. But you would be the only person who would know whether they are necessary.
There is recent AWS blog post which I found very interesting and cost optimized approach to solve this problem.
Here is the description from AWS blog:
The S3 server access logs capture S3 object requests. These are generated and stored in the target S3 bucket.
An S3 inventory report is generated for the source bucket daily. It is written to the S3 inventory target bucket.
An Amazon EventBridge rule is configured that will initiate an AWS Lambda function once a day, or as desired.
The Lambda function initiates an S3 Batch Operation job to tag objects in the source bucket. These must be expired using the following logic:
Capture the number of days (x) configuration from the S3 Lifecycle configuration.
Run an Amazon Athena query that will get the list of objects from the S3 inventory report and server access logs. Create a delta list with objects that were created earlier than 'x' days, but not accessed during that time.
Write a manifest file with the list of these objects to an S3 bucket.
Create an S3 Batch operation job that will tag all objects in the manifest file with a tag of "delete=True".
The Lifecycle rule on the source S3 bucket will expire all objects that were created prior to 'x' days. They will have the tag given via the S3 batch operation of "delete=True".
Expiring Amazon S3 Objects Based on Last Accessed Date to Decrease Costs

Run AWS lambda function on existing S3 images

I write an AWS lambda function in Node.js for image resizing and trigger it when images upload.
I have already more than 1,000,000 images existing in bucket.
I want to run this lambda function on that images but not find anything till yet.
How can I run AWS lamdba function on existing images of S3 bucket?
Note:- I know this question already asked on Stack overflow, but issue is that no solution of them given till yet
Unfortunately, Lambda cannot be triggered automatically for objects that are already existing in a S3 bucket.
You will have to invoke your Lambda function manually for each image in your S3 bucket.
First, you will need to list existing objects in your S3 bucket using the ListObjectsV2 action.
For each object in your S3 bucket, you must then invoke your Lambda function and provide the S3 object's information as the Payload.
Yes , it's completely true that lambda cannot be triggered by objects already present there in your s3 bucket, but invoking your lambda manually for each object is a completely dumb idea.
With some clever techniques you can perform your tasks on those images easily :
The hard way is, make a program locally that exactly does the same thing as your lambda function but add two more things, firstly you have to iterate over each object in your bucket, then perform your code on it and then save it to destination path of s3 after resizing. i.e, for all images already stored in your s3 bucket , instead of using lambda, you are resizing the images locally in your computer and saving them back to s3 destination.
The easiest way is, first make sure that you have configured s3 notification's event type to be Object Created (All) as trigger for your lambda.
Then after this, move all your already stored images to a new temporary bucket, and then move those images back to the original bucket, this is how your lambda will get triggered for each image automatically. You can do the moving task easily by using sdk's provided by AWS. For example, for moving files using boto3 in python, you can refer this link to moving example in python using boto3
Instead of using moving , i.e cut and paste , you can use copy and paste commands too.
In addition to Mausam Sharma's comment you can run the copy between buckets using the aws cli:
aws s3 sync s3://SOURCE-BUCKET-NAME s3://DESTINATION-BUCKET-NAME --source-region SOURCE-REGION-NAME --region DESTINATION-REGION-NAME
from here:
https://medium.com/tensult/copy-s3-bucket-objects-across-aws-accounts-e46c15c4b9e1
You can simply copy back to the same bucket with the CLI which will replace the original file with itself and then run the lambda as a result.
aws s3 copy s3://SOURCE-BUCKET-NAME s3://SOURCE-BUCKET-NAME --recursive
You can also include/exclude glob patterns which can be used to selectively run against say a particular day, or specific extensions etc.
aws s3 copy s3://SOURCE-BUCKET-NAME s3://SOURCE-BUCKET-NAME --recursive --exclude "*" --include "2020-01-15*"
It's worth noting that like many of the other answers here, this will incur costs on s3 for read/write etc, so cautiously apply this in the event of buckets containing lots of files.

Does AWS S3 Have a concept of files being 'updated'?

I'd like to write a Lambda function that is triggered when files are added or modified in an s3 bucket and processes them and moves them elsewhere, clobbering older versions of the files.
I'm wondering if AWS Lambda can be configured to trigger when files are updated?
After reviewing the Boto3 documentation for s3 it looks like the only things that could happen in a s3 bucket would be creations and deletions.
Additionally, the AWS documentation seems to indicate there is no way to trigger things on 'updates' to S3.
Am I correct in thinking there is no real concept of an 'update' to a file in S3 and that an update would actually be when something was destroyed and recreated? If I'm mistaken, how can I trigger a Lambda function when an S3 file is changed in a bucket?
No, there is no concept of updating a file on S3. A file on S3 is updated the same way it is uploaded in the first place - through a PUT object request. (Relevant answer here.) An S3 bucket notification configured to trigger on a PUT object request can execute a Lambda function.
There is now a new functionality for S3 buckets. Under properties there is the possibility to enable versioning for this bucket. And if you set a trigger for creating on S3 assigned to your Lambda function - this will executed every time if you 'update' the same file as it is a new version.