Copy data from S3 and post process - amazon-web-services

There is a service that generates data in S3 bucket that is used for warehouse querying. Data is inserted into S3 as daily mechanism.
I am interested in copying that data from S3 to my service account to further classify the data. The classification needs to happen in my AWS service account as it is based on information present in my service account. Classification needs to happens in my service account as it is specific to my team/service. The service generating the data in S3 is neither concerned about the classification nor has the data to make classification decision.
Each S3 file consists of json objects (record) in it. For every record, I need to look into a dynamodb table. Based on whether data exists in Dynamo table, I need to include an additional attribute to the json object and store the list into another S3 bucket in my account.
The way I am considering doing this:
Trigger a scheduled CW event periodically to invoke a Lambda that will copy the files from Source S3 bucket into a bucket (lets say Bucket A) in my account.
Then, use another scheduled CW event to invoke a Lambda to read the records in the json and compare with dynamodb table to determine classification and write to updated record to another bucket (lets say Bucket B).
I have few questions regarding this:
Are there better alternatives for achieving this?
Would using aws s3 sync in the first Lambda be a good way to achieve this? My concerns revolve around lambdas getting timed out due large amount of data, especially for the second lambda that needs to compare against DDB for every record.

Rather than setting up scheduled events, you can trigger the AWS Lambda functions in real-time.
Use Amazon S3 Events to trigger the Lambda function as soon as a file is created in the source bucket. The Lambda function can call CopyObject() to copy the object to Bucket-A for processing.
Similarly, an Event on Bucket-A could then trigger another Lambda function to process the file. Some things to note:
Lambda functions run for a maximum of 15 minutes
You can increase the memory assigned to a Lambda function, which will also increase the amount of CPU assigned. So, this might speed-up the function if it is taking longer than 15 minutes.
There is a maximum of 512MB of storage space made available for a Lambda function.
If the data is too big, or takes too long to process, then you will need to find a way to do it outside of AWS Lambda. For example, using Amazon EC2 instances.
If you can export the data from DynamoDB (perhaps on a regular basis), you might be able to use Amazon Athena to do all the processing, but that depends on what you're trying to do. If it is simple SELECT/JOIN queries, it might be suitable.

Related

How can I Periodically Insert Data in Amazon Redshift?

I want to periodically insert data from S3 (or other fonts) into Amazon Redshift, i.e., when data is added to my S3 bucket, I want an option to add it automatically to my Amazon Redshift cluster.
My preferred method for doing this is to establish a trigger that fire every time a file is created in a part of a bucket. This trigger creates an event that initiates a Lambda function that issues the desired SQL to Redshift. (Or if the work that is needed in Redshift is complex or long running I will use a step function but this is rare.)
Example setups for this:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/NotificationHowTo.html
https://64lines.medium.com/building-a-aws-lambda-function-to-run-aws-redshift-sql-scripts-in-python-7468b7c2fdea
I'd start simple if you can and work up to Redshift Data API and Step functions.
You can automate the insertion of data from S3 with a scheduled Lambda that triggers periodically. This might be a better solution than invoking a Lambda on every object upload, especially if you are receiving lots of files continuously.

AWS S3 Bucket Notifications when object changes storage class?

I'm looking for a way to be notified when an object in s3 changes storage class. I thought there would be a bucket event notification for this but I don't see it as an option. How can I know when an object moves from STANDARD to GLACIER? We have systems that depend on objects not being in GLACIER. If they change to GLACIER, we need to be made aware and handle them accordingly.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/notification-how-to-event-types-and-destinations.html#supported-notification-event-types
You can use S3 access logs to capture changes in life cycle, but i think thats about it:
Amazon S3 server access logs can be enabled in an S3 bucket to capture
S3 Lifecycle-related actions such as object transition to another
storage class
Taken from AWS docs - life-cycle and other bucket config
You could certainly roll your own notifications for storage class transitions - might be a bit more involved than you are hoping for though.... You need a separate bucket to write your access logs. Setup an S3 notification for object creation in your new logs bucket to trigger a lambda function to process each new log file. In your lambda function use Athena to query the logs and fire off an SNS alert or perform some corrective action in code.
There are some limitations to be aware of though - see best effort logging means you might not get logs for a few hours
Updated 28/5/21
If the logs are on you should see the various lifecycle operations logged as they happen +/- a few hours. If not are you definitely meeting the minimum criteria for transitioning objects to glacier? (eg it takes 30 days to transition from standard to glacier).
As for:
The log record for a particular request might be delivered long after
the request was actually processed, or it might not be delivered at
all.
Consider S3's eventual consistency model and the SLA on data durability - there is possibility of data loss for any object in S3. I think the risk is relatively low of loosing log records, but it could happen.
You could also go for a more active approach - use s3 api from a lambda function triggered by cloudwatch events (cron like scheduling) to scan all the objects in the bucket and do something accordingly (send an email, take corrective action etc). Bare in mind this might get expensive depending on how often you run the lambda and how many object are in your bucket but low volumes might even be in the free tier depending on your usage.
As of Nov 2021 you can now do this via AWS EventBridge.
Simply create a new Rule on the s3 bucket that handles the Object Storage Class Changed event.
See https://aws.amazon.com/blogs/aws/new-use-amazon-s3-event-notifications-with-amazon-eventbridge/

How to save data from a Lambda function into a S3 when we have too much incoming per millisecond?

I have a process that publishes data into a IoT-Core and that triggers a Lambda function that inserts the payload into an Amazon S3 bucket.
I have a process that send around 1.2 million records in some seconds, and when I check in the bucket I see I have lost around 10% of the data. If I set a sleep in the Lambda function it goes beyond 15 minutes.
What is the solution for this scenario?
It appears that your requirement is to capture the events coming into IoT-Core and save them to Amazon S3.
It also sounds like your Lambda functions are being throttled due to hitting concurrency limits and data is being lost. By default, there is a limit of 10,000 concurrent AWS Lambda functions. This could potentially be fixed by requesting an increase in the maximum number of concurrent functions.
Here is a diagram from How AWS IoT works:
As shown in the digram, the Rules engine can actually be used to send data to Amazon S3 without requiring Lambda. However, this creates a separate object in Amazon S3 for every message.
If you wish to combine messages together, you can Write to Kinesis Data Firehose Using AWS IoT. Firehose will buffer the data by time or size, and then output multiple messages to an Amazon S3 object. This could be a good way to handle large volumes of data, and it also makes it easier to work with the resulting objects in S3 because there are less objects created. This makes them faster to query and process later (eg with Amazon Athena).
Going from IoT-Core rule direct to a Lambda can be fragile.
You can use Kinesis to buffer the data or Firehose to stream it directly to S3. These are standard patterns that AWS recommend for IoT in the AWS Well-Architected framework (https://d1.awsstatic.com/whitepapers/architecture/AWS-IoT-Lens.pdf).

How to invoke athena triggered automatically by lambda when objects are updated in the s3 bucket?

I have following 2 use case to apply on this
Case 1. I would need to call the lambda alone to invoke athena to perform query on s3 data? Question: How to invoke lambda alone via api?
Case 2. I would need lambda function to invoke athena whenever a file copied to the same s3 bucket that already mapped to the athena?
Iam referring following link to do the same to perform the Lambda operation over athena
Link:
https://dev.classmethod.jp/cloud/run-amazon-athenas-query-with-aws-lambda/
For the case 2: Following are eg want to integrate:
File in s3-1 is sales.csv - and i would updating sales details by copying data from other s3-2 . And the schema/column defined in the s3-1 data would remain same.
so when i copy some file to the same s3 data that mapped to the athena, the lambda should call athena to perform the query
Appreciate if can provide the better way to achieve above cases?
Thanks
Case 1
An AWS Lambda can be directly invoked via the invoke() command. This can be done via the AWS Command-Line Interface (CLI) or from a programming language using an AWS SDK.
Case 2
An Amazon S3 event can be configured on a bucket to automatically trigger an AWS Lambda function when a file is uploaded. The event provides the bucket name and file name (object name) to the Lambda function.
The Lambda function can extract these details from the event record and can then use that information in an Amazon Athena command.
Please note that, if the file name is different each time, a CREATE TABLE command would be required before a SELECT command can query the data.
General Comments
A Lambda function can run for a maximum of 15 minutes, so make sure the Athena queries do not take more than this time. This is not a particularly efficient use of an AWS Lambda function because it will be billed for the duration of the function call, even if it is just waiting for Athena to finish.
Another option would be to have the Lambda function directly process the file, assuming that the query is not particularly complex. For example, the Lambda function could download the file to temporary storage (maximum 500MB), read through the file, do some calculations (eg add up the total of some columns), then store the results somewhere.
The next step wuold be create a end point to your lambda, you ver can use aws-apigateway for that.
On the other hand, using the amazon console or amazon cli, you can invoke the lambda in order to test.

Identifying and deleting S3 Objects that are not being accessed?

I have recently joined a company that uses S3 Buckets for various different projects within AWS. I want to identify and potentially delete S3 Objects that are not being accessed (read and write), in an effort to reduce the cost of S3 in my AWS account.
I read this, which helped me to some extent.
Is there a way to find out which objects are being accessed and which are not?
There is no native way of doing this at the moment, so all the options are workarounds depending on your usecase.
You have a few options:
Tag each S3 Object (e.g. 2018-10-24). First turn on Object Level Logging for your S3 bucket. Set up CloudWatch Events for CloudTrail. The Tag could then be updated by a Lambda Function which runs on a CloudWatch Event, which is fired on a Get event. Then create a function that runs on a Scheduled CloudWatch Event to delete all objects with a date tag prior to today.
Query CloudTrail logs on, write a custom function to query the last access times from Object Level CloudTrail Logs. This could be done with Athena, or a direct query to S3.
Create a Separate Index, in something like DynamoDB, which you update in your application on read activities.
Use a Lifecycle Policy on the S3 Bucket / key prefix to archive or delete the objects after x days. This is based on upload time rather than last access time, so you could copy the object to itself to reset the timestamp and start the clock again.
No objects in Amazon S3 are required by other AWS services, but you might have configured services to use the files.
For example, you might be serving content through Amazon CloudFront, providing templates for AWS CloudFormation or transcoding videos that are stored in Amazon S3.
If you didn't create the files and you aren't knowingly using the files, can you probably delete them. But you would be the only person who would know whether they are necessary.
There is recent AWS blog post which I found very interesting and cost optimized approach to solve this problem.
Here is the description from AWS blog:
The S3 server access logs capture S3 object requests. These are generated and stored in the target S3 bucket.
An S3 inventory report is generated for the source bucket daily. It is written to the S3 inventory target bucket.
An Amazon EventBridge rule is configured that will initiate an AWS Lambda function once a day, or as desired.
The Lambda function initiates an S3 Batch Operation job to tag objects in the source bucket. These must be expired using the following logic:
Capture the number of days (x) configuration from the S3 Lifecycle configuration.
Run an Amazon Athena query that will get the list of objects from the S3 inventory report and server access logs. Create a delta list with objects that were created earlier than 'x' days, but not accessed during that time.
Write a manifest file with the list of these objects to an S3 bucket.
Create an S3 Batch operation job that will tag all objects in the manifest file with a tag of "delete=True".
The Lifecycle rule on the source S3 bucket will expire all objects that were created prior to 'x' days. They will have the tag given via the S3 batch operation of "delete=True".
Expiring Amazon S3 Objects Based on Last Accessed Date to Decrease Costs