Can anyone help me with how to do an incremental scan on AWS using python SDK? I want to scan the S3 bucket of an Amazon EC2 instance. The first time it should scan completely and after that, it should scan only the change in this.
It can be done using...
Use logs
Consume changes (event driven)
Use rest API
But how I did not get it?
There is no "incremental scan" capability in Amazon S3.
You would need to list all existing objects. However, you could use the LastModified date to determine which objects were created since a particular point in time.
Alternatively, you could create an Amazon S3 Event to trigger an AWS Lambda function whenever a new object is created. You could then write code in the Lambda function that extracts the name of the new objects from the event parameter and does something with it. This is effectively 'incremental', but one new object at a time.
Related
I am looking to trigger code every 1 Hour in AWS.
The code should: Parse through a list of zip codes, fetch data for each of the zip codes, store that data somewhere in AWS.
Is there a specific AWS service would I use for parsing through the list of zip codes and call the api for each zip code? Would this be Lambda?
How could I schedule this service to run every X hours? Do I have to use another AWS Service to call my Lambda function (assuming that's the right answer to #1)?
Which AWS service could I use to store this data?
I tried looking up different approaches and services in AWS. I found I could write serverless code in Lambda which made me think it would be the answer to my first question. Then I tried to look into how that could be ran every x time, but that's where I was struggling to know if I could still use Lambda for that. Then knowing where my options were to store the data. I saw that Glue may be an option, but wasn't sure.
Yes, you can use Lambda to run your code (as long as the total run time is less than 15 minutes).
You can use Amazon EventBridge Scheduler to trigger the Lambda every 1 hour.
Which AWS service could I use to store this data?
That depends on the format of the data and how you will subsequently use it. Some options are
Amazon DynamoDB for key-value, noSQL data
Amazon Aurora for relational data
Amazon S3 for object storage
If you choose S3, you can still do SQL-like queries on the data using Amazon Athena
I have multiple folders inside a bucket each folder is named as a unique guid and it is always going to contain a single file.
I need to fetch only those files which have never been read before. If I'll fetch all the objects at once and then do client side filtering it might introduce latency in the near future as every day the number of new folders getting added could be hundreds.
Initially I tried to list object by specifying StartAfter, but soon I realized it only works with alphabetically sorted list.
https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html
I am using AWS C# SDK. Can someone please give me some idea about the best approach.
Thanks
Amazon S3 does not maintain a concept of "objects that have not been accessed".
However, there is a different approach to process each object only once:
Create an Amazon S3 Event that will trigger when an object is created
The Event can then trigger:
An AWS Lambda function, or
Send a message to an Amazon SQS queue, or
Send a message to an Amazon SNS topic
You could therefore trigger your custom code via one of these methods, and you will never actually need to "search" for new objects.
There is a service that generates data in S3 bucket that is used for warehouse querying. Data is inserted into S3 as daily mechanism.
I am interested in copying that data from S3 to my service account to further classify the data. The classification needs to happen in my AWS service account as it is based on information present in my service account. Classification needs to happens in my service account as it is specific to my team/service. The service generating the data in S3 is neither concerned about the classification nor has the data to make classification decision.
Each S3 file consists of json objects (record) in it. For every record, I need to look into a dynamodb table. Based on whether data exists in Dynamo table, I need to include an additional attribute to the json object and store the list into another S3 bucket in my account.
The way I am considering doing this:
Trigger a scheduled CW event periodically to invoke a Lambda that will copy the files from Source S3 bucket into a bucket (lets say Bucket A) in my account.
Then, use another scheduled CW event to invoke a Lambda to read the records in the json and compare with dynamodb table to determine classification and write to updated record to another bucket (lets say Bucket B).
I have few questions regarding this:
Are there better alternatives for achieving this?
Would using aws s3 sync in the first Lambda be a good way to achieve this? My concerns revolve around lambdas getting timed out due large amount of data, especially for the second lambda that needs to compare against DDB for every record.
Rather than setting up scheduled events, you can trigger the AWS Lambda functions in real-time.
Use Amazon S3 Events to trigger the Lambda function as soon as a file is created in the source bucket. The Lambda function can call CopyObject() to copy the object to Bucket-A for processing.
Similarly, an Event on Bucket-A could then trigger another Lambda function to process the file. Some things to note:
Lambda functions run for a maximum of 15 minutes
You can increase the memory assigned to a Lambda function, which will also increase the amount of CPU assigned. So, this might speed-up the function if it is taking longer than 15 minutes.
There is a maximum of 512MB of storage space made available for a Lambda function.
If the data is too big, or takes too long to process, then you will need to find a way to do it outside of AWS Lambda. For example, using Amazon EC2 instances.
If you can export the data from DynamoDB (perhaps on a regular basis), you might be able to use Amazon Athena to do all the processing, but that depends on what you're trying to do. If it is simple SELECT/JOIN queries, it might be suitable.
I have a key that is being shared among different services and it is currently stored in an s3 bucket inside a text file.
My goal is to read that variable and pass it to my lambda service through cloudformation.
for an ec2 instance it was easy because I could download the file and read it, and that was easily achievable by putting the scripts inside my cloudformation json file. But I don't have any idea how to do it for my lambdas....!
I tried to put my credentials in gitlab pipeline but because of the access permissions it doesn't let gitlab pass it on, so my best and least expensive option right now is to do it in cloud formation.
The easiest method would be to have the Lambda function read the information from Amazon S3.
The only way to get CloudFormation to "read" some information from Amazon S3 would be to create a Custom Resource, which involves writing an AWS Lambda function. However, since you already have a Lambda function, it would be easier to simply have that function read the object.
It's worth mentioning that, rather than storing such information in Amazon S3, you could use the AWS Systems Manager Parameter Store, which is a great place to store configuration information. Your various applications can then use Parameter Store to store and retrieve the configuration. CloudFormation can also access the Parameter Store.
I have following 2 use case to apply on this
Case 1. I would need to call the lambda alone to invoke athena to perform query on s3 data? Question: How to invoke lambda alone via api?
Case 2. I would need lambda function to invoke athena whenever a file copied to the same s3 bucket that already mapped to the athena?
Iam referring following link to do the same to perform the Lambda operation over athena
Link:
https://dev.classmethod.jp/cloud/run-amazon-athenas-query-with-aws-lambda/
For the case 2: Following are eg want to integrate:
File in s3-1 is sales.csv - and i would updating sales details by copying data from other s3-2 . And the schema/column defined in the s3-1 data would remain same.
so when i copy some file to the same s3 data that mapped to the athena, the lambda should call athena to perform the query
Appreciate if can provide the better way to achieve above cases?
Thanks
Case 1
An AWS Lambda can be directly invoked via the invoke() command. This can be done via the AWS Command-Line Interface (CLI) or from a programming language using an AWS SDK.
Case 2
An Amazon S3 event can be configured on a bucket to automatically trigger an AWS Lambda function when a file is uploaded. The event provides the bucket name and file name (object name) to the Lambda function.
The Lambda function can extract these details from the event record and can then use that information in an Amazon Athena command.
Please note that, if the file name is different each time, a CREATE TABLE command would be required before a SELECT command can query the data.
General Comments
A Lambda function can run for a maximum of 15 minutes, so make sure the Athena queries do not take more than this time. This is not a particularly efficient use of an AWS Lambda function because it will be billed for the duration of the function call, even if it is just waiting for Athena to finish.
Another option would be to have the Lambda function directly process the file, assuming that the query is not particularly complex. For example, the Lambda function could download the file to temporary storage (maximum 500MB), read through the file, do some calculations (eg add up the total of some columns), then store the results somewhere.
The next step wuold be create a end point to your lambda, you ver can use aws-apigateway for that.
On the other hand, using the amazon console or amazon cli, you can invoke the lambda in order to test.