S3 efficiency overwrite versus read - amazon-web-services

I just finished the following function on getting customer data from my shopify into an S3 bucket. What happens now is the following. A trigger runs this lambda on a daily basis. Then, all customers are written to an S3 bucket. Every already existing entry is just overwritten. New customers are added.
My question is: Is this a scalable approach or should I read all the files and compare timestamps to only add the new entries? Or is this second approach maybe worse?
import requests
import json
import boto3
s3 = boto3.client('s3')
bucket ='testbucket'
url2 = "something.json"
def getCustomers():
r = requests.get(url2)
return r.json()
def lambda_handler(event, context):
data = getCustomers()
for customer in data["customers"]:
#create a unique id for each customer
customer_id = str(customer["id"])
#create a file name to put the customer in bucket
file_name = 'customers' + '/' + customer_id + '.json'
#Saving .json to s3
customer_string = str(customer)
uploadByteStream = bytes(customer_string.encode('UTF-8'))
s3.put_object(Bucket=bucket, Key=file_name, Body=uploadByteStream)
return {
'statusCode': 200,
'body': json.dumps('Success')
}
An example response is the following:
{
"id": 71806090000,
"email": "something#gmail.com",
"accepts_marketing": false,
"created_at": "2021-07-27T11:06:38+02:00",
"updated_at": "2021-07-27T11:11:58+02:00",
"first_name": "Bertje",
"last_name": "Bertens",
"orders_count": 0,
"state": "disabled",
"total_spent": "0.00",
"last_order_id": null,
"note": "",
"verified_email": true,
"multipass_identifier": null,
"tax_exempt": false,
"phone": "+32470000000",
"tags": "",
"last_order_name": null,
"currency": "EUR",
"addresses": [
{
"id": 6623179276486,
"customer_id": 5371846099142,
"first_name": "Bertje",
"last_name": "Bertens",
"company": "",
"address1": "Somewhere",
"address2": "",
"city": "Somecity",
"province": null,
"country": "",
"zip": "0000",
"phone": null,
"name": "Bertje Bertens",
"province_code": null,
"country_code": null,
"country_name": "",
"default": true
}
],
"accepts_marketing_updated_at": "2021-07-27T11:11:35+02:00",
"marketing_opt_in_level": null,
"tax_exemptions": [],
"admin_graphql_api_id": "",
"default_address": {
"id": 6623179276486,
"customer_id": 5371846099142,
"first_name": "Bertje",
"last_name": "Bertens",
"company": "",
"address1": "Somewhere",
"address2": "",
"city": "Somecity",
"province": null,
"country": "",
"zip": "0000",
"phone": null,
"name": "Bertje Bertens",
"province_code": null,
"country_code": null,
"country_name": "",
"default": true
}
}

Is this a scalable approach or should I read all the files and compare timestamps to only add the new entries? Or is this second approach maybe worse?
Generally speaking, you're not going to run into many scalability problems with a daily task utilizing Lambda and S3.
Some considerations:
Costs
a. Lambda execution costs. The longer your lambda runs, the more time you pay
b. S3 Transfer costs. Unless you run your lambda in a VPC and setup a VPC endpoint for your bucket, you pay S3 transfer costs from lambda -> internet (-> s3).
Lambda execution timeouts.
If you have many files to upload, you may eventually run into a problem where you have so many files to transfer it can't be completed within a single invocation.
Fault tolerance
Right now, if your lambda fails for some reason, you'll drop all the work for the day.
How do these two approaches bear on these considerations?
For (1) you simply have to calculate your costs. Technically, the approach of checking the timestamp first will help you here. However, my guess is that, if you're only running this on a daily basis within a single invocation, the costs are minimal right now and not of much concern. We're talking pennies per month at most (~$0.05/mo # full 15 minute invocation once daily + transfer costs).
For (2) the approach of checking timestamps is also somewhat better, but doesn't truly address the scalability issue. If you expect you may eventually reach a point where you will run out of execution time in Lambda, you may want to consider a new architecture for the solution.
For (3) neither approach has any real bearing. Either way, you have the same fault tolerance problem.
Possible alternative architecture components to address these areas may include:
use of SQS to queue file transfers (help with decoupling and DLQ for fault tolerance)
use of scheduled (fargate) ECS tasks instead of Lambda for compute (deal with Lambda timeout limitations) OR have lambda consume the queue in batches
S3 VPC endpoints and in-vpc compute (optimize s3 transfer; likely not cost effective until much larger scale)
So, to answer the question directly in summary:
The current solution has some scalability concerns, namely the execution timeout of lambda and fault tolerance concerns. The second approach does introduce optimizations, but they do not address the scalability concerns. Additionally, the value you get from the second solution may not be significant.
In any case, what you propose makes sense and shouldn't take much effort to implement.
...
customer_updated_at = datetime.datetime.fromisoformat(customer['created_at'])
file_name = 'customers' + '/' + customer_id + '.json'
# Send HEAD request to check date to see if we need to update it
response = s3.head_object(bucket, file_name)
s3_modified = response["LastModified"]
if customer_updated_at > s3_modified:
# Saving .json to s3
customer_string = str(customer)
uploadByteStream = bytes(customer_string.encode('UTF-8'))
s3.put_object(Bucket=bucket, Key=file_name, Body=uploadByteStream)
else:
print('s3 version is up to date, no need to upload')

It will work as long as you manage to finish the whole process within the max 15 minute timeout of Lambda.
S3 is built to scale to much more demanding workloads ;-)
But:
It's very inefficient as you already observed. A better implementation would be to keep track of the timestamp of the last full load somewhere, e.g. DynamoDB or the Systems Manager parameter store and only write all customers where the "created_at" or "updated_at" attributes are after the last successful full load. In the end you update the full load timestamp.
Here is some pseudo code:
last_full_load_date = get_last_full_load() or '1900-01-01T00:00:00Z'
customers = get_customers()
for customer in customers:
if customer.created_at >= last_full_load_date or customer.updated_at >= last_full_load_date:
write_customer(customer)
set_last_full_load(datetime.now())
This way you only write data that has actually changed (assuming the API is reliable).
This also has the benefit, that you'll be able to retry if something goes wrong during writing since you only update the last_full_load time in the end. Alternatively you could keep track of the last modified time per user, but that seems not necessary if you to a bulk load anyway.

Related

What timestamp is being used by Amazon Connect for recordings filename? Initiation timestamp or Disconnect timestamp?

As we know amazon connect record the calls and store the recordings on S3 bucket, I am looking for what timestamp I can use to make the filename by myself in my code!
There are two timestamps e.g. Initiation timestamp & Disconnect timestamp are being used while creating filenames with contactId_timestamp_UTC e.g. 7bb75057-76ae-4e7e-a140-44a50cc5954b_20220418T06:44_UTC.wav.
I have used the callStartTime to create these filenames and then get the files from S3 using SignedURL but in few cases there is a difference of 1 sec as file stored with incremental of one sec on S3 and I couldn't get the file form S3.
For example I the filename is been created by my application is: 7bb75057-76ae-4e7e-a140-44a50cc5954b_20220418T06:44_UTC.wav. but the recording stored on S3 has filename as: 7bb75057-76ae-4e7e-a140-44a50cc5954b_20220418T06:45_UTC.wav.
The last thing can this data (timestamp) is available in contact object? so I can use it.
Looks like the timestamp for the file name is based on ConnectedToAgentTimestamp which makes sense as the recording doesn't start until the caller is talking to an agent. ConnectedToAgentTimestamp is under AgentInfo in the contact details...
{
"Contact": {
"Arn": "arn:aws:connect:us-west-2:xxxxxxxxxx:instance/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/contact/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"Id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"InitiationMethod": "INBOUND",
"Channel": "VOICE",
"QueueInfo": {
"Id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"EnqueueTimestamp": "2022-04-13T15:05:45.334000+12:00"
},
"AgentInfo": {
"Id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"ConnectedToAgentTimestamp": "2022-04-13T15:06:25.706000+12:00"
},
"InitiationTimestamp": "2022-04-13T15:05:15.869000+12:00",
"DisconnectTimestamp": "2022-04-13T15:08:08.298000+12:00",
"LastUpdateTimestamp": "2022-04-13T15:08:08.299000+12:00"
}
}

What's a "cloud-native" way to convert a Location History REST API into AWS Location pings?

My use case: I've got a Spot Tracker that sends location data up every 5 minutes. I'd like to get these pings into AWS Location, so I can do geofencing, mapping, and other fun stuff with them.
Spot offers a REST API that will show the last X number of events, such as:
"messages": {
"message": [
{
"id": 1605371088,
"latitude": 41.26519,
"longitude": -95.99069,
"dateTime": "2021-06-26T23:21:24+0000",
"batteryState": "GOOD",
"altitude": -103
},
{
"id": 1605371124,
"latitude": 41.2639,
"longitude": -95.98545,
"dateTime": "2021-06-26T23:11:24+0000",
"altitude": 0
},
{
"id": 1605365385,
"latitude": 41.25448,
"longitude": -95.94189,
"dateTime": "2021-06-26T23:06:01+0000",
"altitude": -103
},
...
]
}
What's the most idiomatic, cloud-native way to turn these into pings that go into AWS Location?
Here's a diagram of my initial approach:
The idea is, use a timed Lambda to periodically hit the Spot endpoint, and keep track of the latest one I've sent out in a store like Dynamo:
I'm not an AWS expert, but I feel like there must be a cleaner integration. Are there other tools that would help with this? Is there anything in AWS IOT, for example, that would help me not have to keep track of the last one I uploaded?

I want to find out the total RAM size of AWS RDS through lambda python.I tried the code and got empty set.Is there any other way to find this?

import json
import boto3,datetime
def lambda_handler(event, context):
cloudwatch = boto3.client('cloudwatch',region_name=AWS_REGION)
response = cloudwatch.get_metric_data(
MetricDataQueries=[
{
'Id': 'memory',
'MetricStat': {
'Metric': {
'Namespace': 'AWS/RDS',
'MetricName': 'TotalMemory',
'Dimensions': [
{
"Name": "DBInstanceIdentifier",
"Value": "mydb"
}]
},
'Period': 30,
'Stat': 'Average',
}
}
],
StartTime=(datetime.datetime.now() - datetime.timedelta(seconds=300)).timestamp(),
EndTime=datetime.datetime.now().timestamp()
)
print(response)
The result is like below:
{'MetricDataResults': [{'Id': 'memory', 'Label': 'TotalMemory', 'Timestamps': [], 'Values': [], 'StatusCode': 'Complete'}]
If you are looking to get the configured vCPU/Memory then it seems like we need to call DescribeDBInstances API to get DBInstanceClass, which contains the hardware information from here
You would need to use one of the CloudWatch metric names from https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/MonitoringOverview.html#rds-metrics and it seems like we can retrieve the currently available memory metric using the FreeableMemory. I was able to get data (in bytes) as seen from the RDS' Monitoring console while using this metric name from your sample code.
You can check the total amount of memory and other useful information associated with the RDS in the CloudWatch console.
Step1: Go to the CloudWatch console. Navigate to Log groups.
Step2: Search for RDSOSMetrics in the search bar.
Step3: Click on the log stream. You will be able to find all the details in the JSON. Your total memory would be present in the field titled memory.total. Sample result would be like this
{
"engine": "MYSQL",
"instanceID": "dbName",
"uptime": "283 days, 21:08:36",
"memory": {
"writeback": 0,
"free": 171696,
"hugePagesTotal": 0,
"inactive": 1652000,
"pageTables": 19716,
"dirty": 324,
"active": 5850016,
"total": 7877180,
"buffers": 244312
}
}
I have intentionally reduced the message in the JSON because of the size, but there will be many other useful fields that you can find here.
You can use custom jq command-line utility to extract the field that you want from these log groups.
You can read more about this here cloudwatch enhanced monitoring.

How to specify attributes to return from DynamoDB through AppSync

I have an AppSync pipeline resolver. The first function queries an ElasticSearch database for the DynamoDB keys. The second function queries DynamoDB using the provided keys. This was all working well until I ran into the 1 MB limit of AppSync. Since most of the data is in a few attributes/columns I don't need, I want to limit the results to just the attributes I need.
I tried adding AttributesToGet and ProjectionExpression (from here) but both gave errors like:
{
"data": {
"getItems": null
},
"errors": [
{
"path": [
"getItems"
],
"data": null,
"errorType": "MappingTemplate",
"errorInfo": null,
"locations": [
{
"line": 2,
"column": 3,
"sourceName": null
}
],
"message": "Unsupported element '$[tables][dev-table-name][projectionExpression]'."
}
]
}
My DynamoDB function request mapping template looks like (returns results as long as data is less than 1 MB):
#set($ids = [])
#foreach($pResult in ${ctx.prev.result})
#set($map = {})
$util.qr($map.put("id", $util.dynamodb.toString($pResult.id)))
$util.qr($map.put("ouId", $util.dynamodb.toString($pResult.ouId)))
$util.qr($ids.add($map))
#end
{
"version" : "2018-05-29",
"operation" : "BatchGetItem",
"tables" : {
"dev-table-name": {
"keys": $util.toJson($ids),
"consistentRead": false
}
}
}
I contacted the AWS people who confirmed that ProjectionExpression is not supported currently and that it will be a while before they will get to it.
Instead, I created a lambda to pull the data from DynamoDB.
To limit the results form DynamoDB I used $ctx.info.selectionSetList in AppSync to get the list of requested columns, then used the list to specify the data to pull from DynamoDB. I needed to get multiple results, maintaining order, so I used BatchGetItem, then merged the results with the original list of IDs using LINQ (which put the DynamoDB results back in the correct order since BatchGetItem in C# does not preserve sort order like the AppSync version does).
Because I was using C# with a number of libraries, the cold start time was a little long, so I used Lambda Layers pre-JITed to Linux which allowed us to get the cold start time down from ~1.8 seconds to ~1 second (when using 1024 GB of RAM for the Lambda).
AppSync doesn't support projection but you can explicitly define what fields to return in the response template instead of returning the entire result set.
{
"id": "$ctx.result.get('id')",
"name": "$ctx.result.get('name')",
...
}

How to disable (or redirect) logging on an AWS Step Function that calls parallel Lambda functions

I'm running an AWS step function with parallel execution branches.
Each branch succeeds individually, however the overall function fails with the following error:
States.DataLimitExceeded - The state/task returned a result with a size exceeding the maximum number of characters service limit.
I then found an article from AWS that describes this issue and suggests a work around:
https://docs.aws.amazon.com/step-functions/latest/dg/connect-lambda.html
That article says:
The Lambda invoke API includes logs in the response by default. Multiple Lambda invocations in a workflow can trigger States.DataLimitExceeded errors. To avoid this, include "LogType" = "None" as a parameter when you invoke your Lambda functions.
My question is where exactly do I put it? I've tried putting it various places in the state machine definition, however I get the following error:
The field 'LogType' is not supported by Step Functions
That error seems contrary to the support article, so perhaps I'm doing it wrong!
Any advice is appreciated, thanks in advance!
Cheers
UPDATE 1 :
To be clear, this is a parallel function, with 26 parallel branches. Each branch has a small output as per the example below. The biggest item in this data is the LogResult, which (when base64 decoded) is just the billing info. I think this info multiplied by 26 has led to the error, so I just want to turn this LogResult off!!!
{
"ExecutedVersion": "$LATEST",
"LogResult": "U1RBUlQgUmVxdWVzdElkOiBlODJjZTRkOS0zMjk2LTRlNDctYjcyZC1iYmEwMzI1YmM3MGUgVmVyc2lvbjogJExBVEVTVApFTkQgUmVxdWVzdElkOiBlODJjZTRkOS0zMjk2LTRlNDctYjcyZC1iYmEwMzI1YmM3MGUKUkVQT1JUIFJlcXVlc3RJZDogZTgyY2U0ZDktMzI5Ni00ZTQ3LWI3MmQtYmJhMDMyNWJjNzBlCUR1cmF0aW9uOiA3NzI5Ljc2IG1zCUJpbGxlZCBEdXJhdGlvbjogNzgwMCBtcwlNZW1vcnkgU2l6ZTogMTAyNCBNQglNYXggTWVtb3J5IFVzZWQ6IDEwNCBNQglJbml0IER1cmF0aW9uOiAxMTY0Ljc3IG1zCQo=",
"Payload": {
"statusCode": 200,
"body": {
"signs": 63,
"nil": ""
}
},
"SdkHttpMetadata": {
"HttpHeaders": {
"Connection": "keep-alive",
"Content-Length": "53",
"Content-Type": "application/json",
"Date": "Thu, 21 Nov 2019 04:00:42 GMT",
"X-Amz-Executed-Version": "$LATEST",
"X-Amz-Log-Result": "U1RBUlQgUmVxdWVzdElkOiBlODJjZTRkOS0zMjk2LTRlNDctYjcyZC1iYmEwMzI1YmM3MGUgVmVyc2lvbjogJExBVEVTVApFTkQgUmVxdWVzdElkOiBlODJjZTRkOS0zMjk2LTRlNDctYjcyZC1iYmEwMzI1YmM3MGUKUkVQT1JUIFJlcXVlc3RJZDogZTgyY2U0ZDktMzI5Ni00ZTQ3LWI3MmQtYmJhMDMyNWJjNzBlCUR1cmF0aW9uOiA3NzI5Ljc2IG1zCUJpbGxlZCBEdXJhdGlvbjogNzgwMCBtcwlNZW1vcnkgU2l6ZTogMTAyNCBNQglNYXggTWVtb3J5IFVzZWQ6IDEwNCBNQglJbml0IER1cmF0aW9uOiAxMTY0Ljc3IG1zCQo=",
"x-amzn-Remapped-Content-Length": "0",
"x-amzn-RequestId": "e82ce4d9-3296-4e47-b72d-bba0325bc70e",
"X-Amzn-Trace-Id": "root=1-5dd60be1-47c4669ce54d5208b92b52a4;sampled=0"
},
"HttpStatusCode": 200
},
"SdkResponseMetadata": {
"RequestId": "e82ce4d9-3296-4e47-b72d-bba0325bc70e"
},
"StatusCode": 200
}
I ran into exactly the same problem as you recently. You haven't said what your lambdas are doing or returning however I found that AWS refers to limits that tasks have within executions https://docs.aws.amazon.com/step-functions/latest/dg/limits.html#service-limits-task-executions.
What I found was that my particular lambda had an extremely long response with 10s of thousands of characters. Amending that so that the response from the lambda was more reasonable got past the error in the step function.
I had the problem a week ago.
I way I solved is like below:
You can define which portion of the result that is transmitted to the next step.
For that you have to use
"OutputPath": "$.part2",
In your json input you have
"part1": {
"portion1": {
"procedure": "Delete_X"
},
"portion2":{
"procedure": "Load_X"
}
},
"part2": {
"portion1": {
"procedure": "Delete_Y"
},
"portion2":{
"procedure": "Load_Y"
}
}
Once part1 is processed, you make sure that part1 is not sent in the output and the resultpath related to it. Just part 2 which is needed for the following steps is sent for the next steps.
With this: "OutputPath": "$.part2",
let me know if that helps
I got stuck on the same issue. Step function imposes a limit of 32,768 characters on the data that can be passed around between two states.
https://docs.aws.amazon.com/step-functions/latest/dg/limits.html
Maybe you need to think and breakdown your problem in a different way? That's what I did. Because removing the log response would give you some elasticity but your solution will not scale after a certain limit.
I handle large data in my Step Functions by storing the result in an S3 bucket, and then having my State Machine return the path to the result-file (and a brief summary of the data or a status like PASS/FAIL).
The same could be done using a DB if that's more comfortable.
This way won't have to modify your results' current format, you can just pass the reference around instead of a huge amount of data, and they are persisted as long as you'd like to have them.
The start of the Lambdas looks something like this to figure out if the input is from a file or plain data:
bucket_name = util.env('BUCKET_NAME')
if 'result_path' in input_data.keys():
# Results are in a file that is referenced.
try:
result_path = input_data['result_path']
result_data = util.get_file_content(result_path, bucket_name)
except Exception as e:
report.append(f'Failed to parse JSON from {result_path}: {e}')
else:
# Results are just raw data, not a reference.
result_data = input_data
Then at the end of the Lambda they will upload their results and return directions to that file:
import boto3
def upload_results_to_s3(bucket_name, filename, result_data_to_upload):
try:
s3 = boto3.resource('s3')
results_prefix = 'Path/In/S3/'
results_suffix = '_Results.json'
result_file_path = '' + results_prefix + filename + results_suffix
s3.Object(bucket_name, result_file_path).put(
Body=(bytes(json.dumps(result_data_to_upload, indent=2).encode('UTF-8')))
)
return result_file_path
result_path = upload_results_to_s3(bucket_name, filename, result_data_to_upload)
result_obj = {
"result_path": result_path,
"bucket_name": bucket_name
}
return result_obj
Then the next Lambda will have the first code snippet in it, in order to get the input from the file.
The Step Function Nodes look like this, where the Result will be result_obj in the python code above:
"YOUR STATE":
{
"Comment": "Call Lambda that puts results in file",
"Type": "Task",
"Resource": "arn:aws:lambda:YOUR LAMBDA ARN",
"InputPath": "$.next_function_input",
"ResultPath": "$.next_function_input",
"Next": "YOUR-NEXT-STATE"
}
Something you can do is, add "emptyOutputPath": "" to your json,
"emptyOutputPath": "",
"part1": { "portion1": { "procedure": "Delete_X"
}, "portion2":{ "procedure": "Load_X" } },
"part2": { "portion1": { "procedure": "Delete_Y"
}, "portion2":{ "procedure": "Load_Y" } }
That will allow you to do "OutputPath":"$.emptyOutputPath" which is empty and will clear ResultPath.
Hope that helps
Just following up on this issue to close the loop.
I basically gave up on using parallel lambdas in favour of using AQS message queues instead