Communication between concurrent AWS Lambdas - amazon-web-services

I want a set of concurrent AWS Lambda functions to write and read data from a common place. They will read the last post introduced and perhaps will write a new (last) post. Posts will be short - less than 1kB - and in json format. Once the last concurrent lambda will finish the execution, the posts will be deleted.
I am considering using AWS Dynamo DB. Do you think that this is the best option? Is it possible to use AWS CloudWatch?

Definitely two options Dynamo & ElasticCache. Pretty basic answer, I am sure there are more things to consider, just focussing on this use case.
Dynamo
Pros: serverless
Cons:
Probably need strongly consistent reads to fetch most recent update in sub milli second difference.
Also could become expensive than ElasticCache depending on how many read/writes we are making.
ElasticCache
Pros: Faster than Dynamo, as it is in-memory.
Cors: we need to run a server, so, will be paying some min $.

Related

Is SQS better than DynamoDB for peak loads?

A service runs on ECS and writes the requested URL to a DynamoDB. Dynamic scaling was activated to keep the costs for DynamoDB from becoming too high. DynamoDB scales slower than requests are coming in at any given time, so some calls are not logged. My question now is whether writing to an SQS would be the better way here, because the documentation says:
Standard queues support a nearly unlimited number of API calls per second, per API action (SendMessage, ReceiveMessage, or DeleteMessage).
Of course, the messages would then have to be written back to DynamoDB, but another service can then do that.
Is the throughput of messages per second to SQS really unlimited, so it's definitely cheaper to send messages to SQS instead of increasing DynamoDB's writes per second?
I don't know if this qualifies for a good answer. But remembering a discussion with my architect at the time, we concluded that to have a queue for precisely this problem seems good practice, regardless of load. It keeps requests even if services go down, so there is an added benefit.
SQS and Dynamo fit two very different use cases. Its not so much which is better, its which is right for what you need.
Dynamodb is a NoSQL Document based Database. This is best for when you have known access patterns to data that needs to persist over time, that you need to access quickly, but probably are not making many changes too (or at least the changes do not have to be absolutely immediately, sub 5 ms accessible). Each document in a dynamodb is similar (but also very different) to a row in a standard SQL table, in that it will have attributes (columns) keys (Partition and Sort Key) and be retrievable through a query (though dynamic on the fly queries are NOT good for Dynamo)
SQS is a Queue system. It has no persistence. Payloads of JSON objects are dropped into the Queue and then processed by some end point - either a Lambda, or put into a dynamo, or something else entirely depending on your products use case. It is perfect for when you often receive bursts of data but your system needs some time to handle each individual payload - such as it is waiting on other systems to finish before it can handle the next one - so instead of scaling horizontally (by just handling all the payloads in parallel) you have to scale vertically (be able to handle more payloads at once through a single or only a few threads). You cannot access the data coming in while it is waiting in the queue, no queries on said data, only wait until that data pops/pushes off the queue and into processing by whatever system you have set up to receive it.
The answer to your question is entirely dependent on your use case and your system - something we here at SO will never really understand or know simply because we will always be hearing about it through you and never really experiencing it. As such, to answer it, you need to understand the capabilities of both Dynamo and SQS, the pros and cons for each, and then determine which is best for your product.

Implementing backend caching in a serverless AWS architecture recommendation

I have a serverless app using AWS lambda. My lambda fetches some data from other third party APIs and I'm planning to implement backend caching to cache the data my lambda fetches. I've read some articles online and I saw some sample architectures.
If you observe the picture, the dynamoDB is used to store the cache of the data fetched by the lambda. People suggested DynamoDB as the latency it adds while fetching the cache is very minimal. But I cannot go to DynamoDB as each of my data item to be cached is very large, even after gzipping it (approx 5-7MB). Hence, Im planning to use a DocumentDB instead of a DynamoDB. Im just learning how DocumentDB works and have no idea if it is as fast and efficient as DynamoDB. Could anyone comment on this idea and give some suggestions regarding which options I could use in this case apart from DynamoDB and if DocumentDB is a good alternative?
If you need to cache 5-7MB worth of data, have you considered using S3 as a cache? S3 can be used as an effective (and cheap) nosql database. I'm not sure it will meet your specific use case, but it may be worth exploring depending on your needs.
I found a good conversation about this topic on StackOverflow if you want to dive deeper.

Mirror Marketo Data in S3 Bucket for Visualization

I am looking to get all of the Activity and Lead data in Marketo to be mirrored in an AWS S3 bucket so that I can build dashboards on it in Quicksight, so preferably I'd like to stream the data from Marketo into S3 in real-time, and then use Glue and Athena to connect the data to Quicksight. However, the only way to get large volumes of data out of Marketo appears to be their Bulk Extract tool (one for Leads, one for Activity data).
The problem is that these API interfaces make any attempt at near real-time streaming really clunky. Currently, I have Lambda functions being triggered every hour to pull the most recent hour of Lead/Activity data and saving it as a gzipped CSV in S3. But Marketo's Bulk Extract tool has a request queue and requests often take longer than 15 minutes to process (15 minutes being Lambda's max timeout length). So at least once a day my requests are getting dropped.
The solution seems to be to instead run this on an EC2 instance that can juggle multiple requests and patiently wait for Marketo's queue. But I'd rather not get into all the async and error-handling issues that that approach may entail if there is an easier way to accomplish this.
As an alternative solution, Amazon Appflow integrates with Marketo. But last I checked, it only works with Lead data, not Activity data. And there are restrictions on the filters you have to apply to the Lead data that make it clunky to work with anyway.
On Google I have found several companies that claim to offer seamless, reliable Marketo-to-S3 ETL, but I haven't yet researched their pricing or quality.
If anyone knows of a good approach to set up reliable and cost-efficient ETL between Marketo and S3 in a short period of time, I would very much appreciate it.
In a case like this, I would be tempted to recommend using an EC2 instance to run Singer with a Marketo input and CSV output, then set up something to move the CSV over to S3 as needed. That would be the absolute cheapest ETL solution, but this does suppose you have some comfort and familiarity with Python.
Also worth noting is that Stitch, Singers's paid product equivalent, supports native S3 export--you could always first test with a non-Marketo data source and see if that performs the way you'd like if you prefer money over time.

Counting AWS lambda calls and segmenting data per api key

Customers (around 1000) sign up to my service and receive a customer unique api key. They then use the key when calling a AWS lambda function through AWS api gateway in to access data in DynamoDb.
Requirement 1: The customers get billed by the number of api calls, so I have to be able to count those. AWS only provides metrics for total number of api calls per lambda so I have a few options:
At every api hit increment a counter in DynamoDB.
At every api hit enqueue a message in SQS, receive it in "hit
counter" lambda and increment a counter in DynamoDB.
Deploy a separate lambda for each customer. Use AWS built-in call
counter.
Requirement 2: The data that the lambda can access is unique for each customer and thus dependent on the api key provided.
To enable this I also have a number of options:
Store the required api key together with the data that the customer
has the right to access.
Deploy a separate lambda for each customer. Use api gateway to
protect it with a key.
Create a separate endpoint in api gateway for each customer, protect
it with the api key.
None of the options above seem like a good way to design the solution. Is there a canonical way of doing this? If not, which of the options above is the best? Have I missed an obvious solution due to my unfamiliarity with AWS?
I will try to break your problems down with my experience, but maybe Michael - Sqlbot or John Rotenstein may be able to give more appropriate answers.
Requirement 1
1) This sounds like a good approach. I don't see anything critical here.
2) This, IMHO, is the best out of the 3. It will decouple data access from the billing service, which is a great thing in a Microservices world.
3) This is not scalable. Imagine your system grows and you end up with 10K Lambda functions. Not only you'll have to build a very reliable mechanism to automate this process, but also you'll need to monitor 10K different things (imagine CloudWatch logs, API Gateway, etc), not to mention you'll have 10 thousand functions with exactly the same code (client specific parameters apart). I wouldn't even think about this one.
Requirement 2
1) It could work and it fits nicely in the DynamoDB model of doing things: store as much data as you can in a unique table, so you can fetch everything in one go. From what I see, you could even use this ApiKey as your partition key and, for the sake of simplicity for this answer, store the client's data as JSON in a column named data. Since your query only needs to query by the ApiKey, storing a JSON in DynamoDB won't hurt (do keep in mind, however, that if you need to query by any of its JSON attributes than you're in bad shoes, since DynamoDB's query capabilities are very limited)
2) No, because of Requirement 1.3
3) No, because of the above.
If you still need to store the ApiKey in a different table so you can run different analysis and keep a finer grained control over the client's calls, access, billing and etc., that's not a problem either, just make sure you duplicate your ApiKey on your ClientData table instead of creating a FK (DynamoDB doesn't support FKs, so you'd need to manage these constraints yourself). Duplication is just fine in a NoSQL world.
Your use case is clearly a Multi-Tenancy one, so I'd also recommend you to read Multi-Tenant Storage with Amazon DynamoDB which will give you some more insights and broaden your options a little bit. Multi-Tenancy is not an easy task and can give you lots of headaches if not implemented correctly. I think this is why AWS has also prepared this nice read for us :)
Happy to continue this on the comments section in case you have more info to share
Hope this helps!

AWS Lambda - how to identify duplicate messages

Since several of the triggers for AWS Lambda can only guarantee message delivery "at least once" (SQS and IoT with QoS=1), I wonder what's the best way to identify a duplicate message and ignore it.
I can see that I currently get several duplicate messages, triggering my lambdas twice, causing noise and invalid data as a consequence.
In my client, I solve it by just storing a list of message IDs that I've processed, but in the Lambdas, I have nowhere to store a state.
Of course I could maintain a DB table of processed message IDs but it seems like overkill to me (and probably adds extra billed runtime to the lambdas). A simple key/value store service in memory would be enough.
What other solutions are you guys using?
I know you don't want to use a DB but dynamodb can work well for this kind of thing. If you have something you can use as a good partition key then it will still be quite performant. It will still add a very small amount of time to your lambda run time and, of course, you will be charged for your dynamodb capacity & data. I use this successfully to discard duplicate messages.
The other thing that might be worth looking into would be elasticache which has memcached and redis versions. This would be faster - if performance is a particular focus - but is not persistent like DynamoDB.