How to get CloudFormation to respect Kinesis simultaneous stream creation limits - amazon-web-services

I have a CloudFormation stack that contains multiple Kinesis streams. If the stream count is less than 5 during creation or update, no problems. If I have more than 5 an error occurs and the whole stack is rolled back.
The issue is compounded by streams in the template being added dynamically from config files, so order is not deterministic.
Is there a way to use wait conditions to say only do 5 of these at a time? Even this I think will be an issue because I won't know of streams that are being deleted.
OR is there some way to have CloudFormation back off a creation attempt, wait and try again without ROLLBACK on the whole stack?

WaitConditions aren't really designed for this. They are more for setting up servers that can ping when they're done.
There is no creation strategy for streams at this time.
According to the AWS response in this thread the only way is to build up a dependsOn chain. They suggest batching but I had to do a linked list since I wouldn't know what other stacks are up to. Still not full proof but won't have more than 5 stacks building at once.

Related

How to partition AWS lambda invocations to independent processing tasks

I am looking for some best practice advice on AWS, and hoping this question won't immediately be closed as too open to opinion.
I am working on a conversion of a windows server application to AWS lambda.
The server runs every 5 minutes and grabs all the files that have been uploaded to various FTP locations.
These files must be processed in a specific order, which might not be the order they arrive in, so it then sorts them and processes accordingly.
It interacts with a database to validate the files against information from previous files.
It then sends the relevant information on, and records new information in the database.
Errors are flagged, and logged in the database, to be dealt with manually.
Note that currently there is no parallel processing going on. This would be difficult because of the need to sort the files and process them in the correct order.
I have therefore been assuming the lambda will have to run as a single invocation on a schedule.
However, I have realised that the files can be partitioned according to where they come from, and those locations can be processed independantly.
So I could have a certain amount of parallelism.
My question is what is the correct way to manage that limited parallelism in AWS?
A clunky way of doing it would be through the database, something like this:
A lambda spins up and reads a particular table in the database
This table has a list of independant processing areas, and the columns: "Status", "StartTime".
The lambda finds the oldest one not currently
being processed, registers it as "processing" and updates the
"StartTime".
After processing the status is set to "done" or some such.
I think this would work, but it doesn't feel quite right to be managing such things through the database.
Can someone suggest a pattern that my problem fits into, and the correct AWS way of doing this?
if you really want to do this with parallel lambda invocations, then yes, you should absolutely use a database to coordinate their work.
The protocol you're thinking about seems reasonable. You need to use the transactional capabilities of the database to ensure that the parallel invocations don't interfere with each other, and you need to make sure that the system is resilient to lambda invocations that don't happen.
When your lambda is invoked to handle the event, it should decide how many additional parallel invocations are required, and then make asynchronous lambda calls to run those additional instances. Those instances should recognize that they were invoked directly and skip that part.
After that, all of the parallel lambda invocations should do exactly the same thing. Make sure that none of them are special in any way, so you don't need to rely on any particular one completing without error. They should each pull work from a work queue in the DB until all the work is done.
BUT NOTE: Usually the kind of tasks you're talking about are not CPU-bound. If that is the case then running multiple parallel tasks inside the same lambda invocation will make better use of your resources. You can do both, of course.

A Global Variable(State) in AWS for Serverless Orchestration

I am writing a syncing/ETL app inside AWS. It works as follows:
The source of the data is outside of AWS
Whenever new data is changed/added AWS is alerted via API Gateway (REST)
The REST API triggers a lambda function that does ETL and stores the data in CSV format to S3
This works fine for small tables. However, we are dealing with larger amount of data lately and I have to switch to Fargate (EKS/ECS) instead of lambda. As you can imagine these will be long running jobs and not cheap to perform. Usually when the data is changed in it changes multiple times within a period of 5 minutes, say for example 3 times. So REST API gets a ping 3 times in a row and triggers the ETL jobs 3 times as well. This is very inefficient as you can imagine.
I came up with idea that every time that REST API is triggered lets wait for 5 minutes if the API has not been invoked during the waiting period do ETL otherwise do nothing. I think I can do the waiting using Step Functions. However I cannot find a suitable way to store hash/id of the latest ping to API to one single variable. I thought maybe I can store the hash to an S3 object and after 5 minutes check to see if it is the same as the variable in my step function, but apparently ordinality is not guaranteed. I looked into SQS but the fact that is a FIFO is not very convenient and way more than what I actually need. I am pretty sure that other people have had a similar issue and there must a standard solution for this problem. I could not find any by googling and hence my plea here
Thanks
From what I understand, Amazon DynamoDB is the store you are looking for to save the state of your job.
Also, please note that SQS is not FIFO by default. Using SQS won't prevent you from storing your job state.
What I would do:
Trigger a job and store the state in DynamoDB. Do not further launch job until the job state is done.
Orchestrate the ETL from Step Functions (including the 5 minutes wait)
You can also expire your jobs so DynamoDB will automatically clean them up with time.

AWS Lambda - Store state of a queue

I'm currently tasked with building a serverless architecture for communication between government agencies and citizens, and a main component is some form of queue that contains some form of object/pointer to each citizens request, sorted by priority. The government workers can then process an element when available. As Lambda is stateless, I need to save the queue outside in some manner.
For saving state I've gathered that you can use DynamoDB or S3 Buckets and use event triggers to invoke related Lambda methods. Some also suggest using Parameter Store to save some state variables. Storing things globally has also come up, though as you can't guarantee that the Lambda doesn't terminate, it doesn't seem like a good idea.
Finally, I've also read a bit about SQS, though I have no idea if it is at all applicable to this case.
What is the best-practice / suggested approach when working with Lambda in this way? I'm leaning towards S3 Buckets, due to event triggering, and not using DynamoDB as our DB.
Storing things globally has also come up, though as you can't guarantee that the Lambda doesn't terminate, it doesn't seem like a good idea.
Correct -- this is not viable at all. Note that what you are actually referring to when you say "the Lambda" is the process inside the container... and any time your Lambda function is handling more than one invocation concurrently, you are guaranteed that they will not be running in the same container -- so "global" variables are only useful for optimization, not state. Any two concurrent invocations of the same function have two entirely different global environments.
Forgetting all about Lambda for a moment -- I am not saying don't use Lambda; I'm saying that whether or not you use Lambda isn't relevant to the rest of what is written, below -- I would suggest that parallel/concurrent actions in general are perhaps one of the most important factors that many developers tend to overlook when trying to design something like you are describing.
How you will assign work from this work "queue" is extremely important to consider. You can't just "find the next item" and display it to a worker.
You must have a way to do all of these things:
finding the next item that appears to be available
verify that it is indeed available
assign it to a specific worker
mark it as unavailable for assignment
Not only that, but you have to be able to do all of these things atomically -- as a single logical action -- and without collisions.
A naïve implementation runs the risk of assigning the same work item to two or more people, with the first assignment being blindly and silently overwritten by subsequent assignments that happen at almost the same time.
DynamoDB allows conditional updates -- update a record if and only if a certain condition is true. This is a critical piece of functionality that your solution needs to accommodate -- for example, assign work item x to user y if and only if item x is currently unassigned. A conditional update will fail, and changes nothing, if the condition is not true at the instant the update happens and therein lies the power of the feature.
S3 does not support conditional updates, because unlike DynamoDB, S3 operates only on an eventual-consistency model in most cases. After an object in S3 is updated or deleted, there is no guarantee that the next request to S3 will return the most recent version or that S3 will not return an item that has recently been deleted. This is not a defect in S3 -- it's an optimization -- but it makes S3 unsuited to the "work queue" aspect.
Skip this consideration and you will have a system that appears to work, and works correctly much of the time... but at other times, it "mysteriously" behaves wrongly.
Of course, if your work items have accompanying documents (scanned images, PDF, etc.), it's quite correct to store them in S3... but S3 is the wrong tool for storing "state." SSM Parameter Store is the wrong tool, for the same reason -- there is no way for two actions to work cooperatively when they both need to modify the "state" at the same time.
"Event triggers" are useful, of course, but from your description, the most notable "event" is not from the data, or the creation of the work item, but rather it is when the worker says "I'm ready for my next work item." It is at that point -- triggered by the web site/application code -- when the steps above are executed to select an item and assign it to a worker. (In practice, this could be browser → API Gateway → Lambda). From your description, there may be no need for the creation of a new work item to trigger an "event," or if there is, it is not the most significant among the events.
You will need a proper database for this. DynamoDB is a candidate, as is RDS.
The queues provided by SQS are designed to decouple two parts of your application -- when two processes run at different speeds, SQS is used as a buffer, allowing X to safely store the work needing to be done and then continue with something else until Y is able to do the work. SQS queues are opaque -- you can't introspect what's in the queue, you just take the next message and are responsible for handling it. On its face, that seems to partially describe what you need, but it is not a clean match for this use case. Queues are limited in how long messages can be retained, and once a message is successfully processed, it is completely gone.
Note also that SQS is only a match to your use case with the FIFO queue feature enabled, which guarantees perfect in-order delivery and exactly-once delivery -- standard SQS queues, for performance optimization reasons, do not guarantee perfect in-order delivery and may under certain conditions deliver the same message more than once, to the same consumer or a different consumer. But the SQS FIFO queue feature does not coexist with event triggers, which require standard queues.
So SQS may have a role, but you need an authoritative database to store the work and the results of the business process.
If you need to store the message, then SQS is not the best tool here, because your Lambda function would then need to process the message and finally store it somewhere, making SQS nothing but a broker.
The S3 approach gives what you need out of the box, considering you can store the files (messages) in an S3 bucket and then have one Lambda consume its event. Your Lambda would then process this event and the file would remain safe and sound on S3.
If you eventually need multiple consumers for this message, then you can send the S3 event to SNS instead and finally you could subscribe N Lambda Functions to a given SNS topic.
You appear to be worrying too much about the infrastructure at this stage and not enough on the application design. The fact that it will be serverless does not change the basic functionality of the application — it will still present a UI to users, they will still choose options that must trigger some business logic and information will still be stored in a database.
The queue you describe is merely a datastore of messages that are in a particular state. The application will have some form of business logic for determining the next message to handle, which could be based on creation timestamp, priority, location, category, user (eg VIP users who get faster response), specialization of the staff member asking for the next message, etc. This is not a "queue" but rather a calculation to be performed against all 'unresolved' messages to determine the next message to assign.
If you wish to go serverless, then the back-end will certainly be using Lambda and a database (eg DynamoDB or Amazon RDS). The application should store everything in the database so that data is available for the application's business logic. There is no need to use SQS since there really isn't a "queue", and Parameter Store is merely a way of sharing parameters amongst application components — it is not meant for core data storage.
Determine the application functionality first, then determine the appropriate architecture to make it happen.

Does Terraform offer strong consistency with S3 and DynamoDB?

Terraform offers a few different backend types for saving its state. AWS S3 is probably the most popular one, but it only offers eventual read-after-write consistency for overriding objects. This means that when two people apply a terraform change at approx. the same time, they might create a resource twice or get errors because a resource was deleted in the meantime.
Does Terraform solve that using DynamoDB? WRITES in DynamoDB are strongly consistent. READS, by default, are only eventually consistent, though.
So the question is whether there is strong consistency when working with S3 as a backend for Terraform.
tl;dr: Using DynamoDB to lock state provides a guarantee of strongly consistent reads or at least erroring if the read is not consistent. Without state locking you have a chance of eventual consistency biting you but it's unlikely.
Terraform doesn't currently offer DynamoDB as an option for remote state backends.
When using the S3 backend it does allow for using DynamoDB to lock the state so that multiple apply operations cannot happen concurrently. Because the lock is naively attempted as a put with a condition that that the lock doesn't already exist this gives you the strongly consistent action you need to make sure that it won't write twice (while also avoiding a race condition from making a read of the table followed by the write).
Because you can't run a plan/apply while a lock is in place this allows the first apply in a chain to complete before the second one is allowed to read the state. The lock table also holds an MD5 digest of the state file so if during plan time the state hasn't been updated it won't match the MD5 digest and so will fail hard with the following error:
Error refreshing state: state data in S3 does not have the expected content.
This may be caused by unusually long delays in S3 processing a previous state
update. Please wait for a minute or two and try again. If this problem
persists, and neither S3 nor DynamoDB are experiencing an outage, you may need
to manually verify the remote state and update the Digest value stored in the
DynamoDB table to the following value: 9081e134e40219d67f4c63f4fef9c875
If, for some reason, you aren't using state locking then Terraform does read back the state from S3 to check that it's what it expects it is (and currently retries every 2 seconds for 10 seconds until they match or fails if that timeout is exceeded) but I think that it is still technically possible in an eventually consistent system for a read to show the update only for a second read to not show the update when it hits another node. In my experience this certainly happens in IAM which is a global service with eventual consistency, leading to much slower eventual consistency times.
All that said I have never seen any issues caused by the eventual consistency on the S3 buckets and would expect to see lots of orphaned resources because of things like that, particularly in a previous job where we were executing huge amounts of Terraform jobs concurrently and on a tight schedule.
If you wanted to be more certain of this you could probably test this by having Terraform create an object with a key of a UUID/timestamp that Terraform generates so that every apply will delete the old object and create a new one and then run that in a tight loop, checking the amount of objects in the bucket and exiting if you ever have 2 objects in the bucket.

AWS services appropriate for concurrent access to a resource

I'm designing a system where a cluster of EC2 instances do some computing and then update a large file continually. What would be ideal is if I could have the file in S3, and have all the instances take turns writing to it one at a time, performing calculations while they wait.
As it stands if 2 instances PUT to S3 at the same time, 1 will simply override the other.
How can I solve this concurrency issue?
AWS has a preview service called EFS (http://aws.amazon.com/documentation/efs/) that is an NFS4 that can be shared among EC2 instances. But such service alone does not solve your problem as you may still have concurrency issues. Consider having something more sophisticated such as exploiting "embarrassingly parallel processing" such as having N processes creating N file chunks and finally having a single file joining all pieces together when everything is done.
As it is Amazon states that if you receive a success code then your S3 object is committed. Amazon also adds that there wouldn't be any dirty writes or overlapping inconsistency - you would read either of a fully committed write.
If you need more control you might be able to do it application like implementing a critical section.
It certainly makes sense to enable versioning the bucket so that you get to maintain all the writes and later you can specify which version as the latest.
You can also leverage the life cycle rules delete ( keep deleting ) the last n version to save cost.