Detect updates to AWS StepFunctions State Machine definition inside a Choice state - amazon-web-services

This is a really good pattern for restarting very-long running state machine executions based on an iteration count so we don't breach the Standard quotas of 1 year execution time and 25k events - https://docs.aws.amazon.com/step-functions/latest/dg/tutorial-continue-new.html
My Question: Is it possible to detect if the state machine definition has changed (since the start of the execution) in a Choice state? For eg., in the IsCountReached state above.
We are planning to handle the State Machine creation and updation using AWS CDK. This would enable us to completely automate the deployments to State Machines, instead of manually killing the execution and restarting it after changes to the State Machine.

As far as I know there is no such thing. It does not really make sense either, since a state machine is run on a "version" of your state machine definition. When you change your definition (new version), you typically don't want running processes to be influenced by that, since that might have unexpected consequences.
That said, you should be able to build something like this fairly easy: build a Lambda function that finds currently running state machines, stops them and restarts them. You invoke this Lambda function as part of your deployment process, if your definition changed.
This way, if your deployment contains changes to your state machine, all your currently running state machines would be restarted and then use the new definition.

DescribeStateMachine doesn't return updateDate but DescribeStateMachineForExecution returns it:
https://docs.aws.amazon.com/step-functions/latest/apireference/API_DescribeStateMachineForExecution.html

Related

What's Happening when an AWS Lambda Function Freezes

What's going on behind the scenes when an AWS lambda function freezes?
That is -- many of the Lambda Runtime Docs refer broadly to the concept of a function freezing or unfreezing
The runtime and each extension indicate completion by sending a Next API request. Lambda freezes the execution environment when the runtime and each extension have completed and there are no pending events.
My understanding of this is that after a Lambda function initializes (or "cold starts") and executes the first invocation request, if there are no other invocations to process the function's execution environment will "freeze". Then, when there's another function invocation to process, the function's execution environment will "unfreeze" almost instantly without needing to initialize/cold-start again. If a frozen function goes too long without being invoked it will shutdown, and the next invocation request will need to cold start.
Does anyone know what this freezing is? It's my understanding that these execution environments are firecracker virtual machines. Is this freezing something that firecracker supports, or is it some extra magic that AWS Web Services has that they keep to themselves? Put another way, if I have a Firecracker VM running can I freeze and unfreeze it?
We can understand the freeze as after each execution, AWS Lambda putting the instance to sleep. In other words, the instance freezes (similar to a laptop in hibernate mode). The virtual CPU is turned off. This frees up resources on the worker node. The overhead from waking up such a function is negligible.
For understand how Firecracker works under the hood, take a look on this AWS re:Invent of 2019 video: AWS re:Invent 2019: Firecracker open-source innovation (OPN402)
Also, take a look on this posts:
Understanding Container Reuse in AWS Lambda
A look behind the scenes of AWS Lambda

How cloudformation handles AWS Step functions based custom resources when state machines executions are aborted?

I have several cloudformation templates with custom resources based on several AWS Step Functions state machines.
Sometimes, during development tasks, they are falling into an infinite loop when I was trying to delete the cloudformation stacks, so the delete operation is stuck into DELETE_IN_PROGRESS.
Although I can abort execution of the state machines, cloudformation remains stuck for one hour until the DELETE operation fails.
I cannot find nothing that can help in the official documentation about how cloudformation handles this use case, it seems that the only way to go is wait for an hour until cloudformation states at DELETE_FAILED.
Anybody knows anyway to avoid waiting when a state machine execution is aborted?
I don't think the problem is in aborting State Machine executions.
Most probably your custom resources do not proccess CF DELETE events correctly. So actually you're most probably not actually waiting when a state machine execution is aborted.
To accelerate things consider setting a smaller timeout in Stack creation options when you create the stack.

How to prevent concurent runs of a state machine in AWS Step Functions?

Is there a way to prevent concurrent execution of AWS Step Functions state machines? For example I run state machine and if this execution is not finished and I run this machine again I get an exception.
You can add a step (say, with a Lambda function) which would check if the same state machine is already being executed (and in which state). If this is the case, the lambda and the step would fail.
Depending on what you want to achieve, you can additionally configure a Retry so that the execution will continue once the old state machine has finished.
I don't think it is possible according to StartExecution API documentation:
StartExecution is idempotent. If StartExecution is called with the
same name and input as a running execution, the call will succeed and
return the same response as the original request. If the execution is
closed or if the input is different, it will return a 400
ExecutionAlreadyExists error. Names can be reused after 90 days.

(AWS SWF) Is there a way to get a list of all activity workers listening on a particular tasklist?

In our beta stack, we have a single EC2 instance listening to a tasklist. Sometimes another developer in the team start's his own instance for testing purposes and forget to turn it off. This creates problems for the next developer who tries to start an activity only for it to be taken up by the last developer's machine. Is there a way to get the hostnames of all activity workers listening to a particular tasklist ?
It is not currently possible to get a list of pollers waiting on a task list through the SWF API. The workaround is to look at the identity field on the ActivityExecutionStarted event after it was picked up by the wrong worker.
One way to avoid this issue is always use a task list name that is specific to a machine or developer to avoid collisions.

AWS Lambda: Service error

What does this error mean?
I have 5 Lambda Functions deployed using Java that worked perfectly but since this afternoon all of them started displaying the same message when I execute each:
Service error.
No output, no logs, only that message in a red box.
In http://status.aws.amazon.com/ they say:
6:05 PM PDT We are investigating increased error rates and elevated
latencies for AWS Lambda requests in the US-EAST-1 Region. Newly
created functions and console editing are also affected.
Why does it happen and is there a way to prevent it?
From time to time, parts of Amazon's AWS service fail. Sometimes the failure is very small and short-lived, and in other cases there are larger distributed failures.
Your system design needs to take into account the possibility that the piece of AWS that you are counting on will not work at the moment, and try to route around the damage. For instance, you can run Lambda in multiple regions. (It already runs in multiple availability zones inside a single region, so you don't have to worry about that). This gives you some isolation against failures in any one region.
Getting distributed systems to work at small scale can be hard because the failures that you need to protect against don't happen very often. At large scale, you get systematic efforts like Netflix's "Chaos Monkey", which deliberately introduces failures so that automated processes can detect and correct those issues.
"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." -- Leslie Lamport
"When a Fail-Safe system fails, it fails by failing to fail safe." -- John Gall