Step function exceeding the maximum number of characters service limit - amazon-web-services

My state in a step function flow returns an error of state/task returned a result with a size exceeding the maximum number of characters service limit.. In the step function documentation, the limit for characters for input/output is 32,768 characters. Upon checking the total characters of my result data if falls below the limit. Are there any other scenarios that it will throw that error? Thanks!

2020-09-29 Edit: Step Functions now supports 256KB payloads!
256KB is the maximum size of the payload that can be passed between states. You could also exceed this limit from a Map or Parallel state, whose final output is an array with the output of each iteration or branch.
https://aws.amazon.com/about-aws/whats-new/2020/09/aws-step-functions-increases-payload-size-to-256kb
The recommended solution from the Step Functions documentation is to store the data somewhere else (e.g. S3) and pass around the ARN instead of raw JSON.
https://docs.aws.amazon.com/step-functions/latest/dg/avoid-exec-failures.html
You can also use OutputPath to reduce the output to the fields you want to pass to the next state.
https://docs.aws.amazon.com/step-functions/latest/dg/input-output-outputpath.html

Related

AWS Step Functions Distributed Map output exceeds maximum number of bytes service limit

I have Map in distributed mode step in a state machine, which iterates over large number of entries in S3 (>300,000 items).
Steps in the map all succeed, but the Map step itself then fails with The state/task 'process-s3-file' returned a result with a size exceeding the maximum number of bytes service limit. error.
I've followed advice in this stackoverflow question and set OutputPath to null, which means that every single step returns empty object now {}, but that didn't help for some reason. I've also set ResultPath of the Map step to Discard result and keep original input, but that also didn't help. I can confirm that each individual interation inside map is returning empty object ({}) now.
Am I missing something?
I was able to fix this by enabling S3 export of results. Apparently, that overwrites the output entirely and no longer triggers output size error.

TextSizeLimitExceededException when calling the DetectPiiEntities operation

I am using aws comprehend for PII redaction, Idea is to detect entities and then redact PII from it.
Now the problem is this API has a Input text size limit. How can I increase the limit ?? Maybe to 1 MB ?? Or is there any other way to detect entities for large text.
ERROR: botocore.errorfactory.TextSizeLimitExceededException: An error occurred (TextSizeLimitExceededException) when calling the DetectPiiEntities operation: Input text size exceeds limit. Max length of request text allowed is 5000 bytes while in this request the text size is 7776 bytes
There's no way to increase this limit.
For input text greater than 5000 bytes, you can split the text into multiple chunks of 5000 bytes each and then aggregate the results back.
Please do mind that you keep some overlap between different chunks, to carry over some context from previous chunk.
For reference you can use similar solution exposed by Comprehend team itself . https://github.com/aws-samples/amazon-comprehend-s3-object-lambda-functions/blob/main/src/processors.py#L172

Dividing tasks into aws step functions and then join them back when all completed

We have a AWS step function that processes csv files. These CSV files records can be anything from 1 to 4000.
Now, I want to create another inner AWS step function that will process these csv records. The problem is for each record I need to hit another API and for that I want all of the record to be executed asynchronously.
For example - CSV recieved having records of 2500
The step function called another step function 2500 times (The other step function will take a CSV record as input) process it and then store the result in Dynamo or in any other place.
I have learnt about the callback pattern in aws step function but in my case I will be passing 2500 tokens and I want the outer step function to process them when all the 2500 records are done processing.
So my question is this possible using the AWS step function.
If you know any article or guide for me to reference then that would be great.
Thanks in advance
It sounds like dynamic parallelism could work:
To configure a Map state, you define an Iterator, which is a complete sub-workflow. When a Step Functions execution enters a Map state, it will iterate over a JSON array in the state input. For each item, the Map state will execute one sub-workflow, potentially in parallel. When all sub-workflow executions complete, the Map state will return an array containing the output for each item processed by the Iterator.
This keeps the flow all within a single Step Function and allows for easier traceability.
The limiting factor would be the amount of concurrency available (docs):
Concurrent iterations may be limited. When this occurs, some iterations will not begin until previous iterations have completed. The likelihood of this occurring increases when your input array has more than 40 items.
One additional thing to be aware of here is cost. You'll easily blow right through the free tier and start incurring actual cost (link).

Run ML pipeline using AWS step function for entire dataset?

I have a step function setup which calls preprocessing lambda and inference lambda for a data item. Now, I need to do this process on the entire dataset(over 10000 items). One way is to invoke step function parallelly for each input. Is there a better alternative to this approach?
Another way to do it would be to use Map state to run over an array of items. You could start with a list of item ID's and run a set of tasks for it.
https://aws.amazon.com/blogs/aws/new-step-functions-support-for-dynamic-parallelism/
This approach has some drawbacks though:
There is a 256kb limit for input/output data. The initial array of items could possibly be bigger. If you passes an array of ID's only as an input to map state though, 10k items would likely not cross that limit.
Map state doesn't guarantee that all the items will run concurrently. It could possibly be less than 40 at a time (workaround would be to have nested map states or maps of map states). From documentation:
Concurrent iterations may be limited. When this occurs, some iterations will not begin until previous iterations have completed. The likelihood of this occurring increases when your input array has more than 40 items.
https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-map-state.html

DynamoDB: When does 1MB limit for queries apply

In the docs for DynamoDB it says:
In a Query operation, DynamoDB retrieves the items in sorted order, and then processes the items using KeyConditionExpression and any FilterExpression that might be present.
And:
A single Query operation can retrieve a maximum of 1 MB of data. This limit applies before any FilterExpression is applied to the results.
Does this mean, that KeyConditionExpression is applied before this 1MB limit?
Indeed, your interpretation is correct. With KeyConditionExpression, DynamoDB can efficiently fetch only the data matching its criteria, and you only pay for this matching data and the 1MB read size applies to the matching data. But with FilterExpression the story is different: DynamoDB has no efficient way of filtering out the non-matching items before actually fetching all of it then filtering out the items you don't want. So you pay for reading the entire unfiltered data (before FilterExpression), and the 1MB maximum also corresponds to the unfiltered data.
If you're still unconvinced that this is the way it should be, here's another issue to consider: Imagine that you have 1 gigabyte of data in your database to be Scan'ed (or in a single key to be Query'ed), and after filtering, the result will be just 1 kilobyte. Were you to make this query and expect to get the 1 kilobyte back, Dynamo would need to read and process the entire 1 gigabyte of data before returning. This could take a very long time, and you would have no idea how much, and will likely timeout while waiting for the result. So instead, Dynamo makes sure to return to you after every 1MB of data it reads from disk (and for which you pay ;-)). Control will return to you 1000 (=1 gigabyte / 1 MB) times during the long query, and you won't have a chance to timeout. Whether a 1MB limit actually makes sense here or it should have been more, I don't know, and maybe we should have had a different limit for the response size and the read amount - but definitely some sort of limit was needed on the read amount, even if it doesn't translate to large responses.
By the way, the Scan documentation includes a slightly differently-worded version of the explanation of the 1MB limit, maybe you will find it clearer than the version in the Query documentation:
A single Scan operation will read up to the maximum number of items set (if using the Limit parameter) or a maximum of 1 MB of data and then apply any filtering to the results using FilterExpression.