Run ML pipeline using AWS step function for entire dataset? - amazon-web-services

I have a step function setup which calls preprocessing lambda and inference lambda for a data item. Now, I need to do this process on the entire dataset(over 10000 items). One way is to invoke step function parallelly for each input. Is there a better alternative to this approach?

Another way to do it would be to use Map state to run over an array of items. You could start with a list of item ID's and run a set of tasks for it.
https://aws.amazon.com/blogs/aws/new-step-functions-support-for-dynamic-parallelism/
This approach has some drawbacks though:
There is a 256kb limit for input/output data. The initial array of items could possibly be bigger. If you passes an array of ID's only as an input to map state though, 10k items would likely not cross that limit.
Map state doesn't guarantee that all the items will run concurrently. It could possibly be less than 40 at a time (workaround would be to have nested map states or maps of map states). From documentation:
Concurrent iterations may be limited. When this occurs, some iterations will not begin until previous iterations have completed. The likelihood of this occurring increases when your input array has more than 40 items.
https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-map-state.html

Related

Dividing tasks into aws step functions and then join them back when all completed

We have a AWS step function that processes csv files. These CSV files records can be anything from 1 to 4000.
Now, I want to create another inner AWS step function that will process these csv records. The problem is for each record I need to hit another API and for that I want all of the record to be executed asynchronously.
For example - CSV recieved having records of 2500
The step function called another step function 2500 times (The other step function will take a CSV record as input) process it and then store the result in Dynamo or in any other place.
I have learnt about the callback pattern in aws step function but in my case I will be passing 2500 tokens and I want the outer step function to process them when all the 2500 records are done processing.
So my question is this possible using the AWS step function.
If you know any article or guide for me to reference then that would be great.
Thanks in advance
It sounds like dynamic parallelism could work:
To configure a Map state, you define an Iterator, which is a complete sub-workflow. When a Step Functions execution enters a Map state, it will iterate over a JSON array in the state input. For each item, the Map state will execute one sub-workflow, potentially in parallel. When all sub-workflow executions complete, the Map state will return an array containing the output for each item processed by the Iterator.
This keeps the flow all within a single Step Function and allows for easier traceability.
The limiting factor would be the amount of concurrency available (docs):
Concurrent iterations may be limited. When this occurs, some iterations will not begin until previous iterations have completed. The likelihood of this occurring increases when your input array has more than 40 items.
One additional thing to be aware of here is cost. You'll easily blow right through the free tier and start incurring actual cost (link).

Starting a new execution of Step Function after exceeding 25,000 events, when iterating through objects in an S3 bucket

I am iterating through an S3 bucket to process the files. My solution is based on this example;
https://rubenjgarcia.es/step-function-to-iterate-s3/
The iteration is working fine but unfortunately I exceed the 25,000 events allowed by one execution, so it eventually fails. I know you have to start a new execution of the step function, but I'm unclear how to tell it where I am at in the current iteration. I have the count of how many files have been processed and obviously the ContinuationToken. Can I use the ContinuationToken to keep track of where I am in iterating through the s3 bucket and the count to tell it when to start a new execution.
I have read the AWS docs https://docs.aws.amazon.com/step-functions/latest/dg/tutorial-continue-new.html but I am not sure where to start applying this to my own solution. Has anyone done this when iterating through objects in an s3 bucket, if you can you point me in the right direction?
I can think of two options:
in your solution you are iterating as long as there is the next token. You can extend that and create a counter and in each iteration increase it. And change your condition to iterate as long as there is next token or count is less than a threshold.
I prefer to use a nested state machine to overcome the 25,000 events limitation. Let's say every time you are reading 100 items from s3. If you pass the list to the nested state machine to process them, then the top-level state machine will not reach 25,000 events, and also the nested state machine.

DynamoDB Scan/Query Return x Number of Items

If I scan or query in DynamoDB it is possible to set the Limit property. The DynamoDB documentation says the following:
The maximum number of items to evaluate (not necessarily the number of
matching items).
So the problem with this is if you set filters and such it won't return all the items.
My goal that I'm trying to figure out how to achieve is to have a filter in a scan or query, but have it return x number of items. No matter what. I'm ok with having to use LastEvaluatedKey and make multiple requests, but I would like to try to make it as seamless and easy as possible (so not doing that would be best.
The only way I have thought to do this is to set the Limit property to say 1 or something. Then just keep scanning or querying using the LastEvaluatedKey until I reach that x number of items I'm looking for. Problem is, this seems VERY wasteful and inefficient. I mean if you have a table of millions of records you might have to make thousands and thousands of requests. It doesn't seem like it scales very well. Of course I'm sure it's no different than what DynamoDB would be doing behind the scenes.
But is there a way to do this more efficiently where I can reduce the number of requests I have to make? Or is that the only way to achieve this?
How would you achieve this goal?
A single Query operation will read up to the maximum number of items set (if using the Limit parameter) or a maximum of 1 MB of data and then apply any filtering to the results using FilterExpression.
You're 100% right that Limit is applied before FilterExpression. Meaning Dynamo might return some number or documents less than the Limit while other documents that satisfy the FilterExpression still exist in the table but aren't returned.
Its sounds like it would be unacceptable for your api to behave in the same manner. That is going to mean that in some cases, a single request to your service will result in multiple requests to Dynamo. Also, keep in mind that there is no way to predict what the LastEvaluatedKey will be which would be required to parallelize these requests. So in the case that your service makes multiple requests to Dynamo, they will be serial. To me, this is a rather heavy tradeoff but, if it is a requirement that you satisfy the Limit whenever possible, you have options.
First, Dynamo will automatically page at 1 MB. That means you could simply send your query to Dynamo without a Limit and implement the Limit on your end. You may still need to make multiple requests to ensure that your've satisfied the Limit but this approach will result in the fewest number of requests to Dynamo. The trade off here is the total data being read and transferred. Chances are your Limit will not happen to line up perfectly with the 1 MB limit which means the excess data being read, filtered, and transferred is wasted.
You already mentioned the other extreme of sending a Limit of 1 and pointed out that will result in the maximum number of requests to Dynamo
Another approach along these lines is to create some sort of probabilistic function that takes the Limit given to your service by the client and computes a new Limit for Dynamo. For example, your FilterExpression filters out about half of the documents in the table. That means you can multiply the client Limit by 2 and that would be a reasonable Limit to send to Dynamo. Of the approaches we've talked about so far, this one has the highest potential for efficiency however, it also has the highest potential for complexity. For example, you might find that using a simple linear function is not good enough and instead you need to use machine learning to find a multi-variate non-linear function to calculate the new Limit. This approach also heavily depends on the uniformity of your data in Dynamo as well as your access patterns. Again, you might need machine learning to optimize for those variables.
In any of the cases where you are implementing the Limit on your end, if you plan on sending back the LastEvaluatedKey to the client for subsequent calls to your service, you will also need to take care to keep track of the LastEvaluatedKey that you evaluated. You will no longer be able to rely on the LastEvaluatedKey returned from Dynamo.
The final approach would be to reorganize/regroup your data either with a GSI, a separate table that you keep in sync using Dynamo Streams or a different schema altogether with the goal of not requiring a FilterExpression.

Is it okay to set reduce_limit = false config in couchdb configuration?

I am working on a map/reduce review and I always have reduce_overflow_error each time I run the view, if I set reduce_limit = false in couchdb configuration, it is working, I want to know if there is negative effect if I change this config setting? thank you
The setting reduce_limit=true enforces CouchDB to control the size of reduced output on each step of reduction. If stringified JSON output of a reduction step has more than 200 chars and it‘s twice or more longer than input, CouchDB‘s query server throws an error. Both numbers, 2x and 200 chars, are hard-coded.
Since a reduce function runs inside SpiderMonkey instance(s) with only 64Mb RAM available, the limitation set by default looks somehow reasonable. Theoretically, reduce must fold, not blow up the data given.
However, in real life it‘s quite hard to fly under the limitation in all cases. You can not control number of chunks for a (re)reduction step. It means you can run into situation, when your output for a particular chunk is more than twice longer in chars, although other chunks reduced are much shorter. In this case even one uncomfortable chunk breaks entire reduction if reduce_limit is set.
So unsetting reduce_limit might be helpful, if your reducer can sometimes output more data, than it received.
Common case – unrolling arrays into objects. Imagine you receive list of arrays like [[1,2,3...70], [5,6,7...], ...] as input rows. You want to aggregate your list in a manner {key0:(sum of 0th elts), key1:(sum of 1st elts)...}.
If CouchDB decides to send you a chunk with 1 or 2 rows, you have an error. Reason is simple – object keys are also accounted calculating result length.
Possible (but very hard to achieve) negative effect is SpiderMonkey instance constantly restarting/falling on RAM overquota, when trying to process a reduction step or entire reduction. Restarting SM is CPU and RAM intensive and costs hundreds milliseconds in general.

Sharing counter values between MapReduce mappers

I have a mapper that reads input and writes to a database. I want to limit how many inputs are actually converted and written to that database, and all mappers must contribute to the limit and then stop once that limit is reached (approximately; one or two extra isn't a big deal.)
I implemented a limiter function on our mapper that asks the other tasks, "How many records have you imported?" Once a given limit is reached, it will stop importing those records (although it will continue processing them for other purposes.)
the map code in question looks something like this:
public void map(ImmutableBytesWritable key, Result row, Context context) {
// prepare the input
// ...
if (context.getCounter(Metrics.IMPORTED).getValue()<IMPORT_LIMIT){
importRecord();
context.getCounter(Metrics.IMPORTED).increment(1l);
}
// do other things
// ...
}
So each mapper checks to see if there is more room to import, and only if the limit hasn't been reached does it perform any importing. However, each mapper itself is importing up to the limit, so that for 16 mappers, we get 16*IMPORT_LIMIT records imported. It's definitely doing SOME limiting (the count is much much lower than the normal number of imported records.)
When are counter values pushed to other mappers, or are they even available to each mapper? Can I actually get somewhat real-time values from the counter, or do they only update when a mapper is finished? Is there a better way to share a value between mappers?
Okay: from what I've seen, MapReduce doesn't share counters between mappers until the job is finished (ie. not at all.) I'm not sure if mappers that commit partway through will allow later mappers to see their counters, but it's not reliable enough to be done real time.
Instead what I'll do is I will run a simple java application that iterates over the rows on its own and write to a column, which the existing MapReduce job will use to determine if it should import the row or not.