Processing rather big text files on serverless AWS - amazon-web-services

I'm trying to figure out an architecture for processing rather big files (maybe few hundred MB) on a serverless AWS. This is what I've got so far:
API Gateway -> S3 -> Lambda function -> SNS -> Lambda function
In this scenario, the text file is uploaded to S3 through API Gateway. Then some Lambda function is called based on the event generated on S3. This Lambda function will open the text file and read it line by line, generating tasks to be done as messages in an SNS topic. Each message will invoke a separate Lambda function process the task.
My only concern is the first Lambda function call. What if it times out? How can I make sure that it's not a point of failure?

You can ask S3 to only return a particular byte range of a given object, using the Range header: https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectGET.html
for example:
Range: bytes=0-9
would return only the first 10 bytes of the S3 object.
To read a file line by line, you would have to decide on a specific chunk size (1 MB for example), read 1 chunk of the file at a time and split the chunk by line (by looking for newline characters). Once the whole chunk has been read, you could re-invoke the lambda and pass the chunk pointer as a parameter. The new invocation of the lambda will read the file from the chunk pointer given as a parameter.

First thing to know is that the Lambda CPU available is proportional to its configured RAM size. So, double the RAM gets you double the CPU.
If scaling up the Lambda doesn't do it ... then some back of a napkin ideas:
One workflow might be: if size of CSV less than X (to be determined)
then process in a single Lambda. If size more than X then invoke N
sub-lambdas, pointing them each at a 1/Nth of the input file
(assuming you can split the workload like this). The Lambdas use the get range feature of S3. This is a kind of map/reduce pattern.
Or maybe use Step Functions. Have a 1st Lambda invocation begin to
process the file, keeping track of the time remaining (available
from the context object), and respond to Step Functions to indicate
how far it got. Then Step Functions invokes a subsequent Lambda to
process the next part of the file and so on, until complete.
Or use EC2, containers, or even EMR (obviously not serverless).
Also, note that Lambda functions have limited diskspace (500MB) so if you need to download the file to disk in order to process it, then it will need to be under 500MB, notwithstanding any other diskspace you might need to use. Optionally, you can work around this diskspace limitation by simply reading the file into memory (and resize the Lambda function up to 3GB as needed).

you can use AWS Batch instead of lambda for the heavy stuff.
create docker container with your code, load it to ECS, than create job-definition to run it.
use lambda to submit this job with input file as parameter.
op1: create dependent job this the 2nd stage processing, which will lunch automatically when first job succeded.
op2: use step function to orchestrate all the scenario (note that the integration between step function and Batch is not ideal..)

Related

Lambda: loading 170mb file from disk to memory takes 20+ seconds

I'm using container image with 5x170Mb AI models.
When I invoke function the first time all those models load into memory for further inference.
Problem: more often it takes about 10-25 sec per file to load. (So cold start takes about 2 minutes)
But sometimes it loads as expected about 1-2 sec a model and cold start takes only 10 secs.
After little investigation I've found that it's all about reading/opening file from disk into memory. So simple "read byte-file from disk to variable" takes 10-20 seconds. Insane.
P.S. I'm using 10240Mb RAM functions and should have the most processing power.
Is there any way I can avoid so long loading? Why does it happens?
UPDATE:
I'm using onnxruntime and Python to load the model
All code and models stored in container and opened/loaded from there
From experiment: if I open any model as with open("model.onnx","rb") as f: cont = f.read() it takes 20 secs to open the file. But then when I open the same file with model = onnxruntime.InferenceSession("model.onnx") it loads instantly. So I've made a conclusion that problem with opening/reading file, not with onnx.
This also happens with reading big files in "ZIP" type function. It looks like it's not container problem.
TO REPRODUCE:
If you want to see how it works on your side.
Create lambda function
Configure it to 10240 mb ram and 30 sec timeout
Upload ZIP from my S3: https://alxbtest.s3.amazonaws.com/file-open-test.zip
Run/test event. It took me 16 seconds to open the file.
Zip contains "model.onnx" (168Mb) and "lambda_fuction.py" with code:
import json,time
def lambda_handler(event, context):
# TODO implement
tt = time.time()
with open("model.onnx","rb") as f:
cont = f.read()
tt = time.time()-tt
print(f"Open time: {tt:0.4f} s")
return {
'statusCode': 200,
'body': json.dumps(f'Open time: {tt:0.4f} s')
}
Lambda is not designed for big heavy lifting. Its design intent is small, quickly firing low scope functions. You have two options.
Use an EC2 instance. This is more expensive, but it is a server and designed for this kind of thing
Maybe try Elastic File System - this is another service that can tied to lambda which provides a 'cross invocation' File System that Lambda's can access almost as if it was internal, and exists outside of a single invocation of the lambda. This allows you to have large memory objects 'pre loaded' into the file system memory that the Lambda can access, manipulate, and do whatever with without loading it first into its internal memory.
I noticed you also said AI models. There are specific services for Machine Learning, such as Sage Maker you may take a look into.
SHORT ANSWER: you can't control read/load speed of AWS Lambda
First of all, this problem is about read/write speed of current Lambda instance. It looks like on first invocation AWS look for free instance it can place lambda function to and all those instances has different IO speed.
More often it's about 6-9Mb/sec for reading which insanely slow for opening and working with big files.
Sometimes you are lucky and got instance with 50-80Mb/sec read. But it's pretty rare. Don't count on it.
So, if you want faster speed you must pay more:
Use Elastic File System as mentioned #lynkfox
Use S3
BONUS:
If you don't wired with AWS I've found Google Cloud Run much more suitable for my needs.
It uses docker containers as AWS lambda, also billed per 100 ms, can scale automatically
Read speed pretty stable and about 75Mb/sec
You can select RAM and vCPU separately which can lower costs.
You can load several big files simultaneously with multi processing which makes cold start much faster (multi processing load time in Lambda was the summary of all loaded files. Doesn't work for me)
The Init phase ends when the runtime and all extensions signal that
they are ready by sending a Next API request. The Init phase is
limited to 10 seconds. If all three tasks do not complete within 10
seconds, Lambda retries the Init phase at the time of the first
function invocation.
Refer: https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtime-environment.html
Check what the model load time on any ec2 machine (or CPU-based localhost) is
If it is close to 10 seconds, there is a high chance the model is loaded again. The next init generally happens quickly as lambda already has ready some of the content and loaded the state.
To make the read faster, others have suggested trying EFS. In addition, try EFS in Elastic mode.

AWS lambda sequentially invoke same function

I have nearly 1000 items in my DB. I have to run the same operation on each item. The issue is that this is a third party service that has a 1 second rate limit for each operation. Until now, I was able to do the entire thing inside a lambda function. It is now getting close to the 15 minute (900 second) timeout limit.
I was wondering what the best way for splitting this would be. Can I dump each item (or batches of items) into SQS and have a lambda function process them sequentially? But from what I understood, this isn't the recommended way to do this as I can't delay invocations sufficiently long. Or I would have to call lambda within a lambda, which also sounds weird.
Is AWS Step Functions the way to go here? I haven't used that service yet, so I was wondering if there are other options too. I am also using the serverless framework for doing this if it is of any significance.
Both methods you mentioned are options that would work. Within lambda you could add a delay (sleep) after one item has been processed and then trigger another lambda invocation following the delay. You'll be paying for that dead time, of course, if you use this approach, so step functions may be a more elegant solution. One lambda can certainly invoke another--even invoking itself. If you invoke the next lambda asynchronously, then the initial function will finish while the newly-invoked function starts to run. This article on Asynchronous invocation will be useful for that approach. Essentially, each lambda invocation would be responsible for processing one item, delaying sufficiently to accommodate the service limit, and then invoking the next item.
If anything goes wrong you'd want to build appropriate exception handling so a problem with one item either halts the rest or allows the rest of the chain to continue, depending on what is appropriate for your use case.
Step Functions would also work well to handle this use case. With options like Wait and using a loop you could achieve the same result. For example, your step function flow could invoke one lambda that processes an item and returns the next item, then it could next run a wait step, then process the next item and so on until you reach the end. You could use a Map that runs a lambda task and a wait task:
The Map state ("Type": "Map") can be used to run a set of steps for
each element of an input array. While the Parallel state executes
multiple branches of steps using the same input, a Map state will
execute the same steps for multiple entries of an array in the state
input.
This article on Iterating a Loop Using Lambda is also useful.
If you want the messages to be processed serially and are happy to dump the messages to sqs, set both the concurency of the lambda and the batchsize property of the sqs event that triggers the function to 1
Make it a FIFO queue so that messages dont potentially get processed more than once if that is important.

Create a parallel step function with a lambda

I have a question on the step function part of AWS
I have a function to watch and update datas in databases. But because we can have only 1000 as we can have 1 000 000 items to update, I would like to manage it by 10 000 or 100 000 with a lambda.
But the optimal solution should be to manage them in parallel to update every datas at the same time and finish them together
So for that I would like to create a Lambda function with aws-sdk which should create a parallel step function with X tasks and every tasks will manage 10 000 or 100 000 items of the database
But when I read the aws-sdk documentation, it looks like there is no way to create a parallel step function, even from a template
So my question is, is it possible to create a parallel step function from a Lambda function with aws-sdk ? Or do you have a better solution to my problem ?
Thanks in advance
Update : To give you more informations, my problem is I'll have to update a insert an unknown of datas in my DB each first day of month, and the problem is that I need to call an API that takes 15 seconds to return the data (it's not our API so I cannot try to upgrade return time).
If I just use a Lambda function, it will be in timeout after 15 minutes.
Suddenly, I thought of using Step function to execute the Lambda function for each data, but the problem is, if we have a lot of datas, it will maybe take more than 24 hours and I would like to find a solution where I can execute my Lambda function in parallel to optimize the time, so i thought about parallel task of step function.
But because the number of datas will change every month, I don't know how to dynamically increase or decrease branch number of my step function, and that's why I thought of generate my step function from another Lambda
I have a function to watch and update data in databases.
I suppose what you need to watch is some kind of user/data events? what to watch? what to update?
Can you provide more info before I can give you some architectural suggestions?
By the way, it is Step Functions to orchestrate/invoke Lambda functions, not the other around.
updated answer:
so you seem to face the 15 mins hard limit for Lambda max execution time. there are 3 approaches I can see:
instead of using a Lambda function, use an ECS container or EC2 instance to handle the large volume of data processing and database writing. however, this requires substantial code re-rewrite and infrastructure/architectural change.
figure out a way to break down the input data so you can fan out the handling to multiple Lambda function instances, i.e.: input data -> Lambda to break down task -> SQS messages -> Lambda to handle each task. but my concern is that the task to break down input data may also need substantial time.
before Lambda execution timeout, mark the current processed position, invoke the same Lambda function with the original event + position offset. the next Lambda instance would pick up the data processing from where the previous execution stopped. https://medium.com/swlh/processing-large-s3-files-with-aws-lambda-2c5840ae5c91

How to consolidate the output of a number of Lambda function calls

I have a large file which I want to process using Lambda functions in AWS. Since I can not control the size of the file, I came up with the solution to distribute the processing of the file to multiple lambda function calls to avoid timeouts. Here's how it works:
I dedicated a bucket to accept the new input files to be processed.
I set a trigger on the bucket to handle each time a new file is uploaded (let's call it uploadHandler)
Reading the file, uploadHandler measures the size of the file and splits it into equal chunks.
Each chunk is sent to processor lambda function to be processed.
Notes:
The uploadHandler does not read the file content.
The data sent to processor is just a { start: #, end: # }.
Multiple instances of the processor are called in parallel.
Each processor call reads its own chunk of the file individually and generates the output for it.
So far so good. The problem is how to consolidate the output of the all processor calls into one output? Does anyone have any suggestion? And also how to know when the execution of all the processors is done?
I recently had a similar problem. I solve it using AWS lambda and Step functions using this solution https://docs.aws.amazon.com/step-functions/latest/dg/tutorial-create-iterate-pattern-section.html
In this specific example the execution doesn't happen in Parallel, but it's sequential. But when the state machine finish to execute you have the garantee that the file was totally processed correctly. I don't know if is exactly what you are looking.
Option 1:
After breaking the file, make the uploadHandler function call the processor functions synchronously.
Make the calls concurrent, so that you can trigger all processors at once. Lambda functions have only one vCPU (or 2 vCPUs if RAM > 1,800 Gb), but the requests are IO-bound, so you only need one processor.
The uploadHandler will wait for all processors to respond, then you can assemble all responses.
Pros: simpler to implement, no storage;
Cons: no visibility on what's going on until everything is finished;
Option 2:
Persist a processingJob in a DB (RDS, DynamoDB, whatever). The uploadHandler would create the job and save the number of parts into which the file was broken up. Save the job ID with each file part.
Each processor gets one part (with the job ID), processes it, then store in the DB the results of the processing.
Make each processor check if it's the last one delivering its results; if yes, make it trigger an assembler function to collect all results and do whatever you need.
Pros: more visibility, as you can query your storage DB at any time to check which parts were processed and which are pending; you could store all sorts of metadata from the processor for detailed analysis, if needed;
Cons: requires a storage service and a slightly more complex handling of your Lambdas;

What are the options to process timeseries data from a Kinesis stream

I need to process data from an AWS Kinesis stream, which collects events from devices. Processing function has to be called each second with all events received during the last 10 seconds.
Say, I have two devices A and B that write events into the stream.
My procedure has name of MyFunction and takes the following params:
DeviceId
Array of data for a period
If I start processing at 10:00:00 (and already have accumulated events for devices A and B for the last 10 seconds)
then I need to make two calls:
MyFunction(А, {Events for device A from 09:59:50 to 10:00:00})
MyFunction(B, {Events for device B from 09:59:50 to 10:00:00})
In the next second, at 10:00:01
MyFunction(А, {Events for device A from 09:59:51 to 10:00:01})
MyFunction(B, {Events for device B from 09:59:51 to 10:00:01})
and so on.
Looks like the most simple way to accumulate all the data received from devices is just store it memory in a temp buffer (the last 10 seconds only, of course), so I'd like to try this first.
And the most convenient way to keep such a memory based buffer I have found is to create a Java Kinesis Client Library (KCL) based application.
I have also considered AWS Lambda based solution, but looks like it's impossible to keep data in memory for lambda. Another option for Lambda is to have 2 functions, the first one has to write all the data into DynamoDB, and the second one to be called each second to process data fetched from db, not from memory. (So this option is much more complicated)
So my questions is: what can be other options to implement such processing?
So, what you are doing is called "window operation" (or "windowed computation"). There are multiple ways to achieve that, like you said buffering is the best option.
In memory cache systems: Ehcache, Hazelcast
Accumulate data in a cache system and choose the proper eviction policy (10 minutes in your case). Then do a grouping summation operation and calculate the output.
In memory database: Redis, VoltDB
Just like a cache system, you can use a database architecture. Redis could be helpful and stateful. If you use VoltDB or such SQL system, calling a "sum()" or "avg()" operation would be easier.
Spark Streaming: http://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations
It is possible to use Spark to do that counting. You can try Elastic MapReduce (EMR), so you will stay in AWS ecosystem and integration would be easier.