I have a large file which I want to process using Lambda functions in AWS. Since I can not control the size of the file, I came up with the solution to distribute the processing of the file to multiple lambda function calls to avoid timeouts. Here's how it works:
I dedicated a bucket to accept the new input files to be processed.
I set a trigger on the bucket to handle each time a new file is uploaded (let's call it uploadHandler)
Reading the file, uploadHandler measures the size of the file and splits it into equal chunks.
Each chunk is sent to processor lambda function to be processed.
Notes:
The uploadHandler does not read the file content.
The data sent to processor is just a { start: #, end: # }.
Multiple instances of the processor are called in parallel.
Each processor call reads its own chunk of the file individually and generates the output for it.
So far so good. The problem is how to consolidate the output of the all processor calls into one output? Does anyone have any suggestion? And also how to know when the execution of all the processors is done?
I recently had a similar problem. I solve it using AWS lambda and Step functions using this solution https://docs.aws.amazon.com/step-functions/latest/dg/tutorial-create-iterate-pattern-section.html
In this specific example the execution doesn't happen in Parallel, but it's sequential. But when the state machine finish to execute you have the garantee that the file was totally processed correctly. I don't know if is exactly what you are looking.
Option 1:
After breaking the file, make the uploadHandler function call the processor functions synchronously.
Make the calls concurrent, so that you can trigger all processors at once. Lambda functions have only one vCPU (or 2 vCPUs if RAM > 1,800 Gb), but the requests are IO-bound, so you only need one processor.
The uploadHandler will wait for all processors to respond, then you can assemble all responses.
Pros: simpler to implement, no storage;
Cons: no visibility on what's going on until everything is finished;
Option 2:
Persist a processingJob in a DB (RDS, DynamoDB, whatever). The uploadHandler would create the job and save the number of parts into which the file was broken up. Save the job ID with each file part.
Each processor gets one part (with the job ID), processes it, then store in the DB the results of the processing.
Make each processor check if it's the last one delivering its results; if yes, make it trigger an assembler function to collect all results and do whatever you need.
Pros: more visibility, as you can query your storage DB at any time to check which parts were processed and which are pending; you could store all sorts of metadata from the processor for detailed analysis, if needed;
Cons: requires a storage service and a slightly more complex handling of your Lambdas;
Related
I'm using container image with 5x170Mb AI models.
When I invoke function the first time all those models load into memory for further inference.
Problem: more often it takes about 10-25 sec per file to load. (So cold start takes about 2 minutes)
But sometimes it loads as expected about 1-2 sec a model and cold start takes only 10 secs.
After little investigation I've found that it's all about reading/opening file from disk into memory. So simple "read byte-file from disk to variable" takes 10-20 seconds. Insane.
P.S. I'm using 10240Mb RAM functions and should have the most processing power.
Is there any way I can avoid so long loading? Why does it happens?
UPDATE:
I'm using onnxruntime and Python to load the model
All code and models stored in container and opened/loaded from there
From experiment: if I open any model as with open("model.onnx","rb") as f: cont = f.read() it takes 20 secs to open the file. But then when I open the same file with model = onnxruntime.InferenceSession("model.onnx") it loads instantly. So I've made a conclusion that problem with opening/reading file, not with onnx.
This also happens with reading big files in "ZIP" type function. It looks like it's not container problem.
TO REPRODUCE:
If you want to see how it works on your side.
Create lambda function
Configure it to 10240 mb ram and 30 sec timeout
Upload ZIP from my S3: https://alxbtest.s3.amazonaws.com/file-open-test.zip
Run/test event. It took me 16 seconds to open the file.
Zip contains "model.onnx" (168Mb) and "lambda_fuction.py" with code:
import json,time
def lambda_handler(event, context):
# TODO implement
tt = time.time()
with open("model.onnx","rb") as f:
cont = f.read()
tt = time.time()-tt
print(f"Open time: {tt:0.4f} s")
return {
'statusCode': 200,
'body': json.dumps(f'Open time: {tt:0.4f} s')
}
Lambda is not designed for big heavy lifting. Its design intent is small, quickly firing low scope functions. You have two options.
Use an EC2 instance. This is more expensive, but it is a server and designed for this kind of thing
Maybe try Elastic File System - this is another service that can tied to lambda which provides a 'cross invocation' File System that Lambda's can access almost as if it was internal, and exists outside of a single invocation of the lambda. This allows you to have large memory objects 'pre loaded' into the file system memory that the Lambda can access, manipulate, and do whatever with without loading it first into its internal memory.
I noticed you also said AI models. There are specific services for Machine Learning, such as Sage Maker you may take a look into.
SHORT ANSWER: you can't control read/load speed of AWS Lambda
First of all, this problem is about read/write speed of current Lambda instance. It looks like on first invocation AWS look for free instance it can place lambda function to and all those instances has different IO speed.
More often it's about 6-9Mb/sec for reading which insanely slow for opening and working with big files.
Sometimes you are lucky and got instance with 50-80Mb/sec read. But it's pretty rare. Don't count on it.
So, if you want faster speed you must pay more:
Use Elastic File System as mentioned #lynkfox
Use S3
BONUS:
If you don't wired with AWS I've found Google Cloud Run much more suitable for my needs.
It uses docker containers as AWS lambda, also billed per 100 ms, can scale automatically
Read speed pretty stable and about 75Mb/sec
You can select RAM and vCPU separately which can lower costs.
You can load several big files simultaneously with multi processing which makes cold start much faster (multi processing load time in Lambda was the summary of all loaded files. Doesn't work for me)
The Init phase ends when the runtime and all extensions signal that
they are ready by sending a Next API request. The Init phase is
limited to 10 seconds. If all three tasks do not complete within 10
seconds, Lambda retries the Init phase at the time of the first
function invocation.
Refer: https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtime-environment.html
Check what the model load time on any ec2 machine (or CPU-based localhost) is
If it is close to 10 seconds, there is a high chance the model is loaded again. The next init generally happens quickly as lambda already has ready some of the content and loaded the state.
To make the read faster, others have suggested trying EFS. In addition, try EFS in Elastic mode.
I have Rosbag file which contains messages on various topics, each topic has its own frequency. This data has been captured from a hardware device streaming data, and data from all topics would "reach" at the same time to be used for different algorithms.
I wish to simulate this using the rosbag file(think of it as every topic has associated an array of data) and it is imperative that this data streaming process start at the same time so that the data can be in sync.
I do this via launching different publishers on different threads (I am open to other approaches as well, this was the only one I could think of.), but the threads do not start at the same time, by the time thread 3 starts, thread 1 would be considerably ahead.
How may I achieve this?
Edit - I understand that launching at the exact same time is not possible, but maybe I can get away with a launch extremely close to each other as well. Is there any way to ensure this?
Edit2 - Since the main aim is to get the data stream in Sync, I was wondering about the warmup effect of the thread(suppose a thread1 starts from 3.3GHz and reaches to 4.2GHz by the time thread2 starts at 3.2). Would this have a significant effect (I can always warm them up before starting the publishing process, but I am curious whether it would have a pronounced effect)
TIA
As others have stated in the comments you cannot guarantee threads launch at exactly the same time. To address your overall goal: you're going about solving this problem the wrong way, from a ROS perspective. Instead of manually publishing data and trying to get it in sync, you should be using the rosbag api. This way you can actually guarantee messages have the same timestamp. Note that this doesn't guarantee they will be sent out at the exact same time, because they won't. You can put a message into a bag file directly like this
import rosbag
from std_msgs.msg import Int32, String
bag = rosbag.Bag('test.bag', 'w')
try:
s = String()
s.data = 'foo'
i = Int32()
i.data = 42
bag.write('chatter', s)
bag.write('numbers', i)
finally:
bag.close()
For more complex types that include a Header field simply edit the header.stamp portion to keep timestamps consistent
I'm trying to figure out an architecture for processing rather big files (maybe few hundred MB) on a serverless AWS. This is what I've got so far:
API Gateway -> S3 -> Lambda function -> SNS -> Lambda function
In this scenario, the text file is uploaded to S3 through API Gateway. Then some Lambda function is called based on the event generated on S3. This Lambda function will open the text file and read it line by line, generating tasks to be done as messages in an SNS topic. Each message will invoke a separate Lambda function process the task.
My only concern is the first Lambda function call. What if it times out? How can I make sure that it's not a point of failure?
You can ask S3 to only return a particular byte range of a given object, using the Range header: https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectGET.html
for example:
Range: bytes=0-9
would return only the first 10 bytes of the S3 object.
To read a file line by line, you would have to decide on a specific chunk size (1 MB for example), read 1 chunk of the file at a time and split the chunk by line (by looking for newline characters). Once the whole chunk has been read, you could re-invoke the lambda and pass the chunk pointer as a parameter. The new invocation of the lambda will read the file from the chunk pointer given as a parameter.
First thing to know is that the Lambda CPU available is proportional to its configured RAM size. So, double the RAM gets you double the CPU.
If scaling up the Lambda doesn't do it ... then some back of a napkin ideas:
One workflow might be: if size of CSV less than X (to be determined)
then process in a single Lambda. If size more than X then invoke N
sub-lambdas, pointing them each at a 1/Nth of the input file
(assuming you can split the workload like this). The Lambdas use the get range feature of S3. This is a kind of map/reduce pattern.
Or maybe use Step Functions. Have a 1st Lambda invocation begin to
process the file, keeping track of the time remaining (available
from the context object), and respond to Step Functions to indicate
how far it got. Then Step Functions invokes a subsequent Lambda to
process the next part of the file and so on, until complete.
Or use EC2, containers, or even EMR (obviously not serverless).
Also, note that Lambda functions have limited diskspace (500MB) so if you need to download the file to disk in order to process it, then it will need to be under 500MB, notwithstanding any other diskspace you might need to use. Optionally, you can work around this diskspace limitation by simply reading the file into memory (and resize the Lambda function up to 3GB as needed).
you can use AWS Batch instead of lambda for the heavy stuff.
create docker container with your code, load it to ECS, than create job-definition to run it.
use lambda to submit this job with input file as parameter.
op1: create dependent job this the 2nd stage processing, which will lunch automatically when first job succeded.
op2: use step function to orchestrate all the scenario (note that the integration between step function and Batch is not ideal..)
I came across a situation where I have to log last 1000 events present in the queue.
What will be the best solution to handle this by reducing costly file operation?
At present we are completely rewriting the file with all the queue entries.
Out of the two solutions mentioned below, which one is good? or is there any other option to speed up the logging?
Making a fixed log message size and using file pointer do read/write operation.
Creating multiple files and when the request comes, then read 1000 events from last files
There are several considerations here, that can't be all simultaneously optimized. Among them are:
the latency and throughput of the process emitting the logging messages
the total number of IO operations
the latency of reading log messages
There probably is no "best way". You need to find a working point that suits your requirements.
For example, Nathan Oliver basically suggested in the comments to have the emitting process writing to some aux file, and once it is full to have it rename aux to log.
This idea has very low latency characteristics for the emitter, and an essentially optimal total number of IO ops. Conversely, (at least depending on the implementation,) it has unbounded latency for the reader. Say the logger emits 1700 messages, then indefinitely stops logging. There is no bound on the time it will take the log reader to access the last 700 messages.
So, this idea might be excellent in some settings, but in other settings it might be considered less adequate.
A different way of doing it (with a different working point), is to have the process emitting the messages write again to some aux. When either aux has a number of messages that exceed some number (possibly less than 1000), or a certain amount of time has passed, it should rename aux to some temporary-named file in a temp directory.
Meanwhile, a background process can periodically scan the tmp directory. When it sees there files, it should read:
the log file (which is the only file viewed externally)
the files it found in tmp sorted by modification time
It should retain the last 1000 messages (at most), write them to some tmp_log file, rename it to log, and then erase the files it read in tmp.
This has reasonable latency for both emitter and reader, but more total IO accesses. YMMV.
I need to process data from an AWS Kinesis stream, which collects events from devices. Processing function has to be called each second with all events received during the last 10 seconds.
Say, I have two devices A and B that write events into the stream.
My procedure has name of MyFunction and takes the following params:
DeviceId
Array of data for a period
If I start processing at 10:00:00 (and already have accumulated events for devices A and B for the last 10 seconds)
then I need to make two calls:
MyFunction(А, {Events for device A from 09:59:50 to 10:00:00})
MyFunction(B, {Events for device B from 09:59:50 to 10:00:00})
In the next second, at 10:00:01
MyFunction(А, {Events for device A from 09:59:51 to 10:00:01})
MyFunction(B, {Events for device B from 09:59:51 to 10:00:01})
and so on.
Looks like the most simple way to accumulate all the data received from devices is just store it memory in a temp buffer (the last 10 seconds only, of course), so I'd like to try this first.
And the most convenient way to keep such a memory based buffer I have found is to create a Java Kinesis Client Library (KCL) based application.
I have also considered AWS Lambda based solution, but looks like it's impossible to keep data in memory for lambda. Another option for Lambda is to have 2 functions, the first one has to write all the data into DynamoDB, and the second one to be called each second to process data fetched from db, not from memory. (So this option is much more complicated)
So my questions is: what can be other options to implement such processing?
So, what you are doing is called "window operation" (or "windowed computation"). There are multiple ways to achieve that, like you said buffering is the best option.
In memory cache systems: Ehcache, Hazelcast
Accumulate data in a cache system and choose the proper eviction policy (10 minutes in your case). Then do a grouping summation operation and calculate the output.
In memory database: Redis, VoltDB
Just like a cache system, you can use a database architecture. Redis could be helpful and stateful. If you use VoltDB or such SQL system, calling a "sum()" or "avg()" operation would be easier.
Spark Streaming: http://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations
It is possible to use Spark to do that counting. You can try Elastic MapReduce (EMR), so you will stay in AWS ecosystem and integration would be easier.