Lambda with EFS performance issues - amazon-web-services

I'm using lambda with EFS, and seeing very high latencies, making the whole solution unusable.
Writing small files (1KB) start from 20ms and add up significantly if I write in 100 concurrent threads.
When I batch to larger files I'm still getting +50ms latencies.
Reading the docs, (https://docs.aws.amazon.com/efs/latest/ug/performance.html), they promise sub millisecond latencies.
Am I doing something wrong?
The code I'm using to write is a simple python script:
with open(filepath, 'wb') as f:
f.write(fact_to_write["data"])

Related

Lambda: loading 170mb file from disk to memory takes 20+ seconds

I'm using container image with 5x170Mb AI models.
When I invoke function the first time all those models load into memory for further inference.
Problem: more often it takes about 10-25 sec per file to load. (So cold start takes about 2 minutes)
But sometimes it loads as expected about 1-2 sec a model and cold start takes only 10 secs.
After little investigation I've found that it's all about reading/opening file from disk into memory. So simple "read byte-file from disk to variable" takes 10-20 seconds. Insane.
P.S. I'm using 10240Mb RAM functions and should have the most processing power.
Is there any way I can avoid so long loading? Why does it happens?
UPDATE:
I'm using onnxruntime and Python to load the model
All code and models stored in container and opened/loaded from there
From experiment: if I open any model as with open("model.onnx","rb") as f: cont = f.read() it takes 20 secs to open the file. But then when I open the same file with model = onnxruntime.InferenceSession("model.onnx") it loads instantly. So I've made a conclusion that problem with opening/reading file, not with onnx.
This also happens with reading big files in "ZIP" type function. It looks like it's not container problem.
TO REPRODUCE:
If you want to see how it works on your side.
Create lambda function
Configure it to 10240 mb ram and 30 sec timeout
Upload ZIP from my S3: https://alxbtest.s3.amazonaws.com/file-open-test.zip
Run/test event. It took me 16 seconds to open the file.
Zip contains "model.onnx" (168Mb) and "lambda_fuction.py" with code:
import json,time
def lambda_handler(event, context):
# TODO implement
tt = time.time()
with open("model.onnx","rb") as f:
cont = f.read()
tt = time.time()-tt
print(f"Open time: {tt:0.4f} s")
return {
'statusCode': 200,
'body': json.dumps(f'Open time: {tt:0.4f} s')
}
Lambda is not designed for big heavy lifting. Its design intent is small, quickly firing low scope functions. You have two options.
Use an EC2 instance. This is more expensive, but it is a server and designed for this kind of thing
Maybe try Elastic File System - this is another service that can tied to lambda which provides a 'cross invocation' File System that Lambda's can access almost as if it was internal, and exists outside of a single invocation of the lambda. This allows you to have large memory objects 'pre loaded' into the file system memory that the Lambda can access, manipulate, and do whatever with without loading it first into its internal memory.
I noticed you also said AI models. There are specific services for Machine Learning, such as Sage Maker you may take a look into.
SHORT ANSWER: you can't control read/load speed of AWS Lambda
First of all, this problem is about read/write speed of current Lambda instance. It looks like on first invocation AWS look for free instance it can place lambda function to and all those instances has different IO speed.
More often it's about 6-9Mb/sec for reading which insanely slow for opening and working with big files.
Sometimes you are lucky and got instance with 50-80Mb/sec read. But it's pretty rare. Don't count on it.
So, if you want faster speed you must pay more:
Use Elastic File System as mentioned #lynkfox
Use S3
BONUS:
If you don't wired with AWS I've found Google Cloud Run much more suitable for my needs.
It uses docker containers as AWS lambda, also billed per 100 ms, can scale automatically
Read speed pretty stable and about 75Mb/sec
You can select RAM and vCPU separately which can lower costs.
You can load several big files simultaneously with multi processing which makes cold start much faster (multi processing load time in Lambda was the summary of all loaded files. Doesn't work for me)
The Init phase ends when the runtime and all extensions signal that
they are ready by sending a Next API request. The Init phase is
limited to 10 seconds. If all three tasks do not complete within 10
seconds, Lambda retries the Init phase at the time of the first
function invocation.
Refer: https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtime-environment.html
Check what the model load time on any ec2 machine (or CPU-based localhost) is
If it is close to 10 seconds, there is a high chance the model is loaded again. The next init generally happens quickly as lambda already has ready some of the content and loaded the state.
To make the read faster, others have suggested trying EFS. In addition, try EFS in Elastic mode.

Flink RocksDB Performance issues

I have a flink job (scala) that is basically reading from a kafka-topic (1.0), aggregating data (1 minute event time tumbling window, using a fold function, which I know is deprecated, but is easier to implement than an aggregate function), and writing the result to 2 different kafka topics.
The question is - when I'm using a FS state backend, everything runs smoothly, checkpoints are taking 1-2 seconds, with an average state size of 200 mb - that is, until the state size is increasing (while closing a gap, for example).
I figured I would try rocksdb (over hdfs) for checkpoints - but the throughput is SIGNIFICANTLY less than fs state backend. As I understand it, flink does not need to ser/deserialize for every state access when using fs state backend, because the state is kept in memory (heap), rocks db DOES, and I guess that is what is accounting for the slowdown (and backpressure, and checkpoints take MUCH longer, sometimes timeout after 10 minutes).
Still, there are times that the state cannot fit in memory, and I am trying to figure out basically how to make rocksdb state backend perform "better".
Is it because of the deprecated fold function? Do I need to fine tune some parameters that are not easily searchable in documentation? any tips?
Each state backend holds the working state somewhere, and then durably persists its checkpoints in a distributed filesystem. The RocksDB state backend holds its working state on disk, and this can be a local disk, hopefully faster than hdfs.
Try setting state.backend.rocksdb.localdir (see https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/state/state_backends.html#rocksdb-state-backend-config-options) to somewhere on the fastest local filesystem on each taskmanager.
Turning on incremental checkpointing could also make a large difference.
Also see Tuning RocksDB.

How to consolidate the output of a number of Lambda function calls

I have a large file which I want to process using Lambda functions in AWS. Since I can not control the size of the file, I came up with the solution to distribute the processing of the file to multiple lambda function calls to avoid timeouts. Here's how it works:
I dedicated a bucket to accept the new input files to be processed.
I set a trigger on the bucket to handle each time a new file is uploaded (let's call it uploadHandler)
Reading the file, uploadHandler measures the size of the file and splits it into equal chunks.
Each chunk is sent to processor lambda function to be processed.
Notes:
The uploadHandler does not read the file content.
The data sent to processor is just a { start: #, end: # }.
Multiple instances of the processor are called in parallel.
Each processor call reads its own chunk of the file individually and generates the output for it.
So far so good. The problem is how to consolidate the output of the all processor calls into one output? Does anyone have any suggestion? And also how to know when the execution of all the processors is done?
I recently had a similar problem. I solve it using AWS lambda and Step functions using this solution https://docs.aws.amazon.com/step-functions/latest/dg/tutorial-create-iterate-pattern-section.html
In this specific example the execution doesn't happen in Parallel, but it's sequential. But when the state machine finish to execute you have the garantee that the file was totally processed correctly. I don't know if is exactly what you are looking.
Option 1:
After breaking the file, make the uploadHandler function call the processor functions synchronously.
Make the calls concurrent, so that you can trigger all processors at once. Lambda functions have only one vCPU (or 2 vCPUs if RAM > 1,800 Gb), but the requests are IO-bound, so you only need one processor.
The uploadHandler will wait for all processors to respond, then you can assemble all responses.
Pros: simpler to implement, no storage;
Cons: no visibility on what's going on until everything is finished;
Option 2:
Persist a processingJob in a DB (RDS, DynamoDB, whatever). The uploadHandler would create the job and save the number of parts into which the file was broken up. Save the job ID with each file part.
Each processor gets one part (with the job ID), processes it, then store in the DB the results of the processing.
Make each processor check if it's the last one delivering its results; if yes, make it trigger an assembler function to collect all results and do whatever you need.
Pros: more visibility, as you can query your storage DB at any time to check which parts were processed and which are pending; you could store all sorts of metadata from the processor for detailed analysis, if needed;
Cons: requires a storage service and a slightly more complex handling of your Lambdas;

Elastic Beanstalk high CPU load after a week of running

I am running a single-instance worker on AWS Beanstalk. It is a single-container Docker that runs some processes once every business day. Mostly, the processes sync a large number of small files from S3 and analyze those.
The setup runs fine for about a week, and then CPU load starts growing linearly in time, as in this screenshot.
The CPU load stays at a considerable level, slowing down my scheduled processes. At the same time, my top-resource tracking running inside the container (privileged Docker mode to enable it):
echo "%CPU %MEM ARGS $(date)" && ps -e -o pcpu,pmem,args --sort=pcpu | cut -d" " -f1-5 | tail
shows nearly no CPU load (which changes only during the time that my daily process runs, seemingly accurately reflecting system load at those times).
What am I missing here in terms of the origin of this "background" system load? Wondering if anybody seen some similar behavior, and/or could suggest additional diagnostics from inside the running container.
So far I have been re-starting the setup every week to remove the "background" load, but that is sub-optimal since the first run after each restart has to collect over 1 million small files from S3 (while subsequent daily runs add only a few thousand files per day).
The profile is a bit odd. Especially that it is a linear growth. Almost like something is accumulating and taking progressively longer to process.
I don't have enough information to point at a specific issue. A few things that you could check:
Are you collecting files anywhere, whether intentionally or in a cache or transfer folder? It could be that the system is running background processes (AV, index, defrag, dedupe, etc) and the "large number of small files" are accumulating to become something that needs to be paged or handled inefficiently.
Does any part of your process use a weekly naming convention or house keeping process. Might you be getting conflicts, or accumulating work load as the week rolls over. i.e. the 2nd week is actually processing both the 1st & 2nd week data, but never completing so that the next day it is progressively worse. I saw something similar where an inappropriate bubble sort process was not completing (never reached the completion condition due to the slow but steady inflow of data causing it to constantly reset) and the demand by the process got progressively higher as the array got larger.
Do you have some logging on a weekly rollover cycle ?
Are there any other key performance metrics following the trend ? (network, disk IO, memory, paging, etc)
Do consider if it is a false positive. if it is high CPU there should be other metrics mirroring the CPU behaviour, cache use, disk IO, S3 transfer statistics/logging.
RL

Writing large files in c++

I am writing large files, in range from 70 - 700gb. Does anyone have experience if Memory mapped files would be more efficient than regular writing in chunks?
The code will be in c++ and run on linux 2.6
If you are writing the file from the beginning and onwards, there is nothing to be gained from memory mapping the file.
If you are writing the file in any other pattern, please update the question :)
Typical sustained hard drive transfer speeds for consumer grade drives are around 60 megabytes per second, with the sun shining, a stiff breeze in the back and the file system not too fragmented so the disk drive head doesn't have to seek too often.
So a hard lower limit on the amount of time it takes to write 700 gigabytes is 700 * 1024 / 60 = 11947 seconds or 3 hours and 20 minutes. No amount of buffering is going to fix that, it will quickly be overwhelmed by the drastic mismatch between the disk write speed and the ability of the processor to fill the fire hose. Start looking for a problem in your code or the disk drive state only when it takes a couple of times longer than that.