Maximum Memory Usage of CloudML Online Prediction Instances - google-cloud-ml

What is the maximum peak memory usage for serving tf models?
For example, to process some high-res images using a deep convolution net, we might require 40GB+ in memory usage.
Currently, CloudML charges per NodeHour. Specifically, how is memory applied?

Related

Recommended minimum system requirements to try out the QuestDB binaries?

I'm trying out QuestDB using the binaries, running them in an Ubuntu container under Proxmox. The docs for the binaries don't say what resources you need, so I guesstimated. Looking at the performance metrics for the container when running some of the CRUD examples with 10,000,000 rows, I still managed to over-provision — by a lot.
Provisioned the container with 4 CPU cores, 4GB RAM & swap, and 8GB SSD. It would probably be fine with a fraction of that: CPU usage during queries is <1%, RAM usage <1.25GB, and storage is <25%.
There is some good info in the capacity planning section of the QuestDB docs (e.g. 8 GB RAM for light workloads), but my question is really about the low end of the scale — what’s the least you can get away with and still be performant when getting started with the examples from the docs?
(I don't mind creating a pull request with this and some other docs additions. Most likely, 2 cores, 2 GB of RAM and 4 GB of storage would be plenty and still give you a nice 'wow, this is quick' factor, with the proviso that this is for evaluation purposes only.)
In QuestDB ingestion and querying are separated by design, meaning if you are planning to ingest medium/high throughput data while running queries, you want to have a dedicated core for ingestion and then another for the shared pool.
The shared pool is used for queries, but also for internal tasks QuestDB needs to run. If you are just running a demo, you probably can do well with just one core for the shared pool, but for production scenarios it is likely you would want to increase that depending on your access patterns.
Regarding disk capacity and memory, it all depends on the size of the data set. QuestDB queries will be faster if the working dataset fits in memory. 2GB of RAM and 4GB of disk storage as you suggested should be more than enough for the examples, but for most production scenarios you would probably want to increase both.

What to consider to optimize the throughput for gsutil rsync/cp command

I'm currently transferring 32 of 4GB files from Google Compute Engine instance to Google Cloud Storage. And I am currently trying to maximize my throughput during this process with "-m" and "-o [GSUtil:parallel_composite_upload_threshold=150M, GSUtil:parallel_thread_count=32]". But I was wondering if there are any other things I should consider and take advantage of(especially with boto configuration) to boost the throughput.
The default options are fine.
Increasing the buffer size beyond 512 KB has little impact on network performance. Increasing the number of threads beyond 4 has little impact as well.
The size of the Compute Engine instance and the distance between Compute Engine and Cloud Storage will have the most impact on performance.

Sagemaker Endpoint throttling exception

I have created an endpoint using Sagemaker, and designed my system so that it is called about 100 times simultaneously. This seemed to cause 'Model error' and take too much time. Do I need to create an endpoint for each event, and make one call per endpoint, instead?
you can go in cloudwatch logs to diagnose your model failure.
Real-time inference traffic scaling can be addressed via working on 3 independent dimensions:
hardware: choosing larger machines or more
machines. For example you can load test your model endpoint with bigger and bigger machines and see when hardware size gives you acceptable latency. The Autoscaling feature of SageMaker helps you address this automatically. If deploying a deep neural net, you can also consider using appropriate accelerators, eg GPU (EC2 P3, EC2 G4) or Amazon Elastic Inference Accelerator to make each prediction much faster.
software: you have 2 levers to tune here:
choosing a serving stack that is lean and fast. Different servers will handle load at different levels of performance. One common trick is to batch the load - for example, instead of hitting 100 times your server can you hit it only once with a batch of 100 records? If clients cannot batch their requests, can you use micro-asynchrony so that you do the batching yourself after they issued requests? You can usually configure such micro-batching in advanced deep learning servers such as TF Serving or MXNet Model Server (both can be used in SageMaker), but otherwise you can also do it yourself by having a queue (SQS) in front of your server.
model compilation - optimizing the model graph and its runtime. This is a very smart concept, that leverages the fact that when you know where you're going to deploy (eg NVIDIA, Intel, ARM, etc), you have an insider edge and you can refine your model artifact and create a bespoke runtime application that are tailor-made for this specific target platform. This can reduce memory consumption and latency by double-digit percentage, and is an active area of ML research. In the SageMaker ecosystem, such a compilation task can be performed with SageMaker Neo, but the open source ecosystem is developing fast, with notably treelite (paper, doc) for decision tree compilation and TVM (paper, doc) for arbitrary neural net compilation. Both are dependencies of Neo by the way.
science: some models are slower or heavier than others. If speed and concurrency are your priorities over accuracy, and if you already exploited all possible tricks at level (1) and (2) above, consider using fast-throughput models, eg linear models & logistic regression for structured data, MobileNet or SqueezeNet instead of large Resnets for classification (nice benchmark here), Yolov3 instead of FasterRCNN for detection (nice benchmark here), etc. But be aware that unlike levels (1) and (2), changing model science will alter accuracy.
As mentioned above, those 3 areas of improvements really are about real-time inference; if you can afford to pre-compute all possible model inputs, then the ultimate low-latency high-throughput solution is to pre-compute offline a variety of input-predictions pairs of interest and serve them on demand from a fast database or local read-only store.

Google cloud ml-engine custom hardware

I tried running my job with BASIC_GPU scale tier but I got an out of memory error. So then I tried running it with a custom configuration but I can't find a way of just using 1 Nvidia K80 with additional memory. All examples and predefined options use a number of GPUs, CPUs and workers and my code is not optimized for that. I just want 1 GPU and additional memory. How can I do that?
GPU memory is not extensible currently (Till something like PASCAL is accessible)
Reducing the batch size solves some of the out of memory issues
Adding GPUs to workers doesn't help either, as the model is deployed on individual worker separately (No memory pooling b/n workers)

how to use two aws ec2 instances(1 gpu and 1 cpu instance) with one storage to(run code, store/share files) & reduce cost

My team is using a gpu instance to run machine learning tensorflow based, yolo,computer vision applications and use it for training machine learning models also.. It costs 7$ an hour and has 8 gpu's. Was trying to reduce costs on it. We need 8 gpu's for faster training and sometimes many people can use different gpu's at the same time.
For our use case we are not using sometimes the gpu's(8 gpus) at all for atleast 1-2 weeks of a month. But a use of the gpu may arrive during that time but maynot also. So i wanted to know is there a way to edit the code and do all cpu intensive operations when gpu not needed through a low cost cpu instance. And turn on the gpu instance only when needed use it and then stop it when work done.
I thought of using efs for putting code on the shared file system and then running from there but i read an article( https://www.jeffgeerling.com/blog/2018/getting-best-performance-out-amazon-efs ) where its written that i should never run code from network based drives because the speed can become really slow. So i dont know if its good to run machine learning application from efs file system. I was thinking of making virtual environments on folders in efs but i dont think that is a good idea.
Could anyone suggest good ways of achieving this and reduce costs. And if you are suggesting to use an instance with lower number of gpu's that i have considered but we sometimes need 8 gpu's for faster training but we dont use the gpus at all for 1-2 weeks but the costs are still incurred.
Please suggest a way on how to achieve a low cost for this use case without using spot or reserved instances.
Thanks in advance
A few thoughts:
GPU instances now allow hibernation, so when launching your GPU select the new Stop Instance behavior 'hibernate' which will let you turn it off for 2 weeks but spin it up quickly if necessary
If you only have one instance, look into using EBS for data storage with a high volume of provisioned iops to move data on/off your instance quickly
Alternately, move your model to Sagemaker to ensure you are only charged for GPU use when you are actively training your model
If you are applying your model (inferencing) move that workload to a cheap instance. A trained yolo model can run inferencing on very small CPU instances, no need for a GPU for that part of the workload at all.
To reduce inference costs, you can use Elastic Inference which supports pay-per-use functionality:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/elastic-inference.html