Control AWS Sagemaker costs - amazon-web-services

I want to use GPU capacity for deep learning models. Sagemaker is great in its flexibility of starting on demand clusters for training. However, my department wants to have guarantees we won't overspend on the AWS budget. Is there a way to 'cap' the costs without resorting to using a dedicated machine?

Here are couple ideas:
You can use Service Catalog on top of SageMaker to restrict how
end-users consume the product (and for example limit instance types
and permissions)
On SageMaker Notebooks: you can use a lifecycle configuration to automatically shutdown notebooks that are idle.
You can also create AWS Lambda functions that automate controls (eg
shut down notebook instances at night, send a notif if big machines
are used etc)
On SageMaker Training you can use Spot capacity to benefit of significant savings on the training costs (up to 90% is possible). You can also apply a max training duration with train_max_run
it's a good idea to make sure your code makes good use of GPU. For example on P3 instance (V100 cards), you should try to use mixed-precision training so that training is faster and cheaper. Also, tune data loading, batch size and algorithm complexity so that GPU have enough work to do and don't just spend
their time doing data reads and model updates. Also, scale training vertically first (bigger and bigger machine) vs horizontally. Multi-machine training is generally more costly as harder to write and debug and with more communication overhead. This is more on your side that on SageMaker side though.
Training in the SageMaker Training API, not in SageMaker Notebook. When you write code in a Notebook, very little time is spent on training and much more in writing, debugging and reading documentation. All this time, GPU sits idle. On the other hand, SageMaker Training provides ephemeral, fresh instances billed per second only for the duration of training and comes with additional benefits such as spot, managed metrics, logs, data I/O and metadata management.
You can create cost allocation tags to make cost reporting at custom granularity

Related

When do I use a glue job or a Sagemaker Processing job for an etl?

I am currently struggling to decide on what situations in which a glue job is preferable over a sagemaker processing job and vice versa? Some advice on this topic would be greatly appreciated.
I can do the same on both, so why should I bother with the difference?
if you want to use a specific EC2 instance, use SageMaker
Pricing: SageMaker is pro-rated per-second while Glue has minimum charge amount (1min or 10min depending on versions). You should measure how much would a workload cost you on each platform
customization: in SageMaker Processing you can customize the execution environment, as you provide a Docker image (you could run more than Spark/Python, such as C++ or R)

What AWS service can I use to efficiently process large amounts of S3 data on a weekly basis?

I have a large amount of images stored in an AWS S3 bucket.
Every week, I run a classification task on all these images. The way I'm doing it currently is by downloading all the images to my local PC, processing them, then making database changes once the process is complete.
I would like to reduce the amount of time spent downloading images to increase the overall speed of the classification task.
EDIT2:
I actually am required to process 20,000 images at a time to increase performance of the classification engine. This means I can't use Lambdas since the maximum option for RAM available is 3GB and I need 16GB to process all 20,000 images
The classification task uses about 16GB of RAM. What AWS service can I use to automate this task? Is there a service that can be put on the same VLAN as the S3 Bucket so that images transfer very quickly?
The entire process takes about 6 hours to do. If I spin up an EC2 with 16GB of RAM it would be very cost ineffective as it would finish after 6 hours then spend the remainder of the week sitting there doing nothing.
Is there a service that can automate this task in a more efficient manner?
EDIT:
Each image is around 20-40KB. The classification is a neural network, so I need to download each image so I can feed it through the network.
Multiple images are processed at the same time (batches of 20,000), but the processing part doesn't actually take that long. The longest part of the whole process is the downloading part. For example, downloading takes about 5.7 hours, processing takes about 0.3 hours in total. Hence why I'm trying to reduce the amount of downloading time.
For your purpose you can still use EC2 instance. And if you have large amount of data to be downloaded from S3, you can attach and EBS volume to the instance.
You need to setup the instance with all the tools and software required for running your job. And when you don't have any process to run, you can shut down the instance. And boot it up when you want to run the process.
EC2 instances are not charged for the time they are in stopped state. You will be charged for the EBS volume and Elasitc IP attached to the Instance.
You also will be charged for the storage of the EC2 image on S3.
But I think these cost will be less than the cost of running EC2 instance all the time.
You can schedule start and stop the instance using AWS instance scheduler.
https://www.youtube.com/watch?v=PitS8RiyDv8
You can also use AutoScaling but that would be more complex solution than using the Instance Scheduler.
I would look into Kinesis streams for this, but it's hard to tell because we don't know exactly what processing you are doing to the images

Monitoring works or identifying bottlenecks in data pipeline

I am using google cloud datafow. Some of my data pipelines needs to be optimized. I need to understand how workers are performing in the dataflow cluster on these lines .
1. How much memory is being used ?
Currently I am logging memory usage using java code .
2. Is there a bottleneck on the disk operations ? To understand whether a SSD is required ?
3. Is there a bottleneck in Vcpus ? So as to increase the Vcpus in workers nodes.
I know stackdriver can be used to monitor Cpu and disk usage for the cluster. However it does not provide information on individual workers and also on whether we are hitting the bottle neck in these.
Within the Dataflow Stackdriver UI, you are correct, you cannot view the individual worker's metrics. However, you can certainly setup a Stackdriver Dashboard which gives you the invdividual worker metrics for all of what you have mention. Below is a sample dashboard which shows metrics for CPU, Memory, Network, Read IOPs, and Write IOPS.
Since the Dataflow job name will be part of the GCE instance name, here I filter down the GCE instances being monitored by the job name I'm interested in. In this case, my Dataflow job was named "pubsub-to-bigquery", so I filtered down to instance_name ~= pubsub-to-bigquery.*. I did a regex filter to be sure I captured any job names which may be suffixed with additional data in future runs. Setting up a dashboard such as this can inform you when you'd actually benefit from SSDs, more network bandwidth, etc.
Also be sure to check the Dataflow job graph in the cloud console when looking to optimize your pipeline. The wall time below the step name can give a good indication on what custom transforms or dofns should be targeted for optimization.

Cost to train on AWS?

I'm coming from academia where I had HPC clusters at my disposal. Now I'm trying to deploy something on AWS.
I'm trying to budget for what it would cost, $-wise, to train some standard neural nets on standard data sets so I have an idea what other training will cost. Even ballparkish estimates are appreciated.
I know you can request faster or more sets of GPUs, so I also don't know the spread of speed vs. cost either; any insight here is also appreciated.
What would it cost to train ResNet-50 (or really any smallish ResNet) on CIFAR-10, a relatively small net on a small data set? (say, 100 epochs with reasonable batch size)
I do not know anything on ResNet or CIFAR, but as far as pricing for AWS EC2 goes, it depends on instance family, type and reservations:
On Demand Instances: Higest cost. Ideal for prototyping and short lived environments. Pricing
Reserved Instances: Discount are based on the tenure of the reservation. You also have a No-Upfront option where you can reserve instances from 1 year to 3 years which does not provide maximum savings but significantly saves cost. Pricing
Spot Instances: Cheapest option of all, but your application should be designed in way to handle interruptions as AWS will terminate your instance without notice. Recent announcement from AWS provides support for a termination notification for certain types of spot instances, which you may want to investigate. Pricing

High resource demanding tasks in websites

I'm newbie in web development and wondering about how to process high resource-demanding task in websites. For example, suppose you have some task that should be done once a day and it requires too much resources and takes a long time on your hosting. Example of such task can getting data from elsewhere, make some calculations on that data, plotting some graphs based on results of calculations and finally inserting results to database. What is the most optimal way of doing such tasks? Does AWS provide solutions for that (for example, rent a computer only for processing your task at a specified time in a day) and if yes, what is the name of that service? I'll be very appreciated by your general advice and options suggestions for me.
For AWS, I'd recommend AWS Batch for something like this.
From the documentation:
As a fully managed service, AWS Batch enables developers, scientists,
and engineers to run batch computing workloads of any scale. AWS Batch
automatically provisions compute resources and optimizes the workload
distribution based on the quantity and scale of the workloads. With
AWS Batch, there is no need to install or manage batch computing
software, which allows you to focus on analyzing results and solving
problems. AWS Batch reduces operational complexities, saves time, and
reduces costs, which makes it easy for developers, scientists, and
engineers to run their batch jobs in the AWS Cloud.