I have thousands of training jobs that I want to run on sagemaker. Basically I have a list of hyperparameters and I want to train the model for all of those hyperparmeters in parallel (not a standard hyperparameter tuning where we just want to optimize the hyperparameter, here we want to train for all of the hyperparameters). I have searched the docs quite extensively but it surprises me that I couldn't find any info about this, even though it seems like a pretty basic functionality.
For example, let's say I have 10,000 training jobs, and my quota is 20 instances, what is the best way to run these jobs utilizing all my available instances? In particular,
Is there a "queue manager" functionality that takes the list of hyperparameters and runs the training jobs in batches of 20 until they are all done (even better if it could keep track of failed/completed jobs).
Is it best practice to run a single training job per instance? If that's the case do I need to ask for a much higher quota on the number of instance?
If this functionality does not exist in sagemaker, is it worth using EC2 instead since it's a bit cheaper?
Your question is very broad and the best way forward would depend on other details of your use-case, so we will have to make some assumptions.
[Queue manager]
SageMaker does not have a queue manager. If at the end you decide you need a queue manager, I would suggest looking towards AWS Batch.
[Single vs multiple training jobs]
Since you need to run 10s of thousands job I assume you are training fairly lightweight models, so to save on time, you would be better off reusing instances for multiple training jobs. (Otherwise, with 20 instances limit, you need 500 rounds of training, with a 3 min start time - depending on instance type - you need 25 hours just for the wait time. Depending on the complexity of each individual model, this 25hours might be significant or totally acceptable).
[Instance limit increase]
You can always ask for a limit increase, but going from a limit of 20 to 10k at once is likely that will not be accepted by the AWS support team, unless you are part of an organisation with a track record of usage on AWS, in which case this might be fine.
[One possible option] (Assuming multiple lightweight models)
You could create a single training job, with instance count, the number of instances available to you.
Inside the training job, your code can run a for loop and perform all the individual training jobs you need.
In this case, you will need to know which which instance is which so you can make the split of the HPOs. SageMaker writes this information on the file: /opt/ml/input/config/resourceconfig.json so using that you can easily have each instance run a subset of the trainings required.
Another thing to think of, is if you need to save the generated models (which you probably need). You can either save everything in the output model directory - standard SM approach- but this would zip all models in a model.tar.gz file.
If you don't want this, and prefer to have each model individually saved, I'd suggest using the checkpoints directory that will sync anything written there to your s3 location.
Related
I have an interesting problem where I have a job processing architecture that has a limit on how many jobs can be processed at once. When another job is about to start, it needs to check how many jobs are being processed, and if it is at the threshold, add the job to a queue.
What has stumped me is the best way to implement a "counter" that tracks the number of jobs running at once? This counter needs to be accessed and incremented up and down from different lambda functions.
My first thought was a CloudWatch custom high latency metric, but 1 second is not quick enough, as the system breaks if too many jobs are submitted. Additionally, I'm not sure if the metric can be incremented up or down only through code. The only thing I can think of now is an entire separate DB or EC2 instance, but that seems like complete overkill for just ONE number. We are not using a DB for data storage, it is in another cloud platform, only S3.
Any suggestions on what to do next? Thank you so much :)
You could use a DynamoDB table to hold your counter as a document. However, keep in mind that some operations in DynamoDB could lead to race conditions, so you might want to “lock” your table.
Depending on your load, this could potentially be free, given the Free Tier.
I've been reading some articles regarding this topic and have preliminary thoughts as what I should do with it, but still want to see if anyone can share comments if you have more experience with running machine learning on AWS. I was doing a project for a professor at school, and we decided to use AWS. I need to find a cost-effective and efficient way to deploy a forecasting model on it.
What we want to achieve is:
read the data from S3 bucket monthly (there will be new data coming in every month),
run a few python files (.py) for custom-built packages and install dependencies (including the files, no more than 30kb),
produce predicted results into a file back in S3 (JSON or CSV works), or push to other endpoints (most likely to be some BI tools - tableau etc.) - but really this step can be flexible (not web for sure)
First thought I have is AWS sagemaker. However, we'll be using "fb prophet" model to predict the results, and we built a customized package to use in the model, therefore, I don't think the notebook instance is gonna help us. (Please correct me if I'm wrong) My understanding is that sagemaker is a environment to build and train the model, but we already built and trained the model. Plus, we won't be using AWS pre-built models anyways.
Another thing is if we want to use custom-built package, we will need to create container image, and I've never done that before, not sure about the efforts to do that.
2nd option is to create multiple lambda functions
one that triggers to run the python scripts from S3 bucket (2-3 .py files) every time a new file is imported into S3 bucket, which will happen monthly.
one that trigger after the python scripts are done running and produce results and save into S3 bucket.
3rd option will combine both options:
- Use lambda function to trigger the implementation on the python scripts in S3 bucket when the new file comes in.
- Push the result using sagemaker endpoint, which means we host the model on sagemaker and deploy from there.
I am still not entirely sure how to put pre-built model and python scripts onto sagemaker instance and host from there.
I'm hoping whoever has more experience with AWS service can help give me some guidance, in terms of more cost-effective and efficient way to run model.
Thank you!!
I would say it all depends on how heavy your model is / how much data you're running through it. You're right to identify that Lambda will likely be less work. It's quite easy to get a lambda up and running to do the things that you need, and Lambda has a very generous free tier. The problem is:
Lambda functions are fundamentally limited in their processing capacity (they timeout after max 15 minutes).
Your model might be expensive to load.
If you have a lot of data to run through your model, you will need multiple lambdas. Multiple lambdas means you have to load your model multiple times, and that's wasted work. If you're working with "big data" this will get expensive once you get through the free tier.
If you don't have much data, Lambda will work just fine. I would eyeball it as follows: assuming your data processing step is dominated by your model step, and if all your model interactions (loading the model + evaluating all your data) take less than 15min, you're definitely fine. If they take more, you'll need to do a back-of-the-envelope calculation to figure out whether you'd leave the Lambda free tier.
Regarding Lambda: You can literally copy-paste code in to setup a prototype. If your execution takes more than 15min for all your data, you'll need a method of splitting your data up between multiple Lambdas. Consider Step Functions for this.
SageMaker is a set of services that each is responsible for a different part of the Machine Learning process. What you might want to use is the hosted version of Jupyter notebooks in SageMaker. You get a lot of freedom in the size of the instance that you are using (CPU/GPU, memory, and disk), and you can install various packages on that instance (such as FB Prophet). If you need it once a month, you can stop and start the notebook instances between these times and "Run all" the cells in your notebooks on this instance. It will only cost you the minutes of execution.
regarding the other alternatives, it is not trivial to run FB Prophet in Lambda due to the size limit of the libraries that you can install on Lambda (to avoid too long cold start). You can also use ECS (container Service) where you can have much larger images, but you need to know how to build a Docker image of your code and endpoint to be able to call it.
I have to do a deep learning project at my university, where I need to work with a medical image database. This database is stored in a Google Cloud Platform bucket.
However, the database's size is over 4 TB, so I can't afford download the data using gsutil. I can't use Google Colab notebook either, since it's disk storage size is 350GB.
Is there any way I can access the data and use it to teach my network?
I think you aren't on the right way.
When you build your model, you only need to have a representative subset of your dataset to validate your layers and the expected behavior.
Then, when all is done and packaged, you run your training job on dedicated VM (like Deep Learning VM). This process can be handle automatically by AI-Platform. You can also set up hyper-parameters server and parallelize your training.
In training phase, you often work with batches: you load only a subset of your dataset, you shuffle it and you train perform several steps on this subset (with RMSE/cross-entropy figure out, evaluation, gradient optimization).
Because you use a subset of your full dataset in batches, your don't need to have the 4Tb on your VM at the same time. Your training loop do it for you (download, train, evaluate, delete).
Like I said before, because you use a subset, you can also parallelize your training on several VMs for reducing your training duration.
I recommend you to review your training loop. If your give me the framework name/version which one you work, I could help you with tutorals and examples.
So I want to dump an entire DynamoDB table to S3. This tutorial gives a good explanation of how to do so. Gave it a test, it worked...great
However now I want to use it on my production data which is sizeable (>100GB). And I want it to run quickly. Obviously the read throughput on my DynamoDB table is a factor here, but is there a way to make sure the data pipeline is doing everything it can. I'm not super familiar with these, the architect view after the setup has areas for instance types and instance count, but will increasing these decrease my pipeline time? The tutorial doesn't mention anything about speed except for specifying the throughput of the table you meant to use. Will it scale automatically based upon that?
The template is based on the open source samples that the datapipeline team has on gihub.
The template you are referring to is here .
If you take a look at the pipeline definition, you will find that the export is being done via a map-reduce job. The scalability of the export job should be handled by that.
If you need to get a few more details on how EMR works with DynamoDB, you will find it at the here. If you increase the number of instances you would need to adjust the throughput of your table accordingly to increase the parallelism of the export.
I have a website set up on an EC2 instance which lets users view info from 4 of their social networks.
Once a user joins, the site should update their info every night, to show up-to-date and relevant information the next day.
Initially we had a cron-job which went through each user and did the necessary calls to the APIs and then stored the data on the DB (amazon rds instance).
This operation should take between 2 to 30 seconds per person, which means doing it 1 by 1 would take days to update.
I was looking at MapReduce and would like to know if it would be a suitable option for what im trying to do, but at the moment I can't tell for sure.
Would I be able to give an .sql file to MapReduce, with all the records I want to update + a script that tells MapReduce what to do with each record and have it process them all simultaneously?
If not, what would be the best way to go about it?
Thanks for your help in advance.
I am assuming each user's data is independent of the other users' data, which seems logaical to me. If that-s not the case, please ignore this answer.
Since you have mutually independent data (that is, each user's data is independent from other users') there is no need to use MapReduce. MR is just a paradigm in programming that simplifies data manipulation when the data is not independent (map prepares the data, then there is sorting phase, then reduce pulls the results from the sorted records).
In your case, if you want to use more computers, just split the load between them - each computer should process ~10000 users per hour (very rough estimate). Then users can be distributed among computers beforehand or they can be requested in chunks of 1000 or so users, so the machines that end sooner can process more users.
BUT there is an added bonus in using MR framework (such as Hadoop), even if you only use one phase (map only). It does the error handling for you (nodes failing, jobs failing,...) and it takes care of distributing the input among the nodes.
I'm not sure if MR is worth all the trouble to set it up, depends on your previous experience - YMMV.
If my understanding is correct. should this application to be implement as MapReduce, all the processings are done in the Map phase and reduce might simple output the Map phase result.
So if I were to implement this, I would just divide the job into multiple EC2 instances with each instance process a given range of record in your sql data. This has made the assumption that you have an good idea of how to divide the data to different instances.
The advantage is that you needn't pay for the price of Elastic MapReduce and avoid any possible MapReduce overhead.