I started to use GCP ml-engine to train neural networks. When I was checking for a particular job through tensorboard, it showed the following graph for the loss (plotted vs wall time):
Checking the log in the stackdriver for a time slot in which the "raise" occurs, I found the following:
It seems to me, that the job was re-initialized. The reason is completely unclear to me. Any explanations/help would be appreciated!
Additional information: The particular job, where I observed this behavior, run in parallel to other jobs. The other jobs terminated as expected. The sole difference between the jobs was the number of hidden layers in the Neural Network being 2 whereas in the other jobs being 1 and 4.
If you are using Estimators API what TensorFlow is doing is the following:
Open a session (tf.Session()), load graph for training and checkpoints(if any) and train for X steps.
Save checkpoint and summaries for tensorboard.
Close session
Open a session, load graph for evaluation and evaluate eval set
Save summaries for tensorboard.
Close session
Repeat 1-6 until stopping criteria is met.
That is why you are seeing this kind of re-initialization.
Related
I try to use tpu-v2-8 through custom training job. My job runs fine on vm, but as custom training job, it OOM and also seems slower. It is also quite hard to schedule (pending for more than a few minutes, hit internal error most of the time, tried us-central1 and asia-east1).
Furthermore, the monitoring for cpu, memory, network etc exists in web the UI but says unavailable. Also, I'm using TF/JAX and the log format conforms to glog standard, yet the logging from my application all shows up as error instead of at appropriate levels in cloud logging.
Am I missing something or doing something wrong?
No, everything seems fine on your side. To be specific:
It makes sense that the training process is slower, as all operations are passed through Vertex AI to TPU.
Sometimes, it's hard to obtain TPUs via Vertex AI. This could be the capacity issue in the Vertex AI itself. Just keep trying different regions, including europe-west4.
Yes, unfortunately, no metric is available using TPU at this moment, and some log entries are marked as errors.
I would like to run my model 30 days in using aws sagemaker training job, but its max time is 5 days, how to resume the earlier to proceed further
Follow these steps:
Open a support ticket to increase Longest run time for a training job
to 2419200 seconds (28 days). (this can't be adjusted using the service quotas in AWS Web console).
Using the SageMaker Python SDK, when creating an Estimator, set max_run=2419200.
Implement Resume from checkpoints in your training script.
Also, the questions in #rok's answer are very relevant to consider.
According to the documentation here the maximum allowed runtime is 28 days, not 5. Check your configuration please . You are right, according to the documentation here the maximum runtime for a training job is 5 days. There are multiple things you can do: more powerful (multiple) GPU to reduce training time, or save checkpoint and restart training from there. Anyway 30 days looks like a very big training time (with associated cost), are you sure you need that ?
Actually you could ask for service quotas increase from here but as you can see Longest run time for a training job is not adjustable. So I don't you have any other choice of either using checkpoints or greater GPUs.
I have set up multiple jobs in Informatica cloud to sync data from Oracle with Informatica objects. The job is scheduled to run every 3 minutes as per the business requirements. Sometimes the job used to run long due to secure agent resource crunch and my team used to multiple emails as below
The Mapping task failed to run. Another instance of the task is currently running.
Is there any way to suppress these failure emails in the mapping?
This wont get set at the mapping level but on the session or integration service level see following https://network.informatica.com/thread/7312
This type of error comes when workflow/session is running and trying to re-run. use check if by script if already running then wait. If want to run multiple instance of same:
In Workflow Properties Enable 'Configure Concurrent Execution' by checking the check box.
once its enables you 2 options
Allow Concurrent Run with same instance name
Allow Concurrent run only with unique instance name
Notifications configured at the task level over ride those at the org level, so you could do this by configuring notifications at the task level and only sending out warnings to the broader list. That said, some people should still be receiving the error level warning because if it recurs multiple times within a short period of time there may be another issue.
Another thought is that batch processes that run every three minutes that take longer than three minutes is usually an opportunity to improve the design. Often a business requirement for short batch intervals is around a "near real time" desire. If you have also Cloud Application Integration service, you may want to set up an event to trigger the batch run. If there is still overlap based on events, you can use the Cloud Data Integration API to and create a dynamic version of the task each time. For really simple integrations you could perform the integration in CAI, which does allow multiple instances running at the same time.
HTH
I am wondering what the scheduling strategy behind AWS Batch looks like. The official documentation on this topic doesn't provide much details:
The AWS Batch scheduler evaluates when, where, and how to run jobs that have been submitted to a job queue. Jobs run in approximately the order in which they are submitted as long as all dependencies on other jobs have been met.
(https://docs.aws.amazon.com/batch/latest/userguide/job_scheduling.html)
"Approximately" fifo is quite vaque. Especially as the execution order I observed when testing AWS Batch did't look like fifo.
Did I miss something? Is there a possibility to change the scheduling strategy, or configure Batch to execute the jobs in the exact order in which they were submitted?
I've been using Batch for a while now, and it has always seemed to behave in roughly a FIFO manner. Jobs that are submitted first will generally be started first, but because of limitations with distributed systems, this general rule won't work out perfectly. Jobs with dependencies are kept in the PENDING state until their dependencies have completed, and then they go into the RUNNABLE state. In my experience, whenever Batch is ready to run more jobs from the RUNNABLE state, it picks the job with the earliest time submitted.
However, there are some caveats. First, if Job A was submitted first but requires 8 cores while Job B was submitted later but only requires 4 cores, Job B might be selected first if Batch has only 4 cores available. Second, after a job leaves the RUNNABLE state, it goes into STARTING while Batch downloads the Docker image and gets the container ready to run. Depending on a number of factors, jobs that were submitted at the same time may take longer or shorter in the STARTING state. Finally, if a job fails and is retried, it goes back into the PENDING state with its original time submitted. When Batch decides to select more jobs to run, it will generally select the job with the earliest submit date, which will be the job that failed. If other jobs have started before the first job failed, the first job will start its second run after the other jobs.
There's no way to configure Batch to be perfectly FIFO because it's a distributed system, but generally if you submit jobs with the same compute requirements spaced a few seconds apart, they'll execute in the same order you submitted them.
I am using Google Cloud ML to for training jobs. I observe a peculiar behavior in which I observe varying time taken for the training job to complete for the same data. I analyzed the CPU and Memory utilization in the cloud ML console and see very similar utilization in both the cases(7min and 14mins).
Can anyone let me know what would be the reason for the service to take inconsistent time for the job to complete.
I have the same parameters and data in both the cases and also verified that the time spent in the PREPARING phase is pretty much the same in both cases.
Also would it matter that I schedule simultaneous multiple independent training job on the same project, if so then would like to know the rationale behind it.
Any help would be greatly appreciated.
The easiest way is to add more logging to inspect where the time was spent. You can also inspect training progress using TensorBoard. There's no VM sharing between multiple jobs, so it's unlikely caused by simultaneous jobs.
Also, the running time should be measured from the point when the job enters RUNNING state. Job startup latency varies depending on it's cold or warm start (i.e., we keep the VMs from previous job running for a while).