Can you use different stopping conditions for schedulers versus general tune trials - ray

In Ray Tune, is there any guidance on whether using different stopping conditions for a scheduler versus a trial is fine to do?
Below, I have an async hyperband scheduler stopping based on neg_mean_loss, and tune itself stopping based on mean_f1.
Should I be using the same for both or does it not matter?
scheduler = schedulers.AsyncHyperBandScheduler(
time_attr='training_iteration',
reward_attr='neg_mean_loss', # <------
max_t=100,
grace_period=10,
reduction_factor=3,
brackets=3
)
all_trials = tune.run(
tune_trainable,
name="tuner",
scheduler=scheduler,
stop={"mean_f1": 0.99}, # <------
resources_per_trial={"cpu": 2, "gpu": 1},
config={"lr": tune.grid_search([0.0002, 0.003, 0.007, 0.01])
)

It doesn't matter; you can specify multiple criteria for termination, and Tune will terminate a trial as soon as the criteria is hit.

Related

Vertex AI 504 Errors in batch job - How to fix/troubleshoot

We have a Vertex AI model that takes a relatively long time to return a prediction.
When hitting the model endpoint with one instance, things work fine. But batch jobs of size say 1000 instances end up with around 150 504 errors (upstream request timeout). (We actually need to send batches of 65K but I'm troubleshooting with 1000).
I tried increasing the number of replicas assuming that the # of instances handed to the model would be (1000/# of replicas) but that doesn't seem to be the case.
I then read that the default batch size is 64 and so tried decreasing the batch size to 4 like this from the python code that creates the batch job:
model_parameters = dict(batch_size=4)
def run_batch_prediction_job(vertex_config):
aiplatform.init(
project=vertex_config.vertex_project, location=vertex_config.location
)
model = aiplatform.Model(vertex_config.model_resource_name)
model_params = dict(batch_size=4)
batch_params = dict(
job_display_name=vertex_config.job_display_name,
gcs_source=vertex_config.gcs_source,
gcs_destination_prefix=vertex_config.gcs_destination,
machine_type=vertex_config.machine_type,
accelerator_count=vertex_config.accelerator_count,
accelerator_type=vertex_config.accelerator_type,
starting_replica_count=replica_count,
max_replica_count=replica_count,
sync=vertex_config.sync,
model_parameters=model_params
)
batch_prediction_job = model.batch_predict(**batch_params)
batch_prediction_job.wait()
return batch_prediction_job
I've also tried increasing the machine type to n1-high-cpu-16 and that helped somewhat but I'm not sure I understand how batches are sent to replicas?
Is there another way to decrease the number of instances sent to the model?
Or is there a way to increase the timeout?
Is there log output I can use to help figure this out?
Thanks
Answering your follow up question above.
Is that timeout for a single instance request or a batch request. Also, is it in seconds?
This is a timeout for the batch job creation request.
The timeout is in seconds, according to create_batch_prediction_job() timeout refers to rpc timeout. If we trace the code we will end up here and eventually to gapic where timeout is properly described.
timeout (float): The amount of time in seconds to wait for the RPC
to complete. Note that if ``retry`` is used, this timeout
applies to each individual attempt and the overall time it
takes for this method to complete may be longer. If
unspecified, the the default timeout in the client
configuration is used. If ``None``, then the RPC method will
not time out.
What I could suggest is to stick with whatever is working for your prediction model. If ever adding the timeout will improve your model might as well build on it along with your initial solution where you used a machine with a higher spec. You can also try using a machine with higher memory like the n1-highmem-* family.

Monitor task CPU utilization in VxWorks while program is running

I'm running a VxWorks 6.9 OS embedded system and I need to see when I'm starving low priority tasks. Ideally I'd like to have CPU utilization by task so I know what is eating up all my CPU time.
I know this is a built in feature in many operating systems but have been so far unable to find it for VxWorks 6.9.
If I can't measure by task I'd like to at least to see what percentage of time the CPU is idle.
To that end I've been trying to make a lowest priority task that will run the function below that would try to measure it indirectly.
float Foo::IdleTime(Foo* f)
{
bool inIdleTask;
float timeIdle;
float totalTime;
float percentIdle;
while(true)
{
startTime = _time(); //get time before before measurement starts
inIdleTask = true;
timeIdle = 0;
while(inIdleTask) // I have no clue how to detect when the task left and set this to false
{
timeIdle += (amount_of_time_for_inner_loop); //measure idle time
}
returnTime = _time(); //get time after you return to IdleTime task
totalTime = ( returnTime - startTime );
percentIdle = ( timeIdle / totalTime ) * 100; //calculate percentage of idle time
//logic to report percentIdle
}
The big problem with this concept is I don't know how I would detect when this task is left for a higher priority task.
If you are looking for a one time measurement done during the developement, then spyLib is what you are looking for. Simply call spy from the command line to get per task CPU usage report in 10s intervals. Call spyHelp to learn how to configure the spy. (Might need to inculude the spyLib to kernel if not already included.)
If you want to go the extra mile, taskHookLib is what you need. Simply put, you hook a function to be called in every task switch. Call gives you the TASK_IDs of tasks going in and out of the CPU. You can either simply monitor the starvation of low pri tasks or take action and increase their priority temporarily.
From experience, spy adds a little performance overhead, especially if stdout faces to a slow I/O (e.g. a 9600 baud serial), but fairly easy to use. taskHook'ing adds little to none overhead if you are not immediately printing the results on the terminal, but takes a bit of programming to get it running.
Another thing that might be of interest is WindRiver's remote debugger. Haven't use that one personally, imagine it would require setting up the workbench and the target properly.

Is it possible to do parallell, rather than distributed, hyperparameter tuning in Google ML Engine?

I would like to let each ML Engine worker independently handle one trial, rather than having them cooperate with each trial (distributed training).
Is this possible?
(When setting workerCount > 0 it seems to pass each trial to every worker, independently of the value set for maxParallelTrial)
If each trial only requires a single machine, then configure your TrainingInput with the requirements for each trial (e.g., workerCount = 0; parameterServerCount = 0) and control the number of parallel trials with maxParallelTrial. That should have the desired effect.

How to set internal wall clock in a Fortran program?

I use Fortran to do some scientific computation. I use HPC. As we know, when we submit jobs in a HPC job scheduler, we also specify the wall clock time limit for our jobs. However, when the time is up, if the job is still writing output data, it will be terminated and it will cause some 'NUL' values in the data, causing trouble for the post-processing:
So, could we set an internal mechanism that our job can stop itself peacefully some time before the end of HPC allowance time?
Related Question: How to skip reading "NUL" value in MATLAB's textscan function?
After realizing what you are asking I found out that I implemented similar functionality in my program very recently (commit https://bitbucket.org/LadaF/elmm/commits/f10a1b3421a3dd14fdcbe165aa70bf5c5001413f). But I still have to set the time limit manually.
The most important part:
time_stepping%clock_time_limit is the time limit in seconds. Count the number of system clock ticks corresponding to that:
call system_clock(count_rate = timer_rate)
call system_clock(count_max = timer_max_count)
timer_count_time_limit = int( min(time_stepping%clock_time_limit &
* real(timer_rate, knd), &
real(timer_max_count, knd) * 0.999_dbl) &
, dbl)
Start the timer
call system_clock(count = time_steps_timer_count_start)
Check the timer and exit the main loop with error_exit set to .true. if the time is up
if (mod(time_step,time_stepping%check_period)==0) then
if (master) then
error_exit = time_steps_timer_count_2 - time_steps_timer_count_start > timer_count_time_limit
if (error_exit) write(*,*) "Maximum clock time exceeded."
end if
MPI_Bcast the error exit to other processes
if (error_exit) exit
end if
Now, you may want to get the time limit from your scheduler automatically. That will vary between different job scheduling softwares. There will be an environment variable like $PBS_WALLTIME. See Get walltime in a PBS job script but check your scheduler's manual.
You can read this variable using GET_ENVIRONMENT_VARIABLE()

Go Worker Pool doesn't seem to be processing Concurrently

Hello I'm brand new to go (and concurrent programming in general :() and trying to distribute a slow computation to a pool of workers.
http://play.golang.org/p/lTv4Tm75A4
func main() {
test := []int{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
answer := getSmallestMultiple(test)
fmt.Println(answer)
}
I am trying to find the smallest number that is evenly divisible by all the numbers in test.
I have created a pool of workers and am sending them values until one of the goroutines finds a number that can be evenly divided by all the numbers in test
for w := 0; w < 100; w++ {
go divisibleByAllNumbers(&numbers, jobs, answer)
}
go func() {
for i := max; ; i += max {
fmt.Printf("Sending # %d\n", i)
jobs <- i
}
}()
The program seems to be running at the same speed despite how many workers I start. I have tried many number of workers and it always takes the same number of seconds to run, which seems like the work is not being done concurrently at all.
Each worker is consuming work from the queue using range:
for j := range jobs {}
And i was hoping the more processes consuming off the jobs channel the faster the program would execute.
I have also tried different values for the jobs := make(chan int) buffer value
I have stared at this all day and was hoping someone could see what the issue is. I would expect the more workers I add the faster the computation takes but am not experiencing that. I'm sure I"m missing some key concepts,
Thank you
http://golang.org/doc/effective_go.html#parallel
The current implementation of the Go runtime will not parallelize this code by default. It dedicates only a single core to user-level processing. An arbitrary number of goroutines can be blocked in system calls, but by default only one can be executing user-level code at any time. It should be smarter and one day it will be smarter, but until it is if you want CPU parallelism you must tell the run-time how many goroutines you want executing code simultaneously. There are two related ways to do this. Either run your job with environment variable GOMAXPROCS set to the number of cores to use or import the runtime package and call runtime.GOMAXPROCS(NCPU). A helpful value might be runtime.NumCPU(), which reports the number of logical CPUs on the local machine. Again, this requirement is expected to be retired as the scheduling and run-time improve.