Machine Repair Simulation? - c++

The Problem Statement:
A system is composed of 1, 2, or 3 machines and a repairman responsible for
maintaining these machines. Normally, the machines are running and producing
a product. At random points in time, the machines fail and are fixed by the
repairman. If a second or third machine fails while the repairman is busy
fixing the first machine, these machines will wait on the services of the
repairman in a first come, first served order. When repair on a machine is
complete, the machine will begin running again and producing a product.
The repairman will then repair the next machine waiting. When all machines
are running, the repairman becomes idle.
simulate this system for a fixed period of time and calculate the fraction
of time the machines are busy (utilization) and the fraction of time
the repairman is busy (utilization).
Now, the Input is 50 Running time and 50 Repairing time, then given the period
to calculate the utilization over it and the number of machines to simulate
for each test case.
Sample Input:
7.0 4.5 13.0 10.5 3.0 12.0 ....
9.5 2.5 4.5 12.0 5.7 1.5 ....
20.0 1
20.0 3
0.0 0
Sample Output:
No of Utilization
Case Machines Machine Repairman
1 1 .525 .475
2 3 .558 .775
Case 2 Explanation:
Machine Utilization = ((7+3)+(4.5+6)+(13))/(3*20) = .558
Repairman Utilization = (15.5)/20 = .775
My Approach:
1) load the machines into minimum heap (called runHeap) and give each
of them a run time, so the next to give run time will be a new one
from the 50 run times in the input,
2) calculate the minimum time between minimum reminding run time in the runHeap
,the reminding repair time in the head of the repair queue Or the reminding
time to finish simulation, And Call that value "toGo".
3) Subtract all reminding run time for all machines in the runHeap by toGo,
Subtract the reminding repair time of head of repairQueue by toGo,
4) All machines having reminding run time == 0, push it into the repairQueue,
The head of the repair Queue if the reminding repair time == 0 push it into
the runHeap,
5) Add toGo to the current time
6) if current time < simulation time go to step 2, else return utilization's.
Now, the Question Is It A Good Approach Or one can figure out a better one ??

Related

How to find finish times of processes in cplex

I have a machine,batch scheduling problem. Finish time of a batch is "Z[b]" variable. There are three machines(f represent machines). If a machine starts processing a specific batch at time t X[f][b][t] equals to 1.
"P[b]" parameter is the proccesing time of the batches. I need to find ending times of batches.Tried this constraint.t is the range of time for example 48 hours.
"forall(p in B) Z[p]-(sum(n in F)sum(a in 1..t-P[p]+1)(a+P[p])*X[n][p][a])==0 ;"
I have 3 machines but this constraint just use 2 machines at time 1. Also Z[p] values is not logical.How can i fix this?
Within CPLEX you have CPOptimizer that is good at scheduling.
And to get the end of an interval , endOf(itvs) works fine

What is the way to make Tune run parallel trials across multiple GPUs?

I am hoping to make Tune run each trial of a grid search in parallel across multiple GPUs. I have a 4 GPU machine with 24 VCPUs. When I run the following code, I see 3 GPUs being used in by nvidia-smi, but it is only running one trial.
tune.run("PPO",
config={
"env": "PongNoFrameskip-v4",
"lr": tune.grid_search([0.01, 0.001, 0.0001]),
"num_gpus": 3,
"num_workers": 3
}
)
I can see from the run that Tune is only running one trial.
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 4/24 CPUs, 3/4 GPUs, 0.0/190.43 GiB heap, 0.0/12.84 GiB objects
Memory usage on this node: 5.4/220.4 GiB
Result logdir: /home//ray_results/PPO
Number of trials: 3 ({'RUNNING': 1, 'PENDING': 2})
PENDING trials:
- PPO_PongNoFrameskip-v4_1_lr=0.001: PENDING
- PPO_PongNoFrameskip-v4_2_lr=0.0001: PENDING
RUNNING trials:
- PPO_PongNoFrameskip-v4_0_lr=0.01: RUNNING
I tried setting resources_per_trial with "gpu":1 but Ray gave an error to clear resources_per_trial.
ValueError: Resources for <class 'ray.rllib.agents.trainer_template.PPO'> have been automatically set to Resources(cpu=1, gpu=3, memory=0, object_store_memory=0, extra_cpu=3, extra_gpu=0, extra_memory=0, extra_object_store_memory=0, custom_resources={}, extra_custom_resources={}) by its `default_resource_request()` method. Please clear the `resources_per_trial` option.
What is the way to tell Tune to run all 3 trials in parallel?
Thank you.
Try
tune.run("PPO",
config={
"env": "PongNoFrameskip-v4",
"lr": tune.grid_search([0.01, 0.001, 0.0001]),
"num_gpus": 1,
"num_workers": 3
}
)
Explanation to richiliaw's answer:
Note that the important bit in resources_per_trial is per trial. If e.g. you have 4 GPUs and your grid search has 4 combinations, you must set 1 GPU per trial if you want the 4 of them to run in parallel.
If you set it to 4, each trial will require 4 GPUs, i.e. only 1 trial can run at the same time.
This is explained in the ray tune docs, with the following code sample:
# If you have 8 GPUs, this will run 8 trials at once.
tune.run(trainable, num_samples=10, resources_per_trial={"gpu": 1})
# If you have 4 CPUs on your machine and 1 GPU, this will run 1 trial at a time.
tune.run(trainable, num_samples=10, resources_per_trial={"cpu": 2, "gpu": 1})

Strange behaviour of Parallel Boost Graph Library example code

I have set up simple tests with Parallel Boost Graph Library (PBGL), which I have never used before, and observed entirely unexpected behaviour I would like to explain.
My steps were as follows:
Dump test data in METIS format (a kind of social graph with 50 mln vertices and 100 mln edges);
Build modified PBGL example from graph_parallel\example\dijkstra_shortest_paths.cpp
Example was slightly extended to proceed with Eager, Crauser and delta-stepping algorithms.
Note: building of the example required some obscure workaround about the MUTABLE_QUEUE define in crauser_et_al_shortest_paths.hpp (example code is in fact incompatible with the new mutable_queue)
int lookahead = 1;
delta_stepping_shortest_paths(g, start, dummy_property_map(), get(vertex_distance, g), get(edge_weight, g), lookahead);
dijkstra_shortest_paths(g, start, distance_map(get(vertex_distance, g)).lookahead(lookahead));
dijkstra_shortest_paths(g, start, distance_map(get(vertex_distance, g)));
Run
mpiexec -n 1 mytest.exe mydata.me
mpiexec -n 2 mytest.exe mydata.me
mpiexec -n 4 mytest.exe mydata.me
mpiexec -n 8 mytest.exe mydata.me
The observed behaviour:
-n 1:
mem usage: 35 GB in 1 running process, which utilizes exactly 1 device thread (processor load 12.5%)
delta stepping time: about 1 min 20 s
eager time: about 2 min
crauser time: about 3 min 20 s.
-n 2:
crash in the stage of data load.
-n 4:
mem usage: 40+ Gb in roughly equal parts in 4 running processes, each of which utilizes exactly 1 device thread
calculation times are unchanged in the margins of observation error.
-n 8:
mem usage: 44+ Gb in roughly equal parts in 8 running processes, each of which utilizes exactly 1 device thread
calculation times are unchanged in the margins of observation error.
So, except the unapropriate memory usage and very low total performance the only changes I observe when more MPI processes are running are slightly increased total memory consumption and linear rise of processor load.
The fact that initial graph is somehow partitioned between processes (probably by the vertices number ranges) is nevertheless evident.
What is wrong with this test (and, probably, my idea of MPI usage in whole)?
My enviromnent:
- one Win 10 PC with 64 Gb and 8 kernels;
- MS MPI 10.0.12498.5;
- MSVC 2017, toolset 141;
- boost 1.71
N.B. See original example code here.

Why is there a difference between the sum (stime + utime) of all processes' CPU usage, compared to the overall CPU usage from /proc/stat in Linux?

I need to calculate the overall CPU usage of my Linux device over some time (1-5 seconds) and a list of processes with their respective CPU usage times. The programm should be designed and implemented in C++. My assumption would be that the sum of all process CPU times would be equal to the total value for the whole CPU. For now the CPU I am using is multi-cored (2 cores).
According to How to determine CPU and memory consumption from inside a process? it is possible to calculate all "jiffies" available in the system since startup using the values for "cpu" in /proc/stat. If you now sample the values at two points in time and compare the values for user, nice, system and idle at the two time points, you can calculate the average CPU usage in this interval. The formula would be
totalCPUUsage = ((user_aft - user_bef) + (nice_aft - nice_bef) + (system_aft - system_bef)) /
((user_aft - user_bef) + (nice_aft - nice_bef) + (system_aft - system_bef) + (idle_aft - idle_bef)) * 100 %
According to How to calculate the CPU usage of a process by PID in Linux from C? the used jiffies for a single process can be calculated by adding utime and stime from /proc/${PID}/stat (column 14 and 15 in this file). When I now calculate this sum and divide it by the total amount of jiffies in the analyzed interval, I would assume the formula for one process to be
processCPUUsage = ((process_utime_aft - process_utime_bef) + (process_stime_aft - process_stime_bef)) /
((user_aft - user_bef) + (nice_aft - nice_bef) + (system_aft - system_bef) + (idle_aft - idle_bef)) * 100 %
When I now sum up the values for all processes and compare it to the overall calculated CPU usage, I receive a slightly higher value for the aggregated value most of the time (although the values are quite close for all different CPU loads).
Can anyone explain to me, what's the reason for that? Are there any CPU resources that are used by more than one process and thus accounted twice or more in my accumlation? Or am I simply missing something here? I can not find any further hint in the Linux man page for the proc file system (https://linux.die.net/man/5/proc) as well.
Thanks in advance!

Why do I get such huge jitter in time measurement?

I'm trying to measure a function's performance by measuring the time for each iteration.
During the process, I found even if I do nothing, the results still vary quite a bit.
e.g.
volatile long count = 0;
for (int i = 0; i < N; ++i) {
measure.begin();
++count;
measure.end();
}
In measure.end(), I measure the time difference and keep an unordered_map to keep track of the time-count.
I've used clock_gettime as well as rdtsc, but there's always about 1% of the data points lie far away from mean, in a 1000 factor.
Here's what the above loop generates:
T: count percentile
18 117563 11.7563%
19 111821 22.9384%
21 201605 43.0989%
22 541095 97.2084%
23 2136 97.422%
24 2783 97.7003%
...
406 1 99.9994%
3678 1 99.9995%
6662 1 99.9996%
17945 1 99.9997%
18148 1 99.9998%
18181 1 99.9999%
22800 1 100%
mean:21
So whether it's ticks or ns, the worst case 22800 is about 1000 times bigger than mean.
I did isolcpus in grub and was running this with taskset. The simple loop almost does nothing, the hash table to do time-count statistics is outside of the time measurements.
What am I missing?
I'm running this on a laptop with ubuntu installed, CPU is Intel(R) Core(TM) i5-2520M CPU # 2.50GHz
Thank you for all the answers.
The main interrupt that I couldn't stop is the local timer interrupt. And it seems new 3.10 kernel would support tickless. I'll try that one.