I don't understand a Shell Sciprt that I refer to - bc

There is a shell script to check if CPU usage exceeds CPU threshold,
But one thing that I don't understand is in Below script, why 1 is there ?
COMPARE='echo "if ($CPU_Usage>=$CPU_Threshold) 1" | bc'
Does a script mean if CPU Usage >= CPU_Threshold, then COMPARE variable set to 1 ?
And if not COMPARE variable set to 0 ?

Related

Error: VECTORSZ is too small

I am new to working with Promela and in particular SPIN. I have a model which I am trying verify and can't understand SPIN's output to resolve the problem.
Here is what I did:
spin -a untitled.pml
gcc -o pan pan.c
./pan
The output was as follows:
pan:1: VECTORSZ is too small, edit pan.h (at depth 0)
pan: wrote untitled.pml.trail
(Spin Version 6.4.5 -- 1 January 2016)
Warning: Search not completed
+ Partial Order Reduction
Full statespace search for:
never claim - (none specified)
assertion violations +
acceptance cycles - (not selected)
invalid end states +
State-vector 8172 byte, depth reached 0, errors: 1
0 states, stored
0 states, matched
0 transitions (= stored+matched)
0 atomic steps
hash conflicts: 0 (resolved)
I then ran SPIN again to try to determine the cause of the problem by examining the trail file. I used this command:
spin -t -v -p untitled.pml
This was the result:
using statement merging
spin: trail ends after -4 steps
#processes: 1
( global variable dump omitted )
-4: proc 0 (:init::1) untitled.pml:173 (state 1)
1 process created
According to this output (as I understand it), the verification is failing during the "init" procedure. The relevant code from within untitled.pml is this:
init {
int count = 0;
int ordinal = N;
do // This is line 173
:: (count < 2 * N + 1) ->
At this point I have no idea what is causing the problem since to me, the "do" statement should execute just fine.
Can anyone please help me in understanding SPINs output so I can remove this error during the verification process? The model does produce the correct output for reference.
You can simply ignore the trail file in this case, it is not relevant at all.
The error message
pan:1: VECTORSZ is too small, edit pan.h (at depth 0)
tells you that the size of directive VECTORSZ is too small to successfully verify your model.
By default, VECTORSZ has size 1024.
To fix this issue, try compiling your verifier with a larger VECTORSZ size:
spin -a untitled.pml
gcc -DVECTORSZ=2048 -o run pan.c
./run
If 2048 doesn't work too, try some more (increasingly larger) values.

How can i place a break point with logical condition in Trace 32

I want to set a break point, which stops my application when two variables contain a certain value. E.g. Stop execution when both x==10 and y==11.
How can I achieve that in Lauterbach TRACE32?
The command Var.Break.Set has an option \VarCONDition, which allows you to specify a condition under which the CPUs stays stopped once it hits the corresponding breakpoint. (In the dialog for setting breakpoints you'll find also the field "Condition" for that, when you click on "advanced".)
So for you scenario the required two commands would be:
Var.Break.Set x /Write /VarCONDition (x==10 && y==11)
Var.Break.Set y /Write /VarCONDition (x==10 && y==11)
As a result the CPU is stopped on every write to x or y, but gets immediately restarted when condition "x==11 && y==11" is not met.
Of course x and y must be located in memory. It won't work if the variable is implemented in a CPU register (unless you have one of these rarer CPU, that supports read/write breakpoints on core registers.)
In case you are using a Cortex-A or Cortex-R CPU you must also add the option "/AfterStep" since these processors have a break-before-make behavior also for address-write-breakpoints.
If your CPU supports data value breakpoints (e.g. Cortex-M4) you can set the breakpoints also like this
Var.Break.Set x /Write /DATA 10. /VarCONDition y==11
Var.Break.Set y /Write /DATA 11. /VarCONDition x==10.
This is much better since it only stops the CPU when the correct value is written to x or y and stays stopped if also the other variable has the correct value.

How to get stat information from a child process to measure resource utilization?

I feel like this must have a simple answer, but I really don't know how to approach this.
For background, the stack of things is like this:
Python script -> C++ binary -(fork)-> actual thing we want to measure.
Essentially, we have a python script that simulates an environment by using tmp directories and running multiple instances of this network software stack we're developing. The script calls a host binary (which is unimportant here), and then, after it loads, a helper binary. The helper binary can be passed a parameter to daemonize, and when it does this, it forks in the usual way.
What we need to do is measure the daemon's CPU utilization, but I don't really know how to. What I have done is read the stat file periodically, but since the process daemonizes, I can't use echo $! to get its PID. Using ps aux | grep 'thing' works fine, but I think this is giving me the parent process, because the stat information looks like this:
1472582561 9455 (nlsr) S 1 9455 9455 0 -1 4218944 394 0 0 0 13 0 0 0 20 0 2 0 909820 184770560 3851 18446744073709551615 4194304 5318592 140734694817376 140734694810512 140084250723843 0 0 16781312 0 0 0 0 17 0 0 0 0 0 0 7416544 7421528 16224256 140734694825496 140734694825524 140734694825524 140734694825962 0
I know that the parent process should not be PID1, and definitely the utime field and similar should be greater than 13 clock ticks. This is what is leading me to conclude that this process is really the parent process, and not the forked child that's doing all the work.
I can modify pretty much any file necessary, but because of code review constraints, design specs., etc., the less I have to change along many files, the better.
Get the PID of the child reliably
fork() returns the PID of the child to the parent
Get the CPU stats from /proc/[PID]/stat
#14 utime - CPU time spent in user code, measured in clock ticks

Poor man's Python data parallelism doesn't scale with the # of processors? Why?

I'm needing to make many computational iterations, say 100 or so, with a very complicated function that accepts a number of input parameters. Though the parameters will be varied, each iteration will take nearly the same amount of time to compute. I found this post by Muhammad Alkarouri. I modified his code so as to fork the script N times where I would plan to set N equal at least to the number of cores on my desktop computer.
It's an old MacPro 1,0 running 32-bit Ubuntu 16.04 with 18GB of RAM, and in none of the runs of the test file below does the RAM usage exceed about 15% (swap is never used). That's according to the System Monitor's Resources tab, on which it is also shown that when I try running 4 iterations in parallel all four CPUs are running at 100%, whereas if I only do one iteration only one CPU is 100% utilized and the other 3 are idling. (Here are the actual specs for the computer. And cat /proc/self/status shows a Cpus_allowed of 8, twice the total number of cpu cores, which may indicate hyper-threading.)
So I expected to find that the 4 simultaneous runs would consume only a little more time than one (and going forward I generally expected run times to pretty much scale in inverse proportion to the number of cores on any computer). However, I found to the contrary that instead of the running time for 4 being not much more than the running time for one, it is instead a bit more than twice the running time for one. For example, with the scheme presented below, a sample "time(1) real" value for a single iteration is 0m7.695s whereas for 4 "simultaneous" iterations it is 0m17.733s. And when I go from running 4 iterations at once to 8, the running time scales in proportion.
So my question is, why doesn't it scale as I had supposed (and can anything be done to fix that)? This is, by the way, for deployment on a few desktops; it doesn't have to scale or run on Windows.
Also, I have for now neglected the multiprocessing.Pool() alternative as my function was rejected as not being pickleable.
Here is the modified Alkarouri script, multifork.py:
#!/usr/bin/env python
import os, cPickle, time
import numpy as np
np.seterr(all='raise')
def function(x):
print '... the parameter is ', x
arr = np.zeros(5000).reshape(-1,1)
for r in range(3):
for i in range(200):
arr = np.concatenate( (arr, np.log( np.arange(2, 5002) ).reshape(-1,1) ), axis=1 )
return { 'junk': arr, 'shape': arr.shape }
def run_in_separate_process(func, *args, **kwds):
numruns = 4
result = [ None ] * numruns
pread = [ None ] * numruns
pwrite = [ None ] * numruns
pid = [ None ] * numruns
for i in range(numruns):
pread[i], pwrite[i] = os.pipe()
pid[i] = os.fork()
if pid[i] > 0:
pass
else:
os.close(pread[i])
result[i] = func(*args, **kwds)
with os.fdopen(pwrite[i], 'wb') as f:
cPickle.dump((0,result[i]), f, cPickle.HIGHEST_PROTOCOL)
os._exit(0)
#time.sleep(17)
while True:
for i in range(numruns):
os.close(pwrite[i])
with os.fdopen(pread[i], 'rb') as f:
stat, res = cPickle.load(f)
result[i] = res
#os.waitpid(pid[i], 0)
if not None in result: break
return result
def main():
print 'Running multifork.py.'
print run_in_separate_process( function, 3 )
if __name__ == "__main__":
main()
With multifork.py, uncommenting os.waitpid(pid[i], 0) has no effect. And neither does uncommenting time.sleep() if 4 iterations are being run at once and if the delay is not set to more than about 17 seconds. Given that time(1) real is something like 0m17.733s for 4 iterations done at once, I take that to be an indication that the While True loop is not itself a cause of any appreciable inefficiency (due to the processes all taking the same amount of time) and that the 17 seconds are indeed being consumed solely by the child processes.
Out of a profound sense of mercy I have spared you for now my other scheme, with which I employed subprocess.Popen() in lieu of os.fork(). With that one I had to send the function to an auxiliary script, the script that defines the command that is the first argument of Popen(), via a file. I did however use the same While True loop. And the results? They were the same as with the simpler scheme that I am presenting here--- almost exactly.
Why don't you use joblib.Parallel feature?
#!/usr/bin/env python
from joblib import Parallel, delayed
import numpy as np
np.seterr(all='raise')
def function(x):
print '... the parameter is ', x
arr = np.zeros(5000).reshape(-1,1)
for r in range(3):
for i in range(200):
arr = np.concatenate( (arr, np.log( np.arange(2, 5002) ).reshape(-1,1) ), axis=1 )
return { 'junk': arr, 'shape': arr.shape }
def main():
print 'Running multifork.py.'
print Parallel(n_jobs=2)(delayed(function)(3) for _ in xrange(4))
if __name__ == "__main__":
main()
It seems that you have some bottleneck in your computations.
On your example you passes your data over the pipe which is not very fast method. To avoid this performance problem you should use shared memory. This is how multiprocessing and joblib.Parallel work.
Also you should remember that in single-thread case you don't have to serialize and deserialize the data but in case of multiprocess you have to.
Next, even if you have 8 cores it could be hyper-threading feature enabled with divides core performance into 2 threads, thus if you have 4 HW cores the OS things that there are 8 of them. There are a lot of advantages and disadvantages to use HT but the main thing is if you going to load all your cores to do some computations for a long time then you should disable it.
For example, I have a Intel(R) Core(TM) i3-2100 CPU # 3.10GHz with 2 HW cores and HT enabled. So in top I saw 4 cores. The time of computations for me are:
n_jobs=1 - 0:07.13
n_jobs=2 - 0:06.54
n_jobs=4 - 0:07.45
This is how my lscpu looks like:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 42
Stepping: 7
CPU MHz: 1600.000
BogoMIPS: 6186.10
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3
Note at Thread(s) per core line.
So in your example there are not so many computations rather that data transfering. You application doesn't have time to get the advantages of parallelism. If you will have a long computation job (about 10 minutes) I think you will get it.
ADDITION:
I've taken a look at your function more detailed. I've replaced multiple execution with just one execution of function(3) function and run it within profiler:
$ /usr/bin/time -v python -m cProfile so.py
Output if quite long, you can view full version here (http://pastebin.com/qLBBH5zU). But the main thing is that the program lives most of the time in numpy.concatenate function. You can see it:
ncalls tottime percall cumtime percall filename:lineno(function)
.........
600 1.375 0.002 1.375 0.002 {numpy.core.multiarray.concatenate}
.........
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.64
.........
If you will run multiple instances of this program you will see that the time increases very much as the execution time of individual program instance. I've started 2 copy simultaneously:
$ /usr/bin/time -v python -m cProfile prog.py & /usr/bin/time -v python -m cProfile prog.py &
On the other hand I wrote a small fibo function:
def fibo(x):
arr = (0, 1)
for _ in xrange(x):
arr = (arr[-1], sum(arr))
return arr[0]
And replace the concatinate line with fibo(10000). In this case the execution time of single-instanced program is 0:22.82 while the execution time of two instances takes almost the same time per instance (0:24.62).
Based on this I think perhaps numpy uses some shared resource which leads to parallization problem. Or it can be numpy or scipy -specific issue.
And the last thing about the code, you have to replace the block below:
for r in range(3):
for i in range(200):
arr = np.concatenate( (arr, np.log( np.arange(2, 5002) ).reshape(-1,1) ), axis=1 )
With the only line:
arr = np.concatenate( (arr, np.log(np.arange(2, 5002).repeat(3*200).reshape(-1,3*200))), axis=1 )
I am providing an answer here so as to not leave such a muddled record. First, I can report that most-serene frist's joblib code works, and note that it's shorter and that it also doesn't suffer from the limitations of the While True loop of my example, which only works efficiently with jobs that each take about the same amount of time. I see that the joblib project has current support and if you don't mind dependency upon a third-party library it may be an excellent solution. I may adopt it.
But, with my test function the time(1) real, the running time using the time wrapper, is about the same with either joblib or my poor man's code.
To answer the question of why the scaling of the running time in inverse proportion to the number of physical cores doesn't go as hoped, I had diligently prepared my test code that I presented here so as to produce similar output and to take about as long to run as the code of my actual project, per single-run without parallelism (and my actual project is likewise CPU-bound, not I/O-bound). I did so in order to test before undertaking the not-really-easy fitting of the simple code into my rather complicated project. But I have to report that in spite of that similarity, the results were much better with my actual project. I was surprised to see that I did get the sought inverse scaling of running time with the number of physical cores, more or less.
So I'm left supposing--- here's my tentative answer to my own question--- that perhaps the OS scheduler is fickle and very sensitive to the type of job. And there could be effects due to other processes that may be running even if, as in my case, the other processes are hardly using any CPU time (I checked; they weren't).
Tip #1: Never name your joblib test code joblib.py (you'll get an import error). Tip #2: Never rename your joblib.py file test code file and run the renamed file without deleting the joblib.py file (you'll get an import error).

Unix Command For Benchmarking Code Running K times

Suppose I have a code executed in Unix this way:
$ ./mycode
My question is is there a way I can time the running time of my code
executed K times. The value of K = 1000 for example.
I am aware of Unix "time" command, but that only executed 1 instance.
to improve/clarify on Charlie's answer:
time (for i in $(seq 10000); do ./mycode; done)
try
$ time ( your commands )
write a loop to go in the parens to repeat your command as needed.
Update
Okay, we can solve the command line too long issue. This is bash syntax, if you're using another shell you may have to use expr(1).
$ time (
> while ((n++ < 100)); do echo "n = $n"; done
> )
real 0m0.001s
user 0m0.000s
sys 0m0.000s
Just a word of advice: Make sure this "benchmark" comes close to your real usage of the executed program. If this is a short living process, there could be a significant overhead caused by the process creation alone. Don't assume that it's the same as implementing this as a loop within your program.
To enhance a little bit some other responses, some of them (those based on seq) may cause a command line too long if you decide to test, say one million times. The following does not have this limitation
time ( a=0 ; while test $a -lt 10000 ; do echo $a ; a=`expr $a + 1` ; done)
Another solution to the "command line too long" problem is to use a C-style for loop within bash:
$ for ((i=0;i<10;i++)); do echo $i; done
This works in zsh as well (though I bet zsh has some niftier way of using it, I'm just still new to zsh). I can't test others, as I've never used any other.
forget time, hyperfine will do exactly what you are looking for: https://github.com/sharkdp/hyperfine
% hyperfine 'sleep 0.3'
Benchmark 1: sleep 0.3
Time (mean ± σ): 310.2 ms ± 3.4 ms [User: 1.7 ms, System: 2.5 ms]
Range (min … max): 305.6 ms … 315.2 ms 10 runs
Linux perf stat has a -r repeat_count option. Its output only gives you the mean and standard deviation for each HW/software event, not min/max as well.
It doesn't discard the first run as a warm-up or anything either, but it's somewhat useful in a lot of cases.
Scroll to the right for the stddev results like ( +- 0.13% ) for cycles. Less variance in that than in task-clock, probably because CPU frequency was not fixed. (I intentionally picked a quite short run time, although with Skylake hardware P-state and EPP=performance, it should be ramping up to max turbo quite quickly even compared to a 34 ms run time. But for a CPU-bound task that's not memory-bound at all, its interpreter loop runs at a constant number of clock cycles per iteration, modulo only branch misprediction and interrupts. --all-user is counting CPU events like instructions and cycles only for user-space, not inside interrupt handlers and system calls / page-faults.)
$ perf stat --all-user -r5 awk 'BEGIN{for(i=0;i<1000000;i++){}}'
Performance counter stats for 'awk BEGIN{for(i=0;i<1000000;i++){}}' (5 runs):
34.10 msec task-clock # 0.984 CPUs utilized ( +- 0.40% )
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
178 page-faults # 5.180 K/sec ( +- 0.42% )
139,277,791 cycles # 4.053 GHz ( +- 0.13% )
360,590,762 instructions # 2.58 insn per cycle ( +- 0.00% )
97,439,689 branches # 2.835 G/sec ( +- 0.00% )
16,416 branch-misses # 0.02% of all branches ( +- 8.14% )
0.034664 +- 0.000143 seconds time elapsed ( +- 0.41% )
awk here is just a busy-loop to give us something to measure. If you're using this to microbenchmark a loop or function, construct it to have minimal startup overhead as a fraction of total run time, so perf stat event counts for the whole run mostly reflect the code you wanted to time. Often this means building a repeat-loop into your own program, to loop over the initialized data multiple times.
See also Idiomatic way of performance evaluation? - timing very short things is hard due to measurement overhead. Carefully constructing a repeat loop that tells you something interesting about the throughput or latency of your code under test is important.
Run-to-run variation is often a thing, but often back-to-back runs like this will have less variation within the group than between runs separated by half a second to up-arrow/return. Perhaps something to do with transparent hugepage availability, or choice of alignment? Usually for small microbenchmarks, so not sensitive to the file getting evicted from the pagecache.
(The +- range printed by perf is just I think one standard deviation based on the small sample size, not the full range it saw.)
If you're worried about the overhead of constantly load and unloading the executable into process space, I suggest you set up a ram disk and time your app from there.
Back in the 70's we used to be able to set a "sticky" bit on the executable and have it remain in memory.. I don't know of a single unix which now supports this behaviour as it made updating applications a nightmare.... :o)