How can you get zsh's REPORTTIME variable to print the command that ran? - profiling

Setting the $REPORTTIME variable in zsh will cause zsh to print the time of commands that run for longer than the value of the environment variable.
However all you get is the time, when I set it in my .zshrc I got:
0.00s user 0.00s system 25% cpu 0.001 total
0.00s user 0.00s system 75% cpu 0.003 total
0.00s user 0.00s system 30% cpu 0.002 total
0.00s user 0.00s system 77% cpu 0.003 total
0.00s user 0.00s system 34% cpu 0.001 total
0.00s user 0.00s system 87% cpu 0.003 total
0.00s user 0.00s system 48% cpu 0.001 total
0.00s user 0.00s system 86% cpu 0.002 total
0.00s user 0.01s system 88% cpu 0.014 total
0.00s user 0.00s system 11% cpu 0.014 total
0.00s user 0.01s system 85% cpu 0.016 total
0.00s user 0.00s system 20% cpu 0.016 total
0.00s user 0.00s system 82% cpu 0.007 total
This isn't too helpful as it doesn't tell you what is being slow. Is there a way to also print the command that caused $REPORTTIME to trigger?

You can use setopt xtrace to have each command printed before it is executed. Combined with REPORTTIME, this should give you the information you need. You need to put setopt xtrace inside the script(s) you want to monitor.
For example, take this shell script named test:
#!/bin/zsh
setopt xtrace
ls foo
ls bar
ls baz
unsetopt xtrace
Running the script (with REPORTTIME set) results in:
% ./test
+./test:3> ls foo
ls: cannot access foo: No such file or directory
0.00s user 0.00s system 2% cpu 0.154 total
+./test:4> ls bar
ls: cannot access bar: No such file or directory
0.00s user 0.00s system 0% cpu 0.005 total
+./test:5> ls baz
ls: cannot access baz: No such file or directory
0.00s user 0.00s system 0% cpu 0.008 total
+./test:6> unsetopt xtrace
./test 0.01s user 0.00s system 1% cpu 0.758 total

Related

How to test the integrity of hardware on aws instance?

I have a cluster of consumers (50 or so instance) consuming from kafka partitions.
I notice that there is this one server that is consistently slow. Its cpu usage is always around 80-100%. While the other partitions is around 50%.
Originally I thought there is a slight chance that this is traffic dependent, so I manually switch the partitions that the slow loader is consuming.
However I did not observe an increase in processing speed.
I also don't see cpu steal from iostat, but since all consumer is running the same code I suspect there is some bottle neck in the hardware.
Unfortunately, I can't just replace the server unless I can provide conclusive proof that the hardware is the problem.
So I want to write a load testing script that pin point the bottle neck.
My plan is to write a while loop in python that does n computations, and find out what is the max computation that the slow consumer can do and what is the max computation that the fast consumer can do.
What other testing strategy can I do?
Perhaps I should test disk bottle neck by having my python script write to txt file?
Here is fast consumer iostat
avg-cpu: %user %nice %system %iowait %steal %idle
50.01 0.00 3.96 0.13 0.12 45.77
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
xvda 1.06 0.16 11.46 422953 30331733
xvdb 377.63 0.01 46937.99 35897 124281808572
xvdc 373.43 0.01 46648.25 26603 123514631628
md0 762.53 0.01 93586.24 22235 247796440032
Here is slow consumer iostat
avg-cpu: %user %nice %system %iowait %steal %idle
81.58 0.00 5.28 0.11 0.06 12.98
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
xvda 1.02 0.40 13.74 371145 12685265
xvdb 332.85 0.02 40775.06 18229 37636091096
xvdc 327.42 0.01 40514.44 10899 37395540132
md0 676.47 0.01 81289.50 11287 75031631060

Why will the same compile options of gcc behave differently on different computer architecture?

I use the following two makefile to compile my program to do Gaussian blur.
g++ -Ofast -ffast-math -march=native -flto -fwhole-program -std=c++11 -fopenmp -o interpolateFloatImg interpolateFloatImg.cpp
g++ -O3 -std=c++11 -fopenmp -o interpolateFloatImg interpolateFloatImg.cpp
My two testing environments are:
i7 4710HQ 4 cores 8 threads
E5 2650
However, the first output has 2x speed on E5 but 0.5x speed on i7.
The second output behaves faster on i7 but slower on E5.
Can anyone give some explanations?
this is the source code: https://github.com/makeapp007/interpolateFloatImg
I will give out more details as soon as possible.
The program on i7 will be run on 8 threads.
I did't know how many threads will this program generate on E5.
==== Update ====
I am the teammate of the original author on this project, and here are the results.
Arch-Lenovo-Y50 ~/project/ca/3/12 (git)-[master] % perf stat -d ./interpolateFloatImg lobby.bin out.bin 255 20
Kernel kernelSize : 255
Standard deviation : 20
Kernel maximum: 0.000397887
Kernel minimum: 1.22439e-21
Reading width 20265 height 8533 = 172921245
Micro seconds: 211199093
Performance counter stats for './interpolateFloatImg lobby.bin out.bin 255 20':
1423026.281358 task-clock:u (msec) # 6.516 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
2,604 page-faults:u # 0.002 K/sec
4,167,572,543,807 cycles:u # 2.929 GHz (46.79%)
6,713,517,640,459 instructions:u # 1.61 insn per cycle (59.29%)
725,873,982,404 branches:u # 510.092 M/sec (57.28%)
23,468,237,735 branch-misses:u # 3.23% of all branches (56.99%)
544,480,682,764 L1-dcache-loads:u # 382.622 M/sec (37.00%)
545,000,783,842 L1-dcache-load-misses:u # 100.10% of all L1-dcache hits (31.44%)
38,696,703,292 LLC-loads:u # 27.193 M/sec (26.68%)
1,204,703,652 LLC-load-misses:u # 3.11% of all LL-cache hits (35.70%)
218.384387536 seconds time elapsed
And these are the results from the workstation:
workstation:~/mossCAP3/repos/liuyh1_liujzh/12$ perf stat -d ./interpolateFloatImg ../../../lobby.bin out.bin 255 20
Kernel kernelSize : 255
Standard deviation : 20
Kernel maximum: 0.000397887
Kernel minimum: 1.22439e-21
Reading width 20265 height 8533 = 172921245
Micro seconds: 133661220
Performance counter stats for './interpolateFloatImg ../../../lobby.bin out.bin 255 20':
2035379.528531 task-clock (msec) # 14.485 CPUs utilized
7,370 context-switches # 0.004 K/sec
273 cpu-migrations # 0.000 K/sec
3,123 page-faults # 0.002 K/sec
5,272,393,071,699 cycles # 2.590 GHz [49.99%]
0 stalled-cycles-frontend # 0.00% frontend cycles idle
0 stalled-cycles-backend # 0.00% backend cycles idle
7,425,570,600,025 instructions # 1.41 insns per cycle [62.50%]
370,199,835,630 branches # 181.882 M/sec [62.50%]
47,444,417,555 branch-misses # 12.82% of all branches [62.50%]
591,137,049,749 L1-dcache-loads # 290.431 M/sec [62.51%]
545,926,505,523 L1-dcache-load-misses # 92.35% of all L1-dcache hits [62.51%]
38,725,975,976 LLC-loads # 19.026 M/sec [50.00%]
1,093,840,555 LLC-load-misses # 2.82% of all LL-cache hits [49.99%]
140.520016141 seconds time elapsed
====Update====
the specification of the E5:
workstation:~$ cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c
20 Intel(R) Xeon(R) CPU E5-2650 v3 # 2.30GHz
workstation:~$ dmesg | grep cache
[ 0.041489] Dentry cache hash table entries: 4194304 (order: 13, 33554432 bytes)
[ 0.047512] Inode-cache hash table entries: 2097152 (order: 12, 16777216 bytes)
[ 0.050088] Mount-cache hash table entries: 65536 (order: 7, 524288 bytes)
[ 0.050121] Mountpoint-cache hash table entries: 65536 (order: 7, 524288 bytes)
[ 0.558666] PCI: pci_cache_line_size set to 64 bytes
[ 0.918203] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[ 0.948808] xhci_hcd 0000:00:14.0: cache line size of 32 is not supported
[ 1.076303] ehci-pci 0000:00:1a.0: cache line size of 32 is not supported
[ 1.089022] ehci-pci 0000:00:1d.0: cache line size of 32 is not supported
[ 1.549796] sd 4:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 1.552711] sd 5:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 1.552955] sd 6:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Based on the compiler flags you indicated, the first Makefile is making use of the -march=native flag which partly explains why you are observing different performance gaps on the two CPUs with or without the flag.
This flag allows GCC to use instructions specific to a given CPU architecture, and that are not necessarily available on a different architecture. It also implies -mtune=native which tunes the compiled code for the specific CPU of the machine and favours instruction sequences that run faster on that CPU. Note that code compiled with -march=native may not work at all if run on a system with a different CPU, or be significantly slower.
So even though the options seem to be the same, they will act differently behind the scenes, depending on the machine you are using to compile. You can find more information about this flag in the GCC documentation.
To see what options are specifically enabled for each CPU, you can run the following command on each of your machines:
gcc -march=native -Q --help=target
In addition, different versions of GCC also have an influence on how different compiler flags will optimise your code, especially the -march=native flag which doesn't have as many tweaks enabled on older versions of GCC (newer architectures weren't necessarily fully supported at the time). This can further explain the gaps you are observing.
Your program has very high cache miss ratio. Is it good for the program or bad for it?
545,000,783,842 L1-dcache-load-misses:u # 100.10% of all L1-dcache hits
545,926,505,523 L1-dcache-load-misses # 92.35% of all L1-dcache hits
Cache sizes may be different in i7 and E5, so it is one source of difference. Other is - different assembler code, different gcc versions, different gcc options.
You should try to look inside the code, find hot spot, analyze how many pixels is processed by commands and how order of processing may be better for cpu and memory. Rewriting the hotspot (the part of code where most time of running is spent) is the key of solving the task http://shtech.org/course/ca/projects/3/.
You may use perf profiler in record / report / annotate mode to find the hot spot (it will be easier if you will recompile project with -g option added):
# Profile program using cpu cycle performance counter; write profile to perf.data file
perf record ./test test_arg1 test_arg2
# Read perf.data file and report functions where time was spent
# (Do not change ./test file, or recompile it after record and before report)
perf report
# Find the hotspot in the top functions by annotation
# you may use Arrows and Enter to do "annotate" action from report; or:
perf annonate -s top_function_name
perf annonate -s top_function_name > annotate_func1.txt
I was able to increase speed for small bin file and 277 10 arguments in 7 times on my mobile i5-4* (intel haswell) with 2 cores (4 virtual cores with HT enabled) and AVX2+FMA.
Rewriting some loops / loop nests is needed. You should understand how CPU cache works and what is easier to it: to miss often or not to miss often. Also, gcc may be dumb and may not always detect pattern of reading the data; this detection may be needed to work on several pixels in parallel.

Meaning of very high Elapsed(wall clock) time and low System time in Linux

I have a C++ binary and I am trying to measure it's worst case performance.
I executed it with
/usr/bin/time -v < command >
And result was as
User time (seconds): 161.07
System time (seconds): 16.64
Percent of CPU this job got: 7%
Elapsed (wall clock) time (h:mm:ss or m:ss): 39:44.46
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 19889808
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 1272786
Voluntary context switches: 233597
Involuntary context switches: 138
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
How do I interpret this result, what is causing this application to take this much time?
There is no waiting for user input, it basically deals with large text file and database.
I am looking at it from Linux(OS) perspective.Is it too many context switches(Round robin Scheduling in Linux) that has caused this?
The best thing you can do is to run it with a profiler like gprof, gperftools, callgrind (part of valgrind) or (the best in my opinion) Intel VTune. They can show you what is going one behind the code. And you'd better have the debug symbols (!= than compiling without optimization) to get a clear picture about that. Otherwise you can just have "best guesses" of what is going one under the hood...
As I said, I'm biased towards VTune as it is fast and it displays a lot of useful info. Take a look here at an example:
Vtune example

How to generate ocamlprof.dump by ocamlcp or ocamloptp

I read the manual about profiling (ocamlprof): http://caml.inria.fr/pub/docs/manual-ocaml-4.01/profil.html
I have a hard time to use it. The way I tried to do an example with gprof is:
For example I have a file name: ex.ml
I run: sudo ocamlopt -p ex.ml -o ex
then I use: gprof ex > profile.txt
It shows me a bunch of information but the column related to time is all 0
For instance (this taken from my real function):
Flat profile:
Each sample counts as 0.01 seconds.
no time accumulated
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
0.00 0.00 0.00 415 0.00 0.00 caml_page_table_modify
0.00 0.00 0.00 57 0.00 0.00 caml_get_exception_backtrace
I don't understand why at the column time all functions return 0.00.
In the link above there is a file ocamlprof.dump, I don't know how to write a command generate it. How can I generate ocamlprof.dump. How can I know the locate of a name for example :caml_page_table_modify ?
Thank you very much for your help.

Effect of usleep(0) in C++ on Linux

The documentation for usleep states that calling usleep(0) has no effect. However, on my system (RHEL 5.2) running the small snippets of C++ code below, I find that it actually appears to have the same effect as usleep(1). Is this to be expected, and if so, why is there the discrepancy between the documentation and what I see in real life?
Exhibit A
Code:
#include <unistd.h>
int main()
{
for( int i = 0; i < 10000; i++ )
{
usleep(1);
}
}
Output:
$ time ./test
real 0m10.124s
user 0m0.001s
sys 0m0.000s
Exhibit B
Code:
#include <unistd.h>
int main()
{
for( int i = 0; i < 10000; i++ )
{
usleep(1);
usleep(0);
}
}
Output:
$ time ./test
real 0m20.770s
user 0m0.002s
sys 0m0.001s
Technically it should have no effect. But you must remember that the value passed is used as a minimum, and not an absolute, therefore the system is free to use the smallest possible interval instead.
I just wanted to point out about the time command used here. You should use /usr/bin/time instead of only time command if you want to check your program memory,cpu,time stat. When you call time without full path then built-in time command is called. Look at the difference.
without full path:
# time -v ./a.out
-bash: -v: command not found
real 0m0.001s
user 0m0.000s
sys 0m0.001s
with full path:
# /usr/bin/time -v ./a.out
Command being timed: "./a.out"
User time (seconds): 0.00
System time (seconds): 0.00
Percent of CPU this job got: 0%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:10.87
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 0
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 220
Voluntary context switches: 10001
Involuntary context switches: 1
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
use man time for /usr/bin/time manual and use help time for built in time information.
I would have to look at the source to make sure, but my guess is that it's not quite "no effect", but it's probably still less than usleep(1) - there's still the function call overhead, which can be measurable in a tight loop, even if the library call simply checks its arguments and returns immediately, avoiding the more usual process of setting up a timer/callback and calling the scheduler.
usleep() and sleep() are translated to nanosleep() system calls. Try strace your program and you'll see it. From nanosleep() manual:
nanosleep() suspends the execution of the calling thread until either
at least the time specified in *req has elapsed, or the delivery of a
signal that triggers the invocation of a handler in the calling
thread or that terminates the process.
So I think ulseep(0) will generate an interrupt and a context switch.
That documentation is back from 1997, not sure if it applies to current RHEL5, my Redhat dev systems man page for usleep does not indicate that a sleep time of 0 has no effect.
The parameter you pass is a minimum time for sleeping. There's no guarantee that the thread will wake up after exactly the time specified. Given the specific dynamics of the scheduler, it may result in longer than expected delays.
It also depends on if udelay is implemented as a busy loop for short durations.
As of my experience it has one effect: it's calling an interrupt.
This is good to release the processor for the smallest amount of time in multithreading programming.