How to improve libevent bufferevent_write performance - c++

For my test case, only send "00" message by bufferevent_write.
case 1: 20,000 tcp connections, and send "00" to each every 10s, it will cost 0.15s.
case 2: only 1 tcp connection, and send "00" 20,000 times every 10s, it will cost 0.015s.
Please give me some suggestions to improve bufferevent_write performance.
I just wanna as fast as possible, and wonder that, if bufferevent_write is async, why send 20k message to 1 tcp is much faster than send 1 msssage to 20k tcp.
CPU info:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8269CY CPU # 2.50GHz
Stepping: 7
CPU MHz: 2500.000
BogoMIPS: 5000.00
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 36608K
NUMA node0 CPU(s): 0-15
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512_vnni
Memory info:
32G
the whole test case
#include <event2/buffer.h>
#include <event2/bufferevent.h>
#include <event2/event.h>
#include <event2/listener.h>
#include <event2/thread.h>
#include <netinet/tcp.h>
#include <atomic>
#include <cerrno>
#include <csignal>
#include <cstring>
#include <ctime>
#include <deque>
#include <functional>
#include <iostream>
#include <map>
#include <mutex>
#include <set>
#include <thread>
using namespace std::chrono_literals;
static event_base *kEventBase{nullptr};
static evconnlistener *kListener{nullptr};
static std::set<bufferevent *> kSessions{};
static std::mutex kSessionsMutex{};
static std::atomic_bool kRunning{false};
static void stop() {
kRunning = false;
if (kListener != nullptr) {
evconnlistener_disable(kListener);
std::cout << "normal listener stopped" << std::endl;
}
struct timeval local_timeval = {1, 0};
if (kEventBase != nullptr) { event_base_loopexit(kEventBase, &local_timeval); }
}
static void handler(int sig) {
std::cout << "get signal: " << sig << std::endl;
stop();
}
static void ReadCallback(bufferevent *event, void *) {
auto buffer = evbuffer_new();
evbuffer_add_buffer(buffer, bufferevent_get_input(event));
auto data_size = evbuffer_get_length(buffer);
char data[data_size + 1];
bzero(data, data_size + 1);
evbuffer_remove(buffer, data, data_size);
evbuffer_free(buffer);
std::cout << "get data: " << data << std::endl;
}
static void EventCallback(bufferevent *event, short events, void *) {
if (events & BEV_EVENT_EOF) {
std::cout << "socket EOF" << std::endl;
} else if (events & BEV_EVENT_ERROR) {
std::cout << "socket error: " << evutil_socket_error_to_string(EVUTIL_SOCKET_ERROR());
} else if (events & BEV_EVENT_TIMEOUT) {
std::cout << "socket read/write timeout" << std::endl;
} else {
std::cout << "unhandled socket events: " << std::to_string(events) << std::endl;
}
{
std::lock_guard<std::mutex> local_lock_guard{kSessionsMutex};
kSessions.erase(event);
bufferevent_free(event);
}
}
static void listenerCallback(evconnlistener *, evutil_socket_t socket, sockaddr *, int, void *) {
bufferevent *event =
bufferevent_socket_new(kEventBase, socket, BEV_OPT_CLOSE_ON_FREE | BEV_OPT_THREADSAFE);
if (event == nullptr) {
std::cout << "create buffer event failed" << std::endl;
return;
}
int enable = 1;
setsockopt(socket, IPPROTO_TCP, TCP_NODELAY, (void *)&enable, sizeof(enable));
setsockopt(socket, IPPROTO_TCP, TCP_QUICKACK, (void *)&enable, sizeof(enable));
bufferevent_setcb(event, ReadCallback, nullptr, EventCallback, nullptr);
bufferevent_enable(event, EV_WRITE | EV_READ);
kSessions.emplace(event);
}
int main(int argc, const char **argv) {
signal(SIGTERM, handler);
signal(SIGINT, handler);
evthread_use_pthreads();
// init
kEventBase = event_base_new();
if (kEventBase == nullptr) {
std::cout << "cannot create event_base_miner_listener_" << std::endl;
return -1;
}
sockaddr_in local_sin{};
bzero(&local_sin, sizeof(local_sin));
local_sin.sin_family = AF_INET;
local_sin.sin_port = htons(1800u);
local_sin.sin_addr.s_addr = htonl(INADDR_ANY);
kListener = evconnlistener_new_bind(kEventBase,
listenerCallback,
nullptr,
LEV_OPT_REUSEABLE | LEV_OPT_CLOSE_ON_FREE,
-1,
reinterpret_cast<sockaddr *>(&local_sin),
static_cast<int>(sizeof(local_sin)));
if (kListener == nullptr) {
std::cout << "cannot create normal listener" << std::endl;
return -1;
}
kRunning = true;
std::thread thread_send_message([]() {
while (kRunning) {
{
// case 1: If send to 20,000 tcp connection, and send "00" for each, it will cost 0.15s.
std::lock_guard<std::mutex> local_lock_guard{kSessionsMutex};
std::clock_t clock_start = std::clock();
for (auto &it : kSessions) { bufferevent_write(it, "00", 2); }
std::cout << "send message to all done, client count: " << kSessions.size()
<< ", elapsed: " << std::clock() - clock_start << std::endl;
}
{
// case 2: If send to 1 tcp connection, and send "00" 20,000 times, it will cost 0.015s.
// std::lock_guard<std::mutex> local_lock_guard{kSessionsMutex};
// for (auto &it : kSessions) {
// std::clock_t clock_start = std::clock();
// for (int i = 0; i < 20000; ++i) { bufferevent_write(it, "00", 2); }
// std::cout << "send message 20k times done, elapsed: " << std::clock() - clock_start
// << std::endl;
// }
}
std::this_thread::sleep_for(10s);
}
});
event_base_dispatch(kEventBase);
if (thread_send_message.joinable()) { thread_send_message.join(); }
{
std::lock_guard<std::mutex> local_lock_guard{kSessionsMutex};
for (auto &it : kSessions) { bufferevent_free(it); }
kSessions.clear();
}
if (kListener != nullptr) {
evconnlistener_free(kListener);
kListener = nullptr;
}
if (kEventBase != nullptr) {
event_base_free(kEventBase);
kEventBase = nullptr;
}
}
the minimal reproducible example
// case 1: 20,000 tcp connections, and send "00" for each every 10s, it will cost 0.15s.
std::clock_t clock_start = std::clock();
for (auto &it : kSessions) { bufferevent_write(it, "00", 2); }
std::cout << "send message to all done, client count: " << kSessions.size()
<< ", elapsed: " << std::clock() - clock_start << std::endl;
// case 2: only 1 tcp connection, and send "00" 20,000 times every 10s, it will cost 0.015s.
for (auto &it : kSessions) {
std::clock_t clock_start = std::clock();
for (int i = 0; i < 20000; ++i) { bufferevent_write(it, "00", 2); }
std::cout << "send message 20k times done, elapsed: " << std::clock() - clock_start
<< std::endl;
}
strace of case 1:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
56.32 29.519892 9 3135415 408444 futex
20.53 10.762191 7 1490532 epoll_ctl
15.25 7.992391 11 715355 writev
3.98 2.086553 45360 46 nanosleep
1.86 0.973074 11 85273 1 epoll_wait
0.62 0.324022 8 39267 19266 accept4
0.58 0.305246 6 48721 read
0.55 0.286858 6 48762 write
0.30 0.154980 4 40004 setsockopt
0.01 0.006486 5 1216 mprotect
0.01 0.002952 21 143 madvise
0.00 0.001018 7 152 brk
0.00 0.000527 6 94 clock_gettime
0.00 0.000023 3 8 openat
0.00 0.000021 21 1 mremap
0.00 0.000010 0 22 mmap
0.00 0.000007 1 9 close
0.00 0.000000 0 8 fstat
0.00 0.000000 0 3 munmap
0.00 0.000000 0 4 rt_sigaction
0.00 0.000000 0 1 rt_sigprocmask
0.00 0.000000 0 1 ioctl
0.00 0.000000 0 1 readv
0.00 0.000000 0 8 8 access
0.00 0.000000 0 1 socket
0.00 0.000000 0 1 bind
0.00 0.000000 0 1 listen
0.00 0.000000 0 1 clone
0.00 0.000000 0 1 execve
0.00 0.000000 0 4 getuid
0.00 0.000000 0 4 getgid
0.00 0.000000 0 4 geteuid
0.00 0.000000 0 4 getegid
0.00 0.000000 0 1 arch_prctl
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 2 set_robust_list
0.00 0.000000 0 1 eventfd2
0.00 0.000000 0 1 epoll_create1
0.00 0.000000 0 1 pipe2
0.00 0.000000 0 1 prlimit64
------ ----------- ----------- --------- --------- ----------------
100.00 52.416251 5605075 427719 total
strace of case 2:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
normal listener stopped
66.66 0.151105 7 22506 3469 futex
9.74 0.022084 6 3709 1 epoll_wait
9.54 0.021624 4 5105 epoll_ctl
9.47 0.021466 8 2550 writev
2.47 0.005598 4 1263 write
1.70 0.003857 3 1246 read
0.18 0.000409 18 23 nanosleep
0.09 0.000197 4 46 clock_gettime
0.03 0.000068 4 16 mprotect
0.02 0.000035 2 21 mmap
0.01 0.000024 8 3 munmap
0.01 0.000019 10 2 1 accept4
0.01 0.000018 5 4 setsockopt
0.01 0.000015 8 2 set_robust_list
0.01 0.000014 4 4 rt_sigaction
0.01 0.000014 4 4 geteuid
0.01 0.000013 3 4 getgid
0.01 0.000012 3 4 getuid
0.01 0.000012 3 4 getegid
0.00 0.000011 1 8 fstat
0.00 0.000010 10 1 socket
0.00 0.000008 8 1 clone
0.00 0.000007 2 3 brk
0.00 0.000007 7 1 pipe2
0.00 0.000006 1 7 openat
0.00 0.000006 6 1 epoll_create1
0.00 0.000005 1 8 8 access
0.00 0.000005 5 1 bind
0.00 0.000005 5 1 eventfd2
0.00 0.000005 5 1 prlimit64
0.00 0.000004 1 7 close
0.00 0.000004 4 1 listen
0.00 0.000003 3 1 rt_sigprocmask
0.00 0.000003 3 1 arch_prctl
0.00 0.000003 3 1 set_tid_address
0.00 0.000000 0 1 execve
------ ----------- ----------- --------- --------- ----------------
100.00 0.226676 36561 3479 total

How to improve libevent bufferevent_write performance
Read the documentation of libevent, study its source code, and consider other event loop libraries like libev, Qt, Wt, libonion, POCO etc....
Be aware of several points. I assume a modern Linux/x86-64 system.
you could profile your open source event loop library (e.g. by compiling it from source code with a recent GCC and using -pg -O2 flags, then strace(1) and/or gprof(1) and/or perf(1) and/or time(1) (and also top(1), ps(1), proc(5), netstat(8), ip(8), ifconfig(8), tcpdump(8), xosview to observe your entire Linux system). Of course read time(7) and epoll(7) and poll(2)
TCP/IP is introducing some overhead, IP routing adds more overhead, and the typical Ethernet packet has at least hundreds of bytes (and dozens of bytes of overhead). You certainly want to send(2) or recv(2) several hundred bytes at once. Sending short "00" messages (of about four bytes of useful payload) is inefficient. Ensure that your application send messages of hundreds of bytes at once. You might consider some JSONRPC approach (and of course design your protocol at a higher level with fewer but bigger messages triggering each more complex behavior) or some MPI one. A way to send fewer but higher-level messages is to embed some interpreter like Guile or Lua and send higher level script chunks or requests (like NeWS did in the past, and PostgreSQL or exim do today)
for short and small communications prefer running a few processes or threads on the same computer and use mq_overview(7), pipe(7), fifo(7), unix(7), avoiding Ethernet.
Most computers are in 2020 multi-core, and with care, you could use Pthreads or std::thread (with one thread running on each core, so at least 2 or 4 different threads on a laptop, or a hundred threads on a powerful Linux server). You'll need some synchronization code (e.g. std::mutex with std::lock_guard or Pẗhread mutexes....)
be aware of the C10K problem, and take inspiration from existing open source server programs or libraries such as lighttpd, Wt, FLTK, REDIS, Vmime, libcurl, libonion (and study their source code, and observe their runtime behavior with gdb(1) and/or strace(1) or ltrace(1))
the network might be the bottleneck (then you won't be able to improve your code to gain performance; you'll need some change in your software architecture). Read more about cloud computing, distributed computing, XDR, ASN.1, SOAP, REST, Web services, libssh, π-calculus
Notice that:
static void handler(int sig) {
std::cout << "get signal: " << sig << std::endl;
stop();
}
if used with signal(7) is against the rules of signal-safety(7) so you might use the pipe(7) to self-process trick as suggested by Qt or consider using the Linux specific signalfd(2) system call.
Read also Advanced Linux Programming then syscalls(2) and socket(7) and tcp(7).

Related

OpenMP vectorised code runs way slower than O3 optimized code

I have a minimally reproducible sample which is as follows -
#include <iostream>
#include <chrono>
#include <immintrin.h>
#include <vector>
#include <numeric>
template<typename type>
void AddMatrixOpenMP(type* matA, type* matB, type* result, size_t size){
for(size_t i=0; i < size * size; i++){
result[i] = matA[i] + matB[i];
}
}
int main(){
size_t size = 8192;
//std::cout<<sizeof(double) * 8<<std::endl;
auto matA = (float*) aligned_alloc(sizeof(float), size * size * sizeof(float));
auto matB = (float*) aligned_alloc(sizeof(float), size * size * sizeof(float));
auto result = (float*) aligned_alloc(sizeof(float), size * size * sizeof(float));
for(int i = 0; i < size * size; i++){
*(matA + i) = i;
*(matB + i) = i;
}
auto start = std::chrono::high_resolution_clock::now();
for(int j=0; j<500; j++){
AddMatrixOpenMP<float>(matA, matB, result, size);
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
std::cout<<"Average Time is = "<<duration/500<<std::endl;
std::cout<<*(result + 100)<<" "<<*(result + 1343)<<std::endl;
}
I experiment as follows - I time the code with #pragma omp for simd directive for the loop in the AddMatrixOpenMP function and then time it without the directive. I compile the code as follows -
g++ -O3 -fopenmp example.cpp
Upon inspecting the assembly, both the variants generate vector instructions but when the OpenMP pragma is explicitly specified, the code runs 3 times slower.
I am not able to understand why so.
Edit - I am running GCC 9.3 and OpenMP 4.5. This is running on an i7 9750h 6C/12T on Ubuntu 20.04. I ensured no major processes were running in the background. The CPU frequency held more or less constant during the run for both versions (Minor variations from 4.0 to 4.1)
TIA
The non-OpenMP vectorizer is defeating your benchmark with loop inversion.
Make your function __attribute__((noinline, noclone)) to stop GCC from inlining it into the repeat loop. For cases like this with large enough functions that call/ret overhead is minor, and constant propagation isn't important, this is a pretty good way to make sure that the compiler doesn't hoist work out of the loop.
And in future, check the asm, and/or make sure the benchmark time scales linearly with the iteration count. e.g. increasing 500 up to 1000 should give the same average time in a benchmark that's working properly, but it won't with -O3. (Although it's surprisingly close here, so that smell test doesn't definitively detect the problem!)
After adding the missing #pragma omp simd to the code, yeah I can reproduce this. On i7-6700k Skylake (3.9GHz with DDR4-2666) with GCC 10.2 -O3 (without -march=native or -fopenmp), I get 18266, but with -O3 -fopenmp I get avg time 39772.
With the OpenMP vectorized version, if I look at top while it runs, memory usage (RSS) is steady at 771 MiB. (As expected: init code faults in the two inputs, and the first iteration of the timed region writes to result, triggering page-faults for it, too.)
But with the "normal" vectorizer (not OpenMP), I see the memory usage climb from ~500 MiB until it exits just as it reaches the max 770MiB.
So it looks like gcc -O3 performed some kind of loop inversion after inlining and defeated the memory-bandwidth-intensive aspect of your benchmark loop, only touching each array element once.
The asm shows the evidence: GCC 9.3 -O3 on Godbolt doesn't vectorize, and it leaves an empty inner loop instead of repeating the work.
.L4: # outer loop
movss xmm0, DWORD PTR [rbx+rdx*4]
addss xmm0, DWORD PTR [r13+0+rdx*4] # one scalar operation
mov eax, 500
.L3: # do {
sub eax, 1 # empty inner loop after inversion
jne .L3 # }while(--i);
add rdx, 1
movss DWORD PTR [rcx], xmm0
add rcx, 4
cmp rdx, 67108864
jne .L4
This is only 2 or 3x faster than fully doing the work. Probably because it's not vectorized, and it's effectively running a delay loop instead of optimizing away the empty inner loop entirely. And because modern desktops have very good single-threaded memory bandwidth.
Bumping up the repeat count from 500 to 1000 only improved the computed "average" from 18266 to 17821 us per iter. An empty loop still takes 1 iteration per clock. Normally scaling linearly with the repeat count is a good litmus test for broken benchmarks, but this is close enough to be believable.
There's also the overhead of page faults inside the timed region, but the whole thing runs for multiple seconds so that's minor.
The OpenMP vectorized version does respect your benchmark repeat-loop. (Or to put it another way, doesn't manage to find the huge optimization that's possible in this code.)
Looking at memory bandwidth while the benchmark is running:
Running intel_gpu_top -l while the proper benchmark is running shows (openMP, or with __attribute__((noinline, noclone))). IMC is the Integrated Memory Controller on the CPU die, shared by the IA cores and the GPU via the ring bus. That's why a GPU-monitoring program is useful here.
$ intel_gpu_top -l
Freq MHz IRQ RC6 Power IMC MiB/s RCS/0 BCS/0 VCS/0 VECS/0
req act /s % W rd wr % se wa % se wa % se wa % se wa
0 0 0 97 0.00 20421 7482 0.00 0 0 0.00 0 0 0.00 0 0 0.00 0 0
3 4 14 99 0.02 19627 6505 0.47 0 0 0.00 0 0 0.00 0 0 0.00 0 0
7 7 20 98 0.02 19625 6516 0.67 0 0 0.00 0 0 0.00 0 0 0.00 0 0
11 10 22 98 0.03 19632 6516 0.65 0 0 0.00 0 0 0.00 0 0 0.00 0 0
3 4 13 99 0.02 19609 6505 0.46 0 0 0.00 0 0 0.00 0 0 0.00 0 0
Note the ~19.6GB/s read / 6.5GB/s write. Read ~= 3x write since it's not using NT stores for the output stream.
But with -O3 defeating the benchmark, with a 1000 repeat count, we see only near-idle levels of main-memory bandwidth.
Freq MHz IRQ RC6 Power IMC MiB/s RCS/0 BCS/0 VCS/0 VECS/0
req act /s % W rd wr % se wa % se wa % se wa % se wa
...
8 8 17 99 0.03 365 85 0.62 0 0 0.00 0 0 0.00 0 0 0.00 0 0
9 9 17 99 0.02 349 90 0.62 0 0 0.00 0 0 0.00 0 0 0.00 0 0
4 4 5 100 0.01 303 63 0.25 0 0 0.00 0 0 0.00 0 0 0.00 0 0
7 7 15 100 0.02 345 69 0.43 0 0 0.00 0 0 0.00 0 0 0.00 0 0
10 10 21 99 0.03 350 74 0.64 0 0 0.00 0 0 0.00 0 0 0.00 0 0
vs. a baseline of 150 to 180 MB/s read, 35 to 50MB/s write when the benchmark isn't running at all. (I have some programs running that don't totally sleep even when I'm not touching the mouse / keyboard.)

Huge latency spikes while running simple code

I have a simple benchmark that demonstrates performance of busywait threads. It runs in two modes: first one simply gets two timepoints sequentially, second one iterates through vector and measures duration of an iteration.
I see that two sequential calls of clock::now() takes about 50 nanoseconds on the average and one average iteration through vector takes about 100 nanoseconds. But sometimes these operations are executed with a huge delay: about 50 microseconds in the first case and 10 milliseconds (!) in the second case.
Test runs on single isolated core so context switches do not occur. I also call mlockall in beginning of the program so I assume that page faults do not affect the performance.
Following additional optimizations were also applied:
kernel boot parameters: intel_idle.max_cstate=0 idle=halt
irqaffinity=0,14 isolcpus=4-13,16-27 pti=off spectre_v2=off audit=0
selinux=0 nmi_watchdog=0 nosoftlockup=0 rcu_nocb_poll rcu_nocbs=19-20
nohz_full=19-20;
rcu[^c] kernel threads moved to a housekeeping CPU core 0;
network card RxTx queues moved to a housekeeping CPU core 0;
writeback kernel workqueue moved to a housekeeping CPU core 0;
transparent_hugepage disabled;
Intel CPU HyperThreading disabled;
swap file/partition is not used.
Environment:
System details:
Default Archlinux kernel:
5.1.9-arch1-1-ARCH #1 SMP PREEMPT Tue Jun 11 16:18:09 UTC 2019 x86_64 GNU/Linux
that has following PREEMPT and HZ settings:
CONFIG_HZ_300=y
CONFIG_HZ=300
CONFIG_PREEMPT=y
Hardware details:
RAM: 256GB
CPU(s): 28
On-line CPU(s) list: 0-27
Thread(s) per core: 1
Core(s) per socket: 14
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2690 v4 # 2.60GHz
Stepping: 1
CPU MHz: 3200.011
CPU max MHz: 3500.0000
CPU min MHz: 1200.0000
BogoMIPS: 5202.68
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 35840K
NUMA node0 CPU(s): 0-13
NUMA node1 CPU(s): 14-27
Example code:
struct TData
{
std::vector<char> Data;
TData() = default;
TData(size_t aSize)
{
for (size_t i = 0; i < aSize; ++i)
{
Data.push_back(i);
}
}
};
using TBuffer = std::vector<TData>;
TData DoMemoryOperation(bool aPerform, const TBuffer& aBuffer, size_t& outBufferIndex)
{
if (!aPerform)
{
return TData {};
}
const TData& result = aBuffer[outBufferIndex];
if (++outBufferIndex == aBuffer.size())
{
outBufferIndex = 0;
}
return result;
}
void WarmUp(size_t aCyclesCount, bool aPerform, const TBuffer& aBuffer)
{
size_t bufferIndex = 0;
for (size_t i = 0; i < aCyclesCount; ++i)
{
auto data = DoMemoryOperation(aPerform, aBuffer, bufferIndex);
}
}
void TestCycle(size_t aCyclesCount, bool aPerform, const TBuffer& aBuffer, Measurings& outStatistics)
{
size_t bufferIndex = 0;
for (size_t i = 0; i < aCyclesCount; ++i)
{
auto t1 = std::chrono::steady_clock::now();
{
auto data = DoMemoryOperation(aPerform, aBuffer, bufferIndex);
}
auto t2 = std::chrono::steady_clock::now();
auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
outStatistics.AddMeasuring(diff, t2);
}
}
int Run(int aCpu, size_t aDataSize, size_t aBufferSize, size_t aCyclesCount, bool aAllocate, bool aPerform)
{
if (mlockall(MCL_CURRENT | MCL_FUTURE))
{
throw std::runtime_error("mlockall failed");
}
std::cout << "Test parameters"
<< ":\ndata size=" << aDataSize
<< ",\nnumber of elements=" << aBufferSize
<< ",\nbuffer size=" << aBufferSize * aDataSize
<< ",\nnumber of cycles=" << aCyclesCount
<< ",\nallocate=" << aAllocate
<< ",\nperform=" << aPerform
<< ",\nthread ";
SetCpuAffinity(aCpu);
TBuffer buffer;
if (aPerform)
{
buffer.resize(aBufferSize);
std::fill(buffer.begin(), buffer.end(), TData { aDataSize });
}
WaitForKey();
std::cout << "Running..."<< std::endl;
WarmUp(aBufferSize * 2, aPerform, buffer);
Measurings statistics;
TestCycle(aCyclesCount, aPerform, buffer, statistics);
statistics.Print(aCyclesCount);
WaitForKey();
if (munlockall())
{
throw std::runtime_error("munlockall failed");
}
return 0;
}
And following results are received:
First:
StandaloneTests --run_test=MemoryAccessDelay --cpu=19 --data-size=280 --size=67108864 --count=1000000000 --allocate=1 --perform=0
Test parameters:
data size=280,
number of elements=67108864,
buffer size=18790481920,
number of cycles=1000000000,
allocate=1,
perform=0,
thread 14056 on cpu 19
Statistics: min: 16: max: 18985: avg: 18
0 - 10 : 0 (0 %): -
10 - 100 : 999993494 (99 %): min: 40: max: 117130: avg: 40
100 - 1000 : 946 (0 %): min: 380: max: 506236837: avg: 43056598
1000 - 10000 : 5549 (0 %): min: 56876: max: 70001739: avg: 7341862
10000 - 18985 : 11 (0 %): min: 1973150818: max: 14060001546: avg: 3644216650
Second:
StandaloneTests --run_test=MemoryAccessDelay --cpu=19 --data-size=280 --size=67108864 --count=1000000000 --allocate=1 --perform=1
Test parameters:
data size=280,
number of elements=67108864,
buffer size=18790481920,
number of cycles=1000000000,
allocate=1,
perform=1,
thread 3264 on cpu 19
Statistics: min: 36: max: 4967479: avg: 48
0 - 10 : 0 (0 %): -
10 - 100 : 964323921 (96 %): min: 60: max: 4968567: avg: 74
100 - 1000 : 35661548 (3 %): min: 122: max: 4972632: avg: 2023
1000 - 10000 : 14320 (0 %): min: 1721: max: 33335158: avg: 5039338
10000 - 100000 : 130 (0 %): min: 10010533: max: 1793333832: avg: 541179510
100000 - 1000000 : 0 (0 %): -
1000000 - 4967479 : 81 (0 %): min: 508197829: max: 2456672083: avg: 878824867
Any ideas what is the reason of such huge delays and how it may be investigated?
In:
TData DoMemoryOperation(bool aPerform, const TBuffer& aBuffer, size_t& outBufferIndex);
It returns a std::vector<char> by value. That involves a memory allocation and data copying. The memory allocations can do a syscall (brk or mmap) and memory mappings related syscalls are notorious for being slow.
When timings include syscalls one cannot expect low variance.
You may like to run your application with /usr/bin/time --verbose <app> or perf -ddd <app> to see the number of page faults and context switches.

strace -f on own recursive c++ program won't work

so I have a simple recursive c++ program, very basic:
#include <iostream>
int fibonacciRec(int no) {
if (no == 0 || no == 1)
return no;
else
return fibonacciRec(no-1) + fibonacciRec(no-2);
}
int main(int argc, char** argv) {
int no = 42;
for (int i = 1; i <= no; i++) {
std::cout << fibonacciRec(i-1) << " ";
}
std::cout << std::endl;
return 0;
}
Now I want to run strace on this program, showing all the system calls. Basically I want to see a lot of mmaps etc. but as soon, as the first loop is called, strace -f stops following the system calls and only shows the last write call. Also strace -c gives unlikely numbers, since the program takes well more then 4-6 seconds to compute:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
60.47 0.000078 78 1 munmap
26.36 0.000034 11 3 brk
13.18 0.000017 3 6 fstat
0.00 0.000000 0 4 read
0.00 0.000000 0 1 write
0.00 0.000000 0 5 close
0.00 0.000000 0 14 mmap
0.00 0.000000 0 10 mprotect
0.00 0.000000 0 6 6 access
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 arch_prctl
0.00 0.000000 0 5 openat
------ ----------- ----------- --------- --------- ----------------
100.00 0.000129 57 6 total
There's no need for any mmaps or any other system calls when fibonacciRec is running.
The only memory that might be allocated is stack memory for the recursive calls, and there are several reasons why you those don't show up in the strace:
It's really not a lot of memory. Your maximum recursion depth is about 42, and you've only got 1 local variable, so the stack frames are small. The total stack allocated during the recursion is probably less than 1 page.
Even if it was a lot of memory, the stack allocation only grows, it never shrinks, so you'd see it grow to its maximum pretty quickly, then stay there for a long time. It wouldn't be a flood.
Stack allocation isn't done with a system call anyway. To ask the kernel for more stack, all you have to do is pretend you already have it. The kernel catches the page fault, notices that the faulting address is near your existing stack, and allocates more. It's so transparent that even strace can't see it.
Apart from calling itself and returning a value, fibonacciRec doesn't do anything but manipulate local variables. There are no system calls.

Why do C++ and strace disagree on how long the open() system call is taking?

I have a program which opens a large number of files. I am timing the execution of a C++ loop which literally just opens and closes the files using both a C++ timer and strace. Strangely the system time and the time logged by C++ (which agree with each other) are orders of magnitude larger than the time the time strace claims was spent in system calls. How can this be? I have put the source and output below.
This all came about because I found that my application was spending an unreasonable amount of time just to open files. To help me pin down the problem I wrote the following test code (for reference the file "files.csv" is just a list with one filepath per line):
#include <stdio.h>
#include...
using namespace std;
int main(){
timespec start, end;
ifstream fin("files.csv");
string line;
vector<string> files;
while(fin >> line){
files.push_back(line);
}
fin.close();
clock_gettime(CLOCK_MONOTONIC, &start);
for(int i=0; i<500; i++){
size_t filedesc = open(files[i].c_str(), O_RDONLY);
if(filedesc < 0) printf("error in open");
if(close(filedesc)<0) printf("error in close");
}
clock_gettime(CLOCK_MONOTONIC, &end);
printf(" %fs elapsed\n", (end.tv_sec-start.tv_sec) + ((float)(end.tv_nsec - start.tv_nsec))/1000000000);
return 0;
}
And here is what I get when I run it:
-bash$ time strace -ttT -c ./open_stuff
5.162448s elapsed <------ Output from C++ code
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.72 0.043820 86 508 open <------output from strace
0.15 0.000064 0 508 close
0.14 0.000061 0 705 read
0.00 0.000000 0 1 write
0.00 0.000000 0 8 fstat
0.00 0.000000 0 25 mmap
0.00 0.000000 0 12 mprotect
0.00 0.000000 0 3 munmap
0.00 0.000000 0 52 brk
0.00 0.000000 0 2 rt_sigaction
0.00 0.000000 0 1 rt_sigprocmask
0.00 0.000000 0 1 1 access
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 getrlimit
0.00 0.000000 0 1 arch_prctl
0.00 0.000000 0 3 1 futex
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 1 set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00 0.043945 1834 2 total
real 0m5.821s <-------output from time
user 0m0.031s
sys 0m0.084s
In theory the reported "elapsed" time from C++ should be the execution time of the the calls to open(2) plus the minimal overhead of executing a for loop 500 times. And yet the sum of the total time in open(2) and close(1) calls from strace is 99% shorter . I cannot figure out what is going on.
PS The difference between the C elapsed time and system time is due to the fact that files.csv actually contains tens of thousands of paths, which all get loaded.
Comparing elapsed time with execution time is like comparing apples with orange juice. (One of them is missing the pulp :) ) To open a file, the system has to find and read the appropriate directory entry... and if the paths are deep, it might need to rrad a number of directory entries. If the entries are not cached, they will need to be read from disk, which will involve a disk seek. While the disk heads are moving, and while the sector is spinning around to where the disk heads are, the wall clock keeps ticking, but the CPU can be doing other stuff (if there is work to do.) So that counts as elapsed time -- the inexorable clock ticks on -- but not execution time.

CPU high usage of the usleep on Cent OS 6.3

I compile the code below on cent os 5.3 and cent os 6.3:
#include <pthread.h>
#include <list>
#include <unistd.h>
#include <iostream>
using namespace std;
pthread_mutex_t _mutex;
pthread_spinlock_t spinlock;
list<int *> _task_list;
void * run(void*);
int main()
{
int worker_num = 3;
pthread_t pids[worker_num];
pthread_mutex_init(&_mutex, NULL);
for (int worker_i = 0; worker_i < worker_num; ++worker_i)
{
pthread_create(&(pids[worker_i]), NULL, run, NULL);
}
sleep(14);
}
void *run(void * args)
{
int *recved_info;
long long start;
while (true)
{
pthread_mutex_lock(&_mutex);
if (_task_list.empty())
{
recved_info = 0;
}
else
{
recved_info = _task_list.front();
_task_list.pop_front();
}
pthread_mutex_unlock(&_mutex);
if (recved_info == 0)
{
int f = usleep(1);
continue;
}
}
}
While running on the 5.3, you can't even find the process on top, cpu usage is around 0%. But on cent os 6.3, it's about 20% with 6 threads on a 4 cores cpu.
So I check the a.out with time and stace , the results are about that:
On 5.3:
real 0m14.003s
user 0m0.001s
sys 0m0.001s
On 6.3:
real 0m14.002s
user 0m1.484s
sys 0m1.160s
the strace:
on 5.3:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
91.71 0.002997 0 14965 nanosleep
8.29 0.000271 271 1 execve
0.00 0.000000 0 5 read
0.00 0.000000 0 10 4 open
0.00 0.000000 0 6 close
0.00 0.000000 0 4 4 stat
0.00 0.000000 0 6 fstat
0.00 0.000000 0 22 mmap
0.00 0.000000 0 13 mprotect
0.00 0.000000 0 1 munmap
0.00 0.000000 0 3 brk
0.00 0.000000 0 3 rt_sigaction
0.00 0.000000 0 3 rt_sigprocmask
0.00 0.000000 0 1 1 access
0.00 0.000000 0 3 clone
0.00 0.000000 0 1 uname
0.00 0.000000 0 1 getrlimit
0.00 0.000000 0 1 arch_prctl
0.00 0.000000 0 38 4 futex
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 4 set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00 0.003268 15092 13 total
on 6.3:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.99 1.372813 36 38219 nanosleep
0.01 0.000104 0 409 43 futex
0.00 0.000000 0 5 read
0.00 0.000000 0 6 open
0.00 0.000000 0 6 close
0.00 0.000000 0 6 fstat
0.00 0.000000 0 22 mmap
0.00 0.000000 0 15 mprotect
0.00 0.000000 0 1 munmap
0.00 0.000000 0 3 brk
0.00 0.000000 0 3 rt_sigaction
0.00 0.000000 0 3 rt_sigprocmask
0.00 0.000000 0 7 7 access
0.00 0.000000 0 3 clone
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 getrlimit
0.00 0.000000 0 1 arch_prctl
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 4 set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00 1.372917 38716 50 total
The time and the strace results are not the same test, so data is a little different. But I think it can show something.
I check the kernel config CONFIG_HIGH_RES_TIMERS, CONFIG_HPET and CONFIG_HZ:
On 5.3:
$ cat /boot/config-`uname -r` |grep CONFIG_HIGH_RES_TIMERS
$ cat /boot/config-`uname -r` |grep CONFIG_HPET
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_HPET=y
# CONFIG_HPET_RTC_IRQ is not set
# CONFIG_HPET_MMAP is not set
$ cat /boot/config-`uname -r` |grep CONFIG_HZ
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
On 6.3:
$ cat /boot/config-`uname -r` |grep CONFIG_HIGH_RES_TIMERS
CONFIG_HIGH_RES_TIMERS=y
$ cat /boot/config-`uname -r` |grep CONFIG_HPET
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_HPET=y
CONFIG_HPET_MMAP=y
$ cat /boot/config-`uname -r` |grep CONFIG_HZ
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
In fact, I also try the code on arch on ARM and xubuntu13.04-amd64-desktop, the same as the cent os 6.3.
So what can I do to figure out the reason of the different CPU usages?
Does it have anything with the kernel config?
You're correct, it has to do with the kernel config. usleep(1) will try to sleep for one microsecond. Before high resolution timers, it was not possible to sleep for less than a jiffy (in your case HZ=1000 so 1 jiffy == 1 millisecond).
On CentOS 5.3 which does not have these high resolution timers, you would sleep between 1ms and 2ms[1]. On CentOS 6.3 which has these timers, you're sleeping for close to one microsecond. That's why you're using more cpu on this platform: you're simply polling your task list 500-1000 times more.
If you change the code to usleep(1000), CentOS 5.3 will behave the same. CentOS 6.3 cpu time will decrease and be in the same ballpark as the program running on CentOS 5.3
There is a full discussion of this in the Linux manual: run man 7 time.
Note that your code should use condition variables instead of polling your task list at a certain time interval. That's a more efficient and clean way to do what you're doing.
Also, your main should really join the threads instead of just sleeping for 14 seconds.
[1] There is one exception. If your application was running under a realtime scheduling policy (SCHED_FIFO or SCHED_RR), it would busy-wait instead of sleeping to sleep close to the right amount. But by default you need root privileges