Huge latency spikes while running simple code

Huge latency spikes while running simple code - c++

I have a simple benchmark that demonstrates performance of busywait threads. It runs in two modes: first one simply gets two timepoints sequentially, second one iterates through vector and measures duration of an iteration.
I see that two sequential calls of clock::now() takes about 50 nanoseconds on the average and one average iteration through vector takes about 100 nanoseconds. But sometimes these operations are executed with a huge delay: about 50 microseconds in the first case and 10 milliseconds (!) in the second case.
Test runs on single isolated core so context switches do not occur. I also call mlockall in beginning of the program so I assume that page faults do not affect the performance.
Following additional optimizations were also applied:
kernel boot parameters: intel_idle.max_cstate=0 idle=halt
irqaffinity=0,14 isolcpus=4-13,16-27 pti=off spectre_v2=off audit=0
selinux=0 nmi_watchdog=0 nosoftlockup=0 rcu_nocb_poll rcu_nocbs=19-20
nohz_full=19-20;
rcu[^c] kernel threads moved to a housekeeping CPU core 0;
network card RxTx queues moved to a housekeeping CPU core 0;
writeback kernel workqueue moved to a housekeeping CPU core 0;
transparent_hugepage disabled;
Intel CPU HyperThreading disabled;
swap file/partition is not used.
Environment:
System details:
Default Archlinux kernel:
5.1.9-arch1-1-ARCH #1 SMP PREEMPT Tue Jun 11 16:18:09 UTC 2019 x86_64 GNU/Linux
that has following PREEMPT and HZ settings:
CONFIG_HZ_300=y
CONFIG_HZ=300
CONFIG_PREEMPT=y
Hardware details:
RAM: 256GB
CPU(s): 28
On-line CPU(s) list: 0-27
Thread(s) per core: 1
Core(s) per socket: 14
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2690 v4 # 2.60GHz
Stepping: 1
CPU MHz: 3200.011
CPU max MHz: 3500.0000
CPU min MHz: 1200.0000
BogoMIPS: 5202.68
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 35840K
NUMA node0 CPU(s): 0-13
NUMA node1 CPU(s): 14-27
Example code:
struct TData
{
std::vector<char> Data;
TData() = default;
TData(size_t aSize)
{
for (size_t i = 0; i < aSize; ++i)
{
Data.push_back(i);
}
}
};
using TBuffer = std::vector<TData>;
TData DoMemoryOperation(bool aPerform, const TBuffer& aBuffer, size_t& outBufferIndex)
{
if (!aPerform)
{
return TData {};
}
const TData& result = aBuffer[outBufferIndex];
if (++outBufferIndex == aBuffer.size())
{
outBufferIndex = 0;
}
return result;
}
void WarmUp(size_t aCyclesCount, bool aPerform, const TBuffer& aBuffer)
{
size_t bufferIndex = 0;
for (size_t i = 0; i < aCyclesCount; ++i)
{
auto data = DoMemoryOperation(aPerform, aBuffer, bufferIndex);
}
}
void TestCycle(size_t aCyclesCount, bool aPerform, const TBuffer& aBuffer, Measurings& outStatistics)
{
size_t bufferIndex = 0;
for (size_t i = 0; i < aCyclesCount; ++i)
{
auto t1 = std::chrono::steady_clock::now();
{
auto data = DoMemoryOperation(aPerform, aBuffer, bufferIndex);
}
auto t2 = std::chrono::steady_clock::now();
auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
outStatistics.AddMeasuring(diff, t2);
}
}
int Run(int aCpu, size_t aDataSize, size_t aBufferSize, size_t aCyclesCount, bool aAllocate, bool aPerform)
{
if (mlockall(MCL_CURRENT | MCL_FUTURE))
{
throw std::runtime_error("mlockall failed");
}
std::cout << "Test parameters"
<< ":\ndata size=" << aDataSize
<< ",\nnumber of elements=" << aBufferSize
<< ",\nbuffer size=" << aBufferSize * aDataSize
<< ",\nnumber of cycles=" << aCyclesCount
<< ",\nallocate=" << aAllocate
<< ",\nperform=" << aPerform
<< ",\nthread ";
SetCpuAffinity(aCpu);
TBuffer buffer;
if (aPerform)
{
buffer.resize(aBufferSize);
std::fill(buffer.begin(), buffer.end(), TData { aDataSize });
}
WaitForKey();
std::cout << "Running..."<< std::endl;
WarmUp(aBufferSize * 2, aPerform, buffer);
Measurings statistics;
TestCycle(aCyclesCount, aPerform, buffer, statistics);
statistics.Print(aCyclesCount);
WaitForKey();
if (munlockall())
{
throw std::runtime_error("munlockall failed");
}
return 0;
}
And following results are received:
First:
StandaloneTests --run_test=MemoryAccessDelay --cpu=19 --data-size=280 --size=67108864 --count=1000000000 --allocate=1 --perform=0
Test parameters:
data size=280,
number of elements=67108864,
buffer size=18790481920,
number of cycles=1000000000,
allocate=1,
perform=0,
thread 14056 on cpu 19
Statistics: min: 16: max: 18985: avg: 18
0 - 10 : 0 (0 %): -
10 - 100 : 999993494 (99 %): min: 40: max: 117130: avg: 40
100 - 1000 : 946 (0 %): min: 380: max: 506236837: avg: 43056598
1000 - 10000 : 5549 (0 %): min: 56876: max: 70001739: avg: 7341862
10000 - 18985 : 11 (0 %): min: 1973150818: max: 14060001546: avg: 3644216650
Second:
StandaloneTests --run_test=MemoryAccessDelay --cpu=19 --data-size=280 --size=67108864 --count=1000000000 --allocate=1 --perform=1
Test parameters:
data size=280,
number of elements=67108864,
buffer size=18790481920,
number of cycles=1000000000,
allocate=1,
perform=1,
thread 3264 on cpu 19
Statistics: min: 36: max: 4967479: avg: 48
0 - 10 : 0 (0 %): -
10 - 100 : 964323921 (96 %): min: 60: max: 4968567: avg: 74
100 - 1000 : 35661548 (3 %): min: 122: max: 4972632: avg: 2023
1000 - 10000 : 14320 (0 %): min: 1721: max: 33335158: avg: 5039338
10000 - 100000 : 130 (0 %): min: 10010533: max: 1793333832: avg: 541179510
100000 - 1000000 : 0 (0 %): -
1000000 - 4967479 : 81 (0 %): min: 508197829: max: 2456672083: avg: 878824867
Any ideas what is the reason of such huge delays and how it may be investigated?

In:
TData DoMemoryOperation(bool aPerform, const TBuffer& aBuffer, size_t& outBufferIndex);
It returns a std::vector<char> by value. That involves a memory allocation and data copying. The memory allocations can do a syscall (brk or mmap) and memory mappings related syscalls are notorious for being slow.
When timings include syscalls one cannot expect low variance.
You may like to run your application with /usr/bin/time --verbose <app> or perf -ddd <app> to see the number of page faults and context switches.

Related

Opencv weird speed performance of convertTo and LUT

Environment
OpenCV 4.4.0
CentOS 7.2 docker
gcc 7.3.1
What I want to do
I'm doing some CV deeplearning deployment optimization.Some of my models needs their input to be normalized. So preprocessing normalization became an optimization point. Pixel value is between [0, 255], so first thing is multiplying it with 1 / 255.0, which is my first method. After some google I found LUT which theoretically should be faster than float calculation. So I wrote code like below to test the two methods:
Test code
#include "opencv2/imgcodecs/imgcodecs.hpp"
#include "opencv2/imgproc/imgproc.hpp"
#include <chrono>
#include <dirent.h>
#include <iostream>
#include <string>
#include <vector>
int getFiles(const std::string path, std::vector<std::string>& files, std::string suffix)
{
int iFileCnt = 0;
DIR* dirptr = NULL;
struct dirent* dirp;
if ((dirptr = opendir(path.c_str())) == NULL) {
return 0;
}
while ((dirp = readdir(dirptr)) != NULL) {
if ((dirp->d_type == DT_REG) && 0 == (strcmp(strchr(dirp->d_name, '.'), suffix.c_str()))) {
files.push_back(dirp->d_name);
}
++iFileCnt;
}
closedir(dirptr);
return iFileCnt;
}
int main(int argc, char* argv[])
{
std::string pic_dir = argv[1];
int loop_count = 10;
if (argc >= 3) {
loop_count = std::stoi(argv[2]);
}
float FACTOR = 1 / 255.0;
std::vector<cv::Size> sizes = {
{299, 299},
{416, 416},
{512, 512},
{640, 640},
{960, 540},
{1920, 1080}
};
// std::vector<cv::Size> sizes = {
// {1920, 1080},
// {960, 540},
// {640, 640},
// {512, 512},
// {416, 416},
// {299, 299}
// };
cv::Mat table(1, 256, CV_32FC1);
auto ptr = table.ptr<float>(0);
for (int i = 0; i < 256; ++i) {
ptr[i] = float(i) * FACTOR;
}
std::vector<std::string> pic_files;
getFiles(pic_dir, pic_files, ".jpg");
std::vector<cv::Mat> image_mats(pic_files.size());
for (int i = 0; i < pic_files.size(); ++i) {
std::string one_pic_path = pic_dir + "/" + pic_files[i];
image_mats[i] = cv::imread(one_pic_path);
}
for (auto& one_size : sizes) {
std::cout << "size: " << one_size << std::endl;
double time_1 = 0;
double time_2 = 0;
for (auto& one_mat : image_mats) {
cv::Mat tmp_image;
cv::resize(one_mat, tmp_image, one_size);
for (int i = 0; i < loop_count; ++i) {
auto t_1_1 = std::chrono::steady_clock::now();
cv::Mat out_1;
tmp_image.convertTo(out_1, CV_32FC3, FACTOR);
auto t_1_2 = std::chrono::steady_clock::now();
time_1 += std::chrono::duration<double, std::milli>(t_1_2 - t_1_1).count();
auto t_2_1 = std::chrono::steady_clock::now();
cv::Mat out_2;
cv::LUT(tmp_image, table, out_2);
auto t_2_2 = std::chrono::steady_clock::now();
time_2 += std::chrono::duration<double, std::milli>(t_2_2 - t_2_1).count();
auto diff = cv::sum(out_1 - out_2);
if (diff[0] > 1E-3) {
std::cout << diff << std::endl;
}
}
}
size_t count = loop_count * image_mats.size();
auto average_time_1 = time_1 / count;
auto average_time_2 = time_2 / count;
auto promote_percent = (average_time_1 - average_time_2) / average_time_1 * 100;
printf("total pic num: %d, loop %d times\n", pic_files.size(), loop_count);
printf("method_1, total %f ms, average %f ms\n", time_1, average_time_1);
printf("method_2, total %f ms, average %f ms, promote: %.2f%\n", time_2, average_time_2,
promote_percent);
printf("\n");
}
return 0;
}
Weird performance
What I want to test is speed difference between two methods, with different input sizes, while the outputs of two methods should be equal. I took 128 pictures with different sizes for test. Here is the weird performance:
1. result of the code above
size: [299 x 299]
total pic num: 128, loop 10 times
method_1, total 38.872174 ms, average 0.030369 ms
method_2, total 330.688332 ms, average 0.258350 ms, promote: -750.71%
size: [416 x 416]
total pic num: 128, loop 10 times
method_1, total 103.708926 ms, average 0.081023 ms
method_2, total 689.972421 ms, average 0.539041 ms, promote: -565.30%
size: [512 x 512]
total pic num: 128, loop 10 times
method_1, total 267.989430 ms, average 0.209367 ms
method_2, total 450.809036 ms, average 0.352195 ms, promote: -68.22%
size: [640 x 640]
total pic num: 128, loop 10 times
method_1, total 757.269510 ms, average 0.591617 ms
method_2, total 551.951118 ms, average 0.431212 ms, promote: 27.11%
size: [960 x 540]
total pic num: 128, loop 10 times
method_1, total 1095.167540 ms, average 0.855600 ms
method_2, total 760.330269 ms, average 0.594008 ms, promote: 30.57%
size: [1920 x 1080]
total pic num: 128, loop 10 times
method_1, total 4944.142104 ms, average 3.862611 ms
method_2, total 3471.176202 ms, average 2.711856 ms, promote: 29.79%
2. comment the diff part:
//auto diff = cv::sum(out_1 - out_2);
//if (diff[0] > 1E-3) {
// std::cout << diff << std::endl;
//}
size: [299 x 299]
total pic num: 128, loop 10 times
method_1, total 246.356823 ms, average 0.192466 ms
method_2, total 361.859598 ms, average 0.282703 ms, promote: -46.88%
size: [416 x 416]
total pic num: 128, loop 10 times
method_1, total 516.542233 ms, average 0.403549 ms
method_2, total 719.191240 ms, average 0.561868 ms, promote: -39.23%
size: [512 x 512]
total pic num: 128, loop 10 times
method_1, total 839.599260 ms, average 0.655937 ms
method_2, total 342.608080 ms, average 0.267663 ms, promote: 59.19%
size: [640 x 640]
total pic num: 128, loop 10 times
method_1, total 1384.348467 ms, average 1.081522 ms
method_2, total 524.382672 ms, average 0.409674 ms, promote: 62.12%
size: [960 x 540]
total pic num: 128, loop 10 times
method_1, total 1796.153597 ms, average 1.403245 ms
method_2, total 688.210851 ms, average 0.537665 ms, promote: 61.68%
size: [1920 x 1080]
total pic num: 128, loop 10 times
method_1, total 7707.945924 ms, average 6.021833 ms
method_2, total 3812.262622 ms, average 2.978330 ms, promote: 50.54%
3. Uncomment the diff part but reverse the sizes vector
std::vector<cv::Size> sizes = {
{1920, 1080},
{960, 540},
{640, 640},
{512, 512},
{416, 416},
{299, 299}
};
...
auto diff = cv::sum(out_1 - out_2);
if (diff[0] > 1E-3) {
std::cout << diff << std::endl;
}
size: [1920 x 1080]
total pic num: 128, loop 10 times
method_1, total 4933.384896 ms, average 3.854207 ms
method_2, total 3563.611341 ms, average 2.784071 ms, promote: 27.77%
size: [960 x 540]
total pic num: 128, loop 10 times
method_1, total 887.353187 ms, average 0.693245 ms
method_2, total 917.995079 ms, average 0.717184 ms, promote: -3.45%
size: [640 x 640]
total pic num: 128, loop 10 times
method_1, total 492.562282 ms, average 0.384814 ms
method_2, total 525.089826 ms, average 0.410226 ms, promote: -6.60%
size: [512 x 512]
total pic num: 128, loop 10 times
method_1, total 181.900041 ms, average 0.142109 ms
method_2, total 159.691528 ms, average 0.124759 ms, promote: 12.21%
size: [416 x 416]
total pic num: 128, loop 10 times
method_1, total 77.030586 ms, average 0.060180 ms
method_2, total 221.307936 ms, average 0.172897 ms, promote: -187.30%
size: [299 x 299]
total pic num: 128, loop 10 times
method_1, total 38.139366 ms, average 0.029796 ms
method_2, total 112.203023 ms, average 0.087659 ms, promote: -194.19%
4. Comment the diff part and reverse the sizes vector
std::vector<cv::Size> sizes = {
{1920, 1080},
{960, 540},
{640, 640},
{512, 512},
{416, 416},
{299, 299}
};
...
//auto diff = cv::sum(out_1 - out_2);
//if (diff[0] > 1E-3) {
// std::cout << diff << std::endl;
//}
size: [1920 x 1080]
total pic num: 128, loop 10 times
method_1, total 8021.875493 ms, average 6.267090 ms
method_2, total 3849.222334 ms, average 3.007205 ms, promote: 52.02%
size: [960 x 540]
total pic num: 128, loop 10 times
method_1, total 605.553580 ms, average 0.473089 ms
method_2, total 477.145896 ms, average 0.372770 ms, promote: 21.21%
size: [640 x 640]
total pic num: 128, loop 10 times
method_1, total 268.076975 ms, average 0.209435 ms
method_2, total 169.015667 ms, average 0.132043 ms, promote: 36.95%
size: [512 x 512]
total pic num: 128, loop 10 times
method_1, total 117.419851 ms, average 0.091734 ms
method_2, total 94.436479 ms, average 0.073778 ms, promote: 19.57%
size: [416 x 416]
total pic num: 128, loop 10 times
method_1, total 73.963177 ms, average 0.057784 ms
method_2, total 221.397616 ms, average 0.172967 ms, promote: -199.33%
size: [299 x 299]
total pic num: 128, loop 10 times
method_1, total 38.046131 ms, average 0.029724 ms
method_2, total 113.839007 ms, average 0.088937 ms, promote: -199.21%
Question
I know cpu working state may undulate, and speed performance may not be exactly the same. But Why does code outside timekeeping have such influence on convertTo and LUT?

How to improve libevent bufferevent_write performance

For my test case, only send "00" message by bufferevent_write.
case 1: 20,000 tcp connections, and send "00" to each every 10s, it will cost 0.15s.
case 2: only 1 tcp connection, and send "00" 20,000 times every 10s, it will cost 0.015s.
Please give me some suggestions to improve bufferevent_write performance.
I just wanna as fast as possible, and wonder that, if bufferevent_write is async, why send 20k message to 1 tcp is much faster than send 1 msssage to 20k tcp.
CPU info:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8269CY CPU # 2.50GHz
Stepping: 7
CPU MHz: 2500.000
BogoMIPS: 5000.00
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 36608K
NUMA node0 CPU(s): 0-15
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512_vnni
Memory info:
32G
the whole test case
#include <event2/buffer.h>
#include <event2/bufferevent.h>
#include <event2/event.h>
#include <event2/listener.h>
#include <event2/thread.h>
#include <netinet/tcp.h>
#include <atomic>
#include <cerrno>
#include <csignal>
#include <cstring>
#include <ctime>
#include <deque>
#include <functional>
#include <iostream>
#include <map>
#include <mutex>
#include <set>
#include <thread>
using namespace std::chrono_literals;
static event_base *kEventBase{nullptr};
static evconnlistener *kListener{nullptr};
static std::set<bufferevent *> kSessions{};
static std::mutex kSessionsMutex{};
static std::atomic_bool kRunning{false};
static void stop() {
kRunning = false;
if (kListener != nullptr) {
evconnlistener_disable(kListener);
std::cout << "normal listener stopped" << std::endl;
}
struct timeval local_timeval = {1, 0};
if (kEventBase != nullptr) { event_base_loopexit(kEventBase, &local_timeval); }
}
static void handler(int sig) {
std::cout << "get signal: " << sig << std::endl;
stop();
}
static void ReadCallback(bufferevent *event, void *) {
auto buffer = evbuffer_new();
evbuffer_add_buffer(buffer, bufferevent_get_input(event));
auto data_size = evbuffer_get_length(buffer);
char data[data_size + 1];
bzero(data, data_size + 1);
evbuffer_remove(buffer, data, data_size);
evbuffer_free(buffer);
std::cout << "get data: " << data << std::endl;
}
static void EventCallback(bufferevent *event, short events, void *) {
if (events & BEV_EVENT_EOF) {
std::cout << "socket EOF" << std::endl;
} else if (events & BEV_EVENT_ERROR) {
std::cout << "socket error: " << evutil_socket_error_to_string(EVUTIL_SOCKET_ERROR());
} else if (events & BEV_EVENT_TIMEOUT) {
std::cout << "socket read/write timeout" << std::endl;
} else {
std::cout << "unhandled socket events: " << std::to_string(events) << std::endl;
}
{
std::lock_guard<std::mutex> local_lock_guard{kSessionsMutex};
kSessions.erase(event);
bufferevent_free(event);
}
}
static void listenerCallback(evconnlistener *, evutil_socket_t socket, sockaddr *, int, void *) {
bufferevent *event =
bufferevent_socket_new(kEventBase, socket, BEV_OPT_CLOSE_ON_FREE | BEV_OPT_THREADSAFE);
if (event == nullptr) {
std::cout << "create buffer event failed" << std::endl;
return;
}
int enable = 1;
setsockopt(socket, IPPROTO_TCP, TCP_NODELAY, (void *)&enable, sizeof(enable));
setsockopt(socket, IPPROTO_TCP, TCP_QUICKACK, (void *)&enable, sizeof(enable));
bufferevent_setcb(event, ReadCallback, nullptr, EventCallback, nullptr);
bufferevent_enable(event, EV_WRITE | EV_READ);
kSessions.emplace(event);
}
int main(int argc, const char **argv) {
signal(SIGTERM, handler);
signal(SIGINT, handler);
evthread_use_pthreads();
// init
kEventBase = event_base_new();
if (kEventBase == nullptr) {
std::cout << "cannot create event_base_miner_listener_" << std::endl;
return -1;
}
sockaddr_in local_sin{};
bzero(&local_sin, sizeof(local_sin));
local_sin.sin_family = AF_INET;
local_sin.sin_port = htons(1800u);
local_sin.sin_addr.s_addr = htonl(INADDR_ANY);
kListener = evconnlistener_new_bind(kEventBase,
listenerCallback,
nullptr,
LEV_OPT_REUSEABLE | LEV_OPT_CLOSE_ON_FREE,
-1,
reinterpret_cast<sockaddr *>(&local_sin),
static_cast<int>(sizeof(local_sin)));
if (kListener == nullptr) {
std::cout << "cannot create normal listener" << std::endl;
return -1;
}
kRunning = true;
std::thread thread_send_message([]() {
while (kRunning) {
{
// case 1: If send to 20,000 tcp connection, and send "00" for each, it will cost 0.15s.
std::lock_guard<std::mutex> local_lock_guard{kSessionsMutex};
std::clock_t clock_start = std::clock();
for (auto &it : kSessions) { bufferevent_write(it, "00", 2); }
std::cout << "send message to all done, client count: " << kSessions.size()
<< ", elapsed: " << std::clock() - clock_start << std::endl;
}
{
// case 2: If send to 1 tcp connection, and send "00" 20,000 times, it will cost 0.015s.
// std::lock_guard<std::mutex> local_lock_guard{kSessionsMutex};
// for (auto &it : kSessions) {
// std::clock_t clock_start = std::clock();
// for (int i = 0; i < 20000; ++i) { bufferevent_write(it, "00", 2); }
// std::cout << "send message 20k times done, elapsed: " << std::clock() - clock_start
// << std::endl;
// }
}
std::this_thread::sleep_for(10s);
}
});
event_base_dispatch(kEventBase);
if (thread_send_message.joinable()) { thread_send_message.join(); }
{
std::lock_guard<std::mutex> local_lock_guard{kSessionsMutex};
for (auto &it : kSessions) { bufferevent_free(it); }
kSessions.clear();
}
if (kListener != nullptr) {
evconnlistener_free(kListener);
kListener = nullptr;
}
if (kEventBase != nullptr) {
event_base_free(kEventBase);
kEventBase = nullptr;
}
}
the minimal reproducible example
// case 1: 20,000 tcp connections, and send "00" for each every 10s, it will cost 0.15s.
std::clock_t clock_start = std::clock();
for (auto &it : kSessions) { bufferevent_write(it, "00", 2); }
std::cout << "send message to all done, client count: " << kSessions.size()
<< ", elapsed: " << std::clock() - clock_start << std::endl;
// case 2: only 1 tcp connection, and send "00" 20,000 times every 10s, it will cost 0.015s.
for (auto &it : kSessions) {
std::clock_t clock_start = std::clock();
for (int i = 0; i < 20000; ++i) { bufferevent_write(it, "00", 2); }
std::cout << "send message 20k times done, elapsed: " << std::clock() - clock_start
<< std::endl;
}
strace of case 1:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
56.32 29.519892 9 3135415 408444 futex
20.53 10.762191 7 1490532 epoll_ctl
15.25 7.992391 11 715355 writev
3.98 2.086553 45360 46 nanosleep
1.86 0.973074 11 85273 1 epoll_wait
0.62 0.324022 8 39267 19266 accept4
0.58 0.305246 6 48721 read
0.55 0.286858 6 48762 write
0.30 0.154980 4 40004 setsockopt
0.01 0.006486 5 1216 mprotect
0.01 0.002952 21 143 madvise
0.00 0.001018 7 152 brk
0.00 0.000527 6 94 clock_gettime
0.00 0.000023 3 8 openat
0.00 0.000021 21 1 mremap
0.00 0.000010 0 22 mmap
0.00 0.000007 1 9 close
0.00 0.000000 0 8 fstat
0.00 0.000000 0 3 munmap
0.00 0.000000 0 4 rt_sigaction
0.00 0.000000 0 1 rt_sigprocmask
0.00 0.000000 0 1 ioctl
0.00 0.000000 0 1 readv
0.00 0.000000 0 8 8 access
0.00 0.000000 0 1 socket
0.00 0.000000 0 1 bind
0.00 0.000000 0 1 listen
0.00 0.000000 0 1 clone
0.00 0.000000 0 1 execve
0.00 0.000000 0 4 getuid
0.00 0.000000 0 4 getgid
0.00 0.000000 0 4 geteuid
0.00 0.000000 0 4 getegid
0.00 0.000000 0 1 arch_prctl
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 2 set_robust_list
0.00 0.000000 0 1 eventfd2
0.00 0.000000 0 1 epoll_create1
0.00 0.000000 0 1 pipe2
0.00 0.000000 0 1 prlimit64
------ ----------- ----------- --------- --------- ----------------
100.00 52.416251 5605075 427719 total
strace of case 2:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
normal listener stopped
66.66 0.151105 7 22506 3469 futex
9.74 0.022084 6 3709 1 epoll_wait
9.54 0.021624 4 5105 epoll_ctl
9.47 0.021466 8 2550 writev
2.47 0.005598 4 1263 write
1.70 0.003857 3 1246 read
0.18 0.000409 18 23 nanosleep
0.09 0.000197 4 46 clock_gettime
0.03 0.000068 4 16 mprotect
0.02 0.000035 2 21 mmap
0.01 0.000024 8 3 munmap
0.01 0.000019 10 2 1 accept4
0.01 0.000018 5 4 setsockopt
0.01 0.000015 8 2 set_robust_list
0.01 0.000014 4 4 rt_sigaction
0.01 0.000014 4 4 geteuid
0.01 0.000013 3 4 getgid
0.01 0.000012 3 4 getuid
0.01 0.000012 3 4 getegid
0.00 0.000011 1 8 fstat
0.00 0.000010 10 1 socket
0.00 0.000008 8 1 clone
0.00 0.000007 2 3 brk
0.00 0.000007 7 1 pipe2
0.00 0.000006 1 7 openat
0.00 0.000006 6 1 epoll_create1
0.00 0.000005 1 8 8 access
0.00 0.000005 5 1 bind
0.00 0.000005 5 1 eventfd2
0.00 0.000005 5 1 prlimit64
0.00 0.000004 1 7 close
0.00 0.000004 4 1 listen
0.00 0.000003 3 1 rt_sigprocmask
0.00 0.000003 3 1 arch_prctl
0.00 0.000003 3 1 set_tid_address
0.00 0.000000 0 1 execve
------ ----------- ----------- --------- --------- ----------------
100.00 0.226676 36561 3479 total

How to improve libevent bufferevent_write performance
Read the documentation of libevent, study its source code, and consider other event loop libraries like libev, Qt, Wt, libonion, POCO etc....
Be aware of several points. I assume a modern Linux/x86-64 system.
you could profile your open source event loop library (e.g. by compiling it from source code with a recent GCC and using -pg -O2 flags, then strace(1) and/or gprof(1) and/or perf(1) and/or time(1) (and also top(1), ps(1), proc(5), netstat(8), ip(8), ifconfig(8), tcpdump(8), xosview to observe your entire Linux system). Of course read time(7) and epoll(7) and poll(2)
TCP/IP is introducing some overhead, IP routing adds more overhead, and the typical Ethernet packet has at least hundreds of bytes (and dozens of bytes of overhead). You certainly want to send(2) or recv(2) several hundred bytes at once. Sending short "00" messages (of about four bytes of useful payload) is inefficient. Ensure that your application send messages of hundreds of bytes at once. You might consider some JSONRPC approach (and of course design your protocol at a higher level with fewer but bigger messages triggering each more complex behavior) or some MPI one. A way to send fewer but higher-level messages is to embed some interpreter like Guile or Lua and send higher level script chunks or requests (like NeWS did in the past, and PostgreSQL or exim do today)
for short and small communications prefer running a few processes or threads on the same computer and use mq_overview(7), pipe(7), fifo(7), unix(7), avoiding Ethernet.
Most computers are in 2020 multi-core, and with care, you could use Pthreads or std::thread (with one thread running on each core, so at least 2 or 4 different threads on a laptop, or a hundred threads on a powerful Linux server). You'll need some synchronization code (e.g. std::mutex with std::lock_guard or Pẗhread mutexes....)
be aware of the C10K problem, and take inspiration from existing open source server programs or libraries such as lighttpd, Wt, FLTK, REDIS, Vmime, libcurl, libonion (and study their source code, and observe their runtime behavior with gdb(1) and/or strace(1) or ltrace(1))
the network might be the bottleneck (then you won't be able to improve your code to gain performance; you'll need some change in your software architecture). Read more about cloud computing, distributed computing, XDR, ASN.1, SOAP, REST, Web services, libssh, π-calculus
Notice that:
static void handler(int sig) {
std::cout << "get signal: " << sig << std::endl;
stop();
}
if used with signal(7) is against the rules of signal-safety(7) so you might use the pipe(7) to self-process trick as suggested by Qt or consider using the Linux specific signalfd(2) system call.
Read also Advanced Linux Programming then syscalls(2) and socket(7) and tcp(7).

How to reduce latency when access new page in linux

The page size is 4KB in linux, but I access a new page, the latency is about twice large than inside a page. How can I reduce this latency?
Here is my test code. I use clock_gettime to measure the time cost in the main function.
#define MEM_SIZE 4096 * 12
long long GetRT() {
struct timespec tp;
clock_gettime(CLOCK_REALTIME, &tp);
return (long long) tp.tv_sec * 1000000000 + tp.tv_nsec;
}
void *InitSharedMemory() {
int fd = shm_open("/test-steve", O_CREAT | O_RDWR, ACCESSPERMS);
if (fd < 0) {
perror("shm_open");
}
if (ftruncate(fd, MEM_SIZE) < 0) {
perror("ftruncate");
}
void *mmap_ptr = mmap(NULL, MEM_SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_POPULATE, fd, 0);
if (mmap_ptr == (caddr_t)-1) {
perror("mmap");
}
return mmap_ptr;
}
int main()
{
constexpr auto STEP = 512;
char arr[STEP];
for (int i = 0; i < STEP; ++i) {
arr[i] = i;
}
int ret = 0;
void *buffer = InitSharedMemory();
for (int i = 0; i < 40; ++i) {
if ((i % 0x8) == 0) {
// *(char *) buffer = 'a';
}
auto t1 = GetRT();
*(char *) buffer = 'a';
auto t2 = GetRT();
ret += t2 - t1;
printf("cost: %lld ns\n", t2 - t1);
buffer = (char *) buffer + STEP;
}
return ret;
}
The result is some like follows, is the extra latency TLB missing? how to avoid it?
cost: 272 ns
cost: 73 ns
cost: 76 ns
cost: 74 ns
cost: 74 ns
cost: 75 ns
cost: 73 ns
cost: 76 ns
cost: 281 ns
cost: 74 ns
cost: 73 ns
cost: 76 ns
cost: 76 ns
cost: 74 ns
cost: 73 ns
cost: 76 ns
cost: 334 ns
cost: 76 ns
cost: 76 ns
cost: 76 ns
cost: 76 ns
cost: 73 ns
cost: 76 ns
cost: 76 ns
cost: 267 ns
cost: 74 ns
cost: 74 ns
cost: 75 ns
cost: 73 ns
cost: 76 ns
cost: 73 ns
cost: 76 ns
cost: 260 ns

I am afraid that you cannot avoid the cost of the first hit to a page (as illustrated by your example).
May be this cost could be amortised when using huge-pages (2MB) with MAP_HUGETLB in the flags of mmap().
Depending on the applicative context, touching the first byte (or any) of each page once for all, right after the allocation, will force the commit (may be useful on numa hardware).
This will have a large initial cost but could lead to more predictable timings for the subsequent accesses.

Swap used when there is enough free RAM. Performance impacted

I wrote a simple program to study the performance when using a lot of RAM on Linux (64bit Red Hat Enterprise Linux Server release 6.4). (Please ignore the memory leak.)
#include <sys/time.h>
#include <time.h>
#include <stdio.h>
#include <string.h>
#include <iostream>
#include <vector>
using namespace std;
double getWallTime()
{
struct timeval time;
if (gettimeofday(&time, NULL))
{
return 0;
}
return (double)time.tv_sec + (double)time.tv_usec * .000001;
}
int main()
{
int *a;
int n = 1000000000;
do
{
time_t mytime = time(NULL);
char * time_str = ctime(&mytime);
time_str[strlen(time_str)-1] = '\0';
printf("Current Time : %s\n", time_str);
double start = getWallTime();
a = new int[n];
for (int i = 0; i < n; i++)
{
a[i] = 1;
}
double elapsed = getWallTime()-start;
cout << elapsed << endl;
cout << "Allocated." << endl;
}
while (1);
return 0;
}
The output is
Current Time : Tue May 8 11:46:55 2018
3.73667
Allocated.
Current Time : Tue May 8 11:46:59 2018
64.5222
Allocated.
Current Time : Tue May 8 11:48:03 2018
110.419
The top output is as below. We can see swap increased though there was enough free RAM. The consequence was the runtime soared from 3 seconds to 64 seconds.
top - 11:46:55 up 21 days, 1:14, 18 users, load average: 1.24, 1.25, 0.95
Tasks: 819 total, 3 running, 816 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.6%us, 1.4%sy, 0.0%ni, 97.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 132110088k total, 127500344k used, 4609744k free, 262288k buffers
Swap: 10485752k total, 4112k used, 10481640k free, 45988192k cached
top - 11:47:01 up 21 days, 1:14, 18 users, load average: 1.38, 1.27, 0.96
Tasks: 819 total, 2 running, 817 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.5%us, 2.1%sy, 0.0%ni, 97.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 132110088k total, 131620156k used, 489932k free, 262288k buffers
Swap: 10485752k total, 4112k used, 10481640k free, 45844228k cached
top - 11:47:53 up 21 days, 1:15, 18 users, load average: 1.25, 1.26, 0.97
Tasks: 819 total, 2 running, 817 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 2.5%sy, 0.0%ni, 97.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 132110088k total, 131626300k used, 483788k free, 262276k buffers
Swap: 10485752k total, 5464k used, 10480288k free, 43056696k cached
top - 11:47:56 up 21 days, 1:15, 18 users, load average: 1.23, 1.26, 0.97
Tasks: 819 total, 2 running, 817 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 2.5%sy, 0.0%ni, 97.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 132110088k total, 131627568k used, 482520k free, 262276k buffers
Swap: 10485752k total, 5792k used, 10479960k free, 42949788k cached
top - 11:47:59 up 21 days, 1:15, 18 users, load average: 1.21, 1.25, 0.97
Tasks: 819 total, 2 running, 817 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 2.5%sy, 0.0%ni, 97.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 132110088k total, 131623080k used, 487008k free, 262276k buffers
Swap: 10485752k total, 6312k used, 10479440k free, 42840068k cached
top - 11:48:02 up 21 days, 1:15, 18 users, load average: 1.21, 1.25, 0.97
Tasks: 819 total, 2 running, 817 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 2.5%sy, 0.0%ni, 97.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 132110088k total, 131620016k used, 490072k free, 262276k buffers
Swap: 10485752k total, 6772k used, 10478980k free, 42729276k cached
I read this and this. My questions are
Why would Linux sacrifice the performance rather than totally using cached RAM? Memory fragmentation? But putting data on swap will certainly create fragmentation too.
Is there a workaround to get consistent 3 seconds until reaching the physical RAM size?
Thanks.
Update 1:
Add more output from top.
Update 2:
Per David's suggestions, looking at /proc//io shows my program doesn't I/O. So David's first answer should explain this observation. Now comes to my second question. How to improve the performance as a non-root user (can't modify swappiness, etc.).
Update 3: I switched to another machine since I needed to sudo some commands. This is a real machine (no virtual machine) with Intel(R) Xeon(R) CPU E5-2680 0 # 2.70GHz. The machine has 16 physical cores.
uname -a
2.6.32-642.4.2.el6.x86_64 #1 SMP Tue Aug 23 19:58:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Running osgx's modified code with more iterations gives
Iteration 451
Time to malloc: 1.81198e-05
Time to fill with data: 0.109081
Fill rate with data: **916**.75 Mints/sec, 3667Mbytes/sec
Time to second write access of data: 0.049731
Access rate of data: 2010.82 Mints/sec, 8043.27Mbytes/sec
Time to third write access of data: 0.0478709
Access rate of data: 2088.95 Mints/sec, 8355.81Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 180800Mbytes
Iteration 452
Time to malloc: 1.09673e-05
Time to fill with data: 5.16316
Fill rate with data: **19**.368 Mints/sec, 77.4719Mbytes/sec
Time to second write access of data: 0.0495219
Access rate of data: 2019.31 Mints/sec, 8077.23Mbytes/sec
Time to third write access of data: 0.0439548
Access rate of data: 2275.06 Mints/sec, 9100.25Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 181200Mbytes
I did see kernel switched from 2MB page to 4KB page when slowdown occurred.
vmstat 1 60
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 1217396 11506356 5911040 47499184 0 2 35 47 0 0 14 2 84 0 0
2 0 1217396 11305860 5911040 47499184 4 0 4 36 5163 3460 7 6 87 0 0
2 0 1217396 11112744 5911040 47499188 0 0 0 0 4326 3451 7 6 87 0 0
2 0 1217396 10980556 5911040 47499188 0 0 0 0 4801 3385 7 6 87 0 0
2 0 1217396 10845940 5911040 47499192 0 0 0 20 4650 3596 7 6 87 0 0
2 0 1217396 10712508 5911040 47499200 0 0 0 0 5743 3562 7 6 87 0 0
2 0 1217396 10583380 5911040 47499200 0 0 0 40 4531 3622 7 6 87 0 0
2 0 1217396 10449096 5911040 47499200 0 0 0 0 4516 3629 7 6 87 0 0
2 0 1217396 10187856 5911040 47499200 0 0 0 0 4499 3456 7 6 87 0 0
2 0 1217396 10053256 5911040 47499204 0 0 0 8 5334 3507 7 6 87 0 0
2 0 1217396 9921624 5911040 47499204 0 0 0 0 6310 3593 6 6 87 0 0
2 0 1217396 9788532 5911040 47499208 0 0 0 44 5794 3516 7 6 87 0 0
2 0 1217396 9660516 5911040 47499208 0 0 0 0 4894 3535 7 6 87 0 0
2 0 1217396 9527552 5911040 47499212 0 0 0 0 4686 3570 7 6 87 0 0
2 0 1217396 9396536 5911040 47499212 0 0 0 0 4805 3538 7 6 87 0 0
2 0 1217396 9238664 5911040 47499212 0 0 0 0 5940 3459 7 6 87 0 0
2 0 1217396 9000136 5911040 47499216 0 0 0 32 5239 3333 7 6 87 0 0
2 0 1217396 8861132 5911040 47499220 0 0 0 0 5579 3351 7 6 87 0 0
2 0 1217396 8733688 5911040 47499220 0 0 0 0 4910 3199 7 6 87 0 0
2 0 1217396 8596600 5911040 47499224 0 0 0 44 5075 3453 7 6 87 0 0
2 0 1217396 8338468 5911040 47499232 0 0 0 0 5328 3444 7 6 87 0 0
2 0 1217396 8207732 5911040 47499232 0 0 0 52 5474 3370 7 6 87 0 0
2 0 1217396 8071212 5911040 47499236 0 0 0 0 5442 3419 7 6 87 0 0
2 0 1217396 7807736 5911040 47499236 0 0 0 0 6139 3456 7 6 87 0 0
2 0 1217396 7676080 5911044 47499232 0 0 0 16 4533 3430 6 6 87 0 0
2 0 1217396 7545728 5911044 47499236 0 0 0 0 6712 3957 7 6 87 0 0
4 0 1217396 7412444 5911044 47499240 0 0 0 68 6110 3547 7 6 87 0 0
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 1217396 7280148 5911048 47499244 0 0 0 68 6140 3516 7 7 86 0 0
2 0 1217396 7147836 5911048 47499244 0 0 0 0 4434 3400 7 6 87 0 0
2 0 1217396 6886980 5911048 47499248 0 0 0 16 7354 3393 7 6 87 0 0
2 0 1217396 6752868 5911048 47499248 0 0 0 0 5286 3573 7 6 87 0 0
2 0 1217396 6621772 5911048 47499248 0 0 0 0 5353 3410 7 6 87 0 0
2 0 1217396 6489760 5911048 47499252 0 0 0 48 5172 3454 7 6 87 0 0
2 0 1217396 6248732 5911048 47499256 0 0 0 0 5266 3411 7 6 87 0 0
2 0 1217396 6092804 5911048 47499260 0 0 0 4 6345 3473 7 6 87 0 0
2 0 1217396 5962544 5911048 47499260 0 0 0 0 7399 3712 7 6 87 0 0
2 0 1217396 5828492 5911048 47499264 0 0 0 0 5804 3516 7 6 87 0 0
2 0 1217396 5566720 5911048 47499264 0 0 0 44 5800 3370 7 6 87 0 0
2 0 1217396 5434204 5911048 47499264 0 0 0 0 6716 3446 7 6 87 0 0
2 0 1217396 5240724 5911048 47499268 0 0 0 68 3948 3346 7 6 87 0 0
2 0 1217396 5051688 5911008 47484936 0 0 0 0 4743 3734 7 6 87 0 0
2 0 1217396 4925680 5910500 47478444 0 0 136 0 5978 3779 7 6 87 0 0
2 0 1217396 4801744 5908552 47471820 0 0 0 32 4573 3237 7 6 87 0 0
2 0 1217396 4675772 5908552 47463984 0 0 0 0 6594 3276 7 6 87 0 0
2 0 1217396 4486472 5908444 47455736 0 0 0 4 6096 3256 7 6 87 0 0
2 0 1217396 4299908 5908392 47446964 0 0 0 0 5569 3525 7 6 87 0 0
2 0 1217396 4175444 5906884 47440024 0 0 0 0 4975 3141 7 6 87 0 0
2 0 1217396 4063472 5905976 47423860 0 0 0 56 6255 3147 6 6 87 0 0
2 0 1217396 3939816 5905796 47415596 0 0 0 0 5396 3143 7 6 87 0 0
2 0 1217396 3686540 5905796 47407152 0 0 0 44 6471 3201 7 6 87 0 0
2 0 1217396 3557596 5905796 47398892 0 0 0 0 7581 3727 7 6 87 0 0
2 0 1217396 3445536 5905796 47381812 0 0 0 0 5560 3222 7 6 87 0 0
2 0 1217396 3250272 5905796 47373364 0 0 0 60 5594 3343 7 6 87 0 0
2 0 1217396 3065232 5903744 47367156 0 0 0 0 5595 3182 7 6 87 0 0
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 0 1217396 2951704 5903028 47350792 0 0 0 12 5210 3262 7 6 87 0 0
2 0 1217396 2829228 5902928 47342444 0 0 0 0 5724 3758 7 6 87 0 0
2 0 1217396 2575248 5902580 47334472 0 0 0 0 4377 3369 7 6 87 0 0
2 0 1217396 2527996 5897796 47322436 0 0 0 60 5550 3570 7 6 87 0 0
2 0 1217396 2398672 5893572 47322324 0 0 0 0 5603 3225 7 6 87 0 0
2 0 1217396 2272536 5889364 47322228 0 0 0 16 6924 3310 7 6 87 0 0
iostat -xyz 1 60
Linux 2.6.32-642.4.2.el6.x86_64 05/09/2018 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
6.64 0.00 6.26 0.00 0.00 87.10
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util
avg-cpu: %user %nice %system %iowait %steal %idle
7.00 0.06 5.69 0.00 0.00 87.24
I managed to do "sudo perf top", and saw this in the top line when slowdown occurred.
16.84% [kernel] [k] compaction_alloc
From top. There were several other processes running (not shown).
Tasks: 799 total, 5 running, 787 sleeping, 4 stopped, 3 zombie
Cpu(s): 23.1%us, 16.7%sy, 0.0%ni, 60.0%id, 0.0%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 264503640k total, 256749480k used, 7754160k free, 5830508k buffers
Swap: 409259004k total, 1217112k used, 408041892k free, 50458600k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
23559 toddwz 20 0 165g 164g 1204 R 93.0 65.4 2:05.51 a.out
Update 4
After turning off THP, I see the following. Fill rate is consistent around 550 Mints/sec (900 with THP on) until my program uses 240GB RAM (cached RAM < 1GB). And then swap kicks in, so fill rate drops.
Iteration 610
Time to malloc: 1.3113e-05
Time to fill with data: 0.181151
Fill rate with data: 552.025 Mints/sec, 2208.1Mbytes/sec
Time to second write access of data: 0.04074
Access rate of data: 2454.59 Mints/sec, 9818.36Mbytes/sec
Time to third write access of data: 0.0420492
Access rate of data: 2378.17 Mints/sec, 9512.67Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 244400Mbytes
Iteration 611
Time to malloc: 1.88351e-05
Time to fill with data: 0.306215
Fill rate with data: 326.568 Mints/sec, 1306.27Mbytes/sec
Time to second write access of data: 0.045784
Access rate of data: 2184.17 Mints/sec, 8736.68Mbytes/sec
Time to third write access of data: 0.0441492
Access rate of data: 2265.05 Mints/sec, 9060.19Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 244800Mbytes
Iteration 612
Time to malloc: 2.21729e-05
Time to fill with data: 1.33305
Fill rate with data: 75.016 Mints/sec, 300.064Mbytes/sec
Time to second write access of data: 0.048573
Access rate of data: 2058.76 Mints/sec, 8235.02Mbytes/sec
Time to third write access of data: 0.0495481
Access rate of data: 2018.24 Mints/sec, 8072.96Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 245200Mbytes
Conclusion
The behavior of my program is more transparent to me with transparent huge page (THP) turned off so I'll continue with THP off. For my particular program, the cause is THP not swap. Thanks to all who helped.

First iterations of the test probably uses huge pages (2 MB pages) due to THP: Transparent Hugepage - https://www.kernel.org/doc/Documentation/vm/transhuge.txt -
check your /sys/kernel/mm/transparent_hugepage/enabled and grep AnonHugePages /proc/meminfo during the execution of test.
The reason applications are running faster is because of two
factors. The first factor is almost completely irrelevant and it's not
of significant interest because it'll also have the downside of
requiring larger clear-page copy-page in page faults which is a
potentially negative effect. The first factor consists in taking a
single page fault for each 2M virtual region touched by userland (so
reducing the enter/exit kernel frequency by a 512 times factor). This
only matters the first time the memory is accessed for the lifetime of
a memory mapping.
Allocation of huge amounts of memory with new or malloc is served by single syscall mmap, which usually don't "populate" the virtual memory with physical pages, check man mmap around MADV_POPULATE:
MAP_POPULATE (since Linux 2.5.46)
Populate (prefault) page tables for a mapping. ... This will help
to reduce blocking on page faults later.
This memory is just registered by mmap (without MAP_POPULATE) as virtual and write access is prohibited in page table. When your test tries to do first write to any memory page, page fault exception is generated and handled by OS kernel. Linux kernel will allocate some physical memory and map virtual page to physical (populate the page). With THP enabled (it is often enabled) kernel may allocate single huge page of 2MB, if it has some free huge physical pages. If there is no free huge pages, kernel will allocate 4KB page. So, without hugepages you will have 512 times more page faults (it can be checked by running vmstat 1 180 in another console while test is running, or by perf stat -I 1000).
Next accesses to populated pages will not have page faults, so you can extend your test with second (third) for i in (0..N-1): a[i] = 1; loop and measure time of both loops.
Your results still sounds strange. Is your system real or virtualized? Hypervisors may support 2 MB pages, and virtual systems may have much more cost for memory allocation and exception handling.
On my PC with less memory I have something like 10% slowdown when page faults switches from huge page allocation down to 4KB page allocation (check page-faults strings from perf stat - there were only around 2 thousands page faults per seconds with 2MB pages and >200 thousands page faults with 4KB pages):
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
$ perf stat -I1000 ./a.out
Iteration 0
Time to malloc: 8.10623e-06
Time to fill with data: 0.364378
Fill rate with data: 274.44 Mints/sec, 1097.76Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 400Mbytes
Iteration 1
Time to malloc: 1.90735e-05
Time to fill with data: 0.357983
Fill rate with data: 279.343 Mints/sec, 1117.37Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 800Mbytes
Iteration 2
Time to malloc: 1.69277e-05
# time counts unit events
1.000414902 999.893040 task-clock (msec)
1.000414902 1 context-switches # 0.001 K/sec
1.000414902 0 cpu-migrations # 0.000 K/sec
1.000414902 2,024 page-faults # 0.002 M/sec
1.000414902 2,664,963,857 cycles # 2.665 GHz
1.000414902 3,072,781,834 instructions # 1.15 insn per cycle
1.000414902 559,551,437 branches # 559.611 M/sec
1.000414902 25,176 branch-misses # 0.00% of all branches
Time to fill with data: 0.357014
Fill rate with data: 280.101 Mints/sec, 1120.4Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1200Mbytes
Iteration 3
Time to malloc: 1.71661e-05
Time to fill with data: 0.358964
Fill rate with data: 278.579 Mints/sec, 1114.32Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1600Mbytes
Iteration 4
Time to malloc: 1.69277e-05
Time to fill with data: 0.356918
Fill rate with data: 280.177 Mints/sec, 1120.71Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2000Mbytes
Iteration 5
Time to malloc: 1.50204e-05
2.000779126 1000.703872 task-clock (msec)
2.000779126 1 context-switches # 0.001 K/sec
2.000779126 0 cpu-migrations # 0.000 K/sec
2.000779126 2,280 page-faults # 0.002 M/sec
2.000779126 2,686,072,244 cycles # 2.685 GHz
2.000779126 3,094,777,285 instructions # 1.16 insn per cycle
2.000779126 563,593,105 branches # 563.425 M/sec
2.000779126 9,661 branch-misses # 0.00% of all branches
Time to fill with data: 0.371785
Fill rate with data: 268.973 Mints/sec, 1075.89Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2400Mbytes
Iteration 6
Time to malloc: 1.90735e-05
Time to fill with data: 0.418562
Fill rate with data: 238.913 Mints/sec, 955.653Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2800Mbytes
Iteration 7
Time to malloc: 2.09808e-05
3.001146481 1000.436128 task-clock (msec)
3.001146481 1 context-switches # 0.001 K/sec
3.001146481 0 cpu-migrations # 0.000 K/sec
3.001146481 217,415 page-faults # 0.217 M/sec
3.001146481 2,687,783,783 cycles # 2.687 GHz
3.001146481 3,100,713,038 instructions # 1.16 insn per cycle
3.001146481 560,207,049 branches # 560.014 M/sec
3.001146481 83,230 branch-misses # 0.01% of all branches
Time to fill with data: 0.416297
Fill rate with data: 240.213 Mints/sec, 960.853Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3200Mbytes
Iteration 8
Time to malloc: 1.38283e-05
Time to fill with data: 0.41672
Fill rate with data: 239.969 Mints/sec, 959.877Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3600Mbytes
Iteration 9
Time to malloc: 1.40667e-05
Time to fill with data: 0.424997
Fill rate with data: 235.296 Mints/sec, 941.183Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4000Mbytes
Iteration 10
Time to malloc: 1.28746e-05
4.001467773 1000.378604 task-clock (msec)
4.001467773 2 context-switches # 0.002 K/sec
4.001467773 0 cpu-migrations # 0.000 K/sec
4.001467773 232,690 page-faults # 0.233 M/sec
4.001467773 2,655,313,682 cycles # 2.654 GHz
4.001467773 3,087,157,016 instructions # 1.15 insn per cycle
4.001467773 557,266,313 branches # 557.070 M/sec
4.001467773 95,433 branch-misses # 0.02% of all branches
Time to fill with data: 0.413271
Fill rate with data: 241.972 Mints/sec, 967.888Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4400Mbytes
Iteration 11
Time to malloc: 1.21593e-05
Time to fill with data: 0.414624
Fill rate with data: 241.182 Mints/sec, 964.73Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4800Mbytes
Iteration 12
Time to malloc: 1.5974e-05
5.001792272 1000.372602 task-clock (msec)
5.001792272 2 context-switches # 0.002 K/sec
5.001792272 0 cpu-migrations # 0.000 K/sec
5.001792272 236,260 page-faults # 0.236 M/sec
5.001792272 2,687,340,230 cycles # 2.686 GHz
5.001792272 3,134,864,968 instructions # 1.17 insn per cycle
5.001792272 565,846,287 branches # 565.644 M/sec
5.001792272 104,634 branch-misses # 0.02% of all branches
Time to fill with data: 0.412331
Fill rate with data: 242.524 Mints/sec, 970.094Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5200Mbytes
Iteration 13
Time to malloc: 1.3113e-05
Time to fill with data: 0.414433
Fill rate with data: 241.294 Mints/sec, 965.174Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5600Mbytes
Iteration 14
Time to malloc: 1.88351e-05
Time to fill with data: 0.417277
Fill rate with data: 239.649 Mints/sec, 958.596Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 6000Mbytes
6.002129544 1000.404270 task-clock (msec)
6.002129544 1 context-switches # 0.001 K/sec
6.002129544 0 cpu-migrations # 0.000 K/sec
6.002129544 215,269 page-faults # 0.215 M/sec
6.002129544 2,676,269,667 cycles # 2.675 GHz
6.002129544 3,286,469,282 instructions # 1.23 insn per cycle
6.002129544 578,367,266 branches # 578.156 M/sec
6.002129544 345,470 branch-misses # 0.06% of all branches
....
After disabling THP with root command from https://access.redhat.com/solutions/46111 I always have ~200 thousands page faults per second and around 950 MB/s:
$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
$ perf stat -I1000 ./a.out
Iteration 0
Time to malloc: 1.50204e-05
Time to fill with data: 0.422322
Fill rate with data: 236.786 Mints/sec, 947.145Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 400Mbytes
Iteration 1
Time to malloc: 1.50204e-05
Time to fill with data: 0.415068
Fill rate with data: 240.924 Mints/sec, 963.698Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 800Mbytes
Iteration 2
Time to malloc: 2.19345e-05
# time counts unit events
1.000162191 999.429856 task-clock (msec)
1.000162191 14 context-switches # 0.014 K/sec
1.000162191 0 cpu-migrations # 0.000 K/sec
1.000162191 232,727 page-faults # 0.233 M/sec
1.000162191 2,664,896,604 cycles # 2.666 GHz
1.000162191 3,080,713,267 instructions # 1.16 insn per cycle
1.000162191 555,116,838 branches # 555.434 M/sec
1.000162191 102,262 branch-misses # 0.02% of all branches
Time to fill with data: 0.440695
Fill rate with data: 226.914 Mints/sec, 907.658Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1200Mbytes
Iteration 3
Time to malloc: 2.09808e-05
Time to fill with data: 0.414463
Fill rate with data: 241.276 Mints/sec, 965.104Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1600Mbytes
Iteration 4
Time to malloc: 1.81198e-05
2.000544564 1000.142465 task-clock (msec)
2.000544564 16 context-switches # 0.016 K/sec
2.000544564 0 cpu-migrations # 0.000 K/sec
2.000544564 229,697 page-faults # 0.230 M/sec
2.000544564 2,621,180,984 cycles # 2.622 GHz
2.000544564 3,041,358,811 instructions # 1.15 insn per cycle
2.000544564 547,910,242 branches # 548.027 M/sec
2.000544564 93,682 branch-misses # 0.02% of all branches
Time to fill with data: 0.428383
Fill rate with data: 233.436 Mints/sec, 933.744Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2000Mbytes
Iteration 5
Time to malloc: 1.5974e-05
Time to fill with data: 0.421986
Fill rate with data: 236.975 Mints/sec, 947.899Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2400Mbytes
Iteration 6
Time to malloc: 1.5974e-05
Time to fill with data: 0.413477
Fill rate with data: 241.851 Mints/sec, 967.406Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2800Mbytes
Iteration 7
Time to malloc: 1.88351e-05
3.000866438 999.980461 task-clock (msec)
3.000866438 20 context-switches # 0.020 K/sec
3.000866438 0 cpu-migrations # 0.000 K/sec
3.000866438 231,194 page-faults # 0.231 M/sec
3.000866438 2,622,484,960 cycles # 2.623 GHz
3.000866438 3,061,610,229 instructions # 1.16 insn per cycle
3.000866438 551,533,361 branches # 551.616 M/sec
3.000866438 104,561 branch-misses # 0.02% of all branches
Time to fill with data: 0.448333
Fill rate with data: 223.048 Mints/sec, 892.194Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3200Mbytes
Iteration 8
Time to malloc: 1.50204e-05
Time to fill with data: 0.410566
Fill rate with data: 243.566 Mints/sec, 974.265Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3600Mbytes
Iteration 9
Time to malloc: 1.3113e-05
4.001231042 1000.098860 task-clock (msec)
4.001231042 17 context-switches # 0.017 K/sec
4.001231042 0 cpu-migrations # 0.000 K/sec
4.001231042 228,532 page-faults # 0.229 M/sec
4.001231042 2,586,146,024 cycles # 2.586 GHz
4.001231042 3,026,679,955 instructions # 1.15 insn per cycle
4.001231042 545,236,541 branches # 545.284 M/sec
4.001231042 115,251 branch-misses # 0.02% of all branches
Time to fill with data: 0.441442
Fill rate with data: 226.53 Mints/sec, 906.121Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4000Mbytes
Iteration 10
Time to malloc: 1.5974e-05
Time to fill with data: 0.42898
Fill rate with data: 233.111 Mints/sec, 932.445Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4400Mbytes
Iteration 11
Time to malloc: 2.00272e-05
5.001547227 999.982415 task-clock (msec)
5.001547227 19 context-switches # 0.019 K/sec
5.001547227 0 cpu-migrations # 0.000 K/sec
5.001547227 225,796 page-faults # 0.226 M/sec
5.001547227 2,560,990,918 cycles # 2.561 GHz
5.001547227 3,005,384,743 instructions # 1.15 insn per cycle
5.001547227 542,275,580 branches # 542.315 M/sec
5.001547227 116,537 branch-misses # 0.02% of all branches
Time to fill with data: 0.414212
Fill rate with data: 241.422 Mints/sec, 965.689Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4800Mbytes
Iteration 12
Time to malloc: 1.69277e-05
Time to fill with data: 0.411084
Fill rate with data: 243.259 Mints/sec, 973.037Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5200Mbytes
Iteration 13
Time to malloc: 1.40667e-05
Time to fill with data: 0.413644
Fill rate with data: 241.754 Mints/sec, 967.015Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5600Mbytes
Iteration 14
Time to malloc: 1.28746e-05
6.001849796 999.913923 task-clock (msec)
6.001849796 18 context-switches # 0.018 K/sec
6.001849796 0 cpu-migrations # 0.000 K/sec
6.001849796 236,912 page-faults # 0.237 M/sec
6.001849796 2,685,445,660 cycles # 2.686 GHz
6.001849796 3,153,464,551 instructions # 1.20 insn per cycle
6.001849796 568,989,467 branches # 569.032 M/sec
6.001849796 125,943 branch-misses # 0.02% of all branches
Time to fill with data: 0.444891
Fill rate with data: 224.774 Mints/sec, 899.097Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 6000Mbytes
Test modified for perf stat with rate printing and limited iteration count:
$ cat test.c; g++ test.c
#include <sys/time.h>
#include <time.h>
#include <stdio.h>
#include <string.h>
#include <iostream>
#include <vector>
using namespace std;
double getWallTime()
{
struct timeval time;
if (gettimeofday(&time, NULL))
{
return 0;
}
return (double)time.tv_sec + (double)time.tv_usec * .000001;
}
#define M 1000000
int main()
{
int *a;
int n = 100000000;
int j;
double total = 0;
for(j=0; j<15; j++)
{
cout << "Iteration " << j << endl;
double start = getWallTime();
a = new int[n];
cout << "Time to malloc: " << getWallTime() - start << endl;
for (int i = 0; i < n; i++)
{
a[i] = 1;
}
double elapsed = getWallTime()-start;
cout << "Time to fill with data: " << elapsed << endl;
cout << "Fill rate with data: " << n/elapsed/M << " Mints/sec, " << n*sizeof(int)/elapsed/M << "Mbytes/sec" << endl;
total += n*sizeof(int)*1./M;
cout << "Allocated " << n*sizeof(int)*1./M << " Mbytes, with total memory allocated " << total << "Mbytes" << endl;
}
return 0;
}
Test modified for second and third write access
$ g++ second.c -o second
$ cat second.c
#include <sys/time.h>
#include <time.h>
#include <stdio.h>
#include <string.h>
#include <iostream>
#include <vector>
using namespace std;
double getWallTime()
{
struct timeval time;
if (gettimeofday(&time, NULL))
{
return 0;
}
return (double)time.tv_sec + (double)time.tv_usec * .000001;
}
#define M 1000000
int main()
{
int *a;
int n = 100000000;
int j;
double total = 0;
for(j=0; j<15; j++)
{
cout << "Iteration " << j << endl;
double start = getWallTime();
a = new int[n];
cout << "Time to malloc: " << getWallTime() - start << endl;
for (int i = 0; i < n; i++)
{
a[i] = 1;
}
double elapsed = getWallTime()-start;
cout << "Time to fill with data: " << elapsed << endl;
cout << "Fill rate with data: " << n/elapsed/M << " Mints/sec, " << n*sizeof(int)/elapsed/M << "Mbytes/sec" << endl;
start = getWallTime();
for (int i = 0; i < n; i++)
{
a[i] = 2;
}
elapsed = getWallTime()-start;
cout << "Time to second write access of data: " << elapsed << endl;
cout << "Access rate of data: " << n/elapsed/M << " Mints/sec, " << n*sizeof(int)/elapsed/M << "Mbytes/sec" << endl;
start = getWallTime();
for (int i = 0; i < n; i++)
{
a[i] = 3;
}
elapsed = getWallTime()-start;
cout << "Time to third write access of data: " << elapsed << endl;
cout << "Access rate of data: " << n/elapsed/M << " Mints/sec, " << n*sizeof(int)/elapsed/M << "Mbytes/sec" << endl;
total += n*sizeof(int)*1./M;
cout << "Allocated " << n*sizeof(int)*1./M << " Mbytes, with total memory allocated " << total << "Mbytes" << endl;
}
return 0;
}
Without THP - around 1.25 GB/s for second and third access:
$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
$ ./second
Iteration 0
Time to malloc: 9.05991e-06
Time to fill with data: 0.426387
Fill rate with data: 234.529 Mints/sec, 938.115Mbytes/sec
Time to second write access of data: 0.318292
Access rate of data: 314.177 Mints/sec, 1256.71Mbytes/sec
Time to third write access of data: 0.321722
Access rate of data: 310.827 Mints/sec, 1243.31Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 400Mbytes
Iteration 1
Time to malloc: 3.50475e-05
Time to fill with data: 0.411859
Fill rate with data: 242.802 Mints/sec, 971.206Mbytes/sec
Time to second write access of data: 0.317989
Access rate of data: 314.476 Mints/sec, 1257.91Mbytes/sec
Time to third write access of data: 0.321637
Access rate of data: 310.91 Mints/sec, 1243.64Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 800Mbytes
Iteration 2
Time to malloc: 2.81334e-05
Time to fill with data: 0.411918
Fill rate with data: 242.767 Mints/sec, 971.067Mbytes/sec
Time to second write access of data: 0.318647
Access rate of data: 313.827 Mints/sec, 1255.31Mbytes/sec
Time to third write access of data: 0.321041
Access rate of data: 311.487 Mints/sec, 1245.95Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1200Mbytes
Iteration 3
Time to malloc: 2.5034e-05
Time to fill with data: 0.411138
Fill rate with data: 243.227 Mints/sec, 972.909Mbytes/sec
Time to second write access of data: 0.318429
Access rate of data: 314.042 Mints/sec, 1256.17Mbytes/sec
Time to third write access of data: 0.321332
Access rate of data: 311.205 Mints/sec, 1244.82Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1600Mbytes
Iteration 4
Time to malloc: 3.71933e-05
Time to fill with data: 0.410922
Fill rate with data: 243.355 Mints/sec, 973.421Mbytes/sec
Time to second write access of data: 0.320262
Access rate of data: 312.244 Mints/sec, 1248.98Mbytes/sec
Time to third write access of data: 0.319223
Access rate of data: 313.261 Mints/sec, 1253.04Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2000Mbytes
Iteration 5
Time to malloc: 2.19345e-05
Time to fill with data: 0.418508
Fill rate with data: 238.944 Mints/sec, 955.777Mbytes/sec
Time to second write access of data: 0.320419
Access rate of data: 312.092 Mints/sec, 1248.37Mbytes/sec
Time to third write access of data: 0.319752
Access rate of data: 312.742 Mints/sec, 1250.97Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2400Mbytes
Iteration 6
Time to malloc: 3.19481e-05
Time to fill with data: 0.410054
Fill rate with data: 243.87 Mints/sec, 975.481Mbytes/sec
Time to second write access of data: 0.320244
Access rate of data: 312.262 Mints/sec, 1249.05Mbytes/sec
Time to third write access of data: 0.319546
Access rate of data: 312.944 Mints/sec, 1251.78Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2800Mbytes
Iteration 7
Time to malloc: 3.19481e-05
Time to fill with data: 0.409491
Fill rate with data: 244.206 Mints/sec, 976.822Mbytes/sec
Time to second write access of data: 0.318501
Access rate of data: 313.971 Mints/sec, 1255.88Mbytes/sec
Time to third write access of data: 0.320052
Access rate of data: 312.449 Mints/sec, 1249.8Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3200Mbytes
Iteration 8
Time to malloc: 2.5034e-05
Time to fill with data: 0.409922
Fill rate with data: 243.949 Mints/sec, 975.795Mbytes/sec
Time to second write access of data: 0.320583
Access rate of data: 311.932 Mints/sec, 1247.73Mbytes/sec
Time to third write access of data: 0.319478
Access rate of data: 313.011 Mints/sec, 1252.04Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3600Mbytes
Iteration 9
Time to malloc: 2.69413e-05
Time to fill with data: 0.41104
Fill rate with data: 243.285 Mints/sec, 973.141Mbytes/sec
Time to second write access of data: 0.320389
Access rate of data: 312.121 Mints/sec, 1248.48Mbytes/sec
Time to third write access of data: 0.319762
Access rate of data: 312.733 Mints/sec, 1250.93Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4000Mbytes
Iteration 10
Time to malloc: 2.59876e-05
Time to fill with data: 0.412612
Fill rate with data: 242.358 Mints/sec, 969.434Mbytes/sec
Time to second write access of data: 0.318304
Access rate of data: 314.165 Mints/sec, 1256.66Mbytes/sec
Time to third write access of data: 0.319453
Access rate of data: 313.035 Mints/sec, 1252.14Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4400Mbytes
Iteration 11
Time to malloc: 2.98023e-05
Time to fill with data: 0.412428
Fill rate with data: 242.467 Mints/sec, 969.866Mbytes/sec
Time to second write access of data: 0.318467
Access rate of data: 314.004 Mints/sec, 1256.02Mbytes/sec
Time to third write access of data: 0.319716
Access rate of data: 312.778 Mints/sec, 1251.11Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4800Mbytes
Iteration 12
Time to malloc: 2.69413e-05
Time to fill with data: 0.410515
Fill rate with data: 243.597 Mints/sec, 974.386Mbytes/sec
Time to second write access of data: 0.31832
Access rate of data: 314.149 Mints/sec, 1256.6Mbytes/sec
Time to third write access of data: 0.319569
Access rate of data: 312.921 Mints/sec, 1251.69Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5200Mbytes
Iteration 13
Time to malloc: 2.28882e-05
Time to fill with data: 0.412385
Fill rate with data: 242.492 Mints/sec, 969.967Mbytes/sec
Time to second write access of data: 0.318929
Access rate of data: 313.549 Mints/sec, 1254.2Mbytes/sec
Time to third write access of data: 0.31949
Access rate of data: 312.999 Mints/sec, 1252Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5600Mbytes
Iteration 14
Time to malloc: 2.90871e-05
Time to fill with data: 0.41235
Fill rate with data: 242.512 Mints/sec, 970.05Mbytes/sec
Time to second write access of data: 0.340456
Access rate of data: 293.724 Mints/sec, 1174.89Mbytes/sec
Time to third write access of data: 0.319716
Access rate of data: 312.778 Mints/sec, 1251.11Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 6000Mbytes
With THP - bit faster allocation but same speed of second and third access:
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
$ ./second
Iteration 0
Time to malloc: 1.50204e-05
Time to fill with data: 0.365043
Fill rate with data: 273.94 Mints/sec, 1095.76Mbytes/sec
Time to second write access of data: 0.320503
Access rate of data: 312.01 Mints/sec, 1248.04Mbytes/sec
Time to third write access of data: 0.319442
Access rate of data: 313.046 Mints/sec, 1252.18Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 400Mbytes
...
Iteration 14
Time to malloc: 2.7895e-05
Time to fill with data: 0.409294
Fill rate with data: 244.323 Mints/sec, 977.293Mbytes/sec
Time to second write access of data: 0.318422
Access rate of data: 314.049 Mints/sec, 1256.19Mbytes/sec
Time to third write access of data: 0.322098
Access rate of data: 310.465 Mints/sec, 1241.86Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 6000Mbytes

From updates and the chat:
I did see kernel switched from 2MB page to 4KB page when slowdown occurred.
I managed to do "sudo perf top", and saw this in the top line when slowdown occurred.
16.84% [kernel] [k] compaction_alloc
perf top -g
- 31.27% 31.03% [kernel] [k] compaction_alloc \u2592
- compaction_alloc \u2592
- migrate_pages \u2592
compact_zone \u2592
compact_zone_order \u2592
try_to_compact_pages \u2592
__alloc_pages_direct_compact \u2592
__alloc_pages_nodemask \u2592
alloc_pages_vma \u2592
do_huge_pmd_anonymous_page \u2592
handle_mm_fault \u2592
__do_page_fault \u2592
do_page_fault \u2592
page_fault
Slowdown is connected with enabled THP and slow page faults of 4KB. After 4 KB switch page faults are very slow from some linux kernel internal compaction mechanisms (is kernel still trying to get some more huge pages?) - http://lwn.net/Articles/368869 and http://lwn.net/Articles/591998. More problems from THP on NUMA, both from THP and NUMA code.
The original problem is
we launch several solvers simultaneously based on memory set by user. In this case, use may want to use all 230G free RAM.
we do dynamic memory allocation/deallocation. when we reach the memory limit, in this case, could be say 150 GB (not 230 GB), we see dramatic slowdown.
I observe high system cpu usage, and swap usage. So I make up this little program, which seems to show my original problem
I can suggest globally disable THP (https://unix.stackexchange.com/questions/99154/disable-transparent-hugepage or http://www.olivierdoucet.info/blog/2012/05/19/debugging-a-mysql-stall/), or free most of "cached" (by echo 3 > /proc/sys/vm/drop_caches from root) - this is temporary (and not fast) workaround. With freed cached memory there will be less need for compaction (but it will make programs of other users slower - they will need to re-read their data from disks/nfs).
Huge swap on slow (rotating) disk can kill all performance from the moment it will be used (and swap on ssd is fast enough, and swap on NVMe is very fast).
You may also want to change huge allocations in your software from default new/delete to manual calling of anonymous mmap for allocation and munmap for deallocation to control flags (there are mmap and madvise flags for huge page and there is populate - http://man7.org/linux/man-pages/man2/mmap.2.html http://man7.org/linux/man-pages/man2/madvise.2.html).
With MAP_POPULATE you will have (very?) slow allocation, but all memory allocated will be really used from the moment of allocation (all accesses will be fast).

Periodic Latency Spikes in Writing to Shared Memory on Linux

I have the following code:
#pragma pack(4)
struct RECORD_HEADER {
uint64_t msgType;
uint64_t rdtsc;
};
struct BODY {
char content[488];
};
#pragma pack()
class SerializedRDTSC {
public:
typedef unsigned long long timeunit_t;
static timeunit_t start(void) {
unsigned cycles_high, cycles_low;
__asm__ __volatile__ ( "CPUID\n\t"
"RDTSC\n\t"
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low)::
"%rax", "%rbx", "%rcx", "%rdx");
return ( (unsigned long long)cycles_low)|( ((unsigned long long)cycles_high)<<32 );
}
static timeunit_t end(void) {
unsigned cycles_high, cycles_low;
__asm__ __volatile__( "RDTSCP\n\t"
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t"
"CPUID\n\t": "=r" (cycles_high), "=r" (cycles_low):: "%rax",
"%rbx", "%rcx", "%rdx");
return ( (unsigned long long)cycles_low)|( ((unsigned long long)cycles_high)<<32 );
}
};
char* createSHM() noexcept {
const auto sharedMemHandle = shm_open("testing", O_RDWR | O_CREAT, 0666);
if (-1 == sharedMemHandle) {
std::cout << "failed to open named shared memory: " << std::endl;
return nullptr;
}
constexpr int32_t size = (1 << 26);
ftruncate(sharedMemHandle, size);
char* ptr = (char*) mmap(nullptr, size, PROT_READ | PROT_WRITE,
MAP_SHARED, sharedMemHandle, 0);
if (MAP_FAILED == ptr) {
std::cout << errno << std::endl;
return nullptr;
}
const auto rc = fchmod(sharedMemHandle, 0666);
if (rc == -1) {
fprintf(stderr,
"Can't change permissions to 0666 on shared mem segment: %m\n");
fflush(stderr);
}
return ptr;
}
int main() {
BODY update;
srand(time(nullptr));
char* ptr = createSHM();
constexpr uint64_t n = 700;
constexpr uint64_t n2 = 10;
uint64_t m_data[n * n2];
memset(m_data, 0, sizeof(m_data));
uint64_t r = 0;
for (uint64_t i = 0; i < n; i++) {
for (uint64_t k = 0; k < n2; k++) {
// populate the header
const auto msgType = rand();
const auto rdtsc = rand();
// populate the struct randomly
uint32_t* tmp = reinterpret_cast<uint32_t*>(&update);
for (uint32_t j = 0; j < sizeof(BODY) / sizeof(uint32_t); j++) {
const uint32_t v = rand() % 32767;
tmp[j] = v;
}
// write the struct
const auto s = SerializedRDTSC::start();
memcpy(ptr, (char*)&msgType, sizeof(uint64_t));
ptr+= sizeof(uint64_t);
memcpy(ptr, (char*)&rdtsc, sizeof(uint64_t));
ptr+= sizeof(uint64_t);
memcpy(ptr, &update, sizeof(BODY));
ptr+= sizeof(BODY);
const auto e = SerializedRDTSC::end();
m_data[r++] = e - s;
}
usleep(249998);
}
for (uint32_t i = 0; i < r; i++) {
std::cout << i << "," << m_data[i] << std::endl;
}
}
And for some reason, there are periodic latency spike according to the output:
0 9408
1 210
2 162
3 176
4 172
5 164
6 172
7 8338
8 174
9 186
10 612
11 380
12 380
13 374
14 358
15 13610
16 190
17 186
18 164
19 168
20 246
21 196
22 170
23 5066
24 176
25 176
26 168
27 174
28 166
29 440
30 232
31 214
32 5128
33 180
34 178
35 172
36 174
37 184
38 170
39 162
40 5964
41 182
42 174
43 164
44 180
45 180
46 162
47 172
I already isolated the core and double-checked with htop to make sure no other processes were using the core.
My machine has an i7 CPU (nothing fancy).
And then I tried with an Xeon CPU. The pattern is about the same -- every 7-11 write, there was a spike.
With i7 CPU, I compiled with GCC 7.2 with c++17 and ran it on CentOS 7.3.
With Xeon CPU, I compiled with GCC 4.6 with c++0x and ran it on CentOS 6.5.
My questions are:
1. Why there were periodic latency spikes? (I checked with strace. And I don't see weird system call involved)
2. Any suggestion on how to investigate/understand the spike? More for my learning.
Thanks in advance!
P.S. Yes, some people object to use rdtsc to measure latency because temperature affects TSC. Tho, I don't see any better option as I don't have PTP, and clock_gettime() sometimes will have latency spikes too. If you have any suggestion, it is more than welcome :)

A memory page is 4K bytes. Every time you start writing on a new page, that page needs mapped into the process address space. Since the data you're writing every loop is 8 + 8 + 488 = 504 bytes, you'll get a spike every 8 or 9 time thru the loop.
Since the CPU can speculatively prefetch data from memory, the page fault for the 2nd page (which should occur on the 8th loop) occurs one loop earlier than expected, when the hardware prefetcher tries to access the page.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js