Does multithreading emphasize memory fragmentation? - c++

Description
When allocating and deallocating randomly sized memory chunks with 4 or more threads using openmp's parallel for construct, the program seems to start leaking considerable amounts of memory in the second half of the test-program's runtime. Thus it increases its consumed memory from 1050 MB to 1500 MB or more without actually making use of the extra memory.
As valgrind shows no issues, I must assume that what appears to be a memory leak actually is an emphasized effect of memory fragmentation.
Interestingly, the effect does not show yet if 2 threads make 10000 allocations each, but it shows strongly if 4 threads make 5000 allocations each. Also, if the maximum size of allocated chunks is reduced to 256kb (from 1mb), the effect gets weaker.
Can heavy concurrency emphasize fragmentation that much ? Or is this more likely to be a bug in the heap ?
Test Program Description
The demo program is build to obtain a total of 256 MB of randomly sized memory chunks from the heap, doing 5000 allocations. If the memory limit is hit, the chunks allocated first will be deallocated until the memory consumption falls below the limit. Once 5000 allocations where performed, all memory is released and the loop ends. All this work is done for each thread generated by openmp.
This memory allocation scheme allows us to expect a memory consumption of ~260 MB per thread (including some bookkeeping data).
Demo Program
As this is really something you might want to test, you can download the sample program with a simple makefile from dropbox.
When running the program as is, you should have at least 1400 MB of RAM available. Feel free to adjust the constants in the code to suit your needs.
For completeness, the actual code follows:
#include <stdlib.h>
#include <stdio.h>
#include <iostream>
#include <vector>
#include <deque>
#include <omp.h>
#include <math.h>
typedef unsigned long long uint64_t;
void runParallelAllocTest()
{
// constants
const int NUM_ALLOCATIONS = 5000; // alloc's per thread
const int NUM_THREADS = 4; // how many threads?
const int NUM_ITERS = NUM_THREADS;// how many overall repetions
const bool USE_NEW = true; // use new or malloc? , seems to make no difference (as it should)
const bool DEBUG_ALLOCS = false; // debug output
// pre store allocation sizes
const int NUM_PRE_ALLOCS = 20000;
const uint64_t MEM_LIMIT = (1024 * 1024) * 256; // x MB per process
const size_t MAX_CHUNK_SIZE = 1024 * 1024 * 1;
srand(1);
std::vector<size_t> allocations;
allocations.resize(NUM_PRE_ALLOCS);
for (int i = 0; i < NUM_PRE_ALLOCS; i++) {
allocations[i] = rand() % MAX_CHUNK_SIZE; // use up to x MB chunks
}
#pragma omp parallel num_threads(NUM_THREADS)
#pragma omp for
for (int i = 0; i < NUM_ITERS; ++i) {
uint64_t long totalAllocBytes = 0;
uint64_t currAllocBytes = 0;
std::deque< std::pair<char*, uint64_t> > pointers;
const int myId = omp_get_thread_num();
for (int j = 0; j < NUM_ALLOCATIONS; ++j) {
// new allocation
const size_t allocSize = allocations[(myId * 100 + j) % NUM_PRE_ALLOCS ];
char* pnt = NULL;
if (USE_NEW) {
pnt = new char[allocSize];
} else {
pnt = (char*) malloc(allocSize);
}
pointers.push_back(std::make_pair(pnt, allocSize));
totalAllocBytes += allocSize;
currAllocBytes += allocSize;
// fill with values to add "delay"
for (int fill = 0; fill < (int) allocSize; ++fill) {
pnt[fill] = (char)(j % 255);
}
if (DEBUG_ALLOCS) {
std::cout << "Id " << myId << " New alloc " << pointers.size() << ", bytes:" << allocSize << " at " << (uint64_t) pnt << "\n";
}
// free all or just a bit
if (((j % 5) == 0) || (j == (NUM_ALLOCATIONS - 1))) {
int frees = 0;
// keep this much allocated
// last check, free all
uint64_t memLimit = MEM_LIMIT;
if (j == NUM_ALLOCATIONS - 1) {
std::cout << "Id " << myId << " about to release all memory: " << (currAllocBytes / (double)(1024 * 1024)) << " MB" << std::endl;
memLimit = 0;
}
//MEM_LIMIT = 0; // DEBUG
while (pointers.size() > 0 && (currAllocBytes > memLimit)) {
// free one of the first entries to allow previously obtained resources to 'live' longer
currAllocBytes -= pointers.front().second;
char* pnt = pointers.front().first;
// free memory
if (USE_NEW) {
delete[] pnt;
} else {
free(pnt);
}
// update array
pointers.pop_front();
if (DEBUG_ALLOCS) {
std::cout << "Id " << myId << " Free'd " << pointers.size() << " at " << (uint64_t) pnt << "\n";
}
frees++;
}
if (DEBUG_ALLOCS) {
std::cout << "Frees " << frees << ", " << currAllocBytes << "/" << MEM_LIMIT << ", " << totalAllocBytes << "\n";
}
}
} // for each allocation
if (currAllocBytes != 0) {
std::cerr << "Not all free'd!\n";
}
std::cout << "Id " << myId << " done, total alloc'ed " << ((double) totalAllocBytes / (double)(1024 * 1024)) << "MB \n";
} // for each iteration
exit(1);
}
int main(int argc, char** argv)
{
runParallelAllocTest();
return 0;
}
The Test-System
From what I see so far, the hardware matters a lot. The test might need adjustments if run on a faster machine.
Intel(R) Core(TM)2 Duo CPU T7300 # 2.00GHz
Ubuntu 10.04 LTS 64 bit
gcc 4.3, 4.4, 4.6
3988.62 Bogomips
Testing
Once you have executed the makefile, you should get a file named ompmemtest. To query the memory usage over time, I used the following commands:
./ompmemtest &
top -b | grep ompmemtest
Which yields the quite impressive fragmentation or leaking behaviour. The expected memory consumption with 4 threads is 1090 MB, which became 1500 MB over time:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11626 byron 20 0 204m 99m 1000 R 27 2.5 0:00.81 ompmemtest
11626 byron 20 0 992m 832m 1004 R 195 21.0 0:06.69 ompmemtest
11626 byron 20 0 1118m 1.0g 1004 R 189 26.1 0:12.40 ompmemtest
11626 byron 20 0 1218m 1.0g 1004 R 190 27.1 0:18.13 ompmemtest
11626 byron 20 0 1282m 1.1g 1004 R 195 29.6 0:24.06 ompmemtest
11626 byron 20 0 1471m 1.3g 1004 R 195 33.5 0:29.96 ompmemtest
11626 byron 20 0 1469m 1.3g 1004 R 194 33.5 0:35.85 ompmemtest
11626 byron 20 0 1469m 1.3g 1004 R 195 33.6 0:41.75 ompmemtest
11626 byron 20 0 1636m 1.5g 1004 R 194 37.8 0:47.62 ompmemtest
11626 byron 20 0 1660m 1.5g 1004 R 195 38.0 0:53.54 ompmemtest
11626 byron 20 0 1669m 1.5g 1004 R 195 38.2 0:59.45 ompmemtest
11626 byron 20 0 1664m 1.5g 1004 R 194 38.1 1:05.32 ompmemtest
11626 byron 20 0 1724m 1.5g 1004 R 195 40.0 1:11.21 ompmemtest
11626 byron 20 0 1724m 1.6g 1140 S 193 40.1 1:17.07 ompmemtest
Please Note: I could reproduce this issue when compiling with gcc 4.3, 4.4 and 4.6(trunk).

Ok, picked up the bait.
This is on a system with
Intel(R) Core(TM)2 Quad CPU Q9550 # 2.83GHz
4x5666.59 bogomips
Linux meerkat 2.6.35-28-generic-pae #50-Ubuntu SMP Fri Mar 18 20:43:15 UTC 2011 i686 GNU/Linux
gcc version 4.4.5
total used free shared buffers cached
Mem: 8127172 4220560 3906612 0 374328 2748796
-/+ buffers/cache: 1097436 7029736
Swap: 0 0 0
Naive run
I just ran it
time ./ompmemtest
Id 0 about to release all memory: 258.144 MB
Id 0 done, total alloc'ed -1572.7MB
Id 3 about to release all memory: 257.854 MB
Id 3 done, total alloc'ed -1569.6MB
Id 1 about to release all memory: 257.339 MB
Id 2 about to release all memory: 257.043 MB
Id 1 done, total alloc'ed -1570.42MB
Id 2 done, total alloc'ed -1569.96MB
real 0m13.429s
user 0m44.619s
sys 0m6.000s
Nothing spectacular. Here is the simultaneous output of vmstat -S M 1
Vmstat raw data
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
0 0 0 3892 364 2669 0 0 24 0 701 1487 2 1 97 0
4 0 0 3421 364 2669 0 0 0 0 1317 1953 53 7 40 0
4 0 0 2858 364 2669 0 0 0 0 2715 5030 79 16 5 0
4 0 0 2861 364 2669 0 0 0 0 6164 12637 76 15 9 0
4 0 0 2853 364 2669 0 0 0 0 4845 8617 77 13 10 0
4 0 0 2848 364 2669 0 0 0 0 3782 7084 79 13 8 0
5 0 0 2842 364 2669 0 0 0 0 3723 6120 81 12 7 0
4 0 0 2835 364 2669 0 0 0 0 3477 4943 84 9 7 0
4 0 0 2834 364 2669 0 0 0 0 3273 4950 81 10 9 0
5 0 0 2828 364 2669 0 0 0 0 3226 4812 84 11 6 0
4 0 0 2823 364 2669 0 0 0 0 3250 4889 83 10 7 0
4 0 0 2826 364 2669 0 0 0 0 3023 4353 85 10 6 0
4 0 0 2817 364 2669 0 0 0 0 3176 4284 83 10 7 0
4 0 0 2823 364 2669 0 0 0 0 3008 4063 84 10 6 0
0 0 0 3893 364 2669 0 0 0 0 4023 4228 64 10 26 0
Does that information mean anything to you?
Google Thread Caching Malloc
Now for real fun, add a little spice
time LD_PRELOAD="/usr/lib/libtcmalloc.so" ./ompmemtest
Id 1 about to release all memory: 257.339 MB
Id 1 done, total alloc'ed -1570.42MB
Id 3 about to release all memory: 257.854 MB
Id 3 done, total alloc'ed -1569.6MB
Id 2 about to release all memory: 257.043 MB
Id 2 done, total alloc'ed -1569.96MB
Id 0 about to release all memory: 258.144 MB
Id 0 done, total alloc'ed -1572.7MB
real 0m11.663s
user 0m44.255s
sys 0m1.028s
Looks faster, not?
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
4 0 0 3562 364 2684 0 0 0 0 1041 1676 28 7 64 0
4 2 0 2806 364 2684 0 0 0 172 1641 1843 84 14 1 0
4 0 0 2758 364 2685 0 0 0 0 1520 1009 98 2 1 0
4 0 0 2747 364 2685 0 0 0 0 1504 859 98 2 0 0
5 0 0 2745 364 2685 0 0 0 0 1575 1073 98 2 0 0
5 0 0 2739 364 2685 0 0 0 0 1415 743 99 1 0 0
4 0 0 2738 364 2685 0 0 0 0 1526 981 99 2 0 0
4 0 0 2731 364 2685 0 0 0 684 1536 927 98 2 0 0
4 0 0 2730 364 2685 0 0 0 0 1584 1010 99 1 0 0
5 0 0 2730 364 2685 0 0 0 0 1461 917 99 2 0 0
4 0 0 2729 364 2685 0 0 0 0 1561 1036 99 1 0 0
4 0 0 2729 364 2685 0 0 0 0 1406 756 100 1 0 0
0 0 0 3819 364 2685 0 0 0 4 1159 1476 26 3 71 0
In case you wanted to compare vmstat outputs
Valgrind --tool massif
This is the head of output from ms_print after valgrind --tool=massif ./ompmemtest (default malloc):
--------------------------------------------------------------------------------
Command: ./ompmemtest
Massif arguments: (none)
ms_print arguments: massif.out.beforetcmalloc
--------------------------------------------------------------------------------
GB
1.009^ :
| ##::::##:::::::##::::::##::::##::#::::#::::#:::::::::#::::::#:::
| # :: :# :::: ::# : ::::# :: :# ::#::::#: ::#:::::: ::#::::::#:::
| # :: :# :::: ::# : ::::# :: :# ::#::::#: ::#:::::: ::#::::::#:::
| :# :: :# :::: ::# : ::::# :: :# ::#::::#: ::#:::::: ::#::::::#:::
| :# :: :# :::: ::# : ::::# :: :# ::#::::#: ::#:::::: ::#::::::#:::
| :# :: :# :::: ::# : ::::# :: :# ::#::::#: ::#:::::: ::#::::::#::::
| ::# :: :# :::: ::# : ::::# :: :# ::#::::#: ::#:::::: ::#::::::#::::
| ::# :: :# :::: ::# : ::::# :: :# ::#::::#: ::#:::::: ::#::::::#::::
| ::# :: :# :::: ::# : ::::# :: :# ::#::::#: ::#:::::: ::#::::::#::::
| ::# :: :# :::: ::# : ::::# :: :# ::#::::#: ::#:::::: ::#::::::#::::
| ::# :: :# :::: ::# : ::::# :: :# ::#::::#: ::#:::::: ::#::::::#::::
| ::::# :: :# :::: ::# : ::::# :: :# ::#::::#: ::#:::::: ::#::::::#::::
| : ::# :: :# :::: ::# : ::::# :: :# ::#::::#: ::#:::::: ::#::::::#::::
| : ::# :: :# :::: ::# : ::::# :: :# ::#::::#: ::#:::::: ::#::::::#::::
| :: ::# :: :# :::: ::# : ::::# :: :# ::#::::#: ::#:::::: ::#::::::#::::
| :: ::# :: :# :::: ::# : ::::# :: :# ::#::::#: ::#:::::: ::#::::::#::::
| ::: ::# :: :# :::: ::# : ::::# :: :# ::#::::#: ::#:::::: ::#::::::#::::
| ::: ::# :: :# :::: ::# : ::::# :: :# ::#::::#: ::#:::::: ::#::::::#::::
| ::: ::# :: :# :::: ::# : ::::# :: :# ::#::::#: ::#:::::: ::#::::::#::::
0 +----------------------------------------------------------------------->Gi
0 264.0
Number of snapshots: 63
Detailed snapshots: [6 (peak), 10, 17, 23, 27, 30, 35, 39, 48, 56]
Google HEAPPROFILE
Unfortunately, vanilla valgrind doesn't work with tcmalloc, so I switched horses midrace to heap profiling with google-perftools
gcc openMpMemtest_Linux.cpp -fopenmp -lgomp -lstdc++ -ltcmalloc -o ompmemtest
time HEAPPROFILE=/tmp/heapprofile ./ompmemtest
Starting tracking the heap
Dumping heap profile to /tmp/heapprofile.0001.heap (100 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0002.heap (200 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0003.heap (300 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0004.heap (400 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0005.heap (501 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0006.heap (601 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0007.heap (701 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0008.heap (801 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0009.heap (902 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0010.heap (1002 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0011.heap (2029 MB allocated cumulatively, 1031 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0012.heap (3053 MB allocated cumulatively, 1030 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0013.heap (4078 MB allocated cumulatively, 1031 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0014.heap (5102 MB allocated cumulatively, 1031 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0015.heap (6126 MB allocated cumulatively, 1033 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0016.heap (7151 MB allocated cumulatively, 1029 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0017.heap (8175 MB allocated cumulatively, 1029 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0018.heap (9199 MB allocated cumulatively, 1028 MB currently in use)
Id 0 about to release all memory: 258.144 MB
Id 0 done, total alloc'ed -1572.7MB
Id 2 about to release all memory: 257.043 MB
Id 2 done, total alloc'ed -1569.96MB
Id 3 about to release all memory: 257.854 MB
Id 3 done, total alloc'ed -1569.6MB
Id 1 about to release all memory: 257.339 MB
Id 1 done, total alloc'ed -1570.42MB
Dumping heap profile to /tmp/heapprofile.0019.heap (Exiting)
real 0m11.981s
user 0m44.455s
sys 0m1.124s
Contact me for full logs/details
Update
To the comments: I updated the program
--- omptest/openMpMemtest_Linux.cpp 2011-05-03 23:18:44.000000000 +0200
+++ q/openMpMemtest_Linux.cpp 2011-05-04 13:42:47.371726000 +0200
## -13,8 +13,8 ##
void runParallelAllocTest()
{
// constants
- const int NUM_ALLOCATIONS = 5000; // alloc's per thread
- const int NUM_THREADS = 4; // how many threads?
+ const int NUM_ALLOCATIONS = 55000; // alloc's per thread
+ const int NUM_THREADS = 8; // how many threads?
const int NUM_ITERS = NUM_THREADS;// how many overall repetions
const bool USE_NEW = true; // use new or malloc? , seems to make no difference (as it should)
It ran for over 5m3s. Close to the end, a screenshot of htop teaches that indeed, the reserved set is slightly higher, going towards 2.3g:
1 [||||||||||||||||||||||||||||||||||||||||||||||||||96.7%] Tasks: 125 total, 2 running
2 [||||||||||||||||||||||||||||||||||||||||||||||||||96.7%] Load average: 8.09 5.24 2.37
3 [||||||||||||||||||||||||||||||||||||||||||||||||||97.4%] Uptime: 01:54:22
4 [||||||||||||||||||||||||||||||||||||||||||||||||||96.1%]
Mem[||||||||||||||||||||||||||||||| 3055/7936MB]
Swp[ 0/0MB]
PID USER NLWP PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
4330 sehe 8 20 0 2635M 2286M 908 R 368. 28.8 15:35.01 ./ompmemtest
Comparing results with a tcmalloc run: 4m12s, similar top stats has minor differences; the big difference is in the VIRT set (but that isn't particularly useful unless you have a very limited address space per process?). The RES set is quite similar, if you ask me. The more important thing to note is parallellism is increased; all cores are now maxed out. This is obviously due to reduced need to lock for heap operations when using tcmalloc:
If the free list is empty: (1) We fetch a bunch of objects from a central free list for this size-class (the central free list is shared by all threads). (2) Place them in the thread-local free list. (3) Return one of the newly fetched objects to the applications.
1 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%] Tasks: 172 total, 2 running
2 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%] Load average: 7.39 2.92 1.11
3 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%] Uptime: 11:12:25
4 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]
Mem[|||||||||||||||||||||||||||||||||||||||||||| 3278/7936MB]
Swp[ 0/0MB]
PID USER NLWP PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
14391 sehe 8 20 0 2251M 2179M 1148 R 379. 27.5 8:08.92 ./ompmemtest

When linking the test program with google's tcmalloc library, the executable doesn't only run ~10% faster, but shows greatly reduced or insignificant memory fragmentation as well:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13441 byron 20 0 379m 334m 1220 R 187 8.4 0:02.63 ompmemtestgoogle
13441 byron 20 0 1085m 1.0g 1220 R 194 26.2 0:08.52 ompmemtestgoogle
13441 byron 20 0 1111m 1.0g 1220 R 195 26.9 0:14.42 ompmemtestgoogle
13441 byron 20 0 1131m 1.1g 1220 R 195 27.4 0:20.30 ompmemtestgoogle
13441 byron 20 0 1137m 1.1g 1220 R 195 27.6 0:26.19 ompmemtestgoogle
13441 byron 20 0 1137m 1.1g 1220 R 195 27.6 0:32.05 ompmemtestgoogle
13441 byron 20 0 1149m 1.1g 1220 R 191 27.9 0:37.81 ompmemtestgoogle
13441 byron 20 0 1149m 1.1g 1220 R 194 27.9 0:43.66 ompmemtestgoogle
13441 byron 20 0 1161m 1.1g 1220 R 188 28.2 0:49.32 ompmemtestgoogle
13441 byron 20 0 1161m 1.1g 1220 R 194 28.2 0:55.15 ompmemtestgoogle
13441 byron 20 0 1161m 1.1g 1220 R 191 28.2 1:00.90 ompmemtestgoogle
13441 byron 20 0 1161m 1.1g 1220 R 191 28.2 1:06.64 ompmemtestgoogle
13441 byron 20 0 1161m 1.1g 1356 R 192 28.2 1:12.42 ompmemtestgoogle
From the data I have, the answer appears to be:
Multithreaded access to the heap can emphasize fragmentation if the employed heap library does not deal well with concurrent access and if the processor fails to execute the threads truly concurrently.
The tcmalloc library shows no significant memory fragmentation running the same program that previously caused ~400MB to be lost in fragmentation.
But why does that happen ?
The best idea I have to offer here is some sort of locking artifact within the heap.
The test program will allocate randomly sized blocks of memory, freeing up blocks allocated early in the program to stay within its memory limit. When one thread is in the process of releasing old memory which is in a heap block on the 'left', it might actually be halted as another thread is scheduled to run, leaving a (soft) lock on that heap block. The newly scheduled thread wants to allocate memory, but may not even read that heap block on the 'left' side to check for free memory as it is currently being changed. Hence it might end up using a new heap block unnecessarily from the 'right'.
This process could look like a heap-block-shifting, where the the first blocks (on the left) remain only sparsely used and fragmented, forcing new blocks to be used on the right.
Lets restate that this fragmentation issue only occurs for me if I use 4 or more threads on a dual core system which can only handle two threads more or less concurrently. When only two threads are used, the (soft) locks on the heap will be held short enough not to block the other thread who wants to allocate memory.
Also, as a disclaimer, I didn't check the actual code of the glibc heap implementation, nor am I anything more than novice in the field of memory allocators - all I wrote is just how it appears to me which makes it pure speculation.
Another interesting read might be the tcmalloc documentation, which states common problems with heaps and multi-threaded access, some of which may have played their role in the test program too.
Its worth noting that it will never return memory to the system (see Caveats paragraph in tcmalloc documentation)

Yes the default malloc (Depending on linux version) does some crazy stuff which fails massively in some multi threaded applications. Specifically it keeps almost per thread heaps (arenas) to avoid locking. This is much faster than a single heap for all threads, but massively memory inefficient (sometimes). You can tune this by using code like this which turns off the multiple arenas (this kills performance so don't do this if you have lots of small allocations!)
rv = mallopt(-7, 1); // M_ARENA_TEST
rv = mallopt(-8, 1); // M_ARENA_MAX
Or as others suggested using various replacements for malloc.
Basically it's impossible for a general purpose malloc to always be efficient as it doesn't know how it's going to be used.
ChrisP.

Related

Address Sanitizier invokes OOM-killer

I am trying to use Address Sanitizer, but the kernel keeps killing my process due to excessive memory usage. Without Address Sanitizer the process runs just fine.
The program is compiled for arm-v7a using gcc-8.2.1 with
-fno-omit-frame-pointer
-fsanitize=address
-fsanitize-recover=all
-fdata-sections
-ffunction-sections
-fPIC
I am starting the process as follows:
ASAN_OPTIONS=debug=1:verbosity=0:detect_leaks=0:abort_on_error=0:halt_on_error=0:check_initialization_order=1:allocator_may_return_null=1 ./Launcher
Is there a way to reduce the memory footprint of the Address Sanitizer? Unfortunately, enabling swap is not an option.
This is the kernel log as printed by dmesg:
[512792.413376] Launcher invoked oom-killer: gfp_mask=0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), order=0, oom_score_adj=0
[512792.424695] CPU: 3 PID: 7786 Comm: Launcher Tainted: G W 5.4.1 #1
[512792.432821] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[512792.439455] Backtrace:
[512792.442039] [<8010eb1c>] (dump_backtrace) from [<8010eee0>] (show_stack+0x20/0x24)
[512792.449721] r7:811d32ec r6:00000000 r5:60070113 r4:811d32ec
[512792.455500] [<8010eec0>] (show_stack) from [<80ba06e8>] (dump_stack+0xbc/0xe8)
[512792.462840] [<80ba062c>] (dump_stack) from [<80257360>] (dump_header+0x64/0x440)
[512792.470343] r10:00000a24 r9:a9a4ce00 r8:00016f9c r7:80e82aac r6:a749fce0 r5:a9a4ce00
[512792.478275] r4:a749fce0 r3:6f25b167
[512792.481958] [<802572fc>] (dump_header) from [<80256364>] (oom_kill_process+0x494/0x4ac)
[512792.490066] r10:00000a24 r9:a9a4c100 r8:00016f9c r7:80e82aac r6:a749fce0 r5:a9a4ce00
[512792.497996] r4:a9a4d264
[512792.500636] [<80255ed0>] (oom_kill_process) from [<80256e8c>] (out_of_memory+0xf8/0x4ec)
[512792.508830] r10:00000a24 r9:a9a4c100 r8:00016f9c r7:8110b640 r6:8110b640 r5:811d8860
[512792.516760] r4:a749fce0
[512792.519405] [<80256d94>] (out_of_memory) from [<802a0910>] (__alloc_pages_nodemask+0xf7c/0x13a4)
[512792.528295] r9:00000000 r8:81107d30 r7:811d5588 r6:0000233c r5:00000000 r4:00000000
[512792.536153] [<8029f994>] (__alloc_pages_nodemask) from [<80285d10>] (__pte_alloc+0x34/0x1ac)
[512792.544697] r10:74b94000 r9:00000000 r8:00000000 r7:a8b9e580 r6:a8b9e580 r5:a7445d28
[512792.552628] r4:a7445d28
[512792.555271] [<80285cdc>] (__pte_alloc) from [<802869c8>] (copy_page_range+0x4ec/0x650)
[512792.563295] r9:00000000 r8:00000000 r7:a8b9e580 r6:a7174f4c r5:a8b9e580 r4:a7445d28
[512792.571148] [<802864dc>] (copy_page_range) from [<801241b8>] (dup_mm+0x470/0x4e0)
[512792.578736] r10:a7174f14 r9:a7174f10 r8:a8b9d680 r7:a7c36420 r6:a7174f4c r5:a8b9e580
[512792.586667] r4:a7835d20
[512792.589307] [<80123d48>] (dup_mm) from [<801255e0>] (copy_process+0x10bc/0x1888)
[512792.596807] r10:a749ff60 r9:ffffffff r8:00000000 r7:a749e000 r6:9d283400 r5:a825c300
[512792.604738] r4:00100000
[512792.607378] [<80124524>] (copy_process) from [<80125fb8>] (_do_fork+0x90/0x750)
[512792.614792] r10:00100000 r9:a749e000 r8:801011c4 r7:a749e000 r6:a749ff60 r5:6f25b167
[512792.622722] r4:00000001
[512792.625362] [<80125f28>] (_do_fork) from [<80126954>] (sys_clone+0x80/0x9c)
[512792.632428] r10:00000078 r9:a749e000 r8:801011c4 r7:00000078 r6:7649e000 r5:6f25b167
[512792.640358] r4:a749e000
[512792.643001] [<801268d4>] (sys_clone) from [<80101000>] (ret_fast_syscall+0x0/0x28)
[512792.650671] Exception stack(0xa749ffa8 to 0xa749fff0)
[512792.655828] ffa0: 54ad00fc 76ffe964 00100011 00000000 54ad00fc 00000000
[512792.664112] ffc0: 54ad00fc 76ffe964 7649e000 00000078 54ad0100 54ad0120 00000001 54ad0280
[512792.672391] ffe0: 00000078 54ad00e8 763d590b 763bf746
[512792.677546] r5:76ffe964 r4:54ad00fc
[512792.681484] Mem-Info:
[512792.683936] active_anon:158884 inactive_anon:15315 isolated_anon:0
active_file:1041 inactive_file:1140 isolated_file:0
unevictable:2224 dirty:8 writeback:1 unstable:0
slab_reclaimable:4553 slab_unreclaimable:4490
mapped:5064 shmem:17635 pagetables:1579 bounce:0
free:56987 free_pcp:173 free_cma:53962
[512792.718450] Node 0 active_anon:635536kB inactive_anon:61260kB active_file:4264kB inactive_file:5460kB unevictable:8896kB isolated(anon):0kB isolated(file):0kB mapped:21056kB dirty:32kB writeback:4kB shmem:70540kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[512792.742142] Normal free:226708kB min:3312kB low:4140kB high:4968kB active_anon:635436kB inactive_anon:61260kB active_file:4584kB inactive_file:5652kB unevictable:8896kB writepending:36kB present:1048576kB managed:1015668kB mlocked:0kB kernel_stack:1216kB pagetables:6316kB bounce:0kB free_pcp:192kB local_pcp:0kB free_cma:215848kB
[512792.771461] lowmem_reserve[]: 0 0 0
[512792.775161] Normal: 1651*4kB (UMEC) 839*8kB (UMEC) 495*16kB (UMEC) 221*32kB (UMEC) 78*64kB (UEC) 29*128kB (MC) 1*256kB (U) 40*512kB (C) 35*1024kB (C) 21*2048kB (C) 10*4096kB (C) 2*8192kB (C) 0*16384kB 1*32768kB (C) = 226708kB
[512792.795442] 20243 total pagecache pages
[512792.799391] 0 pages in swap cache
[512792.802816] Swap cache stats: add 0, delete 0, find 0/0
[512792.808232] Free swap = 0kB
[512792.811225] Total swap = 0kB
[512792.814296] 262144 pages RAM
[512792.817288] 0 pages HighMem/MovableOnly
[512792.821232] 8227 pages reserved
[512792.824558] 81920 pages cma reserved
[512792.828247] Tasks state (memory values in pages):
[512792.833057] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[512792.841890] [ 211] 0 211 9965 1608 67584 0 0 systemd-journal
[512792.851149] [ 224] 0 224 3848 249 16384 0 -1000 systemd-udevd
[512792.860222] [ 317] 0 317 1559 339 12288 0 0 dhclient
[512792.868867] [ 316] 0 316 1559 348 14336 0 0 dhclient
[512792.877508] [ 333] 0 333 1810 856 14336 0 0 haveged
[512792.886061] [ 334] 101 334 4985 261 22528 0 0 systemd-timesyn
[512792.895309] [ 336] 104 336 1342 167 12288 0 0 rpcbind
[512792.903866] [ 368] 106 368 1333 218 12288 0 -900 dbus-daemon
[512792.912684] [ 369] 0 369 6193 356 22528 0 0 rsyslogd
[512792.921327] [ 370] 0 370 2681 178 18432 0 0 systemd-logind
[512792.930490] [ 372] 0 372 1625 158 14336 0 0 cron
[512792.938784] [ 431] 0 431 428 122 10240 0 0 motion_sensor
[512792.947870] [ 560] 0 560 8756 207 18432 0 0 automount
[512792.956597] [ 564] 0 564 1190 172 12288 0 0 login
[512792.964988] [ 566] 0 566 1338 98 12288 0 0 agetty
[512792.973372] [ 572] 0 572 2218 276 16384 0 -1000 sshd
[512792.981664] [ 574] 0 574 946 33 12288 0 0 inputattach
[512792.990569] [ 637] 0 637 3017 379 18432 0 0 systemd
[512792.999122] [ 640] 0 640 3504 402 20480 0 0 (sd-pam)
[512793.007768] [ 653] 0 653 1760 329 12288 0 0 bash
[512793.016057] [ 671] 0 671 2599 1116 18432 0 0 Server.
[512793.025310] [ 732] 0 732 1300 132 12288 0 0 dbus-daemon
[512793.034212] [ 31836] 0 31836 3173 980 22528 0 0 sshd
[512793.042428] [ 31847] 0 31847 422 154 8192 0 0 sftp-server
[512793.051332] [ 5350] 0 5350 2555 351 16384 0 0 sshd
[512793.059631] [ 5452] 0 5452 1793 379 16384 0 0 bash
[512793.067924] [ 5823] 0 5823 2555 350 16384 0 0 sshd
[512793.076216] [ 5833] 0 5833 1760 326 14336 0 0 bash
[512793.084509] [ 6822] 0 6822 792 31 10240 0 0 xinit
[512793.092813] [ 6823] 0 6823 29526 5386 112640 0 0 Xorg
[512793.101103] [ 6827] 0 6827 3655 866 22528 0 0 xterm
[512793.109488] [ 6829] 0 6829 1620 114 14336 0 0 bash
[512793.117784] [ 7256] 0 7256 1549 322 12288 0 0 watch
[512793.126169] [ 7363] 0 7363 127832 56725 520192 0 0 gdb
[512793.134370] [ 7368] 0 7368 281561 93707 1046528 0 0 Launcher
[512793.143613] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),task=Launcher,pid=7368,uid=0
[512793.152974] Out of memory: Killed process 7368 (Launcher) total-vm:1126244kB, anon-rss:365128kB, file-rss:5700kB, shmem-rss:4000kB, UID:0 pgtables:1046528kB oom_score_adj:0
[512793.387824] oom_reaper: reaped process 7368 (Launcher), now anon-rss:0kB, file-rss:0kB, shmem-rss:4000kB
You could reduce some Asan features (or enable them one by one in separate runs):
# Disable UAR error detection (reduces code and heap size)
CFLAGS+='-fsanitize-address-use-after-return=never -fno-sanitize-address-use-after-scope'
export ASAN_OPTIONS="$ASAN_OPTIONS:detect_stack_use_after_return=1"
# Disable inline instrumentation (slower but saves code size)
CFLAGS+='-fsanitize-address-outline-instrumentation'
# Reduce heap quarantine (reduces heap consumption but also lowers chance of UAF detection)
export ASAN_OPTIONS="$ASAN_OPTIONS:quarantine_size_mb=16"
# Do not keep full backtrace of malloc origin (slightly complicates debugging but reduces heap size)
export ASAN_OPTIONS="$ASAN_OPTIONS:malloc_context_size=5"
Compiler options are for Clang but GCC also has similar switches.
As for the swap, we had good experience with enabling compressed swap in RAM.

SDL2 regular jitter when moving a simple texture - even at high fps without vsync

I'm trying to make a very simple SDL2 app that just scrolls a texture from an image over the screen.
I need to run it on a pretty old device though:
$ inxi -a
CPU: Single Core Intel Atom N270 (-MT-) speed/min/max: 1600/800/1600 MHz
Kernel: 4.9.126-antix.1-486-smp i686 Up: 1m Mem: 94.4/992.7 MiB (9.5%) Storage: 14.92 GiB (32.0% used)
Procs: 154 Shell: bash 4.4.12 inxi: 3.0.36
urve#urve:~
$ inxi -G
Graphics: Device-1: Intel Mobile 945GSE Express Integrated Graphics driver: i915 v: kernel
Display: server: X.org 1.19.2 driver: intel tty: 111x45
Message: Advanced graphics data unavailable in console. Try -G --display
$ free -h
total used free shared buff/cache available
Mem: 992M 76M 666M 60M 249M 833M
Swap: 2.0G 0B 2.0G
It's running Linux Antix:
$ uname -a
Linux urve 4.9.126-antix.1-486-smp #1 SMP Mon Sep 10 16:55:08 BST 2018 i686 GNU/Linux
$ lsb_release -a
No LSB modules are available.
Distributor ID: antiX
Description: antiX 17.2
Release: 17.2
Codename: stretch
Here's output of glxinfo:
DISPLAY=:0.0 glxinfo
name of display: :0.0
display: :0 screen: 0
direct rendering: Yes
server glx vendor string: SGI
server glx version string: 1.4
server glx extensions:
GLX_ARB_create_context, GLX_ARB_create_context_profile,
GLX_ARB_fbconfig_float, GLX_ARB_framebuffer_sRGB, GLX_ARB_multisample,
GLX_EXT_create_context_es2_profile, GLX_EXT_create_context_es_profile,
GLX_EXT_fbconfig_packed_float, GLX_EXT_framebuffer_sRGB,
GLX_EXT_import_context, GLX_EXT_libglvnd, GLX_EXT_texture_from_pixmap,
GLX_EXT_visual_info, GLX_EXT_visual_rating, GLX_INTEL_swap_event,
GLX_MESA_copy_sub_buffer, GLX_OML_swap_method, GLX_SGIS_multisample,
GLX_SGIX_fbconfig, GLX_SGIX_pbuffer, GLX_SGIX_visual_select_group,
GLX_SGI_make_current_read, GLX_SGI_swap_control
client glx vendor string: Mesa Project and SGI
client glx version string: 1.4
client glx extensions:
GLX_ARB_create_context, GLX_ARB_create_context_profile,
GLX_ARB_create_context_robustness, GLX_ARB_fbconfig_float,
GLX_ARB_framebuffer_sRGB, GLX_ARB_get_proc_address, GLX_ARB_multisample,
GLX_EXT_buffer_age, GLX_EXT_create_context_es2_profile,
GLX_EXT_create_context_es_profile, GLX_EXT_fbconfig_packed_float,
GLX_EXT_framebuffer_sRGB, GLX_EXT_import_context,
GLX_EXT_texture_from_pixmap, GLX_EXT_visual_info, GLX_EXT_visual_rating,
GLX_INTEL_swap_event, GLX_MESA_copy_sub_buffer,
GLX_MESA_multithread_makecurrent, GLX_MESA_query_renderer,
GLX_MESA_swap_control, GLX_OML_swap_method, GLX_OML_sync_control,
GLX_SGIS_multisample, GLX_SGIX_fbconfig, GLX_SGIX_pbuffer,
GLX_SGIX_visual_select_group, GLX_SGI_make_current_read,
GLX_SGI_swap_control, GLX_SGI_video_sync
GLX version: 1.4
GLX extensions:
GLX_ARB_create_context, GLX_ARB_create_context_profile,
GLX_ARB_fbconfig_float, GLX_ARB_framebuffer_sRGB,
GLX_ARB_get_proc_address, GLX_ARB_multisample,
GLX_EXT_create_context_es2_profile, GLX_EXT_create_context_es_profile,
GLX_EXT_fbconfig_packed_float, GLX_EXT_framebuffer_sRGB,
GLX_EXT_import_context, GLX_EXT_texture_from_pixmap, GLX_EXT_visual_info,
GLX_EXT_visual_rating, GLX_INTEL_swap_event, GLX_MESA_copy_sub_buffer,
GLX_MESA_multithread_makecurrent, GLX_MESA_query_renderer,
GLX_MESA_swap_control, GLX_OML_swap_method, GLX_OML_sync_control,
GLX_SGIS_multisample, GLX_SGIX_fbconfig, GLX_SGIX_pbuffer,
GLX_SGIX_visual_select_group, GLX_SGI_make_current_read,
GLX_SGI_swap_control, GLX_SGI_video_sync
Extended renderer info (GLX_MESA_query_renderer):
Vendor: Intel Open Source Technology Center (0x8086)
Device: Mesa DRI Intel(R) 945GME x86/MMX/SSE2 (0x27ae)
Version: 13.0.6
Accelerated: yes
Video memory: 192MB
Unified memory: yes
Preferred profile: compat (0x2)
Max core profile version: 0.0
Max compat profile version: 2.1
Max GLES1 profile version: 1.1
Max GLES[23] profile version: 2.0
OpenGL vendor string: Intel Open Source Technology Center
OpenGL renderer string: Mesa DRI Intel(R) 945GME x86/MMX/SSE2
OpenGL version string: 2.1 Mesa 13.0.6
OpenGL shading language version string: 1.20
OpenGL extensions:
GL_3DFX_texture_compression_FXT1, GL_AMD_shader_trinary_minmax,
GL_ANGLE_texture_compression_dxt3, GL_ANGLE_texture_compression_dxt5,
GL_APPLE_object_purgeable, GL_APPLE_packed_pixels,
GL_APPLE_vertex_array_object, GL_ARB_ES2_compatibility,
GL_ARB_clear_buffer_object, GL_ARB_compressed_texture_pixel_storage,
GL_ARB_copy_buffer, GL_ARB_debug_output, GL_ARB_depth_texture,
GL_ARB_draw_buffers, GL_ARB_draw_elements_base_vertex,
GL_ARB_explicit_attrib_location, GL_ARB_explicit_uniform_location,
GL_ARB_fragment_program, GL_ARB_fragment_shader,
GL_ARB_framebuffer_object, GL_ARB_get_program_binary,
GL_ARB_get_texture_sub_image, GL_ARB_half_float_pixel,
GL_ARB_internalformat_query, GL_ARB_invalidate_subdata,
GL_ARB_map_buffer_alignment, GL_ARB_map_buffer_range, GL_ARB_multi_bind,
GL_ARB_multisample, GL_ARB_multitexture, GL_ARB_occlusion_query,
GL_ARB_pixel_buffer_object, GL_ARB_point_parameters, GL_ARB_point_sprite,
GL_ARB_program_interface_query, GL_ARB_provoking_vertex,
GL_ARB_robustness, GL_ARB_sampler_objects, GL_ARB_separate_shader_objects,
GL_ARB_shader_objects, GL_ARB_shading_language_100, GL_ARB_shadow,
GL_ARB_sync, GL_ARB_texture_border_clamp, GL_ARB_texture_compression,
GL_ARB_texture_cube_map, GL_ARB_texture_env_add,
GL_ARB_texture_env_combine, GL_ARB_texture_env_crossbar,
GL_ARB_texture_env_dot3, GL_ARB_texture_mirrored_repeat,
GL_ARB_texture_non_power_of_two, GL_ARB_texture_rectangle,
GL_ARB_texture_storage, GL_ARB_transpose_matrix,
GL_ARB_vertex_array_object, GL_ARB_vertex_attrib_binding,
GL_ARB_vertex_buffer_object, GL_ARB_vertex_program, GL_ARB_vertex_shader,
GL_ARB_window_pos, GL_ATI_blend_equation_separate, GL_ATI_draw_buffers,
GL_ATI_separate_stencil, GL_ATI_texture_env_combine3, GL_EXT_abgr,
GL_EXT_bgra, GL_EXT_blend_color, GL_EXT_blend_equation_separate,
GL_EXT_blend_func_separate, GL_EXT_blend_minmax, GL_EXT_blend_subtract,
GL_EXT_compiled_vertex_array, GL_EXT_copy_texture,
GL_EXT_draw_range_elements, GL_EXT_fog_coord, GL_EXT_framebuffer_blit,
GL_EXT_framebuffer_object, GL_EXT_gpu_program_parameters,
GL_EXT_multi_draw_arrays, GL_EXT_packed_depth_stencil,
GL_EXT_packed_pixels, GL_EXT_pixel_buffer_object, GL_EXT_point_parameters,
GL_EXT_polygon_offset, GL_EXT_provoking_vertex, GL_EXT_rescale_normal,
GL_EXT_secondary_color, GL_EXT_separate_specular_color,
GL_EXT_shadow_funcs, GL_EXT_stencil_two_side, GL_EXT_stencil_wrap,
GL_EXT_subtexture, GL_EXT_texture, GL_EXT_texture3D,
GL_EXT_texture_compression_dxt1, GL_EXT_texture_cube_map,
GL_EXT_texture_edge_clamp, GL_EXT_texture_env_add,
GL_EXT_texture_env_combine, GL_EXT_texture_env_dot3,
GL_EXT_texture_filter_anisotropic, GL_EXT_texture_lod_bias,
GL_EXT_texture_object, GL_EXT_texture_rectangle, GL_EXT_texture_sRGB,
GL_EXT_texture_sRGB_decode, GL_EXT_vertex_array,
GL_IBM_multimode_draw_arrays, GL_IBM_rasterpos_clip,
GL_IBM_texture_mirrored_repeat, GL_INGR_blend_func_separate,
GL_KHR_context_flush_control, GL_KHR_debug, GL_MESA_pack_invert,
GL_MESA_window_pos, GL_MESA_ycbcr_texture, GL_NV_blend_square,
GL_NV_light_max_exponent, GL_NV_packed_depth_stencil,
GL_NV_texgen_reflection, GL_NV_texture_env_combine4,
GL_NV_texture_rectangle, GL_OES_EGL_image, GL_OES_read_format,
GL_S3_s3tc, GL_SGIS_generate_mipmap, GL_SGIS_texture_border_clamp,
GL_SGIS_texture_edge_clamp, GL_SGIS_texture_lod, GL_SUN_multi_draw_arrays
OpenGL ES profile version string: OpenGL ES 2.0 Mesa 13.0.6
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 1.0.16
OpenGL ES profile extensions:
GL_ANGLE_texture_compression_dxt3, GL_ANGLE_texture_compression_dxt5,
GL_APPLE_texture_max_level, GL_EXT_blend_minmax,
GL_EXT_discard_framebuffer, GL_EXT_draw_buffers,
GL_EXT_draw_elements_base_vertex, GL_EXT_map_buffer_range,
GL_EXT_multi_draw_arrays, GL_EXT_read_format_bgra,
GL_EXT_separate_shader_objects, GL_EXT_texture_border_clamp,
GL_EXT_texture_compression_dxt1, GL_EXT_texture_filter_anisotropic,
GL_EXT_texture_format_BGRA8888, GL_EXT_texture_type_2_10_10_10_REV,
GL_EXT_unpack_subimage, GL_KHR_context_flush_control, GL_KHR_debug,
GL_NV_draw_buffers, GL_NV_fbo_color_attachments, GL_NV_read_buffer,
GL_NV_read_depth, GL_NV_read_depth_stencil, GL_NV_read_stencil,
GL_OES_EGL_image, GL_OES_EGL_sync, GL_OES_depth24, GL_OES_depth_texture,
GL_OES_draw_elements_base_vertex, GL_OES_element_index_uint,
GL_OES_fbo_render_mipmap, GL_OES_get_program_binary, GL_OES_mapbuffer,
GL_OES_packed_depth_stencil, GL_OES_rgb8_rgba8, GL_OES_stencil8,
GL_OES_surfaceless_context, GL_OES_texture_3D,
GL_OES_texture_border_clamp, GL_OES_texture_npot,
GL_OES_vertex_array_object
12 GLX Visuals
visual x bf lv rg d st colorbuffer sr ax dp st accumbuffer ms cav
id dep cl sp sz l ci b ro r g b a F gb bf th cl r g b a ns b eat
----------------------------------------------------------------------------
0x020 24 tc 0 32 0 r y . 8 8 8 8 . . 0 24 8 0 0 0 0 0 0 None
0x021 24 dc 0 32 0 r y . 8 8 8 8 . . 0 24 8 0 0 0 0 0 0 None
0x077 24 tc 0 32 0 r y . 8 8 8 8 . . 0 0 0 0 0 0 0 0 0 None
0x078 24 tc 0 32 0 r . . 8 8 8 8 . . 0 0 0 0 0 0 0 0 0 None
0x079 24 tc 0 32 0 r . . 8 8 8 8 . . 0 24 8 0 0 0 0 0 0 None
0x07a 24 tc 0 32 0 r y . 8 8 8 8 . . 0 24 8 16 16 16 16 0 0 Slow
0x07b 24 dc 0 32 0 r y . 8 8 8 8 . . 0 0 0 0 0 0 0 0 0 None
0x07c 24 dc 0 32 0 r . . 8 8 8 8 . . 0 0 0 0 0 0 0 0 0 None
0x07d 24 dc 0 32 0 r . . 8 8 8 8 . . 0 24 8 0 0 0 0 0 0 None
0x07e 24 dc 0 32 0 r y . 8 8 8 8 . . 0 24 8 0 0 0 0 0 0 None
0x07f 24 dc 0 32 0 r y . 8 8 8 8 . . 0 24 8 16 16 16 16 0 0 Slow
0x05e 32 tc 0 32 0 r y . 8 8 8 8 . . 0 24 8 0 0 0 0 0 0 None
24 GLXFBConfigs:
visual x bf lv rg d st colorbuffer sr ax dp st accumbuffer ms cav
id dep cl sp sz l ci b ro r g b a F gb bf th cl r g b a ns b eat
----------------------------------------------------------------------------
0x05f 0 tc 0 16 0 r y . 5 6 5 0 . . 0 0 0 0 0 0 0 0 0 None
0x060 0 tc 0 16 0 r . . 5 6 5 0 . . 0 0 0 0 0 0 0 0 0 None
0x061 0 tc 0 16 0 r y . 5 6 5 0 . . 0 16 0 0 0 0 0 0 0 None
0x062 0 tc 0 16 0 r . . 5 6 5 0 . . 0 16 0 0 0 0 0 0 0 None
0x063 24 tc 0 32 0 r y . 8 8 8 8 . . 0 0 0 0 0 0 0 0 0 None
0x064 24 tc 0 32 0 r . . 8 8 8 8 . . 0 0 0 0 0 0 0 0 0 None
0x065 24 tc 0 32 0 r y . 8 8 8 8 . . 0 24 8 0 0 0 0 0 0 None
0x066 24 tc 0 32 0 r . . 8 8 8 8 . . 0 24 8 0 0 0 0 0 0 None
0x067 0 tc 0 16 0 r y . 5 6 5 0 . . 0 16 0 0 0 0 0 0 0 None
0x068 0 tc 0 16 0 r y . 5 6 5 0 . . 0 16 0 16 16 16 0 0 0 Slow
0x069 32 tc 0 32 0 r y . 8 8 8 8 . . 0 24 8 0 0 0 0 0 0 None
0x06a 24 tc 0 32 0 r y . 8 8 8 8 . . 0 24 8 16 16 16 16 0 0 Slow
0x06b 0 dc 0 16 0 r y . 5 6 5 0 . . 0 0 0 0 0 0 0 0 0 None
0x06c 0 dc 0 16 0 r . . 5 6 5 0 . . 0 0 0 0 0 0 0 0 0 None
0x06d 0 dc 0 16 0 r y . 5 6 5 0 . . 0 16 0 0 0 0 0 0 0 None
0x06e 0 dc 0 16 0 r . . 5 6 5 0 . . 0 16 0 0 0 0 0 0 0 None
0x06f 24 dc 0 32 0 r y . 8 8 8 8 . . 0 0 0 0 0 0 0 0 0 None
0x070 24 dc 0 32 0 r . . 8 8 8 8 . . 0 0 0 0 0 0 0 0 0 None
0x071 24 dc 0 32 0 r y . 8 8 8 8 . . 0 24 8 0 0 0 0 0 0 None
0x072 24 dc 0 32 0 r . . 8 8 8 8 . . 0 24 8 0 0 0 0 0 0 None
0x073 0 dc 0 16 0 r y . 5 6 5 0 . . 0 16 0 0 0 0 0 0 0 None
0x074 0 dc 0 16 0 r y . 5 6 5 0 . . 0 16 0 16 16 16 0 0 0 Slow
0x075 24 dc 0 32 0 r y . 8 8 8 8 . . 0 24 8 0 0 0 0 0 0 None
0x076 24 dc 0 32 0 r y . 8 8 8 8 . . 0 24 8 16 16 16 16 0 0 Slow
The problem is, that even though the transition appears to be smooth, there is a constant jitter happening every 3 seconds or so. It looks as if the moving texture jumped back instead of moving forward.
I tried many different methods of smoothing this out. I used delta time based movement, then disabling VSync and limiting the FPS manually, then I found out this question: https://gamedev.stackexchange.com/questions/163477/how-can-i-avoid-jittery-motion-in-sdl2
And I applied the same method of capping the framerate as in the answer. Still no help.
Even when using 120FPS (which I'm actually able to achieve), it's better but there is still a visible jitter.
I printed out the difference between last X pos and current during 120FPS:
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2.01
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
So the results look very OK to me.
With 60FPS, printing the frame time and X position also looks very good:
4
Frame time: 0.01667
4
Frame time: 0.01667
4
Frame time: 0.01667
4
Frame time: 0.01667
4
Frame time: 0.01667
4
Frame time: 0.01667
4
Frame time: 0.01667
4
Frame time: 0.01667
4
Frame time: 0.01667
4
Frame time: 0.01667
4
Frame time: 0.01667
4
Frame time: 0.01667
4
Frame time: 0.01667
4.004
Frame time: 0.01668
4
Frame time: 0.01667
4
Frame time: 0.01667
4.003
Frame time: 0.01668
4
Frame time: 0.01667
4
Frame time: 0.01667
4.008
Frame time: 0.0167
4
Frame time: 0.01667
4
Frame time: 0.01667
4.002
Frame time: 0.01667
4
Frame time: 0.01667
4
Frame time: 0.01667
4.003
Frame time: 0.01668
4
Frame time: 0.01667
4
Frame time: 0.01667
4.001
Frame time: 0.01667
4
Frame time: 0.01667
4
Frame time: 0.01667
4
Frame time: 0.01667
There are no visible spikes bigger than what you see here, yet the jitter is still there.
The system is running in an forced resolution of 1280x768 #60Hz, but I tried applying other resolutions (either using gtf and cvt modelines) but it didn't change anything.
I also tried disabling VSync for the entire system using .drirc file as mentioned here: https://wiki.archlinux.org/index.php/Intel_graphics
but it's still the same, although better than when running with vsync and SDL_RENDERER_PRESENT_VSYNC.
Surprisingly, setting the AccelMethod to uxa (instead of default sna) actually gets rid of the regular jitter, but the end result is very flickery and hurts the eyes...
I'm running out of ideas.
I checked if anything is hogging the GPU, but it's not even being used at 50%:
$ sudo intel_gpu_top
render clock: 166 Mhz display clock: 200 Mhz
render busy: 33%: ██████▋ render space: 36/131072
task percent busy
Bypass FIFO: 32%: ██████▌
Color calculator: 32%: ██████▌
Map filter: 29%: █████▉
Intermediate Z: 23%: ████▋
Windowizer: 22%: ████▌
Perspective interpolation: 3%: ▋
Pixel shader: 3%: ▋
Setup engine: 2%: ▌
Dispatcher: 2%: ▌
Strips and fans: 2%: ▌
Sampler Cache: 0%:
Map L2: 0%:
Filtering: 0%:
Texture decompression: 0%:
Projection and LOD: 0%:
Dependent address calculation: 0%:
Texture fetch: 0%:
Here's a very basic one-file code of the app:
#include <iostream>
#include <iomanip>
#include <SDL2/SDL.h>
#include <SDL2/SDL_image.h>
#include <time.h>
#include <unistd.h>
#define WIDTH 1280
#define HEIGHT 156
SDL_Renderer* ren = nullptr;
using namespace std;
struct Timer {
Uint64 previous_ticks{};
float elapsed_seconds{};
void tick() {
const Uint64 current_ticks{SDL_GetPerformanceCounter()};
const Uint64 delta{ current_ticks - previous_ticks };
previous_ticks = current_ticks;
static const Uint64 TICKS_PER_SECOND { SDL_GetPerformanceFrequency() };
elapsed_seconds = delta / static_cast<float>(TICKS_PER_SECOND);
}
};
class Object {
public:
Object(string fileName, SDL_Renderer* ren) {
this->ren = ren;
surf = IMG_Load(fileName.c_str());
tx = SDL_CreateTextureFromSurface(ren, surf);
this->setRect(0, 0, surf->w, surf->h);
SDL_FreeSurface(surf);
}
~Object() {
SDL_DestroyTexture(tx);
}
SDL_Rect* getDest() { return &dest; };
SDL_Rect* getSrc() { return &src; };
SDL_Texture* getTexture() { return tx; };
void setRect(int x, int y, int w, int h) {
//src.x = x;
//src.y = y;
src.x = 0;
src.y = 0;
src.w = w;
src.h = h;
dest.y = y;
dest.x = x;
dest.w = w;
dest.h = h;
pos_x = static_cast<float>(x);
}
void setX(float x) {
pos_x = x;
dest.x = roundf(x);
if(dest.x <= -dest.w) {
dest.x = WIDTH;
pos_x = static_cast<float>(WIDTH); // reset
}
cout << setprecision(4) << (last_x - pos_x) << endl;
last_x = pos_x;
}
void setXY(int x, int y) {
this->setX(static_cast<float>(x));
dest.y = y;
}
void move(float timestep) {
this->setX(pos_x - (240.0f * (timestep) ));
}
void draw() {
SDL_RenderCopy(this->ren, this->tx, &this->src, &this->dest);
}
private:
float pos_x = 0.0f;
float last_x = (float)WIDTH;
SDL_Rect dest;
SDL_Rect src;
SDL_Texture *tx = nullptr;
SDL_Surface *surf = nullptr;
SDL_Renderer *ren = nullptr;
};
void loop();
int main(int argc, char *argv[]) {
// Create the SDL window, accelerated renderer without vsync
SDL_Init(0);
SDL_Window* win = SDL_CreateWindow("SDLTest", 0, 0, WIDTH, HEIGHT, 0);
ren = SDL_CreateRenderer(win, -1, SDL_RENDERER_ACCELERATED);
SDL_SetRenderDrawColor(ren, 255, 255, 255, 255);
loop();
SDL_Quit();
return 1;
}
void loop() {
bool running = true;
SDL_Rect bg_rect;
bg_rect.x = 0;
bg_rect.y = 0;
bg_rect.w = WIDTH;
bg_rect.h = HEIGHT;
const int UPDATE_FREQUENCY { 60 };
const float CYCLE_TIME { 1.0f / UPDATE_FREQUENCY };
static Timer timer;
float accumulated_seconds = 0.0f;
/*struct timespec tv_sleep;
tv_sleep.tv_sec = 0;
tv_sleep.tv_nsec = 10;*/
Object obj("test.png", ren);
obj.setXY(WIDTH, 20);
int iterCount = 0;
while(running) {
iterCount++;
// cap framerate
timer.tick();
accumulated_seconds += timer.elapsed_seconds;
if(accumulated_seconds >= CYCLE_TIME*2.0f || accumulated_seconds <= 0.0 )
accumulated_seconds = CYCLE_TIME;
if(std::isgreaterequal(accumulated_seconds, CYCLE_TIME)) {
SDL_Event e;
while(SDL_PollEvent(&e)) {
if(e.type == SDL_QUIT) {
running = false;
cout << "QUIT" << endl;
return;
}
}
SDL_RenderClear(ren);
SDL_RenderFillRect(ren, &bg_rect);
// Draw moving stuff here
obj.move(accumulated_seconds);
obj.draw();
SDL_RenderPresent(ren);
cout << "Frame time: " << setprecision(4) << accumulated_seconds << endl;
accumulated_seconds -= CYCLE_TIME;
}
/*if(iterCount == 1000) {
iterCount = 0;
//sleep(0);
}*/
}
}
(you just need a test.png image in the same directory as SDL loads it to a texture)
I also tried using the FPSManager from SDL2_framrate.h but it gave very similar results, if not even worse.
I wanted to apply this method of smoothing the delta: http://frankforce.com/?p=2636
But I couldn't find any implementation of the GetMonitorRefreshRate() func, and using SDL's SDL_GetDisplayMode() actually returned 0 for every value in the display mode struct. And even if it returned, the refresh_rate is an integer, so it won't really help I think.
I'm beginning it's some sort of a problem with the system itself, some missing/corrupted driver or whatever, but I tried installing almost everything I could find without luck.
Here's my /etc/X11/xorg.conf.d/20-intel-flicker-fix.conf file:
Section "Device"
Identifier "Intel Graphics"
Driver "intel"
Option "TearFree" "true"
Option "AccelMethod" "sna"
EndSection
I needed to add the TearFree option, because there was huge tearing on the image when scrolling it.
Here's the gtf modeline I'm using:
"1280x768_60.00" 80.00 1280 1344 1480 1680 768 769 772 795 -HSync +Vsync

Swap used when there is enough free RAM. Performance impacted

I wrote a simple program to study the performance when using a lot of RAM on Linux (64bit Red Hat Enterprise Linux Server release 6.4). (Please ignore the memory leak.)
#include <sys/time.h>
#include <time.h>
#include <stdio.h>
#include <string.h>
#include <iostream>
#include <vector>
using namespace std;
double getWallTime()
{
struct timeval time;
if (gettimeofday(&time, NULL))
{
return 0;
}
return (double)time.tv_sec + (double)time.tv_usec * .000001;
}
int main()
{
int *a;
int n = 1000000000;
do
{
time_t mytime = time(NULL);
char * time_str = ctime(&mytime);
time_str[strlen(time_str)-1] = '\0';
printf("Current Time : %s\n", time_str);
double start = getWallTime();
a = new int[n];
for (int i = 0; i < n; i++)
{
a[i] = 1;
}
double elapsed = getWallTime()-start;
cout << elapsed << endl;
cout << "Allocated." << endl;
}
while (1);
return 0;
}
The output is
Current Time : Tue May 8 11:46:55 2018
3.73667
Allocated.
Current Time : Tue May 8 11:46:59 2018
64.5222
Allocated.
Current Time : Tue May 8 11:48:03 2018
110.419
The top output is as below. We can see swap increased though there was enough free RAM. The consequence was the runtime soared from 3 seconds to 64 seconds.
top - 11:46:55 up 21 days, 1:14, 18 users, load average: 1.24, 1.25, 0.95
Tasks: 819 total, 3 running, 816 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.6%us, 1.4%sy, 0.0%ni, 97.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 132110088k total, 127500344k used, 4609744k free, 262288k buffers
Swap: 10485752k total, 4112k used, 10481640k free, 45988192k cached
top - 11:47:01 up 21 days, 1:14, 18 users, load average: 1.38, 1.27, 0.96
Tasks: 819 total, 2 running, 817 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.5%us, 2.1%sy, 0.0%ni, 97.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 132110088k total, 131620156k used, 489932k free, 262288k buffers
Swap: 10485752k total, 4112k used, 10481640k free, 45844228k cached
top - 11:47:53 up 21 days, 1:15, 18 users, load average: 1.25, 1.26, 0.97
Tasks: 819 total, 2 running, 817 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 2.5%sy, 0.0%ni, 97.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 132110088k total, 131626300k used, 483788k free, 262276k buffers
Swap: 10485752k total, 5464k used, 10480288k free, 43056696k cached
top - 11:47:56 up 21 days, 1:15, 18 users, load average: 1.23, 1.26, 0.97
Tasks: 819 total, 2 running, 817 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 2.5%sy, 0.0%ni, 97.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 132110088k total, 131627568k used, 482520k free, 262276k buffers
Swap: 10485752k total, 5792k used, 10479960k free, 42949788k cached
top - 11:47:59 up 21 days, 1:15, 18 users, load average: 1.21, 1.25, 0.97
Tasks: 819 total, 2 running, 817 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 2.5%sy, 0.0%ni, 97.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 132110088k total, 131623080k used, 487008k free, 262276k buffers
Swap: 10485752k total, 6312k used, 10479440k free, 42840068k cached
top - 11:48:02 up 21 days, 1:15, 18 users, load average: 1.21, 1.25, 0.97
Tasks: 819 total, 2 running, 817 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 2.5%sy, 0.0%ni, 97.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 132110088k total, 131620016k used, 490072k free, 262276k buffers
Swap: 10485752k total, 6772k used, 10478980k free, 42729276k cached
I read this and this. My questions are
Why would Linux sacrifice the performance rather than totally using cached RAM? Memory fragmentation? But putting data on swap will certainly create fragmentation too.
Is there a workaround to get consistent 3 seconds until reaching the physical RAM size?
Thanks.
Update 1:
Add more output from top.
Update 2:
Per David's suggestions, looking at /proc//io shows my program doesn't I/O. So David's first answer should explain this observation. Now comes to my second question. How to improve the performance as a non-root user (can't modify swappiness, etc.).
Update 3: I switched to another machine since I needed to sudo some commands. This is a real machine (no virtual machine) with Intel(R) Xeon(R) CPU E5-2680 0 # 2.70GHz. The machine has 16 physical cores.
uname -a
2.6.32-642.4.2.el6.x86_64 #1 SMP Tue Aug 23 19:58:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Running osgx's modified code with more iterations gives
Iteration 451
Time to malloc: 1.81198e-05
Time to fill with data: 0.109081
Fill rate with data: **916**.75 Mints/sec, 3667Mbytes/sec
Time to second write access of data: 0.049731
Access rate of data: 2010.82 Mints/sec, 8043.27Mbytes/sec
Time to third write access of data: 0.0478709
Access rate of data: 2088.95 Mints/sec, 8355.81Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 180800Mbytes
Iteration 452
Time to malloc: 1.09673e-05
Time to fill with data: 5.16316
Fill rate with data: **19**.368 Mints/sec, 77.4719Mbytes/sec
Time to second write access of data: 0.0495219
Access rate of data: 2019.31 Mints/sec, 8077.23Mbytes/sec
Time to third write access of data: 0.0439548
Access rate of data: 2275.06 Mints/sec, 9100.25Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 181200Mbytes
I did see kernel switched from 2MB page to 4KB page when slowdown occurred.
vmstat 1 60
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 1217396 11506356 5911040 47499184 0 2 35 47 0 0 14 2 84 0 0
2 0 1217396 11305860 5911040 47499184 4 0 4 36 5163 3460 7 6 87 0 0
2 0 1217396 11112744 5911040 47499188 0 0 0 0 4326 3451 7 6 87 0 0
2 0 1217396 10980556 5911040 47499188 0 0 0 0 4801 3385 7 6 87 0 0
2 0 1217396 10845940 5911040 47499192 0 0 0 20 4650 3596 7 6 87 0 0
2 0 1217396 10712508 5911040 47499200 0 0 0 0 5743 3562 7 6 87 0 0
2 0 1217396 10583380 5911040 47499200 0 0 0 40 4531 3622 7 6 87 0 0
2 0 1217396 10449096 5911040 47499200 0 0 0 0 4516 3629 7 6 87 0 0
2 0 1217396 10187856 5911040 47499200 0 0 0 0 4499 3456 7 6 87 0 0
2 0 1217396 10053256 5911040 47499204 0 0 0 8 5334 3507 7 6 87 0 0
2 0 1217396 9921624 5911040 47499204 0 0 0 0 6310 3593 6 6 87 0 0
2 0 1217396 9788532 5911040 47499208 0 0 0 44 5794 3516 7 6 87 0 0
2 0 1217396 9660516 5911040 47499208 0 0 0 0 4894 3535 7 6 87 0 0
2 0 1217396 9527552 5911040 47499212 0 0 0 0 4686 3570 7 6 87 0 0
2 0 1217396 9396536 5911040 47499212 0 0 0 0 4805 3538 7 6 87 0 0
2 0 1217396 9238664 5911040 47499212 0 0 0 0 5940 3459 7 6 87 0 0
2 0 1217396 9000136 5911040 47499216 0 0 0 32 5239 3333 7 6 87 0 0
2 0 1217396 8861132 5911040 47499220 0 0 0 0 5579 3351 7 6 87 0 0
2 0 1217396 8733688 5911040 47499220 0 0 0 0 4910 3199 7 6 87 0 0
2 0 1217396 8596600 5911040 47499224 0 0 0 44 5075 3453 7 6 87 0 0
2 0 1217396 8338468 5911040 47499232 0 0 0 0 5328 3444 7 6 87 0 0
2 0 1217396 8207732 5911040 47499232 0 0 0 52 5474 3370 7 6 87 0 0
2 0 1217396 8071212 5911040 47499236 0 0 0 0 5442 3419 7 6 87 0 0
2 0 1217396 7807736 5911040 47499236 0 0 0 0 6139 3456 7 6 87 0 0
2 0 1217396 7676080 5911044 47499232 0 0 0 16 4533 3430 6 6 87 0 0
2 0 1217396 7545728 5911044 47499236 0 0 0 0 6712 3957 7 6 87 0 0
4 0 1217396 7412444 5911044 47499240 0 0 0 68 6110 3547 7 6 87 0 0
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 1217396 7280148 5911048 47499244 0 0 0 68 6140 3516 7 7 86 0 0
2 0 1217396 7147836 5911048 47499244 0 0 0 0 4434 3400 7 6 87 0 0
2 0 1217396 6886980 5911048 47499248 0 0 0 16 7354 3393 7 6 87 0 0
2 0 1217396 6752868 5911048 47499248 0 0 0 0 5286 3573 7 6 87 0 0
2 0 1217396 6621772 5911048 47499248 0 0 0 0 5353 3410 7 6 87 0 0
2 0 1217396 6489760 5911048 47499252 0 0 0 48 5172 3454 7 6 87 0 0
2 0 1217396 6248732 5911048 47499256 0 0 0 0 5266 3411 7 6 87 0 0
2 0 1217396 6092804 5911048 47499260 0 0 0 4 6345 3473 7 6 87 0 0
2 0 1217396 5962544 5911048 47499260 0 0 0 0 7399 3712 7 6 87 0 0
2 0 1217396 5828492 5911048 47499264 0 0 0 0 5804 3516 7 6 87 0 0
2 0 1217396 5566720 5911048 47499264 0 0 0 44 5800 3370 7 6 87 0 0
2 0 1217396 5434204 5911048 47499264 0 0 0 0 6716 3446 7 6 87 0 0
2 0 1217396 5240724 5911048 47499268 0 0 0 68 3948 3346 7 6 87 0 0
2 0 1217396 5051688 5911008 47484936 0 0 0 0 4743 3734 7 6 87 0 0
2 0 1217396 4925680 5910500 47478444 0 0 136 0 5978 3779 7 6 87 0 0
2 0 1217396 4801744 5908552 47471820 0 0 0 32 4573 3237 7 6 87 0 0
2 0 1217396 4675772 5908552 47463984 0 0 0 0 6594 3276 7 6 87 0 0
2 0 1217396 4486472 5908444 47455736 0 0 0 4 6096 3256 7 6 87 0 0
2 0 1217396 4299908 5908392 47446964 0 0 0 0 5569 3525 7 6 87 0 0
2 0 1217396 4175444 5906884 47440024 0 0 0 0 4975 3141 7 6 87 0 0
2 0 1217396 4063472 5905976 47423860 0 0 0 56 6255 3147 6 6 87 0 0
2 0 1217396 3939816 5905796 47415596 0 0 0 0 5396 3143 7 6 87 0 0
2 0 1217396 3686540 5905796 47407152 0 0 0 44 6471 3201 7 6 87 0 0
2 0 1217396 3557596 5905796 47398892 0 0 0 0 7581 3727 7 6 87 0 0
2 0 1217396 3445536 5905796 47381812 0 0 0 0 5560 3222 7 6 87 0 0
2 0 1217396 3250272 5905796 47373364 0 0 0 60 5594 3343 7 6 87 0 0
2 0 1217396 3065232 5903744 47367156 0 0 0 0 5595 3182 7 6 87 0 0
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 0 1217396 2951704 5903028 47350792 0 0 0 12 5210 3262 7 6 87 0 0
2 0 1217396 2829228 5902928 47342444 0 0 0 0 5724 3758 7 6 87 0 0
2 0 1217396 2575248 5902580 47334472 0 0 0 0 4377 3369 7 6 87 0 0
2 0 1217396 2527996 5897796 47322436 0 0 0 60 5550 3570 7 6 87 0 0
2 0 1217396 2398672 5893572 47322324 0 0 0 0 5603 3225 7 6 87 0 0
2 0 1217396 2272536 5889364 47322228 0 0 0 16 6924 3310 7 6 87 0 0
iostat -xyz 1 60
Linux 2.6.32-642.4.2.el6.x86_64 05/09/2018 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
6.64 0.00 6.26 0.00 0.00 87.10
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util
avg-cpu: %user %nice %system %iowait %steal %idle
7.00 0.06 5.69 0.00 0.00 87.24
I managed to do "sudo perf top", and saw this in the top line when slowdown occurred.
16.84% [kernel] [k] compaction_alloc
From top. There were several other processes running (not shown).
Tasks: 799 total, 5 running, 787 sleeping, 4 stopped, 3 zombie
Cpu(s): 23.1%us, 16.7%sy, 0.0%ni, 60.0%id, 0.0%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 264503640k total, 256749480k used, 7754160k free, 5830508k buffers
Swap: 409259004k total, 1217112k used, 408041892k free, 50458600k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
23559 toddwz 20 0 165g 164g 1204 R 93.0 65.4 2:05.51 a.out
Update 4
After turning off THP, I see the following. Fill rate is consistent around 550 Mints/sec (900 with THP on) until my program uses 240GB RAM (cached RAM < 1GB). And then swap kicks in, so fill rate drops.
Iteration 610
Time to malloc: 1.3113e-05
Time to fill with data: 0.181151
Fill rate with data: 552.025 Mints/sec, 2208.1Mbytes/sec
Time to second write access of data: 0.04074
Access rate of data: 2454.59 Mints/sec, 9818.36Mbytes/sec
Time to third write access of data: 0.0420492
Access rate of data: 2378.17 Mints/sec, 9512.67Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 244400Mbytes
Iteration 611
Time to malloc: 1.88351e-05
Time to fill with data: 0.306215
Fill rate with data: 326.568 Mints/sec, 1306.27Mbytes/sec
Time to second write access of data: 0.045784
Access rate of data: 2184.17 Mints/sec, 8736.68Mbytes/sec
Time to third write access of data: 0.0441492
Access rate of data: 2265.05 Mints/sec, 9060.19Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 244800Mbytes
Iteration 612
Time to malloc: 2.21729e-05
Time to fill with data: 1.33305
Fill rate with data: 75.016 Mints/sec, 300.064Mbytes/sec
Time to second write access of data: 0.048573
Access rate of data: 2058.76 Mints/sec, 8235.02Mbytes/sec
Time to third write access of data: 0.0495481
Access rate of data: 2018.24 Mints/sec, 8072.96Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 245200Mbytes
Conclusion
The behavior of my program is more transparent to me with transparent huge page (THP) turned off so I'll continue with THP off. For my particular program, the cause is THP not swap. Thanks to all who helped.
First iterations of the test probably uses huge pages (2 MB pages) due to THP: Transparent Hugepage - https://www.kernel.org/doc/Documentation/vm/transhuge.txt -
check your /sys/kernel/mm/transparent_hugepage/enabled and grep AnonHugePages /proc/meminfo during the execution of test.
The reason applications are running faster is because of two
factors. The first factor is almost completely irrelevant and it's not
of significant interest because it'll also have the downside of
requiring larger clear-page copy-page in page faults which is a
potentially negative effect. The first factor consists in taking a
single page fault for each 2M virtual region touched by userland (so
reducing the enter/exit kernel frequency by a 512 times factor). This
only matters the first time the memory is accessed for the lifetime of
a memory mapping.
Allocation of huge amounts of memory with new or malloc is served by single syscall mmap, which usually don't "populate" the virtual memory with physical pages, check man mmap around MADV_POPULATE:
MAP_POPULATE (since Linux 2.5.46)
Populate (prefault) page tables for a mapping. ... This will help
to reduce blocking on page faults later.
This memory is just registered by mmap (without MAP_POPULATE) as virtual and write access is prohibited in page table. When your test tries to do first write to any memory page, page fault exception is generated and handled by OS kernel. Linux kernel will allocate some physical memory and map virtual page to physical (populate the page). With THP enabled (it is often enabled) kernel may allocate single huge page of 2MB, if it has some free huge physical pages. If there is no free huge pages, kernel will allocate 4KB page. So, without hugepages you will have 512 times more page faults (it can be checked by running vmstat 1 180 in another console while test is running, or by perf stat -I 1000).
Next accesses to populated pages will not have page faults, so you can extend your test with second (third) for i in (0..N-1): a[i] = 1; loop and measure time of both loops.
Your results still sounds strange. Is your system real or virtualized? Hypervisors may support 2 MB pages, and virtual systems may have much more cost for memory allocation and exception handling.
On my PC with less memory I have something like 10% slowdown when page faults switches from huge page allocation down to 4KB page allocation (check page-faults strings from perf stat - there were only around 2 thousands page faults per seconds with 2MB pages and >200 thousands page faults with 4KB pages):
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
$ perf stat -I1000 ./a.out
Iteration 0
Time to malloc: 8.10623e-06
Time to fill with data: 0.364378
Fill rate with data: 274.44 Mints/sec, 1097.76Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 400Mbytes
Iteration 1
Time to malloc: 1.90735e-05
Time to fill with data: 0.357983
Fill rate with data: 279.343 Mints/sec, 1117.37Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 800Mbytes
Iteration 2
Time to malloc: 1.69277e-05
# time counts unit events
1.000414902 999.893040 task-clock (msec)
1.000414902 1 context-switches # 0.001 K/sec
1.000414902 0 cpu-migrations # 0.000 K/sec
1.000414902 2,024 page-faults # 0.002 M/sec
1.000414902 2,664,963,857 cycles # 2.665 GHz
1.000414902 3,072,781,834 instructions # 1.15 insn per cycle
1.000414902 559,551,437 branches # 559.611 M/sec
1.000414902 25,176 branch-misses # 0.00% of all branches
Time to fill with data: 0.357014
Fill rate with data: 280.101 Mints/sec, 1120.4Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1200Mbytes
Iteration 3
Time to malloc: 1.71661e-05
Time to fill with data: 0.358964
Fill rate with data: 278.579 Mints/sec, 1114.32Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1600Mbytes
Iteration 4
Time to malloc: 1.69277e-05
Time to fill with data: 0.356918
Fill rate with data: 280.177 Mints/sec, 1120.71Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2000Mbytes
Iteration 5
Time to malloc: 1.50204e-05
2.000779126 1000.703872 task-clock (msec)
2.000779126 1 context-switches # 0.001 K/sec
2.000779126 0 cpu-migrations # 0.000 K/sec
2.000779126 2,280 page-faults # 0.002 M/sec
2.000779126 2,686,072,244 cycles # 2.685 GHz
2.000779126 3,094,777,285 instructions # 1.16 insn per cycle
2.000779126 563,593,105 branches # 563.425 M/sec
2.000779126 9,661 branch-misses # 0.00% of all branches
Time to fill with data: 0.371785
Fill rate with data: 268.973 Mints/sec, 1075.89Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2400Mbytes
Iteration 6
Time to malloc: 1.90735e-05
Time to fill with data: 0.418562
Fill rate with data: 238.913 Mints/sec, 955.653Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2800Mbytes
Iteration 7
Time to malloc: 2.09808e-05
3.001146481 1000.436128 task-clock (msec)
3.001146481 1 context-switches # 0.001 K/sec
3.001146481 0 cpu-migrations # 0.000 K/sec
3.001146481 217,415 page-faults # 0.217 M/sec
3.001146481 2,687,783,783 cycles # 2.687 GHz
3.001146481 3,100,713,038 instructions # 1.16 insn per cycle
3.001146481 560,207,049 branches # 560.014 M/sec
3.001146481 83,230 branch-misses # 0.01% of all branches
Time to fill with data: 0.416297
Fill rate with data: 240.213 Mints/sec, 960.853Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3200Mbytes
Iteration 8
Time to malloc: 1.38283e-05
Time to fill with data: 0.41672
Fill rate with data: 239.969 Mints/sec, 959.877Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3600Mbytes
Iteration 9
Time to malloc: 1.40667e-05
Time to fill with data: 0.424997
Fill rate with data: 235.296 Mints/sec, 941.183Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4000Mbytes
Iteration 10
Time to malloc: 1.28746e-05
4.001467773 1000.378604 task-clock (msec)
4.001467773 2 context-switches # 0.002 K/sec
4.001467773 0 cpu-migrations # 0.000 K/sec
4.001467773 232,690 page-faults # 0.233 M/sec
4.001467773 2,655,313,682 cycles # 2.654 GHz
4.001467773 3,087,157,016 instructions # 1.15 insn per cycle
4.001467773 557,266,313 branches # 557.070 M/sec
4.001467773 95,433 branch-misses # 0.02% of all branches
Time to fill with data: 0.413271
Fill rate with data: 241.972 Mints/sec, 967.888Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4400Mbytes
Iteration 11
Time to malloc: 1.21593e-05
Time to fill with data: 0.414624
Fill rate with data: 241.182 Mints/sec, 964.73Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4800Mbytes
Iteration 12
Time to malloc: 1.5974e-05
5.001792272 1000.372602 task-clock (msec)
5.001792272 2 context-switches # 0.002 K/sec
5.001792272 0 cpu-migrations # 0.000 K/sec
5.001792272 236,260 page-faults # 0.236 M/sec
5.001792272 2,687,340,230 cycles # 2.686 GHz
5.001792272 3,134,864,968 instructions # 1.17 insn per cycle
5.001792272 565,846,287 branches # 565.644 M/sec
5.001792272 104,634 branch-misses # 0.02% of all branches
Time to fill with data: 0.412331
Fill rate with data: 242.524 Mints/sec, 970.094Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5200Mbytes
Iteration 13
Time to malloc: 1.3113e-05
Time to fill with data: 0.414433
Fill rate with data: 241.294 Mints/sec, 965.174Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5600Mbytes
Iteration 14
Time to malloc: 1.88351e-05
Time to fill with data: 0.417277
Fill rate with data: 239.649 Mints/sec, 958.596Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 6000Mbytes
6.002129544 1000.404270 task-clock (msec)
6.002129544 1 context-switches # 0.001 K/sec
6.002129544 0 cpu-migrations # 0.000 K/sec
6.002129544 215,269 page-faults # 0.215 M/sec
6.002129544 2,676,269,667 cycles # 2.675 GHz
6.002129544 3,286,469,282 instructions # 1.23 insn per cycle
6.002129544 578,367,266 branches # 578.156 M/sec
6.002129544 345,470 branch-misses # 0.06% of all branches
....
After disabling THP with root command from https://access.redhat.com/solutions/46111 I always have ~200 thousands page faults per second and around 950 MB/s:
$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
$ perf stat -I1000 ./a.out
Iteration 0
Time to malloc: 1.50204e-05
Time to fill with data: 0.422322
Fill rate with data: 236.786 Mints/sec, 947.145Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 400Mbytes
Iteration 1
Time to malloc: 1.50204e-05
Time to fill with data: 0.415068
Fill rate with data: 240.924 Mints/sec, 963.698Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 800Mbytes
Iteration 2
Time to malloc: 2.19345e-05
# time counts unit events
1.000162191 999.429856 task-clock (msec)
1.000162191 14 context-switches # 0.014 K/sec
1.000162191 0 cpu-migrations # 0.000 K/sec
1.000162191 232,727 page-faults # 0.233 M/sec
1.000162191 2,664,896,604 cycles # 2.666 GHz
1.000162191 3,080,713,267 instructions # 1.16 insn per cycle
1.000162191 555,116,838 branches # 555.434 M/sec
1.000162191 102,262 branch-misses # 0.02% of all branches
Time to fill with data: 0.440695
Fill rate with data: 226.914 Mints/sec, 907.658Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1200Mbytes
Iteration 3
Time to malloc: 2.09808e-05
Time to fill with data: 0.414463
Fill rate with data: 241.276 Mints/sec, 965.104Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1600Mbytes
Iteration 4
Time to malloc: 1.81198e-05
2.000544564 1000.142465 task-clock (msec)
2.000544564 16 context-switches # 0.016 K/sec
2.000544564 0 cpu-migrations # 0.000 K/sec
2.000544564 229,697 page-faults # 0.230 M/sec
2.000544564 2,621,180,984 cycles # 2.622 GHz
2.000544564 3,041,358,811 instructions # 1.15 insn per cycle
2.000544564 547,910,242 branches # 548.027 M/sec
2.000544564 93,682 branch-misses # 0.02% of all branches
Time to fill with data: 0.428383
Fill rate with data: 233.436 Mints/sec, 933.744Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2000Mbytes
Iteration 5
Time to malloc: 1.5974e-05
Time to fill with data: 0.421986
Fill rate with data: 236.975 Mints/sec, 947.899Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2400Mbytes
Iteration 6
Time to malloc: 1.5974e-05
Time to fill with data: 0.413477
Fill rate with data: 241.851 Mints/sec, 967.406Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2800Mbytes
Iteration 7
Time to malloc: 1.88351e-05
3.000866438 999.980461 task-clock (msec)
3.000866438 20 context-switches # 0.020 K/sec
3.000866438 0 cpu-migrations # 0.000 K/sec
3.000866438 231,194 page-faults # 0.231 M/sec
3.000866438 2,622,484,960 cycles # 2.623 GHz
3.000866438 3,061,610,229 instructions # 1.16 insn per cycle
3.000866438 551,533,361 branches # 551.616 M/sec
3.000866438 104,561 branch-misses # 0.02% of all branches
Time to fill with data: 0.448333
Fill rate with data: 223.048 Mints/sec, 892.194Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3200Mbytes
Iteration 8
Time to malloc: 1.50204e-05
Time to fill with data: 0.410566
Fill rate with data: 243.566 Mints/sec, 974.265Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3600Mbytes
Iteration 9
Time to malloc: 1.3113e-05
4.001231042 1000.098860 task-clock (msec)
4.001231042 17 context-switches # 0.017 K/sec
4.001231042 0 cpu-migrations # 0.000 K/sec
4.001231042 228,532 page-faults # 0.229 M/sec
4.001231042 2,586,146,024 cycles # 2.586 GHz
4.001231042 3,026,679,955 instructions # 1.15 insn per cycle
4.001231042 545,236,541 branches # 545.284 M/sec
4.001231042 115,251 branch-misses # 0.02% of all branches
Time to fill with data: 0.441442
Fill rate with data: 226.53 Mints/sec, 906.121Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4000Mbytes
Iteration 10
Time to malloc: 1.5974e-05
Time to fill with data: 0.42898
Fill rate with data: 233.111 Mints/sec, 932.445Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4400Mbytes
Iteration 11
Time to malloc: 2.00272e-05
5.001547227 999.982415 task-clock (msec)
5.001547227 19 context-switches # 0.019 K/sec
5.001547227 0 cpu-migrations # 0.000 K/sec
5.001547227 225,796 page-faults # 0.226 M/sec
5.001547227 2,560,990,918 cycles # 2.561 GHz
5.001547227 3,005,384,743 instructions # 1.15 insn per cycle
5.001547227 542,275,580 branches # 542.315 M/sec
5.001547227 116,537 branch-misses # 0.02% of all branches
Time to fill with data: 0.414212
Fill rate with data: 241.422 Mints/sec, 965.689Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4800Mbytes
Iteration 12
Time to malloc: 1.69277e-05
Time to fill with data: 0.411084
Fill rate with data: 243.259 Mints/sec, 973.037Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5200Mbytes
Iteration 13
Time to malloc: 1.40667e-05
Time to fill with data: 0.413644
Fill rate with data: 241.754 Mints/sec, 967.015Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5600Mbytes
Iteration 14
Time to malloc: 1.28746e-05
6.001849796 999.913923 task-clock (msec)
6.001849796 18 context-switches # 0.018 K/sec
6.001849796 0 cpu-migrations # 0.000 K/sec
6.001849796 236,912 page-faults # 0.237 M/sec
6.001849796 2,685,445,660 cycles # 2.686 GHz
6.001849796 3,153,464,551 instructions # 1.20 insn per cycle
6.001849796 568,989,467 branches # 569.032 M/sec
6.001849796 125,943 branch-misses # 0.02% of all branches
Time to fill with data: 0.444891
Fill rate with data: 224.774 Mints/sec, 899.097Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 6000Mbytes
Test modified for perf stat with rate printing and limited iteration count:
$ cat test.c; g++ test.c
#include <sys/time.h>
#include <time.h>
#include <stdio.h>
#include <string.h>
#include <iostream>
#include <vector>
using namespace std;
double getWallTime()
{
struct timeval time;
if (gettimeofday(&time, NULL))
{
return 0;
}
return (double)time.tv_sec + (double)time.tv_usec * .000001;
}
#define M 1000000
int main()
{
int *a;
int n = 100000000;
int j;
double total = 0;
for(j=0; j<15; j++)
{
cout << "Iteration " << j << endl;
double start = getWallTime();
a = new int[n];
cout << "Time to malloc: " << getWallTime() - start << endl;
for (int i = 0; i < n; i++)
{
a[i] = 1;
}
double elapsed = getWallTime()-start;
cout << "Time to fill with data: " << elapsed << endl;
cout << "Fill rate with data: " << n/elapsed/M << " Mints/sec, " << n*sizeof(int)/elapsed/M << "Mbytes/sec" << endl;
total += n*sizeof(int)*1./M;
cout << "Allocated " << n*sizeof(int)*1./M << " Mbytes, with total memory allocated " << total << "Mbytes" << endl;
}
return 0;
}
Test modified for second and third write access
$ g++ second.c -o second
$ cat second.c
#include <sys/time.h>
#include <time.h>
#include <stdio.h>
#include <string.h>
#include <iostream>
#include <vector>
using namespace std;
double getWallTime()
{
struct timeval time;
if (gettimeofday(&time, NULL))
{
return 0;
}
return (double)time.tv_sec + (double)time.tv_usec * .000001;
}
#define M 1000000
int main()
{
int *a;
int n = 100000000;
int j;
double total = 0;
for(j=0; j<15; j++)
{
cout << "Iteration " << j << endl;
double start = getWallTime();
a = new int[n];
cout << "Time to malloc: " << getWallTime() - start << endl;
for (int i = 0; i < n; i++)
{
a[i] = 1;
}
double elapsed = getWallTime()-start;
cout << "Time to fill with data: " << elapsed << endl;
cout << "Fill rate with data: " << n/elapsed/M << " Mints/sec, " << n*sizeof(int)/elapsed/M << "Mbytes/sec" << endl;
start = getWallTime();
for (int i = 0; i < n; i++)
{
a[i] = 2;
}
elapsed = getWallTime()-start;
cout << "Time to second write access of data: " << elapsed << endl;
cout << "Access rate of data: " << n/elapsed/M << " Mints/sec, " << n*sizeof(int)/elapsed/M << "Mbytes/sec" << endl;
start = getWallTime();
for (int i = 0; i < n; i++)
{
a[i] = 3;
}
elapsed = getWallTime()-start;
cout << "Time to third write access of data: " << elapsed << endl;
cout << "Access rate of data: " << n/elapsed/M << " Mints/sec, " << n*sizeof(int)/elapsed/M << "Mbytes/sec" << endl;
total += n*sizeof(int)*1./M;
cout << "Allocated " << n*sizeof(int)*1./M << " Mbytes, with total memory allocated " << total << "Mbytes" << endl;
}
return 0;
}
Without THP - around 1.25 GB/s for second and third access:
$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
$ ./second
Iteration 0
Time to malloc: 9.05991e-06
Time to fill with data: 0.426387
Fill rate with data: 234.529 Mints/sec, 938.115Mbytes/sec
Time to second write access of data: 0.318292
Access rate of data: 314.177 Mints/sec, 1256.71Mbytes/sec
Time to third write access of data: 0.321722
Access rate of data: 310.827 Mints/sec, 1243.31Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 400Mbytes
Iteration 1
Time to malloc: 3.50475e-05
Time to fill with data: 0.411859
Fill rate with data: 242.802 Mints/sec, 971.206Mbytes/sec
Time to second write access of data: 0.317989
Access rate of data: 314.476 Mints/sec, 1257.91Mbytes/sec
Time to third write access of data: 0.321637
Access rate of data: 310.91 Mints/sec, 1243.64Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 800Mbytes
Iteration 2
Time to malloc: 2.81334e-05
Time to fill with data: 0.411918
Fill rate with data: 242.767 Mints/sec, 971.067Mbytes/sec
Time to second write access of data: 0.318647
Access rate of data: 313.827 Mints/sec, 1255.31Mbytes/sec
Time to third write access of data: 0.321041
Access rate of data: 311.487 Mints/sec, 1245.95Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1200Mbytes
Iteration 3
Time to malloc: 2.5034e-05
Time to fill with data: 0.411138
Fill rate with data: 243.227 Mints/sec, 972.909Mbytes/sec
Time to second write access of data: 0.318429
Access rate of data: 314.042 Mints/sec, 1256.17Mbytes/sec
Time to third write access of data: 0.321332
Access rate of data: 311.205 Mints/sec, 1244.82Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1600Mbytes
Iteration 4
Time to malloc: 3.71933e-05
Time to fill with data: 0.410922
Fill rate with data: 243.355 Mints/sec, 973.421Mbytes/sec
Time to second write access of data: 0.320262
Access rate of data: 312.244 Mints/sec, 1248.98Mbytes/sec
Time to third write access of data: 0.319223
Access rate of data: 313.261 Mints/sec, 1253.04Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2000Mbytes
Iteration 5
Time to malloc: 2.19345e-05
Time to fill with data: 0.418508
Fill rate with data: 238.944 Mints/sec, 955.777Mbytes/sec
Time to second write access of data: 0.320419
Access rate of data: 312.092 Mints/sec, 1248.37Mbytes/sec
Time to third write access of data: 0.319752
Access rate of data: 312.742 Mints/sec, 1250.97Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2400Mbytes
Iteration 6
Time to malloc: 3.19481e-05
Time to fill with data: 0.410054
Fill rate with data: 243.87 Mints/sec, 975.481Mbytes/sec
Time to second write access of data: 0.320244
Access rate of data: 312.262 Mints/sec, 1249.05Mbytes/sec
Time to third write access of data: 0.319546
Access rate of data: 312.944 Mints/sec, 1251.78Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2800Mbytes
Iteration 7
Time to malloc: 3.19481e-05
Time to fill with data: 0.409491
Fill rate with data: 244.206 Mints/sec, 976.822Mbytes/sec
Time to second write access of data: 0.318501
Access rate of data: 313.971 Mints/sec, 1255.88Mbytes/sec
Time to third write access of data: 0.320052
Access rate of data: 312.449 Mints/sec, 1249.8Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3200Mbytes
Iteration 8
Time to malloc: 2.5034e-05
Time to fill with data: 0.409922
Fill rate with data: 243.949 Mints/sec, 975.795Mbytes/sec
Time to second write access of data: 0.320583
Access rate of data: 311.932 Mints/sec, 1247.73Mbytes/sec
Time to third write access of data: 0.319478
Access rate of data: 313.011 Mints/sec, 1252.04Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3600Mbytes
Iteration 9
Time to malloc: 2.69413e-05
Time to fill with data: 0.41104
Fill rate with data: 243.285 Mints/sec, 973.141Mbytes/sec
Time to second write access of data: 0.320389
Access rate of data: 312.121 Mints/sec, 1248.48Mbytes/sec
Time to third write access of data: 0.319762
Access rate of data: 312.733 Mints/sec, 1250.93Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4000Mbytes
Iteration 10
Time to malloc: 2.59876e-05
Time to fill with data: 0.412612
Fill rate with data: 242.358 Mints/sec, 969.434Mbytes/sec
Time to second write access of data: 0.318304
Access rate of data: 314.165 Mints/sec, 1256.66Mbytes/sec
Time to third write access of data: 0.319453
Access rate of data: 313.035 Mints/sec, 1252.14Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4400Mbytes
Iteration 11
Time to malloc: 2.98023e-05
Time to fill with data: 0.412428
Fill rate with data: 242.467 Mints/sec, 969.866Mbytes/sec
Time to second write access of data: 0.318467
Access rate of data: 314.004 Mints/sec, 1256.02Mbytes/sec
Time to third write access of data: 0.319716
Access rate of data: 312.778 Mints/sec, 1251.11Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4800Mbytes
Iteration 12
Time to malloc: 2.69413e-05
Time to fill with data: 0.410515
Fill rate with data: 243.597 Mints/sec, 974.386Mbytes/sec
Time to second write access of data: 0.31832
Access rate of data: 314.149 Mints/sec, 1256.6Mbytes/sec
Time to third write access of data: 0.319569
Access rate of data: 312.921 Mints/sec, 1251.69Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5200Mbytes
Iteration 13
Time to malloc: 2.28882e-05
Time to fill with data: 0.412385
Fill rate with data: 242.492 Mints/sec, 969.967Mbytes/sec
Time to second write access of data: 0.318929
Access rate of data: 313.549 Mints/sec, 1254.2Mbytes/sec
Time to third write access of data: 0.31949
Access rate of data: 312.999 Mints/sec, 1252Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5600Mbytes
Iteration 14
Time to malloc: 2.90871e-05
Time to fill with data: 0.41235
Fill rate with data: 242.512 Mints/sec, 970.05Mbytes/sec
Time to second write access of data: 0.340456
Access rate of data: 293.724 Mints/sec, 1174.89Mbytes/sec
Time to third write access of data: 0.319716
Access rate of data: 312.778 Mints/sec, 1251.11Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 6000Mbytes
With THP - bit faster allocation but same speed of second and third access:
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
$ ./second
Iteration 0
Time to malloc: 1.50204e-05
Time to fill with data: 0.365043
Fill rate with data: 273.94 Mints/sec, 1095.76Mbytes/sec
Time to second write access of data: 0.320503
Access rate of data: 312.01 Mints/sec, 1248.04Mbytes/sec
Time to third write access of data: 0.319442
Access rate of data: 313.046 Mints/sec, 1252.18Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 400Mbytes
...
Iteration 14
Time to malloc: 2.7895e-05
Time to fill with data: 0.409294
Fill rate with data: 244.323 Mints/sec, 977.293Mbytes/sec
Time to second write access of data: 0.318422
Access rate of data: 314.049 Mints/sec, 1256.19Mbytes/sec
Time to third write access of data: 0.322098
Access rate of data: 310.465 Mints/sec, 1241.86Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 6000Mbytes
From updates and the chat:
I did see kernel switched from 2MB page to 4KB page when slowdown occurred.
I managed to do "sudo perf top", and saw this in the top line when slowdown occurred.
16.84% [kernel] [k] compaction_alloc
perf top -g
- 31.27% 31.03% [kernel] [k] compaction_alloc \u2592
- compaction_alloc \u2592
- migrate_pages \u2592
compact_zone \u2592
compact_zone_order \u2592
try_to_compact_pages \u2592
__alloc_pages_direct_compact \u2592
__alloc_pages_nodemask \u2592
alloc_pages_vma \u2592
do_huge_pmd_anonymous_page \u2592
handle_mm_fault \u2592
__do_page_fault \u2592
do_page_fault \u2592
page_fault
Slowdown is connected with enabled THP and slow page faults of 4KB. After 4 KB switch page faults are very slow from some linux kernel internal compaction mechanisms (is kernel still trying to get some more huge pages?) - http://lwn.net/Articles/368869 and http://lwn.net/Articles/591998. More problems from THP on NUMA, both from THP and NUMA code.
The original problem is
we launch several solvers simultaneously based on memory set by user. In this case, use may want to use all 230G free RAM.
we do dynamic memory allocation/deallocation. when we reach the memory limit, in this case, could be say 150 GB (not 230 GB), we see dramatic slowdown.
I observe high system cpu usage, and swap usage. So I make up this little program, which seems to show my original problem
I can suggest globally disable THP (https://unix.stackexchange.com/questions/99154/disable-transparent-hugepage or http://www.olivierdoucet.info/blog/2012/05/19/debugging-a-mysql-stall/), or free most of "cached" (by echo 3 > /proc/sys/vm/drop_caches from root) - this is temporary (and not fast) workaround. With freed cached memory there will be less need for compaction (but it will make programs of other users slower - they will need to re-read their data from disks/nfs).
Huge swap on slow (rotating) disk can kill all performance from the moment it will be used (and swap on ssd is fast enough, and swap on NVMe is very fast).
You may also want to change huge allocations in your software from default new/delete to manual calling of anonymous mmap for allocation and munmap for deallocation to control flags (there are mmap and madvise flags for huge page and there is populate - http://man7.org/linux/man-pages/man2/mmap.2.html http://man7.org/linux/man-pages/man2/madvise.2.html).
With MAP_POPULATE you will have (very?) slow allocation, but all memory allocated will be really used from the moment of allocation (all accesses will be fast).

Assigning Variables from CSV files (or another format) in C++

Hello Stack Overflow world :3 My name is Chris, I have a slight issue.. So I am going to present the issue in this format..
Part 1
I will present the materials & code snippets I am currently working with that IS working..
Part 2
I will explain in my best ability my desired new way of achieving my goal.
Part 3
So you guys think I am not having you do all the work, I will go ahead and present my attempts at said goal, as well as possibly ways research has dug up that I did not fully understand.
Part 1
mobDB.csv Example:
ID Sprite kName iName LV HP SP EXP JEXP Range1 ATK1 ATK2 DEF MDEF STR AGI VIT INT DEX LUK Range2 Range3 Scale Race Element Mode Speed aDelay aMotion dMotion MEXP ExpPer MVP1id MVP1per MVP2id MVP2per MVP3id MVP3per Drop1id Drop1per Drop2id Drop2per Drop3id Drop3per Drop4id Drop4per Drop5id Drop5per Drop6id Drop6per Drop7id Drop7per Drop8id Drop8per Drop9id Drop9per DropCardid DropCardper
1001 SCORPION Scorpion Scorpion 24 1109 0 287 176 1 80 135 30 0 1 24 24 5 52 5 10 12 0 4 23 12693 200 1564 864 576 0 0 0 0 0 0 0 0 990 70 904 5500 757 57 943 210 7041 100 508 200 625 20 0 0 0 0 4068 1
1002 PORING Poring Poring 1 50 0 2 1 1 7 10 0 5 1 1 1 0 6 30 10 12 1 3 21 131 400 1872 672 480 0 0 0 0 0 0 0 0 909 7000 1202 100 938 400 512 1000 713 1500 512 150 619 20 0 0 0 0 4001 1
1004 HORNET Hornet Hornet 8 169 0 19 15 1 22 27 5 5 6 20 8 10 17 5 10 12 0 4 24 4489 150 1292 792 216 0 0 0 0 0 0 0 0 992 80 939 9000 909 3500 1208 15 511 350 518 150 0 0 0 0 0 0 4019 1
1005 FARMILIAR Familiar Familiar 8 155 0 28 15 1 20 28 0 0 1 12 8 5 28 0 10 12 0 2 27 14469 150 1276 576 384 0 0 0 0 0 0 0 0 913 5500 1105 20 2209 15 601 50 514 100 507 700 645 50 0 0 0 0 4020 1
1007 FABRE Fabre Fabre 2 63 0 3 2 1 8 11 0 0 1 2 4 0 7 5 10 12 0 4 22 385 400 1672 672 480 0 0 0 0 0 0 0 0 914 6500 949 500 1502 80 721 5 511 700 705 1000 1501 200 0 0 0 0 4002 1
1008 PUPA Pupa Pupa 2 427 0 2 4 0 1 2 0 20 1 1 1 0 1 20 10 12 0 4 22 256 1000 1001 1 1 0 0 0 0 0 0 0 0 1010 80 915 5500 938 600 2102 2 935 1000 938 600 1002 200 0 0 0 0 4003 1
1009 CONDOR Condor Condor 5 92 0 6 5 1 11 14 0 0 1 13 5 0 13 10 10 12 1 2 24 4233 150 1148 648 480 0 0 0 0 0 0 0 0 917 9000 1702 150 715 80 1750 5500 517 400 916 2000 582 600 0 0 0 0 4015 1
1010 WILOW Willow Willow 4 95 0 5 4 1 9 12 5 15 1 4 8 30 9 10 10 12 1 3 22 129 200 1672 672 432 0 0 0 0 0 0 0 0 902 9000 1019 100 907 1500 516 700 1068 3500 1067 2000 1066 1000 0 0 0 0 4010 1
1011 CHONCHON Chonchon Chonchon 4 67 0 5 4 1 10 13 10 0 1 10 4 5 12 2 10 12 0 4 24 385 200 1076 576 480 0 0 0 0 0 0 0 0 998 50 935 6500 909 1500 1205 55 601 100 742 5 1002 150 0 0 0 0 4009 1
So this is an example of the Spreadsheet I have.. This is what I wish to be using in my ideal goal. Not what I am using right now.. It was done in MS Excel 2010, using Columns A-BF and Row 1-993
Currently my format for working code, I am using manually implemented Arrays.. For example for the iName I have:
char iName[16][25] = {"Scorpion", "Poring", "Hornet", "Familiar", "null", "null", "null", "null", "null", "null", "null", "null", "null", "null", "null", "null"};
Defined in a header file (bSystem.h) now to apply, lets say their health variable? I have to have another array in the same Header with corresponding order, like so:
int HP[16] = {1109, 50, 169, 155, 95, 95, 118, 118, 142, 142, 167, 167, 193, 193, 220, 220};
The issue is, there is a large amount of data to hard code into the various file I need for Monsters, Items, Spells, Skills, ect.. On the original small scale to get certain system made it was fine.. I have been using various Voids in header files to transfer data from file to file when it's called.. But when I am dealing with 1,000+ Monsters and having to use all these variables.. Manually putting them in is kinda.. Ridiculous? Lol...
Part 2
Now my ideal system for this, is to be able to use the .CSV Files to load the data.. I have hit a decent amount of various issues in this task.. Such as, converting the data pulled from Names to a Char array, actually pulling the data from the CSV file and assigning specific sections to certain arrays... The main idea I have in mind, that I can not seem to get to is this;
I would like to be able to find a way to just read these various variables from the CSV file... So when I call upon the variables like:
cout << name << "(" << health << " health) VS. " << iName[enemy] << "(" << HP[enemy] << " health)";
where [enemy] is, it would be the ID.. the enemy encounter is in another header (lSystem.h) where it basically goes like;
case 0:
enemy = 0;
Where 0 would be the first data in the Arrays involving Monsters.. I hate that it has to be order specific.. I would want to be able to say enemy = 1002; so when the combat systems start it can just pull the variables it needs from the enemy with the ID 1002..
I always hit a few different issues, I can't get it to pull the data from the file to the program.. When I can, I can only get it to store int values to int arrays, I have issues getting it to convert the strings to char arrays.. Then the next issue I am presented with is recalling it and the actual saving part... Which is where part 3 comes in :3
Part 3
I have attempted a few different things so far and have done research on how to achieve this.. What I have came across so far is..
I can write a function to read the data from let's say mobDB, record it into arrays, then output it to a .dat? So when I need to recall variables I can do some from the .dat instead of a modifiable CSV.. I was presented with the same issues as far as reading and converting..
I can go the SQL route, but I have had a ton of issues understanding how to pull the data from the SQL? I have a PowerEdge 2003 Server box in my house which I store data on, it does have NavicatSQL Premium set up, so I guess my main 2 questions about the SQL route is, is it possible to hook right into the SQLServer and as I update the Data Base, when the client runs it would just pull the variables and data from the DB? Or would I be stuck compiling SQL files... When it is an online game, I know I will have to use something to transfer from Server to Client, which is why I am trying to set up this early in dev so I have more to build off of, I am sure I can use SQL servers for that? If anyone has a good grasp on how this works I would very much like to take the SQL route..
Attempts I have made are using like, Boost to Parse the data from the CSV instead of standard libs.. Same issues were presented.. I did read up on converting a string to a char.. But the issue lied in once I pulled the data, I couldn't convert it?..
I've also tried the ADO C++ route.. Dead end there..
All in all I have spent the last week or so on this.. I would very much like to set up the SQL server to actually update the variables... but I am open to any working ideas that presents ease of editing and implementing large amounts of data..
I appreciate any and all help.. If anyone does attempt to help get a working code for this, if it's not too much trouble to add comments to parts you feel you should explain? I don't want someone to just give me a quick fix.. I actually want to learn and understand what I am using. Thank you all very much :)
-Chris
Let's see if I understand your problem correctly: You are writing a game and currently all the stats for your game actors are hardcoded. You already have an Excel spreadsheet with this data and you just want to use this instead of the hardcoded header files, so that you can tweak the stats without waiting for a long recompilation. You are currently storing the stats in your code in a column-store fashion, i.e. one array per attribute. The CSV file stores stuff in a row-wise fashion. Correct so far?
Now my understanding of your problem becomes a little blurry. But let's try. If I understand you correctly, you want to completely remove the arrays from your code and directly access the CSV file when you need the stats for some creature? If yes, then this is already the problem. File I/O is incredibly slow, you need to keep this data in main memory. Just keep the arrays, but instead of manually assigning the values in the headers, you have a load function that reads the CSV file when you start the game and loads its contents into the array. You can keep the rest of your code unchanged.
Example:
void load (std::ifstream &csv)
{
readFirstLineAndCheckThatItIsCorrect (csv);
while (!csv.eof())
{
int id;
std::string spriteName;
csv >> id;
csv >> spriteName >> kName[id] >> iName[id] >> LV[id] >> HP[id] >> SP[id] >> ...
Sprite[id] = getSpriteForName (spriteName);
}
}
Using a database system is completely out of scope here. All you need to do is load some data into some arrays. If you want to be able to change the stats without restarting the program, add some hotkey for reloading the CSV file.
If you plan to write an online game, then you still have a long way ahead of you. Even then, SQL is a very bad idea for exchanging data between server and clients because a) it just introduces way too much overhead and b) it is an open invitation for cheaters and hackers because if clients have direct access to your database, you can no longer validate their inputs. See http://forums.somethingawful.com/showthread.php?noseen=0&pagenumber=258&threadid=2803713 for an actual example.
If you really want this to be an online game, you need to design your own communication protocol. But maybe you should read some books about that first, because it really is a complex issue. For instance, you need to hide the latency from the user by guessing on the client side what the server and the other players will most likely do next, and gracefully correct your guesses if they were wrong, all without the player noticing (Dead Reckoning).
Still, good luck on your game and I hope to play it some day. :-)
IMO, the simplest thing to do would be to first create a struct that holds all the data for a monster. Here's a reduced version because I don't feel like typing all those variables.
struct Mob
{
std::string SPRITE, kName, iName;
int ID, LV, HP, SP, EXP;
};
The loading code for your particular format is then fairly simple:
bool ParseMob(const std::string & str, Mob & m)
{
std::stringstream iss(str);
Mob tmp;
if (iss >> tmp.ID >> tmp.SPRITE >> tmp.kName >> tmp.iName
>> tmp.LV >> tmp.HP >> tmp.SP >> tmp.EXP)
{
m = tmp;
return true;
}
return false;
}
std::vector<Mob> LoadMobs()
{
std::vector<Mob> mobs;
Mob tmp;
std::ifstream fin("mobDB.csv");
for (std::string line; std::getline(fin, line); )
{
if (ParseMob(line,tmp))
mobs.emplace_back(std::move(tmp));
}
return mobs;
}

constructing a Data Frame in Rcpp

I want to construct a data frame in an Rcpp function, but when I get it, it doesn't really look like a data frame. I've tried pushing vectors etc. but it leads to the same thing. Consider:
RcppExport SEXP makeDataFrame(SEXP in) {
Rcpp::DataFrame dfin(in);
Rcpp::DataFrame dfout;
for (int i=0;i<dfin.length();i++) {
dfout.push_back(dfin(i));
}
return dfout;
}
in R:
> .Call("makeDataFrame",mtcars,"myPkg")
[[1]]
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
[[2]]
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
[[3]]
[1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8
[13] 275.8 275.8 472.0 460.0 440.0 78.7 75.7 71.1 120.1 318.0 304.0 350.0
[25] 400.0 79.0 120.3 95.1 351.0 145.0 301.0 121.0
[[4]]
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 66 52
[20] 65 97 150 150 245 175 66 91 113 264 175 335 109
[[5]]
[1] 3.90 3.90 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 3.92 3.07 3.07 3.07 2.93
[16] 3.00 3.23 4.08 4.93 4.22 3.70 2.76 3.15 3.73 3.08 4.08 4.43 3.77 4.22 3.62
[31] 3.54 4.11
[[6]]
[1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440 4.070
[13] 3.730 3.780 5.250 5.424 5.345 2.200 1.615 1.835 2.465 3.520 3.435 3.840
[25] 3.845 1.935 2.140 1.513 3.170 2.770 3.570 2.780
[[7]]
[1] 16.46 17.02 18.61 19.44 17.02 20.22 15.84 20.00 22.90 18.30 18.90 17.40
[13] 17.60 18.00 17.98 17.82 17.42 19.47 18.52 19.90 20.01 16.87 17.30 15.41
[25] 17.05 18.90 16.70 16.90 14.50 15.50 14.60 18.60
[[8]]
[1] 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 1
[[9]]
[1] 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1
[[10]]
[1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4
[[11]]
[1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
Briefly:
DataFrames are indeed just like lists with the added restriction of having to have a common length, so they are best constructed column by column.
The best way is often to look for our unit tests. Her inst/unitTests/runit.DataFrame.R
regroups tests for the DataFrame class.
You also found the .push_back() member function in Rcpp which we added for convenience and analogy with the STL. We do warn that it is not recommended: due to differences with the way R objects are constructed, we essentially always need to do full copies .push_back is not very efficient.
Despite me answering here frequently, the rcpp-devel list a better place for Rcpp questions.
It seems Rcpp can return a proper data.frame, provided you supply the names explicitely. I'm not sure how to adapt this to your example with arbitrary names
mkdf <- '
Rcpp::DataFrame dfin(input);
Rcpp::DataFrame dfout;
for (int i=0;i<dfin.length();i++) {
dfout.push_back(dfin(i));
}
return Rcpp::DataFrame::create( Named("x")= dfout(1), Named("y") = dfout(2));
'
library(inline)
test <- cxxfunction( signature(input="data.frame"),
mkdf, plugin="Rcpp")
test(input=head(iris))
Using the information from #baptiste's answer, this is what finally does give a well formed data frame:
RcppExport SEXP makeDataFrame(SEXP in) {
Rcpp::DataFrame dfin(in);
Rcpp::DataFrame dfout;
Rcpp::CharacterVector namevec;
std::string namestem = "Column Heading ";
for (int i=0;i<2;i++) {
dfout.push_back(dfin(i));
namevec.push_back(namestem+std::string(1,(char)(((int)'a') + i)));
}
dfout.attr("names") = namevec;
Rcpp::DataFrame x;
Rcpp::Language call("as.data.frame",dfout);
x = call.eval();
return x;
}
I think the point remains that this might be inefficient due to push_back (as suggested by #Dirk) and the second Language call evaluation. I looked up the rcpp unitTests, and haven't been able to come up with something better yet. Anybody have any ideas?
Update:
Using #Dirk's suggestions (thanks!), this seems to be a simpler, efficient solution:
RcppExport SEXP makeDataFrame(SEXP in) {
Rcpp::DataFrame dfin(in);
Rcpp::List myList(dfin.length());
Rcpp::CharacterVector namevec;
std::string namestem = "Column Heading ";
for (int i=0;i<dfin.length();i++) {
myList[i] = dfin(i); // adding vectors
namevec.push_back(namestem+std::string(1,(char)(((int)'a') + i))); // making up column names
}
myList.attr("names") = namevec;
Rcpp::DataFrame dfout(myList);
return dfout;
}
I concur with joran. The output of a C function called from within R is a list of all its arguments, both "in" and "out", so each "column" of the dataframe could be represented in the C function call as an argument. Once the result of the C function call is in R, all that remains to be done is to extract those list elements using list indexing and give them the appropriate names.