Why is dummy loop consuming kernel time?

Why is dummy loop consuming kernel time? - c++

I wrote this simple, useless program:
#include <iostream>
using namespace std;
#define TABSIZE 256*1024*1024
int main(){
volatile int *tab = new int[TABSIZE];
cerr << "start" << endl;
for(int i=0;i<TABSIZE;i+=1)
tab[i] *= i;
cerr << "stop" << endl;
delete []tab;
return 0;
}
After compiling it (witch O2) and running with time ./a.out I was really surprised at the results:
real 0m1,660s
user 0m0,462s
sys 0m1,177s
Clearly, most of the time is spent in kernel space, even though main loop has nothing to do with any syscalls. I can see string start immediately after program start, so new does not consume much time. Cognately, stop shows up just before end.
So, the question is: Why so much time is spent in kernel space?
I use GCC 7.4 on Linux 4.15.

volatile int *tab = new int[TABSIZE];
This statement allocates 1 GiB of RAM. This amount is marked in the page table of the process as used, but not really initialized nor made available (the pages are marked as "not present").
tab[i] *= i
At this point, the first access to a new page (usually 4KiB) results in a Page Fault, which is handled by the kernel, which then initializes a range of memory with zeros and marks the page as present. Then, until the next page, the acesses happen without page faults.
This memory management is the reason for kernel CPU usage.

Related

Memory usage doesn't increase when allocating memory

I would like to find out the amount of bytes used by a process from within a C++ program by inspecting the operating system's memory information. The reason I would like to do this is to find a possible overhead in memory allocation when allocating memory (due to memory control blocks/nodes in free lists etc.) Currently I am on mac and am using this code:
#include <mach/mach.h>
#include <iostream>
int getResidentMemoryUsage() {
task_basic_info t_info;
mach_msg_type_number_t t_info_count = TASK_BASIC_INFO_COUNT;
if (task_info(mach_task_self(), TASK_BASIC_INFO,
reinterpret_cast<task_info_t>(&t_info),
&t_info_count) == KERN_SUCCESS) {
return t_info.resident_size;
}
return -1;
}
int getVirtualMemoryUsage() {
task_basic_info t_info;
mach_msg_type_number_t t_info_count = TASK_BASIC_INFO_COUNT;
if (task_info(mach_task_self(), TASK_BASIC_INFO,
reinterpret_cast<task_info_t>(&t_info),
&t_info_count) == KERN_SUCCESS) {
return t_info.virtual_size;
}
return -1;
}
int main(void) {
int virtualMemoryBefore = getVirtualMemoryUsage();
int residentMemoryBefore = getResidentMemoryUsage();
int* a = new int(5);
int virtualMemoryAfter = getVirtualMemoryUsage();
int residentMemoryAfter = getResidentMemoryUsage();
std::cout << virtualMemoryBefore << " " << virtualMemoryAfter << std::endl;
std::cout << residentMemoryBefore << " " << residentMemoryAfter << std::endl;
return 0;
}
When running this code I would have expected to see that the memory usage has increased after allocating an int. However when I run the above code I get the following output:
75190272 75190272
819200 819200
I have several questions because this output does not make any sense.
Why hasn't either the virtual/resident memory changed after an integer has been allocated?
How come the operating system is allocating such large amounts of memory to a running process.
When I do run the code and check activity monitor I find that 304 kb of memory is used but that number differs from the virtual/resident memory usage obtained programmatically.
My end goal is to be able to find the memory overhead when assigning data, so is there a way to do this (i.e. determine the bytes used by the OS and compare with the bytes allocated to find the difference is what I am currently thinking of)
Thank you for reading

The C++ runtime typically allocates a block of memory when a program starts up, and then parcels this out to your code when you use things like new, and adds it back to the block when you call delete. Hence, the operating system doesn't know anything about individual new or delete calls. This is also true for malloc and free in C (or C++)

First you measure number of pages, not really memory allocated. Second the runtime pre allocates few pages at startup. If you want to observe something allocate more than a single int. Try allocating several thousands and you will observe some changes.

C++ iPhone Compiler Problems

I've written a simple code to practice some stuff, but it's not running as it's supposed to.
I'm programming on my iPhone just for fun and l'm using an app called c/c++ offline compiler which seems to work really well.
Anyway, I wrote a program to display numbers, and if the number is less than 5 digits, in each empty space display a star. Then next to that, display the memory address.
There are two questions I have. First, why are the stars not displaying when I run this. Second, each time run it, the memory addresses are different. Is this because of the compiler, or because of how the iPhone's memory works?
Source Code:
//c plus plus program 1
#include <iostream>
using namespace std;
int main (void){
for(int i=0; i<150;i+=30){
cout.width(5);
cout.fill('*');
cout<<i<< "="<<&i <<endl;
}
return 0;
}

It is normal that address of i is changing each time you run your program. As I know it is because of how system works with memory. Why did you think it will place your program in the same part every time? :-)
I tried you code in http://cpp.sh only and stars were shown:
****0=0xffcde9dc
***30=0xffcde9dc
***60=0xffcde9dc
***90=0xffcde9dc
**120=0xffcde9dc
Note, that all after second << was not took into consideration when output width was determined. So first of all to investigate the problem I would try something like
int main (void) {
cout.width(5);
cout.fill('*');
cout << 1 << endl;
cout << 2 << endl;
}
to understand if it compiler problem or not.

What could cause a mutex to misbehave?

I've been busy the last couple of months debugging a rare crash caused somewhere within a very large proprietary C++ image processing library, compiled with GCC 4.7.2 for an ARM Cortex-A9 Linux target. Since a common symptom was glibc complaining about heap corruption, the first step was to employ a heap corruption checker to catch oob memory writes. I used the technique described in https://stackoverflow.com/a/17850402/3779334 to divert all calls to free/malloc to my own function, padding every allocated chunk of memory with some amount of known data to catch out-of-bounds writes - but found nothing, even when padding with as much as 1 KB before and after every single allocated block (there are hundreds of thousands of allocated blocks due to intensive use of STL containers, so I can't enlarge the padding further, plus I assume any write more than 1KB out of bounds would eventually trigger a segfault anyway). This bounds checker has found other problems in the past so I don't doubt its functionality.
(Before anyone says 'Valgrind', yes, I have tried that too with no results either.)
Now, my memory bounds checker also has a feature where it prepends every allocated block with a data struct. These structs are all linked in one long linked list, to allow me to occasionally go over all allocations and test memory integrity. For some reason, even though all manipulations of this list are mutex protected, the list was getting corrupted. When investigating the issue, it began to seem like the mutex itself was occasionally failing to do its job. Here is the pseudocode:
pthread_mutex_t alloc_mutex;
static bool boolmutex; // set to false during init. volatile has no effect.
void malloc_wrapper() {
// ...
pthread_mutex_lock(&alloc_mutex);
if (boolmutex) {
printf("mutex misbehaving\n");
__THROW_ERROR__; // this happens!
}
boolmutex = true;
// manipulate linked list here
boolmutex = false;
pthread_mutex_unlock(&alloc_mutex);
// ...
}
The code commented with "this happens!" is occasionally reached, even though this seems impossible. My first theory was that the mutex data structure was being overwritten. I placed the mutex within a struct, with large arrays before and after it, but when this problem occurred the arrays were untouched so nothing seems to be overwritten.
So.. What kind of corruption could possibly cause this to happen, and how would I find and fix the cause?
A few more notes. The test program uses 3-4 threads for processing. Running with less threads seems to make the corruptions less common, but not disappear. The test runs for about 20 seconds each time and completes successfully in the vast majority of cases (I can have 10 units repeating the test, with the first failure occurring after 5 minutes to several hours). When the problem occurs it is quite late in the test (say, 15 seconds in), so this isn't a bad initialization issue. The memory bounds checker never catches actual out of bounds writes but glibc still occasionally fails with a corrupted heap error (Can such an error be caused by something other than an oob write?). Each failure generates a core dump with plenty of trace information; there is no pattern I can see in these dumps, no particular section of code that shows up more than others. This problem seems very specific to a particular family of algorithms and does not happen in other algorithms, so I'm quite certain this isn't a sporadic hardware or memory error. I have done many more tests to check for oob heap accesses which I don't want to list to keep this post from getting any longer.
Thanks in advance for any help!

Thanks to all commenters. I've tried nearly all suggestions with no results, when I finally decided to write a simple memory allocation stress test - one that would run a thread on each of the CPU cores (my unit is a Freescale i.MX6 quad core SoC), each allocating and freeing memory in random order at high speed. The test crashed with a glibc memory corruption error within minutes or a few hours at most.
Updating the kernel from 3.0.35 to 3.0.101 solved the problem; both the stress test and the image processing algorithm now run overnight without failing. The problem does not reproduce on Intel machines with the same kernel version, so the problem is specific either to ARM in general or perhaps to some patch Freescale included with the specific BSP version that included kernel 3.0.35.
For those curious, attached is the stress test source code. Set NUM_THREADS to the number of CPU cores and build with:
<cross-compiler-prefix>g++ -O3 test_heap.cpp -lpthread -o test_heap
I hope this information helps someone. Cheers :)
// Multithreaded heap stress test. By Itay Chamiel 20151012.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>
#include <pthread.h>
#include <sys/time.h>
#define NUM_THREADS 4 // set to number of CPU cores
#define ALIVE_INDICATOR NUM_THREADS
// Each thread constantly allocates and frees memory. In each iteration of the infinite loop, decide at random whether to
// allocate or free a block of memory. A list of 500-1000 allocated blocks is maintained by each thread. When memory is allocated
// it is added to this list; when freeing, a random block is selected from this list, freed and removed from the list.
void* thr(void* arg) {
int* alive_flag = (int*)arg;
int thread_id = *alive_flag; // this is a number between 0 and (NUM_THREADS-1) given by main()
int cnt = 0;
timeval t_pre, t_post;
gettimeofday(&t_pre, NULL);
const int ALLOCATE=1, FREE=0;
const unsigned int MINSIZE=500, MAXSIZE=1000;
const int MAX_ALLOC=10000;
char* membufs[MAXSIZE];
unsigned int membufs_size = 0;
int num_allocs = 0, num_frees = 0;
while(1)
{
int action;
// Decide whether to allocate or free a memory block.
// if we have less than MINSIZE buffers, allocate.
if (membufs_size < MINSIZE) action = ALLOCATE;
// if we have MAXSIZE, free.
else if (membufs_size >= MAXSIZE) action = FREE;
// else, decide randomly.
else {
action = ((rand() & 0x1)? ALLOCATE : FREE);
}
if (action == ALLOCATE) {
// choose size to allocate, from 1 to MAX_ALLOC bytes
size_t size = (rand() % MAX_ALLOC) + 1;
// allocate and fill memory
char* buf = (char*)malloc(size);
memset(buf, 0x77, size);
// add buffer to list
membufs[membufs_size] = buf;
membufs_size++;
assert(membufs_size <= MAXSIZE);
num_allocs++;
}
else { // action == FREE
// choose a random buffer to free
size_t pos = rand() % membufs_size;
assert (pos < membufs_size);
// free and remove from list by replacing entry with last member
free(membufs[pos]);
membufs[pos] = membufs[membufs_size-1];
membufs_size--;
assert(membufs_size >= 0);
num_frees++;
}
// once in 10 seconds print a status update
gettimeofday(&t_post, NULL);
if (t_post.tv_sec - t_pre.tv_sec >= 10) {
printf("Thread %d [%d] - %d allocs %d frees. Alloced blocks %u.\n", thread_id, cnt++, num_allocs, num_frees, membufs_size);
gettimeofday(&t_pre, NULL);
}
// indicate alive to main thread
*alive_flag = ALIVE_INDICATOR;
}
return NULL;
}
int main()
{
int alive_flag[NUM_THREADS];
printf("Memory allocation stress test running on %d threads.\n", NUM_THREADS);
// start a thread for each core
for (int i=0; i<NUM_THREADS; i++) {
alive_flag[i] = i; // tell each thread its ID.
pthread_t th;
int ret = pthread_create(&th, NULL, thr, &alive_flag[i]);
assert(ret == 0);
}
while(1) {
sleep(10);
// check that all threads are alive
bool ok = true;
for (int i=0; i<NUM_THREADS; i++) {
if (alive_flag[i] != ALIVE_INDICATOR)
{
printf("Thread %d is not responding\n", i);
ok = false;
}
}
assert(ok);
for (int i=0; i<NUM_THREADS; i++)
alive_flag[i] = 0;
}
return 0;
}

How do you calculate memory access time?

I create a large boolean 2d array (5000X5000 for a total of 25 billion elements at 23MB). Then I loop through and instantiate every element with a random true or false. Then I loop through and read every single element. All 25 million elements are read in ~100ms.
23MB is too big to fit in the CPU's cache and I think my program is too simple to benefit from any type of compiler optimization so am I right to conclude that the program is reading 25 million elements from RAM in ~100ms?
#include "stdafx.h"
#include <iostream>
#include <chrono>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
bool **locs;
locs = new bool*[5000];
for(int i = 0; i < 5000; i++)
locs[i] = new bool[5000];
for(int i = 0; i < 5000; i++)
for(int i2 = 0; i2 < 5000; i2++)
locs[i][i2] = rand() % 2 == 0 ? true : false;
int *idx = new int [5000*5000];
for(int i = 0; i < 5000*5000; i++)
*(idx + i) = rand() % 4999;
bool val;
int memAccesses = 0;
auto start = std::chrono::high_resolution_clock::now();
for(int i = 0; i < 5000*5000; i++) {
val = locs[*(idx + i)][*(idx + ++i)];
memAccesses += 2;
}
auto finish = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(finish-start).count() << " ns\n";
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(finish-start).count() << " ms\n";
cout << "TOTAL MEMORY ACCESSES: " << memAccesses << endl;
cout << "The size of the array in memory is " << ((sizeof(bool)*5000*5000)/1048576) << "MB";
int exit; cin >> exit;
return 0;
}
/*
OUTPUT IS:
137013700 ns
137 ms
TOTAL MEMORY ACCESSES: 25000000
The size of the array in memory is 23MB
*/

As other answers have mentioned, the "speed" you are seeing (even if the CPU is executing your code and it is not stripped by the compiler) is about 250 MBps, which is very very low number for modern systems.
However, your methodology seems flawed to me (admittedly, I'm not an expert in benchmarking.) And here are the problems I see:
For any benchmark such as this, even in the simplest form, you need to distinguish random-access from sequential-access. Memory is not a random-access device (despite its name) and performs very poorly here. Your code seems to be accessing memory randomly, so you add that to your conclusion as a qualifier: that you are "reading 25 million elements from random locations from RAM in ~100ms."
Another aspect of this sort of benchmarks is the concept of latency vs. throughput. Again, if you want to conclude anything from your numbers and timings, you need to be aware what are you measuring exactly.
You are counting memory accesses incorrectly. Depending of the exact code your compiler is generating, this line:
val = locs[*(idx + i)][*(idx + ++i)];
might realistically access the memory system anywhere between 4 to 9 times.
At best, if i, idx, locs and val are all either in registers or access to them is eliminated, then you need to read *(idx + i), read locs[*(idx + i)] (remember that locs is an array of pointers to arrays, not a 2D array,) read *(idx + ++i), and finally read locs[*(idx + i)][*(idx + ++i)]. A few of these might be cached, but it's unlikely, with the cache-thrashing that's going on.
At worst, in addition to the above, you need two accesses for ++i (read, then write back,) one for idx, one for locs and one for val. I don't know, you might even need another read for the single i and/or two reads for the two idx occurrences (due to pointer aliasing and whatnot.)
You need to be aware that memory is never accessed in single bytes or even words. Memory is always read and written in units of cache-line. And cache line size can be different from system to system, although the most common size these days is 64 bytes. So, each time you read a memory location that is not in the cache, you are loading 64-bytes (or more) from RAM. If the memory locations you are reading are at the cache line boundary (some of the bytes in one cache line and some in the next) then you are loading two cache lines from RAM. Given a sane compiler and properly aligned variables in memory, this doesn't happen very often, but it might. So you have to at least multiply your calculated bandwidth used by the size of your cache line.
However, if you are accessing a memory location that is already in cache, then you don't load anything from RAM. You need to consider this in your conclusions too.
You also need to consider cache line eviction, your cache's associativity, number of levels, the fact that some cache levels are shared between instructions and data and some aren't, some are shared between cores and some aren't, and a lot of other things when evaluating the performance of caches and memory.
The DRAM chips also have a lot of weird and complex behaviors and characteristics. Some memory locations are faster to read after some others (due to the arrangements of rows and columns,) some accesses might get delayed a long time (at CPU speeds) because of the refresh cycle, other devices might be using the RAM or the bus that RAM is on, etc., etc. I'm far from familiar with the operations of modern memory chips, and even I know that it's a complete mess.
You have to consider the effects of compiler optimization on your code. This means that you have to look t your code after the compiler is done with it, in assembly form. You need to look at the generated assembly to be able to know what your code is actually doing: whether and which of your memory accesses are optimized out.
All in all, I don't think that you can conclude much useful information from your program. Sorry about that, but memory is very complex!

Portions (blocks) of memory will be stored in the processor cache at a time, which allows the processor to quickly access those items. However, that speed is perfectly reasonable for modern memory. Even the slowest DDR3 ram can transfer data at about 6 GB/s.

Cache usage is independent from program's complexity. Whenever data is read from RAM it goes into cache. Since cache has a certain size, there's always that amount of data available. If you access a memory location next to the previous, there is a good chance it will be cached already. In such case RAM is not accessed.
I would suggest reading CPU cache wikipedia entry to broaden your knowledge.
BTW: val = locs[*(idx + i)][*(idx + ++i)]; are you certain that this is evaluated from left to right? I am not. This is an undefined behavior. I'd suggest putting the ++i below the accessor line.
//EDIT:
There is nothing done with the value read from memory. It is quite possible that these instructions are not executed at all! Check the bytecode or add a (void) val; instruction which should force it to be generated.

No. The reads won't always go all the way down to the RAM. Blocks of memory get pulled into the cache when a read (or write) is performed. As long as the block from which you are reading is already in the cache, the cache is used. If you request data from a block that is not in the cache, then the RAM is accessed to fetch the block of memory and place it in the cache. Reading from the cache is significantly cheaper than reading from RAM.
EDIT
Again, write oprerations cause blocks from memory to get pulled into the cache. Because you are storing the values in your program before reading them, the data you are reading is most likely already in the cache from when you stored it. Therefore, it is likely that your loop that reads the values never needs to access RAM.

Seg faults with pthreads_mutex

I am implementing a particle interaction simulator in pthreads,and I keep getting segmentation faults in my pthreads code. The fault occurs in the following loop, which each thread does in the end of each timestep in my thread_routine:
for (int i = first; i < last; i++)
{
get_id(particles[i], box_id);
pthread_mutex_lock(&locks[box_id.x + box_no * box_id.y]);
//cout << box_id.x << "," << box_id.y << "," << thread_id << "l" << endl;
box[box_id.x][box_id.y].push_back(&particles[i]);
//cout << box_id.x << box_id.y << endl;
pthread_mutex_unlock(&locks[box_id.x + box_no * box_id.y]);
}
The strange thing is that if I uncomment one (it doesn't matter which one) or both of the couts, the program runs as expected, with no errors occurring (but this obviously kills performance, and isn't an elegant solution), giving correct output.
box is a globally declared
vector < vector < vector < particle_t*> > > box
which represents a decomposition of my (square) domain into boxes.
When the loop starts, box[i][j].size() has been set to zero for all i, j, and the loop is supposed to put particles back into the box-structure (the get_id function gives correct results, I've checked)
The array pthread_mutex_t locks is declared as a global
pthread_mutex_t * locks,
and the size is set by thread 0 and the locks initialized by thread 0 before the other threads are created:
locks = (pthread_mutex_t *) malloc( box_no*box_no * sizeof( pthread_mutex_t ) );
for (int i = 0; i < box_no*box_no; i++)
{
pthread_mutex_init(&locks[i],NULL);
}
Do you have any idea of what could cause this? The code also runs if the number of processors is set to 1, and it seems like the more processors I run on, the earlier the seg fault occurs (it has run through the entire simulation once on two processors, but this seems to be an exception)
Thanks

This is only an educated guess, but based on the problem going away if you use one lock for all the boxes: push_back has to allocate memory, which it does via the std::allocator template. I don't think allocator is guaranteed to be thread-safe and I don't think it's guaranteed to be partitioned, one for each vector, either. (The underlying operator new is thread-safe, but allocator usually does block-slicing tricks to amortize operator new's cost.)
Is it practical for you to use reserve to preallocate space for all your vectors ahead of time, using some conservative estimate of how many particles are going to wind up in each box? That's the first thing I'd try.
The other thing I'd try is using one lock for all the boxes, which we know works, but moving the lock/unlock operations outside the for loop so that each thread gets to stash all its items at once. That might actually be faster than what you're trying to do -- less lock thrashing.

Are the box and box[i] vectors initialized properly? You only say the innermost set of vectors are set. Otherwise it looks like box_id's x or y component is wrong and running off the end of one of your arrays.
What part of the look is it crashing on?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js