Memory occupation increase

Memory occupation increase - c++

I am trapped in a wired situation; my c++ code keeps consuming more memory (reaching around 70G), until the whole process got killed.
I am invoking a C++ code from Python, which implements the Longest common subsequence length algorithm.
The C++ code is shown below:
#define MAX(a,b) (((a)>(b))?(a):(b))
#include <stdio.h>
int LCSLength(long unsigned X[], long unsigned Y[], int m, int n)
{
int** L = new int*[m+1];
for(int i = 0; i < m+1; ++i)
L[i] = new int[n+1];
printf("i am hre\n");
int i, j;
for(i=0; i<=m; i++)
{
printf("i am hre1\n");
for(j=0; j<=n; j++)
{
if(i==0 || j==0)
L[i][j] = 0;
else if(X[i-1]==Y[j-1])
L[i][j] = L[i-1][j-1]+1;
else
L[i][j] = MAX(L[i-1][j],L[i][j-1]);
}
}
int tt = L[m][n];
printf("i am hre2\n");
for (i = 0; i < m+1; i++)
delete [] L[i];
delete [] L;
return tt;
}
And my Python code is like this:
from ctypes import cdll
import ctypes
lib = cdll.LoadLibrary('./liblcs.so')
la = 36840
lb = 833841
a = (ctypes.c_ulong * la)()
b = (ctypes.c_ulong * lb)()
for i in range(la):
a[i] = 1
for i in range(lb):
b[i] = 1
print "test"
lib._Z9LCSLengthPmS_ii(a, b, la, lb)
IMHO, in the C++ code, after the new operation which could allocate a large amount of memory on the heap, there would be not more additional memory consumption inside the loop.
However, to my surprise, I observed that the used memory keeps increasing during the loop. (I am using top on Linux, and it keeps print i am her1 before the process got killed)
It is really confused me at this point, as I guess after the memory allocation, there are only some arithmetic operations inside the loop, why does the code take more memory?
Am I clear enough? Could anyone give me some help on this issue? Thank you!

Your consuming too much memory. The reason why the system does not die on allocation is because Linux allows you to allocate more memory than you can use
http://serverfault.com/questions/141988/avoid-linux-out-of-memory-application-teardown
I just did the same thing on a test machine. I was able to get past the uses of new and start the loop, only when the system decided that I was eating too much of the available RAM did it kill me.
This is what I got. A lovely OOM message in dmesg.
[287602.898843] Out of memory: Kill process 7476 (a.out) score 792 or sacrifice child
[287602.899900] Killed process 7476 (a.out) total-vm:2885212kB, anon-rss:907032kB, file-rss:0kB, shmem-rss:0kB
On Linux you would see something like this in your kernel logs or as the output from dmesg...
[287585.306678] Out of memory: Kill process 7469 (a.out) score 787 or sacrifice child
[287585.307759] Killed process 7469 (a.out) total-vm:2885208kB, anon-rss:906912kB, file-rss:4kB, shmem-rss:0kB
[287602.754624] a.out invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
[287602.755843] a.out cpuset=/ mems_allowed=0
[287602.756482] CPU: 0 PID: 7476 Comm: a.out Not tainted 4.5.0-x86_64-linode65 #2
[287602.757592] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[287602.759461] 0000000000000000 ffff88003d845780 ffffffff815abd27 0000000000000000
[287602.760689] 0000000000000282 ffff88003a377c58 ffffffff811d0e82 ffff8800397f8270
[287602.761915] 0000000000f7d192 000105902804d798 ffffffff81046a71 ffff88003d845780
[287602.763192] Call Trace:
[287602.763532] [<ffffffff815abd27>] ? dump_stack+0x63/0x84
[287602.774614] [<ffffffff811d0e82>] ? dump_header+0x59/0x1ed
[287602.775454] [<ffffffff81046a71>] ? kvm_clock_read+0x1b/0x1d
[287602.776322] [<ffffffff8112b046>] ? ktime_get+0x49/0x91
[287602.777127] [<ffffffff81156c83>] ? delayacct_end+0x3b/0x60
[287602.777970] [<ffffffff81187c11>] ? oom_kill_process+0xc0/0x367
[287602.778866] [<ffffffff811882c5>] ? out_of_memory+0x3bf/0x406
[287602.779755] [<ffffffff8118c646>] ? __alloc_pages_nodemask+0x8fc/0xa6b
[287602.780756] [<ffffffff811c095d>] ? alloc_pages_current+0xbc/0xe0
[287602.781686] [<ffffffff81186c1d>] ? filemap_fault+0x2d3/0x48b
[287602.782561] [<ffffffff8128adea>] ? ext4_filemap_fault+0x37/0x51
[287602.783511] [<ffffffff811a9d56>] ? __do_fault+0x68/0xb1
[287602.784310] [<ffffffff811adcaa>] ? handle_mm_fault+0x6a4/0xd1b
[287602.785216] [<ffffffff810496cd>] ? __do_page_fault+0x33d/0x398
[287602.786124] [<ffffffff819c6ab8>] ? async_page_fault+0x28/0x30

Take a look at what you are doing:
#include <iostream>
int main(){
int m = 36840;
int n = 833841;
unsigned long total = 0;
total += (sizeof(int) * (m+1));
for(int i = 0; i < m+1; ++i){
total += (sizeof(int) * (n+1));
}
std::cout << total << '\n';
}
You're simply consuming too much memory.
If the size of your int is 4 bytes, you are allocating 122 GB.

Related

OpenMP parallelization of for loop : poor efficiency in my code

I have a function which is apparently the bottleneck of my whole program. I thought that parallelization with OpenMP could be helpful.
Here is an working example of my computation (sorry, the function is a little bit long). In my program, some of the work before the 5 nested loops is done somewhere else and is not a problem at all for the efficiency.
#include <vector>
#include <iostream>
#include <cmath>
#include <cstdio>
#include <chrono>
#include "boost/dynamic_bitset.hpp"
using namespace std::chrono;
void compute_mddr(unsigned Ns, unsigned int block_size, unsigned int sector)
{
std::vector<unsigned int> basis;
for (std::size_t s = 0; s != std::pow(2,Ns); s++) {
boost::dynamic_bitset<> s_bin(Ns,s);
if (s_bin.count() == Ns/2) {
basis.push_back(s);
}
}
std::vector<double> gs(basis.size());
for (unsigned int i = 0; i != gs.size(); i++)
gs[i] = double(std::rand())/RAND_MAX;
unsigned int ns_A = block_size;
unsigned int ns_B = Ns-ns_A;
boost::dynamic_bitset<> mask_A(Ns,(1<<ns_A)-(1<<0));
boost::dynamic_bitset<> mask_B(Ns,((1<<ns_B)-(1<<0))<<ns_A);
// Find the basis of the A block
unsigned int NAsec = sector;
std::vector<double> basis_NAsec;
for (unsigned int s = 0; s < std::pow(2,ns_A); s++) {
boost::dynamic_bitset<> s_bin(ns_A,s);
if (s_bin.count() == NAsec)
basis_NAsec.push_back(s);
}
unsigned int bs_A = basis_NAsec.size();
// Find the basis of the B block
unsigned int NBsec = (Ns/2)-sector;
std::vector<double> basis_NBsec;
for (unsigned int s = 0; s < std::pow(2,ns_B); s++) {
boost::dynamic_bitset<> s_bin(ns_B,s);
if (s_bin.count() == NBsec)
basis_NBsec.push_back(s);
}
unsigned int bs_B = basis_NBsec.size();
std::vector<std::vector<double> > mddr(bs_A);
for (unsigned int i = 0; i != mddr.size(); i++) {
mddr[i].resize(bs_A);
for (unsigned int j = 0; j != mddr[i].size(); j++) {
mddr[i][j] = 0.0;
}
}
// Main calculation part
for (unsigned int mu_A = 0; mu_A != bs_A; mu_A++) { // loop 1
boost::dynamic_bitset<> mu_A_bin(ns_A,basis_NAsec[mu_A]);
for (unsigned int nu_A = mu_A; nu_A != bs_A; nu_A++) { // loop 2
boost::dynamic_bitset<> nu_A_bin(ns_A,basis_NAsec[nu_A]);
double sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (unsigned int mu_B = 0; mu_B < bs_B; mu_B++) { // loop 3
boost::dynamic_bitset<> mu_B_bin(ns_B,basis_NBsec[mu_B]);
for (unsigned int si = 0; si != basis.size(); si++) { // loop 4
boost::dynamic_bitset<> si_bin(Ns,basis[si]);
boost::dynamic_bitset<> si_A_bin = si_bin & mask_A;
si_A_bin.resize(ns_A);
if (si_A_bin != mu_A_bin)
continue;
boost::dynamic_bitset<> si_B_bin = (si_bin & mask_B)>>ns_A;
si_B_bin.resize(ns_B);
if (si_B_bin != mu_B_bin)
continue;
for (unsigned int sj = 0; sj < basis.size(); sj++) { // loop 5
boost::dynamic_bitset<> sj_bin(Ns,basis[sj]);
boost::dynamic_bitset<> sj_A_bin = sj_bin & mask_A;
sj_A_bin.resize(ns_A);
if (sj_A_bin != nu_A_bin)
continue;
boost::dynamic_bitset<> sj_B_bin = (sj_bin & mask_B)>>ns_A;
sj_B_bin.resize(ns_B);
if (sj_B_bin != mu_B_bin)
continue;
sum += gs[si]*gs[sj];
}
}
}
mddr[nu_A][mu_A] = mddr[mu_A][nu_A] = sum;
}
}
}
int main()
{
unsigned int l = 8;
unsigned int Ns = 2*l;
unsigned block_size = 6; // must be between 1 and l
unsigned sector = (block_size%2 == 0) ? block_size/2 : (block_size+1)/2;
high_resolution_clock::time_point t1 = high_resolution_clock::now();
compute_mddr(Ns,block_size,sector);
high_resolution_clock::time_point t2 = high_resolution_clock::now();
duration<double> time_span = duration_cast<duration<double>>(t2 - t1);
std::cout << "Function took " << time_span.count() << " seconds.";
std::cout << std::endl;
}
The compute_mddr function is basically entirely filling up the matrix mddr, and this corresponds to the outermost loops 1 and 2.
I decided to parallelize the loop 3, since it's essentially computing a sum. To give order of magnitudes, the loop 3 is over ~50-100 elements in the basis_NBsec vector, while the two innermost loops si and sj run over the ~10000 elements for the vector basis.
However, when running the code (compiled with -O3 -fopenmp on gcc 5.4.0, ubuntu 16.0.4 and i5-4440 cpu) I see either no speed-up (2 threads) or a very limited gain (3 and 4 threads):
time OMP_NUM_THREADS=1 ./a.out
Function took 230.435 seconds.
real 3m50.439s
user 3m50.428s
sys 0m0.000s
time OMP_NUM_THREADS=2 ./a.out
Function took 227.754 seconds.
real 3m47.758s
user 7m2.140s
sys 0m0.048s
time OMP_NUM_THREADS=3 ./a.out
Function took 181.492 seconds.
real 3m1.495s
user 7m36.056s
sys 0m0.036s
time OMP_NUM_THREADS=4 ./a.out
Function took 150.564 seconds.
real 2m30.568s
user 7m56.156s
sys 0m0.096s
If I understand correctly the numbers from user, for 3 and 4 threads the cpu usage is not good (and indeed, when the code is running I get ~250% cpu usage for 3 threads and barely 300% for 4 threads).
It's my first use of OpenMP, I just played with it very quickly on simple examples. Here, as far as I can see, I'm not modifying any of the shared vectors basis_NAsec, basis_NBsec and basis in the parallel part, only reading (this was an aspect pointed out in several related questions I read).
So, what am I doing wrong ?

Taking a quick look at the performance of your program with perf record shows that, regaradless of the number of threads, most of the time is spent in malloc & free. That's generally a bad sign, and it also inhibits parallelization.
Samples: 1M of event 'cycles:pp', Event count (approx.): 743045339605
Children Self Command Shared Object Symbol
+ 17.14% 17.12% a.out a.out [.] _Z12compute_mddrjjj._omp_fn.0
+ 15.45% 15.43% a.out libc-2.23.so [.] __memcmp_sse4_1
+ 15.21% 15.19% a.out libc-2.23.so [.] __memset_avx2
+ 13.09% 13.07% a.out libc-2.23.so [.] _int_free
+ 11.66% 11.65% a.out libc-2.23.so [.] _int_malloc
+ 10.21% 10.20% a.out libc-2.23.so [.] malloc
The cause for malloc & free is the constant creation of boost::dynamic_bitset objects, which are basically std::vectors. Note: With perf, can be challenging to find the callers of a certain function. You can just run in gdb, interrupt during the execution phase, break balloc, continue to figure out the callers.
The direct approach to improving performance, is trying to keep alive those objects as long as possible to avoid reallocation over and over again. This goes against the usual good practice to declare variables as locally as possible. The transformation of reusing the dynamic_bitset objects could look like the following:
#pragma omp parallel for reduction(+:sum)
for (unsigned int mu_B = 0; mu_B < bs_B; mu_B++) { // loop 3
boost::dynamic_bitset<> mu_B_bin(ns_B,basis_NBsec[mu_B]);
boost::dynamic_bitset<> si_bin(Ns);
boost::dynamic_bitset<> si_A_bin(Ns);
boost::dynamic_bitset<> si_B_bin(Ns);
boost::dynamic_bitset<> sj_bin(Ns);
boost::dynamic_bitset<> sj_A_bin(Ns);
boost::dynamic_bitset<> sj_B_bin(Ns);
for (unsigned int si = 0; si != basis.size(); si++) { // loop 4
si_bin = basis[si];
si_A_bin = si_bin;
assert(si_bin.size() == Ns);
assert(si_A_bin.size() == Ns);
assert(mask_A.size() == Ns);
si_A_bin &= mask_A;
si_A_bin.resize(ns_A);
if (si_A_bin != mu_A_bin)
continue;
si_B_bin = si_bin;
assert(si_bin.size() == Ns);
assert(si_B_bin.size() == Ns);
assert(mask_B.size() == Ns);
// Optimization note: dynamic_bitset::operator&
// does create a new object, operator&= does not
// Same for >>
si_B_bin &= mask_B;
si_B_bin >>= ns_A;
si_B_bin.resize(ns_B);
if (si_B_bin != mu_B_bin)
continue;
for (unsigned int sj = 0; sj < basis.size(); sj++) { // loop 5
sj_bin = basis[sj];
sj_A_bin = sj_bin;
assert(sj_bin.size() == Ns);
assert(sj_A_bin.size() == Ns);
assert(mask_A.size() == Ns);
sj_A_bin &= mask_A;
sj_A_bin.resize(ns_A);
if (sj_A_bin != nu_A_bin)
continue;
sj_B_bin = sj_bin;
assert(sj_bin.size() == Ns);
assert(sj_B_bin.size() == Ns);
assert(mask_B.size() == Ns);
sj_B_bin &= mask_B;
sj_B_bin >>= ns_A;
sj_B_bin.resize(ns_B);
if (sj_B_bin != mu_B_bin)
continue;
sum += gs[si]*gs[sj];
}
}
}
This already reduces the single threaded runtime on my system from ~289 s to ~39 s. Also the program scales almost perfectly up to ~10 threads (4.1 s).
For more threads, there are load balance issues in the parallel loop. which can be mitigated a bit by adding schedule(dynamic), but I'm not sure how relevant that is for you.
More importantly, you should consider using std::bitset. Even without the extremely expensive boost::dynamic_bitset constructor, it is very expensive. Most of the time is pent in superflous dynamic_bitest/vector code and memmove/memcmp on a single word.
+ 32.18% 32.15% ope_gcc_dyn ope_gcc_dyn [.] _ZNSt6vectorImSaImEEaSERKS1_
+ 29.13% 29.10% ope_gcc_dyn ope_gcc_dyn [.] _Z12compute_mddrjjj._omp_fn.0
+ 21.65% 0.00% ope_gcc_dyn [unknown] [.] 0000000000000000
+ 16.24% 16.23% ope_gcc_dyn ope_gcc_dyn [.] _ZN5boost14dynamic_bitsetImSaImEE6resizeEmb.constprop.102
+ 10.25% 10.23% ope_gcc_dyn libc-2.23.so [.] __memcmp_sse4_1
+ 9.61% 0.00% ope_gcc_dyn libc-2.23.so [.] 0xffffd47cb9d83b78
+ 7.74% 7.73% ope_gcc_dyn libc-2.23.so [.] __memmove_avx_unaligned
That basically goes away if you just use very few words of a std::bitset. Maybe 64 bit will always be enough for you. If it is dynamic over a large range, you could make a template of the entire function and instantiate it for a number of different bitsiszes of which you dynamically select the appropriate one. I suspect you should gain another order of magnitude in performance. This may in turn reduce parallel efficiency, requiring another round of performance analysis.
It's very important to use tools to understand the performance of your codes. There are very simple and very good tools for all sorts of cases. In your case a simple one such as perf is sufficient.

VSZ and RSS kept increasing on AIX

I see an abnormal memory usage pattern while running my application program on AIX...
I have created a simple program to malloc and free replicate the same problem.
int main()
{
int *ptr_one;
// enter value as 0.
// I wanted few secs fetch the PID of this statndlone process
// and run 'ps -p <PID> -o "vsz rssize"'
long a;
scanf("%ld", &a);
for(;;)
{
if(a < 10000000) a = a + 100;
ptr_one = (int *)malloc(sizeof(int)*a);
if (ptr_one == 0){
printf("ERROR: Out of memory\n");
return 1;
}
*ptr_one = 25;
printf("%d\n", *ptr_one);
free(ptr_one);
}
return 0;
}
I have captured the memory usage of this program using the below command,
ps -p $1 -o "vsz rssize" | tail -1 >> out.txt
The graph tells the memory kept growing and not released.
Is this a sign of leak or this is a normal memory behavior on AIX?

It is fully correct that process'es memory usage size is not decreased: while malloc can request additional memory for the process, free never return it to the system. Instead, freeing memory is reused in future malloc calls.

Slow heap array performance

I'm experiencing strange memory access performance problem, any ideas?
int* pixel_ptr = somewhereFromHeap;
int local_ptr[307200]; //local
//this is very slow
for(int i=0;i<307200;i++){
pixel_ptr[i] = someCalculatedVal ;
}
//this is very slow
for(int i=0;i<307200;i++){
pixel_ptr[i] = 1 ; //constant
}
//this is fast
for(int i=0;i<307200;i++){
int val = pixel_ptr[i];
local_ptr[i] = val;
}
//this is fast
for(int i=0;i<307200;i++){
local_ptr[i] = someCalculatedVal ;
}
Tried consolidating values to local scanline
int scanline[640]; // local
//this is very slow
for(int i=xMin;i<xMax;i++){
int screen_pos = sy*screen_width+i;
int val = scanline[i];
pixel_ptr[screen_pos] = val ;
}
//this is fast
for(int i=xMin;i<xMax;i++){
int screen_pos = sy*screen_width+i;
int val = scanline[i];
pixel_ptr[screen_pos] = 1 ; //constant
}
//this is fast
for(int i=xMin;i<xMax;i++){
int screen_pos = sy*screen_width+i;
int val = i; //or a constant
pixel_ptr[screen_pos] = val ;
}
//this is slow
for(int i=xMin;i<xMax;i++){
int screen_pos = sy*screen_width+i;
int val = scanline[0];
pixel_ptr[screen_pos] = val ;
}
Any ideas? I'm using mingw with cflags -01 -std=c++11 -fpermissive.
update4:
I have to say that these are snippets from my program and there are heavy code/functions running before and after. The scanline block did ran at the end of function before exit.
Now with proper test program. thks to #Iwillnotexist.
#include <stdio.h>
#include <unistd.h>
#include <sys/time.h>
#define SIZE 307200
#define SAMPLES 1000
double local_test(){
int local_array[SIZE];
timeval start, end;
long cpu_time_used_sec,cpu_time_used_usec;
double cpu_time_used;
gettimeofday(&start, NULL);
for(int i=0;i<SIZE;i++){
local_array[i] = i;
}
gettimeofday(&end, NULL);
cpu_time_used_sec = end.tv_sec- start.tv_sec;
cpu_time_used_usec = end.tv_usec- start.tv_usec;
cpu_time_used = cpu_time_used_sec*1000 + cpu_time_used_usec/1000.0;
return cpu_time_used;
}
double heap_test(){
int* heap_array=new int[SIZE];
timeval start, end;
long cpu_time_used_sec,cpu_time_used_usec;
double cpu_time_used;
gettimeofday(&start, NULL);
for(int i=0;i<SIZE;i++){
heap_array[i] = i;
}
gettimeofday(&end, NULL);
cpu_time_used_sec = end.tv_sec- start.tv_sec;
cpu_time_used_usec = end.tv_usec- start.tv_usec;
cpu_time_used = cpu_time_used_sec*1000 + cpu_time_used_usec/1000.0;
delete[] heap_array;
return cpu_time_used;
}
double heap_test2(){
static int* heap_array = NULL;
if(heap_array==NULL){
heap_array = new int[SIZE];
}
timeval start, end;
long cpu_time_used_sec,cpu_time_used_usec;
double cpu_time_used;
gettimeofday(&start, NULL);
for(int i=0;i<SIZE;i++){
heap_array[i] = i;
}
gettimeofday(&end, NULL);
cpu_time_used_sec = end.tv_sec- start.tv_sec;
cpu_time_used_usec = end.tv_usec- start.tv_usec;
cpu_time_used = cpu_time_used_sec*1000 + cpu_time_used_usec/1000.0;
return cpu_time_used;
}
int main (int argc, char** argv){
double cpu_time_used = 0;
for(int i=0;i<SAMPLES;i++)
cpu_time_used+=local_test();
printf("local: %f ms\n",cpu_time_used);
cpu_time_used = 0;
for(int i=0;i<SAMPLES;i++)
cpu_time_used+=heap_test();
printf("heap_: %f ms\n",cpu_time_used);
cpu_time_used = 0;
for(int i=0;i<SAMPLES;i++)
cpu_time_used+=heap_test2();
printf("heap2: %f ms\n",cpu_time_used);
}
Complied with no optimization.
local: 577.201000 ms
heap_: 826.802000 ms
heap2: 686.401000 ms
The first heap test with new and delete is 2x slower. (paging as suggested?)
The second heap with reused heap array is still 1.2x slower.
But I guess the second test is not that practical as there tend to other codes running before and after at least for my case. For my case, my pixel_ptr of course only allocated once during
prograim initialization.
But if anyone has solutions/idea to speeding things up please reply!
I'm still perplexed why heap write is so much slower than stack segment.
Surely there must be some tricks to make the heap more cpu/cache flavourable.
Final update?:
I revisited, the disassemblies again and this time, suddenly I have an idea why some of my breakpoints
don't activate. The program looks suspiciously shorter thus I suspect the complier might
have removed the redundant dummy code I put in which explains why the local array is magically many times faster.

I was a bit curious so I did the test, and indeed I could measure a difference between stack and heap access.
The first guess would be that the generated assembly is different, but after taking a look, it is actually identical for heap and stack (which makes sense, memory shouldn't be discriminated).
If the assembly is the same, then the difference must come from the paging mechanism. The guess is that on the stack, the pages are already allocated, but on the heap, first access cause a page fault and page allocation (invisible, it all happens at kernel level). To verify this, I did the same test, but first I would access the heap once before measuring. The test gave identical times for stack and heap. To be sure, I also did a test in which I first accessed the heap, but only every 4096 bytes (every 1024 int), then 8192, because a page is usually 4096 bytes long. The result is that accessing only every 4096 bytes also gives the same time for heap and stack, but accessing every 8192 gives a difference, but not as much as with no previous access at all. This is because only half of the pages were accessed and allocated beforehand.
So the answer is that on the stack, memory pages are already allocated, but on the heap, pages are allocated on-the-fly. This depends on the OS paging policy, but all major PC OSes probably have a similar one.
For all the tests I used Windows, with MS compiler targeting x64.
EDIT: For the test, I measured a single, larger loop, so there was only one access at each memory location. deleteing the array and measuring the same loop multiple time should give similar times for stack and heap, because deleteing memory probably don't de-allocate the pages, and they are already allocated for the next loop (if the next new allocated on the same space).

The following two code examples should not differ in runtime with a good compiler setting. Probably your compiler will generate the same code:
//this is fast
for(int i=0;i<307200;i++){
int val = pixel_ptr[i];
local_ptr[i] = val;
}
//this is fast
for(int i=0;i<307200;i++){
local_ptr[i] = pixel_ptr[i];
}
Please try to increase optimization setting.

C++: stress windows system

I want to write a program in c++ that are able to stress a windows 7 system. In my intention I want that this program brings the cpu usage to 100%, using all ram installed.
I have tried with a big FOR cycle that run a simple multiplication at every step: the cpu usage increase but the ram used remain low.
What is the best approach to reach my target?!

In an OS agnostic way, you could allocate and fill heap memory obtained with malloc(3) (so do some computation with that zone), or in C++ with operator new. Be sure to test against failure of malloc. And increase the size of your zone progressively.

If your goal is ability to stress CPU and RAM utilization (and not necessarily writing a program), try HeavyLoad which is free and does just that http://www.jam-software.com/heavyload/

I wrote a test program for this using multithreading and Mersenne Primes as the algorithm for stress testing
The code first determines the number of cores on the machine (WINDOWS only guys, sorry!!)
NtQuerySystemInformation(SystemBasicInformation, &BasicInformation, sizeof(SYSTEM_BASIC_INFORMATION), NULL);
It then runs a test method to spawn a thread for each core
// make sure we've got at least as many threads as cores
for (int i = 0; i < this->BasicInformation.NumberOfProcessors; i++)
{
threads.push_back(thread(&CpuBenchmark::MersennePrimes, CpuBenchmark()));
}
running our load
BOOL CpuBenchmark::IsPrime(ULONG64 n) // function determines if the number n is prime.
{
for (ULONG64 i = 2; i * i < n; i++)
if (n % i == 0)
return false;
return true;
}
ULONG64 CpuBenchmark::Power2(ULONG64 n) //function returns 2 raised to power of n
{
ULONG64 square = 1;
for (ULONG64 i = 0; i < n; i++)
{
square *= 2;
}
return square;
}
VOID CpuBenchmark::MersennePrimes()
{
ULONG64 i;
ULONG64 n;
for (n = 2; n <= 61; n++)
{
if (IsPrime(n) == 1)
{
i = Power2(n) - 1;
if (IsPrime(i) == 1) {
// dont care just want to stress the CPU
}
}
}
}
and waits for all threads to complete
// wait for them all to finish
for (auto& th : threads)
th.join();
Full article and source code on my blog site
http://malcolmswaine.com/c-cpu-stress-test-algorithm-mersenne-primes/

openmp segmentation fault - strange behaviour

I'm quite new at c++ and openmp in general. I have a part of my program that is causing segmentation faults in strange circumstances (strange to me at least).
It doesn't occur when using the g++ compiler, but does with intel compiler, however there are no faults in serial.
It also doesnt segfault when compiling on a different system (university hpc, intel compiler), but does on my PC.
It also doesn't segfault when three particular cout statements are present, however if any one of them is commented out then the segfault occurs. (This is what I find strange)
I'm new at using the intel debugger (idb) and i don't know how to work it properly yet. But i did manage to get this information from it:
Program received signal SIGSEGV
VLMsolver::iterateWake (this=<no value>) at /home/name/prog/src/vlmsolver.cpp:996
996 moveWakePoints();
So I'll show the moveWakePoints method below, and point out the critical cout lines:
void VLMsolver::moveWakePoints() {
inFreeWakeStage =true;
int iw = 0;
std::vector<double> wV(3);
std::vector<double> bV(3);
for (int cl=0;cl<3;++cl) {
wV[cl]=0;
bV[cl]=0;
}
cout<<"thanks for helping"<<endl;
for (int b = 0;b < sNumberOfBlades;++b) {
cout<<"b: "<<b<<endl;
#pragma omp parallel for firstprivate(iw,b,bV,wV)
for (int i = 0;i< iteration;++i) {
iw = iteration -i - 1;
for (int j = 0;j<numNodesY;++j) {
cout<<"b: "<<b<<"a: "<<"a: "<<endl;
double xp = wakes[b].x[iw*numNodesY+j];
double yp = wakes[b].y[iw*numNodesY+j];
double zp = wakes[b].z[iw*numNodesY+j];
if ( (sFreeWake ==true && sFreezeAfter == 0) || ( sFreeWake==true && iw<((sFreezeAfter*2*M_PI)/(sTimeStep*sRotationRate)) && sRotationRate != 0 ) || ( sFreeWake==true && sRotationRate == 0 && iw<((sFreezeAfter*sChord)/(sTimeStep*sFreeStream)))) {
if (iteration>1) {
getWakeVelocity(xp, yp, zp, wV);
}
getBladeVelocity(xp, yp, zp, bV);
} else {
for (int cl=0;cl<3;++cl) {
wV[cl]=0;
bV[cl]=0;
}
}
if (sRotationRate != 0) {
double theta;
theta = M_PI/2;
double radius = sqrt(pow(yp,2) + pow(zp,2));
wakes[b].yTemp[(iw+1)*numNodesY+j] = cos(theta - sTimeStep*sRotationRate)*radius;
wakes[b].zTemp[(iw+1)*numNodesY+j] = sin(theta - sTimeStep*sRotationRate)*radius;
wakes[b].xTemp[(iw+1)*numNodesY+j] = xp + sFreeStream*sTimeStep;
} else {
std::vector<double> fS(3);
getFreeStreamVelocity(xp, yp, zp, fS);
wakes[b].xTemp[(iw+1)*numNodesY+j] = xp + fS[0] * sTimeStep;
wakes[b].yTemp[(iw+1)*numNodesY+j] = yp + fS[1] * sTimeStep;
wakes[b].zTemp[(iw+1)*numNodesY+j] = zp + fS[2] * sTimeStep;
}
wakes[b].xTemp[(iw+1)*numNodesY+j] = wakes[b].xTemp[(iw+1)*numNodesY+j] + (wV[0]+bV[0])*sTimeStep;
wakes[b].yTemp[(iw+1)*numNodesY+j] = wakes[b].yTemp[(iw+1)*numNodesY+j] + (wV[1]+bV[1])*sTimeStep;
wakes[b].zTemp[(iw+1)*numNodesY+j] = wakes[b].zTemp[(iw+1)*numNodesY+j] + (wV[2]+bV[2])*sTimeStep;
} // along the numnodesy
} // along the iterations i
if (sBladeSymmetry) {
break;
}
}
}
The three cout lines at the top are what I added, and found the program worked when i did.
On the third cout line for example, if I change it to:
cout<<"b: "<<"a: "<<"a: "<<endl;
i get the segfault, or if I change it to:
cout<<"b: "<<b<<endl;
, i also get the segfault.
Thanks for reading, I appreciate any ideas.

As already stated in the previous answer you can try to use Valgrind to detect where your memory is corrupted. Just compile your binary with "-g -O0" and then run:
valgrind --tool=memcheck --leak-check=full <binary> <arguments>
If you are lucky you will get the exact line and column in the source code where the memory violation has occurred.
The fact that a segfault disappears when some "printf" statements are added is indeed not strange. Adding these statements you are modifying the portion of memory the program owns. If by any chance you are writing in a wrong location inside an allowed portion of memory, then the segfault will not occurr.
You can refer to this pdf (section "Debugging techniques/Out of Bounds") for a broader explanation of the topic:
Summer School of Parallel Computing
Hope I've been of help :-)

try increasing stack size,
http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/cpp/lin/optaps/common/optaps_par_var.htm
try valgrind
try debugger

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js