The following code declare a global array (256 MiB) and calculate sum of it's items.
This program consumes 188 KiB when runing:
#include <cstdlib>
#include <iostream>
using namespace std;
const unsigned int buffer_len = 256 * 1024 * 1024; // 256 MB
unsigned char buffer[buffer_len];
int main()
{
while(true)
{
int sum = 0;
for(int i = 0; i < buffer_len; i++)
sum += buffer[i];
cout << "Sum: " << sum << endl;
}
return 0;
}
The following code is like the above code, but sets array elements to random value before calculating the sum of array items.
This program consumes 256.1 MiB when runing:
#include <cstdlib>
#include <iostream>
using namespace std;
const unsigned int buffer_len = 256 * 1024 * 1024; // 256 MB
unsigned char buffer[buffer_len];
int main()
{
while(true)
{
// ********** Changing array items.
for(int i = 0; i < buffer_len; i++)
buffer[i] = std::rand() % 256;
// **********
int sum = 0;
for(int i = 0; i < buffer_len; i++)
sum += buffer[i];
cout << "Sum: " << sum << endl;
}
return 0;
}
Why global arrays does not consume memory in readonly mode (188K vs 256M)?
My Compiler: GCC
MY OS: Ubuntu 20.04
Update:
In my real scenario I will generate the buffer with xxd command, so it's elements are not zero:
$ xxd -i buffer.dat buffer.cpp
There's much speculation in the comments that this behavior is explained by compiler optimizations, but OP's use of g++ (i.e. without optimization) doesn't support this, and neither does the assembly output on the same architecture, which clearly shows buffer being used:
buffer:
.zero 268435456
.section .rodata
...
.L3:
movl -4(%rbp), %eax
cmpl $268435455, %eax
ja .L2
movl -4(%rbp), %eax
cltq
leaq buffer(%rip), %rdx
movzbl (%rax,%rdx), %eax
movzbl %al, %eax
addl %eax, -8(%rbp)
addl $1, -4(%rbp)
jmp .L3
The real reason you're seeing this behavior is the use of Copy On Write in the kernel's VM system. Essentially for a large buffer of zeros like you have here, the kernel will create a single "zero page" and point all pages in buffer to this page. Only when the page is written will it get allocated.
This is actually true in your second example as well (i.e. the behavior is the same), but you're touching every page with data, which forces the memory to be "paged in" for the entire buffer. Try only writing buffer_len/2, and you'll see that 128.2MiB of memory gets allocated by the process. Half of the true size of the buffer:
Here's also a helpful summary from Wikipedia:
The copy-on-write technique can be extended to support efficient memory allocation by having a page of physical memory filled with zeros. When the memory is allocated, all the pages returned refer to the page of zeros and are all marked copy-on-write. This way, physical memory is not allocated for the process until data is written, allowing processes to reserve more virtual memory than physical memory and use memory sparsely, at the risk of running out of virtual address space.
Related
I'm writing this code to make a candlestick chart and I want a red box if the open price for the day is greater than the close. I also want the box to be green if the close is higher than the open price.
if(open > close) {
boxColor = red;
} else {
boxColor = green;
}
Pseudo code is easier than an English sentence for this.
So I wrote this code first and then tried to benchmark it but I don't know how to get meaningful results.
for(int i = 0; i < history.get().close.size(); i++) {
auto open = history->open[i];
auto close = history->close[i];
int red = ((int)close - (int)open) >> ((int)sizeof(close) * 8);
int green = ((int)open - (int)close) >> ((int)sizeof(close) * 8);
gl::color(red,green,0);
gl::drawSolidRect( Rectf(vec2(i - 1, open), vec2(i + 1, close)) );
}
This is how I tried to benchmark it. Each run just shows 2ns. My main question to the community is this:
Can I actually make it faster by using a right shift and avoid a conditional branch?
#include <benchmark/reporter.h>
static void BM_red_noWork(benchmark::State& state) {
double open = (double)rand() / RAND_MAX;
double close = (double)rand() / RAND_MAX;
while (state.KeepRunning()) {
}
}
BENCHMARK(BM_red_noWork);
static void BM_red_fast_work(benchmark::State& state) {
double open = (double)rand() / RAND_MAX;
double close = (double)rand() / RAND_MAX;
while (state.KeepRunning()) {
int red = ((int)open - (int)close) >> sizeof(int) - 1;
}
}
BENCHMARK(BM_red_fast_work);
static void BM_red_slow_work(benchmark::State& state) {
double open = (double)rand() / RAND_MAX;
double close = (double)rand() / RAND_MAX;
while (state.KeepRunning()) {
int red = open > close ? 0 : 1;
}
}
BENCHMARK(BM_red_slow_work);
Thanks!
As I stated in my comment, the compiler will do these optimizations for you. Here is a minimal compilable example:
int main() {
volatile int a = 42;
if (a <= 0) {
return 0;
} else {
return 1;
}
}
The volatile is simply to prevent optimizations from "knowing" the value of a and instead it forces it to be read.
This was compiled with the command g++ -O3 -S test.cpp and it produces a file named test.s
Inside test.s is the assembly generated by the compiler (pardon AT&T syntax):
movl $42, -4(%rsp)
movl -4(%rsp), %eax
testl %eax, %eax
setg %al
movzbl %al, %eax
ret
As you can see, it is branchless. It uses testl to set a flag if the number is <= 0 and then reads that value using setg, moves it back into the proper register, then finally it returns.
It should be noted, at this was adapted from your code. A much better way to write this is simply:
int main() {
volatile int a = 42;
return a > 0;
}
It also generates the same assembly.
This is likely to be better than anything readable you could write directly in C++. For instance your code (hopefully corrected for bit arithmetic errors):
int main() {
volatile int a = 42;
return ~(a >> (sizeof(int) * CHAR_BIT - 1)) & 1;
}
Compiles to:
movl $42, -4(%rsp)
movl -4(%rsp), %eax
notl %eax
shrl $31, %eax
ret
Which is indeed, very slightly smaller. But it's not significantly faster. Especially not when you have a GL call right next to it. I'd rather spend 1-3 additional cycles to get readable code, rather than have to scratch my head wondering what my coworker (or me from 6 months ago, which is essentially the same thing) did.
EDIT: I should be remarked that the compiler also optimized the bit arithmetic I wrote, because I wrote it less well than I could have. The assembly is actually: (~a) >> 31 which is equivalent to the ~(a >> 31) & 1 that I wrote (at least in most implementations with an unsigned integer, see comments for details).
How can a separate loop affect the performance of an independent earlier loop?
My first loop reads some large text files and counts the lines/rows.
After a malloc, the second loop populates the allocated matrix.
If the second loop is commented out, the first loop takes 1.5 sec. However, compiling WITH the second loop slows down the first loop, which now takes 30-40 seconds!
In other words: the second loop somehow slows down the first loop. I have tried to change scope, change compilers, change compiler flags, change the loop itself, bring everything into main(), use boost::iostream and even place one loop in a shared library, but in every attempt the same problem persists!
The first loop is fast until the program is compiled with the second loop.
EDIT: Here is a full example of my problem ------------>
#include <iostream>
#include <vector>
#include "string.h"
#include "boost/chrono.hpp"
#include "sys/mman.h"
#include "sys/stat.h"
#include "fcntl.h"
#include <algorithm>
unsigned long int countLines(char const *fname) {
static const auto BUFFER_SIZE = 16*1024;
int fd = open(fname, O_RDONLY);
if(fd == -1) {
std::cout << "Open Error" << std::endl;
std::exit(EXIT_FAILURE);
}
posix_fadvise(fd, 0, 0, 1);
char buf[BUFFER_SIZE + 1];
unsigned long int lines = 0;
while(size_t bytes_read = read(fd, buf, BUFFER_SIZE)) {
if(bytes_read == (size_t)-1) {
std::cout << "Read Failed" << std::endl;
std::exit(EXIT_FAILURE);
}
if (!bytes_read)
break;
int n;
char *p;
for(p = buf, n=bytes_read ; n > 0 && (p = (char*) memchr(p, '\n', n)) ; n = (buf+bytes_read) - ++p)
++lines;
}
close(fd);
return lines;
}
int main(int argc, char *argv[])
{
// initial variables
int offset = 55;
unsigned long int rows = 0;
unsigned long int cols = 0;
std::vector<unsigned long int> dbRows = {0, 0, 0};
std::vector<std::string> files = {"DATA/test/file1.csv", // large files: 3Gb
"DATA/test/file2.csv", // each line is 55 chars long
"DATA/test/file3.csv"};
// find each file's number of rows
for (int x = 0; x < files.size(); x++) { // <--- FIRST LOOP **
dbRows[x] = countLines(files[x].c_str());
}
// define matrix row as being the largest row found
// define matrix col as being 55 chars long for each csv file
std::vector<unsigned long int>::iterator maxCount;
maxCount = std::max_element(dbRows.begin(), dbRows.end());
rows = dbRows[std::distance(dbRows.begin(), maxCount)]; // typically rows = 72716067
cols = dbRows.size() * offset; // cols = 165
// malloc required space (11998151055)
char *syncData = (char *)malloc(rows*cols*sizeof(char));
// fill up allocated memory with a test letter
char t[]= "x";
for (unsigned long int x = 0; x < (rows*cols); x++) { // <--- SECOND LOOP **
syncData[x] = t[0];
}
free(syncData);
return 0;
}
I have also noticed that lowering the number of columns speeds up the first loop.
A profiler points the finger to this line:
while(size_t bytes_read = read(fd, buf, BUFFER_SIZE))
The program idles on this line for 30 seconds or a wait count of 230,000.
In assembly, the wait count occurs on:
Block 5:
lea 0x8(%rsp), %rsi
mov %r12d, %edi
mov $0x4000, %edx
callq 0x402fc0 <------ stalls on callq
Block 6:
mov %rax, %rbx
test %rbx, %rbx
jz 0x404480 <Block 18>
My guess is that an API block occurs when reading from stream, but I don't know why?
My theory:
Allocating and touching all that memory evicts the big files from disk cache, so the next run has to read them from disk.
If you ran the version without loop2 a couple times to warm up the disk cache, then run a version with loop2, I predict it will be fast the first time, but slow for further runs without warming up the disk cache again first.
The memory consumption happens after the files have been read. This causes "memory pressure" on the page cache (aka disk cache), causing it to evict data from the cache to make room for the pages for your process to write to.
Your computer probably has just barely enough free RAM to cache your working set. Closing your web browser might free up enough to make a difference! Or not, since your 11998151055 is 11.1GiB, and you're writing every page of it. (Every byte, even. You could do it with memset for higher performance, although I assume what you've shown is just a dummy version)
BTW, another tool for investigating this would be time ./a.out. It can show you if your program is spending all its CPU time in user-space vs. kernel ("system") time.
If user+sys adds up to the real time, your process is CPU bound. If not, it's I/O bound, and your process is blocking on disk I/O (which is normal, since counting newlines should be fast).
From THIS PAGE:
"The function close closes the file descriptor filedes. Closing a file has the following consequences:
The file descriptor is deallocated.
Any record locks owned by the process on the file are unlocked.
When all file descriptors associated with a pipe or FIFO have been closed, any unread data is discarded."
I think you might have the resources locked by previous read. Try closing the file and let us know the result.
I am doing a program, and at this point I need to make it efficient.
I am using a Haswell microarchitecture (64bits) and the 'g++'.
The objective is made use of an ADC instruction, until the loop ends.
//I removed every carry handlers from this preview, yo be more simple
size_t anum = ap[i], bnum = bp[i];
unsigned carry;
// The Carry flag is set here with an common addtion
anum += bnum;
cnum[0]= anum;
carry = check_Carry(anum, bnum);
for (int i=1; i<n; i++){
anum = ap[i];
bnum = bp[i];
//I want to remove this line and insert the __asm__ block
anum += (bnum + carry);
carry = check_Carry(anum, bnum);
//This block is not working
__asm__(
"movq -64(%rbp), %rcx;"
"adcq %rdx, %rcx;"
"movq %rsi, -88(%rbp);"
);
cnum[i] = anum;
}
Is the CF set only in the first addition? Or is it every time I do an ADC instruction?
I think that the problem is on the loss of the CF, every time the loop is done. If it is this the problem how I can solve it?
You use asm like this in the gcc family of compilers:
int src = 1;
int dst;
asm ("mov %1, %0\n\t"
"add $1, %0"
: "=r" (dst)
: "r" (src));
printf("%d\n", dst);
That is, you can refer to variables, rather than guessing where they might be in memory/registers.
[Edit] On the subject of carries: It's not completely clear what you are wanting, but: ADC takes the CF as an input and produces it as an output. However, MANY other instructions muck with the flags, (such as likely those used by the compiler to construct your for-loop), so you probably need to use some instructions to save/restore the CF (perhaps LAHF/SAHF).
Here is a piece of code that I've been trying to compile:
#include <cstdio>
#define N 3
struct Data {
int A[N][N];
int B[N];
};
int foo(int uloc, const int A[N][N], const int B[N])
{
for(unsigned int j = 0; j < N; j++) {
for( int i = 0; i < N; i++) {
for( int r = 0; r < N ; r++) {
for( int q = 0; q < N ; q++) {
uloc += B[i]*A[r][j] + B[j];
}
}
}
}
return uloc;
}
int apply(const Data *d)
{
return foo(4,d->A,d->B);
}
int main(int, char **)
{
Data d;
for(int i = 0; i < N; ++i) {
for(int j = 0; j < N; ++j) {
d.A[i][j] = 0.0;
}
d.B[i] = 0.0;
}
int res = 11 + apply(&d);
printf("%d\n",res);
return 0;
}
Yes, it looks quite strange, and does not do anything useful at all at the moment, but it is the most concise version of a much larger program which I had the problem with initially.
It compiles and runs just fine with GCC(G++) 4.4 and 4.6, but if I use GCC 4.7, and enable third level optimizations:
g++-4.7 -g -O3 prog.cpp -o prog
I get a segmentation fault when running it. Gdb does not really give much information on what went wrong:
(gdb) run
Starting program: /home/kalle/work/code/advect_diff/c++/strunt
Program received signal SIGSEGV, Segmentation fault.
apply (d=d#entry=0x7fffffffe1a0) at src/strunt.cpp:25
25 int apply(const Data *d)
(gdb) bt
#0 apply (d=d#entry=0x7fffffffe1a0) at src/strunt.cpp:25
#1 0x00000000004004cc in main () at src/strunt.cpp:34
I've tried tweaking the code in different ways to see if the error goes away. It seems necessary to have all of the four loop levels in foo, and I have not been able to reproduce it by having a single level of function calls. Oh yeah, the outermost loop must use an unsigned loop index.
I'm starting to suspect that this is a bug in the compiler or runtime, since it is specific to version 4.7 and I cannot see what memory accesses are invalid.
Any insight into what is going on would be very much appreciated.
It is possible to get the same situation with the C-version of GCC, with a slight modification of the code.
My system is:
Debian wheezy
Linux 3.2.0-4-amd64
GCC 4.7.2-5
Okay so I looked at the disassembly offered by gdb, but I'm afraid it doesn't say much to me:
Dump of assembler code for function apply(Data const*):
0x0000000000400760 <+0>: push %r13
0x0000000000400762 <+2>: movabs $0x400000000,%r8
0x000000000040076c <+12>: push %r12
0x000000000040076e <+14>: push %rbp
0x000000000040076f <+15>: push %rbx
0x0000000000400770 <+16>: mov 0x24(%rdi),%ecx
=> 0x0000000000400773 <+19>: mov (%rdi,%r8,1),%ebp
0x0000000000400777 <+23>: mov 0x18(%rdi),%r10d
0x000000000040077b <+27>: mov $0x4,%r8b
0x000000000040077e <+30>: mov 0x28(%rdi),%edx
0x0000000000400781 <+33>: mov 0x2c(%rdi),%eax
0x0000000000400784 <+36>: mov %ecx,%ebx
0x0000000000400786 <+38>: mov (%rdi,%r8,1),%r11d
0x000000000040078a <+42>: mov 0x1c(%rdi),%r9d
0x000000000040078e <+46>: imul %ebp,%ebx
0x0000000000400791 <+49>: mov $0x8,%r8b
0x0000000000400794 <+52>: mov 0x20(%rdi),%esi
What should I see when I look at this?
Edit 2015-08-13: This seem to be fixed in g++ 4.8 and later.
You never initialized d. Its value is indeterminate, and trying to do math with its contents is undefined behavior. (Even trying to read its values without doing anything with them is undefined behavior.) Initialize d and see what happens.
Now that you've initialized d and it still fails, that looks like a real compiler bug. Try updating to 4.7.3 or 4.8.2; if the problem persists, submit a bug report. (The list of known bugs currently appears to be empty, or at least the link is going somewhere that only lists non-bugs.)
It indeed and unfortunately is a bug in gcc. I have not the slightest idea what it is doing there, but the generated assembly for the apply function is ( I compiled it without main btw., and it has foo inlined in it):
_Z5applyPK4Data:
pushq %r13
movabsq $17179869184, %r8
pushq %r12
pushq %rbp
pushq %rbx
movl 36(%rdi), %ecx
movl (%rdi,%r8), %ebp
movl 24(%rdi), %r10d
and exactly at the movl (%rdi,%r8), %ebp it will crashes, since it adds a nonsensical 0x400000000 to $rdi (the first parameter, thus the pointer to Data) and dereferences it.
It's really a simple problem :
I'm programming a Go program. Should I represent the board with a QVector<int> or a QVector<Player> where
enum Player
{
EMPTY = 0,
BLACK = 1,
WHITE = 2
};
I guess that of course, using Player instead of integers will be slower. But I wonder how much more, because I believe that using enum is better coding.
I've done a few tests regarding assigning and comparing Players (as opposed to int)
QVector<int> vec;
vec.resize(10000000);
int size = vec.size();
for(int i =0; i<size; ++i)
{
vec[i] = 0;
}
for(int i =0; i<size; ++i)
{
bool b = (vec[i] == 1);
}
QVector<Player> vec2;
vec2.resize(10000000);
int size = vec2.size();
for(int i =0; i<size; ++i)
{
vec2[i] = EMPTY;
}
for(int i =0; i<size; ++i)
{
bool b = (vec2[i] == BLACK);
}
Basically, it's only 10% slower. Is there anything else I should know before continuing?
Thanks!
Edit : The 10% difference is not a figment of my imagination, it seems to be specific to Qt and QVector. When I use std::vector, the speed is the same
Enums are completely resolved at compile time (enum constants as integer literals, enum variables as integer variables), there's no speed penalty in using them.
In general the average enumeration won't have an underlying type bigger than int (unless you put in it very big constants); in facts, at §7.2 ¶ 5 it's explicitly said:
The underlying type of an enumeration is an integral type that can represent all the enumerator values defined in the enumeration. It is implementation-defined which integral type is used as the underlying type for an enumeration except that the underlying type shall not be larger than int unless the value of an enumerator cannot fit in an int or unsigned int.
You should use enumerations when it's appropriate because they usually make the code easier to read and to maintain (have you ever tried to debug a program full of "magic numbers"? :S).
As for your results: probably your test methodology doesn't take into account the normal speed fluctuations you get when you run code on "normal" machines1; have you tried running the test many (100+) times and calculating mean and standard deviation of your times? The results should be compatible: the difference between the means shouldn't be bigger than 1 or 2 times the RSS2 of the two standard deviations (assuming, as usual, a Gaussian distribution for the fluctuations).
Another check you could do is to compare the generated assembly code (with g++ you can get it with the -S switch).
On "normal" PCs you have some indeterministic fluctuations because of other tasks running, cache/RAM/VM state, ...
Root Sum Squared, the square root of the sum of the squared standard deviations.
In general, using an enum should make absolutely no difference to performance. How did you test this?
I just ran tests myself. The differences are pure noise.
Just now, I compiled both versions to assembler. Here's the main function from each:
int
LFB1778:
pushl %ebp
LCFI11:
movl %esp, %ebp
LCFI12:
subl $8, %esp
LCFI13:
movl $65535, %edx
movl $1, %eax
call __Z41__static_initialization_and_destruction_0ii
leave
ret
Player
LFB1774:
pushl %ebp
LCFI10:
movl %esp, %ebp
LCFI11:
subl $8, %esp
LCFI12:
movl $65535, %edx
movl $1, %eax
call __Z41__static_initialization_and_destruction_0ii
leave
ret
It's hazardous to base any statement regarding performance on micro-benchmarks. There are too many extraneous factors skewing the data.
Enums should be no slower. They're implemented as integers.
if you use Visual Studio for example you can create a simple project where you have
a=Player::EMPTY;
and if you right click "go to disassembly" the code will be
mov dword ptr [a],0
So the compiler replace the value of the enum, and normally it will not generate any overhead.
Well, I did a few tests and there wasn't much difference between the integer and enum forms. I also added a char form which was consistently about 6% quicker (which isn't surprising as it is using less memory). Then I just used a char array rather than a vector and that was 300% faster! Since we've not been given what QVector is, it could be a wrapper for an array rather than the std::vector I've used.
Here's the code I used, compiled using standard release options in Dev Studio 2005. Note that I've changed the timed loop a small amount as the code in the question could be optimised to nothing (you'd have to check the assembly code).
#include <windows.h>
#include <vector>
#include <iostream>
using namespace std;
enum Player
{
EMPTY = 0,
BLACK = 1,
WHITE = 2
};
template <class T, T search>
LONGLONG TimeFunction ()
{
vector <T>
vec;
vec.resize (10000000);
size_t
size = vec.size ();
for (size_t i = 0 ; i < size ; ++i)
{
vec [i] = static_cast <T> (rand () % 3);
}
LARGE_INTEGER
start,
end;
QueryPerformanceCounter (&start);
for (size_t i = 0 ; i < size ; ++i)
{
if (vec [i] == search)
{
break;
}
}
QueryPerformanceCounter (&end);
return end.QuadPart - start.QuadPart;
}
LONGLONG TimeArrayFunction ()
{
size_t
size = 10000000;
char
*vec = new char [size];
for (size_t i = 0 ; i < size ; ++i)
{
vec [i] = static_cast <char> (rand () % 3);
}
LARGE_INTEGER
start,
end;
QueryPerformanceCounter (&start);
for (size_t i = 0 ; i < size ; ++i)
{
if (vec [i] == 10)
{
break;
}
}
QueryPerformanceCounter (&end);
delete [] vec;
return end.QuadPart - start.QuadPart;
}
int main ()
{
cout << " Char form = " << TimeFunction <char, 10> () << endl;
cout << "Integer form = " << TimeFunction <int, 10> () << endl;
cout << " Player form = " << TimeFunction <Player, static_cast <Player> (10)> () << endl;
cout << " Array form = " << TimeArrayFunction () << endl;
}
The compiler should convert enum into integers. They get inlined at compile time, so once your program is compiled, it's supposed to be exactly the same as if you used the integers themselves.
If your testing produces different results, there could be something going on with the test itself. Either that, or your compiler is behaving oddly.
This is implementation dependent, and it is quite possible for enums and ints to have different performance and either the same or different assembly code, although it is probably a sign of a suboptimal compiler. some ways to get differences are:
QVector may be specialized on your enum type to do something surprising.
enum doesn't get compiled to int but to "some integral type no larger than int". QVector of int may be specialized differently from QVector of some_integral_type.
even if QVector isn't specialized, the compiler may do a better job of aligning ints in memory than of aligning some_integral_type, leading to a greater cache miss rate when you loop over the vector of enums or of some_integral_type.