Here is a piece of code that I've been trying to compile:
#include <cstdio>
#define N 3
struct Data {
int A[N][N];
int B[N];
};
int foo(int uloc, const int A[N][N], const int B[N])
{
for(unsigned int j = 0; j < N; j++) {
for( int i = 0; i < N; i++) {
for( int r = 0; r < N ; r++) {
for( int q = 0; q < N ; q++) {
uloc += B[i]*A[r][j] + B[j];
}
}
}
}
return uloc;
}
int apply(const Data *d)
{
return foo(4,d->A,d->B);
}
int main(int, char **)
{
Data d;
for(int i = 0; i < N; ++i) {
for(int j = 0; j < N; ++j) {
d.A[i][j] = 0.0;
}
d.B[i] = 0.0;
}
int res = 11 + apply(&d);
printf("%d\n",res);
return 0;
}
Yes, it looks quite strange, and does not do anything useful at all at the moment, but it is the most concise version of a much larger program which I had the problem with initially.
It compiles and runs just fine with GCC(G++) 4.4 and 4.6, but if I use GCC 4.7, and enable third level optimizations:
g++-4.7 -g -O3 prog.cpp -o prog
I get a segmentation fault when running it. Gdb does not really give much information on what went wrong:
(gdb) run
Starting program: /home/kalle/work/code/advect_diff/c++/strunt
Program received signal SIGSEGV, Segmentation fault.
apply (d=d#entry=0x7fffffffe1a0) at src/strunt.cpp:25
25 int apply(const Data *d)
(gdb) bt
#0 apply (d=d#entry=0x7fffffffe1a0) at src/strunt.cpp:25
#1 0x00000000004004cc in main () at src/strunt.cpp:34
I've tried tweaking the code in different ways to see if the error goes away. It seems necessary to have all of the four loop levels in foo, and I have not been able to reproduce it by having a single level of function calls. Oh yeah, the outermost loop must use an unsigned loop index.
I'm starting to suspect that this is a bug in the compiler or runtime, since it is specific to version 4.7 and I cannot see what memory accesses are invalid.
Any insight into what is going on would be very much appreciated.
It is possible to get the same situation with the C-version of GCC, with a slight modification of the code.
My system is:
Debian wheezy
Linux 3.2.0-4-amd64
GCC 4.7.2-5
Okay so I looked at the disassembly offered by gdb, but I'm afraid it doesn't say much to me:
Dump of assembler code for function apply(Data const*):
0x0000000000400760 <+0>: push %r13
0x0000000000400762 <+2>: movabs $0x400000000,%r8
0x000000000040076c <+12>: push %r12
0x000000000040076e <+14>: push %rbp
0x000000000040076f <+15>: push %rbx
0x0000000000400770 <+16>: mov 0x24(%rdi),%ecx
=> 0x0000000000400773 <+19>: mov (%rdi,%r8,1),%ebp
0x0000000000400777 <+23>: mov 0x18(%rdi),%r10d
0x000000000040077b <+27>: mov $0x4,%r8b
0x000000000040077e <+30>: mov 0x28(%rdi),%edx
0x0000000000400781 <+33>: mov 0x2c(%rdi),%eax
0x0000000000400784 <+36>: mov %ecx,%ebx
0x0000000000400786 <+38>: mov (%rdi,%r8,1),%r11d
0x000000000040078a <+42>: mov 0x1c(%rdi),%r9d
0x000000000040078e <+46>: imul %ebp,%ebx
0x0000000000400791 <+49>: mov $0x8,%r8b
0x0000000000400794 <+52>: mov 0x20(%rdi),%esi
What should I see when I look at this?
Edit 2015-08-13: This seem to be fixed in g++ 4.8 and later.
You never initialized d. Its value is indeterminate, and trying to do math with its contents is undefined behavior. (Even trying to read its values without doing anything with them is undefined behavior.) Initialize d and see what happens.
Now that you've initialized d and it still fails, that looks like a real compiler bug. Try updating to 4.7.3 or 4.8.2; if the problem persists, submit a bug report. (The list of known bugs currently appears to be empty, or at least the link is going somewhere that only lists non-bugs.)
It indeed and unfortunately is a bug in gcc. I have not the slightest idea what it is doing there, but the generated assembly for the apply function is ( I compiled it without main btw., and it has foo inlined in it):
_Z5applyPK4Data:
pushq %r13
movabsq $17179869184, %r8
pushq %r12
pushq %rbp
pushq %rbx
movl 36(%rdi), %ecx
movl (%rdi,%r8), %ebp
movl 24(%rdi), %r10d
and exactly at the movl (%rdi,%r8), %ebp it will crashes, since it adds a nonsensical 0x400000000 to $rdi (the first parameter, thus the pointer to Data) and dereferences it.
Related
The following code declare a global array (256 MiB) and calculate sum of it's items.
This program consumes 188 KiB when runing:
#include <cstdlib>
#include <iostream>
using namespace std;
const unsigned int buffer_len = 256 * 1024 * 1024; // 256 MB
unsigned char buffer[buffer_len];
int main()
{
while(true)
{
int sum = 0;
for(int i = 0; i < buffer_len; i++)
sum += buffer[i];
cout << "Sum: " << sum << endl;
}
return 0;
}
The following code is like the above code, but sets array elements to random value before calculating the sum of array items.
This program consumes 256.1 MiB when runing:
#include <cstdlib>
#include <iostream>
using namespace std;
const unsigned int buffer_len = 256 * 1024 * 1024; // 256 MB
unsigned char buffer[buffer_len];
int main()
{
while(true)
{
// ********** Changing array items.
for(int i = 0; i < buffer_len; i++)
buffer[i] = std::rand() % 256;
// **********
int sum = 0;
for(int i = 0; i < buffer_len; i++)
sum += buffer[i];
cout << "Sum: " << sum << endl;
}
return 0;
}
Why global arrays does not consume memory in readonly mode (188K vs 256M)?
My Compiler: GCC
MY OS: Ubuntu 20.04
Update:
In my real scenario I will generate the buffer with xxd command, so it's elements are not zero:
$ xxd -i buffer.dat buffer.cpp
There's much speculation in the comments that this behavior is explained by compiler optimizations, but OP's use of g++ (i.e. without optimization) doesn't support this, and neither does the assembly output on the same architecture, which clearly shows buffer being used:
buffer:
.zero 268435456
.section .rodata
...
.L3:
movl -4(%rbp), %eax
cmpl $268435455, %eax
ja .L2
movl -4(%rbp), %eax
cltq
leaq buffer(%rip), %rdx
movzbl (%rax,%rdx), %eax
movzbl %al, %eax
addl %eax, -8(%rbp)
addl $1, -4(%rbp)
jmp .L3
The real reason you're seeing this behavior is the use of Copy On Write in the kernel's VM system. Essentially for a large buffer of zeros like you have here, the kernel will create a single "zero page" and point all pages in buffer to this page. Only when the page is written will it get allocated.
This is actually true in your second example as well (i.e. the behavior is the same), but you're touching every page with data, which forces the memory to be "paged in" for the entire buffer. Try only writing buffer_len/2, and you'll see that 128.2MiB of memory gets allocated by the process. Half of the true size of the buffer:
Here's also a helpful summary from Wikipedia:
The copy-on-write technique can be extended to support efficient memory allocation by having a page of physical memory filled with zeros. When the memory is allocated, all the pages returned refer to the page of zeros and are all marked copy-on-write. This way, physical memory is not allocated for the process until data is written, allowing processes to reserve more virtual memory than physical memory and use memory sparsely, at the risk of running out of virtual address space.
I wrote assembly function and main in C in VS Code and compiled it a standard command
gcc charFull64.s file.c -gstabs -no-pie -o ex
The program was linked and worked. Then I tried to add them to the project in CLion. Unfortunately, I keep getting the same problem that the linker cannot find my assembly function. I tried to add some flags in the CMake and find an answer on the internet, but I can't.
Probably a significant part of the function program (charFull64.s):
[...]
.text
.global fullChar
.type fullChar , #function
#int fullChar (unsigned char*);
fullChar:
push %rbp
movq %rsp, %rbp
pushq %rbx
pushq %r12
pushq %r13
movq %rdi, number # unsigned char* to number
[...]
And my main in file.c:
#include <stdio.h>
extern int fullChar (unsigned char*);
int main(){
unsigned char a[] = "514";
int size = fullChar(a);
for(int i = 0; i < size; i++)
printf(" %d", a[i]);
printf("\n");
return 0;
}
Clion still throws this:
CMakeFiles/program.dir/main.cpp.o: In function `main':
/home/stanley/CLionProjects/Factorization/main.cpp:17: undefined reference to `fullChar(unsigned char*)'
collect2: error: ld returned 1 exit status
CMakeFiles/program.dir/build.make:90: recipe for target 'program' failed
I don't know how to solve it anymore.
Edit: It seems this is a compiler bug indeed: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803
I am writing a wrapper for writing logs that uses TLS to store a std::stringstream buffer. This code will be used by shared-libraries. When looking at the code on godbolt.org it seems that neither gcc nor clang will cache the result of a TLS lookup (the loop repeatedly calls '__tls_get_addr()' when I believe I have designed my class in a way that should let it.
#include <sstream>
class LogStream
{
public:
LogStream()
: m_buffer(getBuffer())
{
}
LogStream(std::stringstream& buffer)
: m_buffer(buffer)
{
}
static std::stringstream& getBuffer()
{
thread_local std::stringstream buffer;
return buffer;
}
template <typename T>
inline LogStream& operator<<(const T& t)
{
m_buffer << t;
return *this;
}
private:
std::stringstream& m_buffer;
};
int main()
{
LogStream log{};
for (int i = 0; i < 12345678; ++i)
{
log << i;
}
}
Looking at the assembly code output both gcc and clang generate pretty similar output:
clang 5.0.0:
xor ebx, ebx
.LBB0_3: # =>This Inner Loop Header: Depth=1
data16
lea rdi, [rip + LogStream::getBuffer[abi:cxx11]()::buffer[abi:cxx11]#TLSGD]
data16
data16
rex64
call __tls_get_addr#PLT // Called on every loop iteration.
lea rdi, [rax + 16]
mov esi, ebx
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)#PLT
inc ebx
cmp ebx, 12345678
jne .LBB0_3
gcc 7.2:
xor ebx, ebx
.L3:
lea rdi, guard variable for LogStream::getBuffer[abi:cxx11]()::buffer#tlsld[rip]
call __tls_get_addr#PLT // Called on every loop iteration.
mov esi, ebx
add ebx, 1
lea rdi, LogStream::getBuffer[abi:cxx11]()::buffer#dtpoff[rax+16]
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)#PLT
cmp ebx, 12345678
jne .L3
How can I convince both compilers that the lookup doesn't need to be repeatedly done?
Compiler options: -std=c++11 -O3 -fPIC
Godbolt link
This really looks like an optimization bug in both Clang and GCC.
Here's what I think happens. (I might be completely off.) The compiler completely inlines everything down to this code:
int main()
{
// pseudo-access
std::stringstream& m_buffer = LogStream::getBuffer::buffer;
for (int i = 0; i < 12345678; ++i)
{
m_buffer << i;
}
}
And then, not realizing that access to a thread-local is very expensive under -fPIC, it decides that the temporary reference to the global is not necessary and inlines that as well:
int main()
{
for (int i = 0; i < 12345678; ++i)
{
// pseudo-access
LogStream::getBuffer::buffer << i;
}
}
Whatever actually happens, this is clearly a pessimization of the code your wrote. You should report this as a bug to GCC and Clang.
GCC bugtracker: https://gcc.gnu.org/bugzilla/
Clang bugtracker: https://bugs.llvm.org/
I watched (most of) Herb Sutter's the atmoic<> weapons video, and I wanted to test the "conditional lock" with a loop inside sample. Apparently, although (if I understand correctly) the C++11 standard says the below example should work properly and be sequentially consistent, it is not.
Before you read on, my question is: Is this correct? Is the compiler broken? Is my code broken - do I have a race condition here which I missed? How do I bypass this?
I tried it on 3 different versions of Visual C++: VC10 professional, VC11 professional and VC12 Express (== Visual Studio 2013 Desktop Express).
Below is the code I used for the Visual Studio 2013. For the other versions I used boost instead of std, but the idea is the same.
#include <iostream>
#include <thread>
#include <mutex>
int a = 0;
std::mutex m;
void other()
{
std::lock_guard<std::mutex> l(m);
std::this_thread::sleep_for(std::chrono::milliseconds(2));
a = 999999;
std::this_thread::sleep_for(std::chrono::seconds(2));
std::cout << a << "\n";
}
int main(int argc, char* argv[])
{
bool work = (argc > 1);
if (work)
{
m.lock();
}
std::thread th(other);
for (int i = 0; i < 100000000; ++i)
{
if (i % 7 == 3)
{
if (work)
{
++a;
}
}
}
if (work)
{
std::cout << a << "\n";
m.unlock();
}
th.join();
}
To summarize the idea of the code: The global variable a is protected by the global mutex m. Assuming there are no command line arguments (argc==1) the thread which runs other() is the only one which is supposed to access the global variable a.
The correct output of the program is to print 999999.
However, because of the compiler loop optimization (using a register for in-loop increments and at the end of the loop copying the value back to a), a is modified by the assembly even though it's not supposed to.
This happened in all 3 VC versions, although in this code example in VC12 I had to plant some calls to sleep() to make it break.
Here's some of the assembly code (the adress of a in this run is 0x00f65498):
Loop initialization - value from a is copied into edi
27: for (int i = 0; i < 100000000; ++i)
00F61543 xor esi,esi
00F61545 mov edi,dword ptr ds:[0F65498h]
00F6154B jmp main+0C0h (0F61550h)
00F6154D lea ecx,[ecx]
28: {
29: if (i % 7 == 3)
Increment within the condition, and after the loop copied back to the location of a unconditionally
30: {
31: if (work)
00F61572 mov al,byte ptr [esp+1Bh]
00F61576 jne main+0EDh (0F6157Dh)
00F61578 test al,al
00F6157A je main+0EDh (0F6157Dh)
32: {
33: ++a;
00F6157C inc edi
27: for (int i = 0; i < 100000000; ++i)
00F6157D inc esi
00F6157E cmp esi,5F5E100h
00F61584 jl main+0C0h (0F61550h)
32: {
33: ++a;
00F61586 mov dword ptr ds:[0F65498h],edi
34: }
And the output of the program is 0.
The 'volatile' keyword will prevent that kind of optimization. That's exactly what it's for: every use of 'a' will be read or written exactly as shown, and won't be moved in a different order to other volatile variables.
The implementation of the mutex should include compiler-specific instructions to cause a "fence" at that point, telling the optimizer not to reorder instructions across that boundary. Since the implementation is not from the compiler vendor, maybe that's left out? I've never checked.
Since 'a' is global, I would generally think the compiler would be more careful with it. But, VS10 doesn't know about threads so it won't consider that other threads will use it. Since the optimizer grasps the entire loop execution, it knows that functions called from within the loop won't touch 'a' and that's enough for it.
I'm not sure what the new standard says about thread visibility of global variables other than volatile. That is, is there a rule that would prevent that optimization (even though the function can be grasped all the way down so it knows other functions don't use the global, must it assume that other threads can) ?
I suggest trying the newer compiler with the compiler-provided std::mutex, and checking what the C++ standard and current drafts say about that. I think the above should help you know what to look for.
—John
Almost a month later, Microsoft still hasn't responded to the bug in MSDN Connect.
To summarize the above comments (and some further tests), apparently it happens in VS2013 professional as well, but the bug only happens when building for Win32, not for x64. The generated assembly code in x64 doesn't have this problem.
So it appears that it is a bug in the optimizer, and that there's no race condition in this code.
Apparently this bug also happens in GCC 4.8.1, but not in GCC 4.9.
(Thanks to Voo, nosid and Chris Dodd for all their testing).
It was suggested to mark a as volatile. This indeed prevents the bug, but only because it prevents the optimizer from performing the loop register optimization.
I found another solution: Add another local variable b, and if needed (and under lock) do the following:
Copy a into b
Increment b in the loop
Copy back to a if needed
The optimizer replaces the local variable with a register, so the code is still optimized, but the copies from and to a are done only if needed, and under lock.
Here's the new main() code, with arrows marking the changed lines.
int main(int argc, char* argv[])
{
bool work = (argc == 1);
int b = 0; // <----
if (work)
{
m.lock();
b = a; // <----
}
std::thread th(other);
for (int i = 0; i < 100000000; ++i)
{
if (i % 7 == 3)
{
if (work)
{
++b; // <----
}
}
}
if (work)
{
a = b; // <----
std::cout << a << "\n";
m.unlock();
}
th.join();
}
And this is what the assembly code looks like (&a == 0x000744b0, b replaced with edi):
21: int b = 0;
00071473 xor edi,edi
22:
23: if (work)
00071475 test bl,bl
00071477 je main+5Bh (07149Bh)
24: {
25: m.lock();
........
00071492 add esp,4
26: b = a;
00071495 mov edi,dword ptr ds:[744B0h]
27: }
28:
........
33: {
34: if (work)
00071504 test bl,bl
00071506 je main+0C9h (071509h)
35: {
36: ++b;
00071508 inc edi
30: for (int i = 0; i < 100000000; ++i)
00071509 inc esi
0007150A cmp esi,5F5E100h
00071510 jl main+0A0h (0714E0h)
37: }
38: }
39: }
40:
41: if (work)
00071512 test bl,bl
00071514 je main+10Ch (07154Ch)
42: {
43: a = b;
44: std::cout << a << "\n";
00071516 mov ecx,dword ptr ds:[73084h]
0007151C push edi
0007151D mov dword ptr ds:[744B0h],edi
00071523 call dword ptr ds:[73070h]
00071529 mov ecx,eax
0007152B call std::operator<<<std::char_traits<char> > (071A80h)
........
This keeps the optimization and solves (or works around) the problem.
I've understood it's best to avoid _mm_set_epi*, and instead rely on _mm_load_si128 (or even _mm_loadu_si128 with a small performance hit if the data is not aligned). However, the impact this has on performance seems inconsistent to me. The following is a good example.
Consider the two following functions that utilize SSE intrinsics:
static uint32_t clmul_load(uint16_t x, uint16_t y)
{
const __m128i c = _mm_clmulepi64_si128(
_mm_load_si128((__m128i const*)(&x)),
_mm_load_si128((__m128i const*)(&y)), 0);
return _mm_extract_epi32(c, 0);
}
static uint32_t clmul_set(uint16_t x, uint16_t y)
{
const __m128i c = _mm_clmulepi64_si128(
_mm_set_epi16(0, 0, 0, 0, 0, 0, 0, x),
_mm_set_epi16(0, 0, 0, 0, 0, 0, 0, y), 0);
return _mm_extract_epi32(c, 0);
}
The following function benchmarks the performance of the two:
template <typename F>
void benchmark(int t, F f)
{
std::mt19937 rng(static_cast<unsigned int>(std::time(0)));
std::uniform_int_distribution<uint32_t> uint_dist10(
0, std::numeric_limits<uint32_t>::max());
std::vector<uint32_t> vec(t);
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < t; ++i)
{
vec[i] = f(uint_dist10(rng), uint_dist10(rng));
}
auto duration = std::chrono::duration_cast<
std::chrono::milliseconds>(
std::chrono::high_resolution_clock::now() -
start);
std::cout << (duration.count() / 1000.0) << " seconds.\n";
}
Finally, the following main program does some testing:
int main()
{
const int N = 10000000;
benchmark(N, clmul_load);
benchmark(N, clmul_set);
}
On an i7 Haswell with MSVC 2013, a typical output is
0.208 seconds. // _mm_load_si128
0.129 seconds. // _mm_set_epi16
Using GCC with parameters -O3 -std=c++11 -march=native (with slightly older hardware), a typical output is
0.312 seconds. // _mm_load_si128
0.262 seconds. // _mm_set_epi16
What explains this? Are there actually cases where _mm_set_epi* is preferable over _mm_load_si128? There are other times where I've noticed _mm_load_si128 to perform better, but I can't really characterize those observations.
Your compiler is optimizing away the "gather" behavior of your _mm_set_epi16() call since it really isn't needed. From g++ 4.8 (-O3) and gdb:
(gdb) disas clmul_load
Dump of assembler code for function clmul_load(uint16_t, uint16_t):
0x0000000000400b80 <+0>: mov %di,-0xc(%rsp)
0x0000000000400b85 <+5>: mov %si,-0x10(%rsp)
0x0000000000400b8a <+10>: vmovdqu -0xc(%rsp),%xmm0
0x0000000000400b90 <+16>: vmovdqu -0x10(%rsp),%xmm1
0x0000000000400b96 <+22>: vpclmullqlqdq %xmm1,%xmm0,%xmm0
0x0000000000400b9c <+28>: vmovd %xmm0,%eax
0x0000000000400ba0 <+32>: retq
End of assembler dump.
(gdb) disas clmul_set
Dump of assembler code for function clmul_set(uint16_t, uint16_t):
0x0000000000400bb0 <+0>: vpxor %xmm0,%xmm0,%xmm0
0x0000000000400bb4 <+4>: vpxor %xmm1,%xmm1,%xmm1
0x0000000000400bb8 <+8>: vpinsrw $0x0,%edi,%xmm0,%xmm0
0x0000000000400bbd <+13>: vpinsrw $0x0,%esi,%xmm1,%xmm1
0x0000000000400bc2 <+18>: vpclmullqlqdq %xmm1,%xmm0,%xmm0
0x0000000000400bc8 <+24>: vmovd %xmm0,%eax
0x0000000000400bcc <+28>: retq
End of assembler dump.
The vpinsrw (insert word) is ever-so-slightly faster than the unaligned double-quadword move from clmul_load, likely due to the internal load/store unit being able to do the smaller reads simultaneously but not the 16B ones. If you were doing more arbitrary loads, this would go away, obviously.
The slowness of _mm_set_epi* comes from the need to scrape together various variables into a single vector. You'd have to examine the generated assembly to be certain, but my guess is that since most of the arguments to your _mm_set_epi16 calls are constants (and zeroes, at that), GCC is generating a fairly short and fast set of instructions for the intrinsic.