Clang sanitizers missing a read from uninitialized memory - c++

I have the following code, that I am confident reads from garbage memory, but clang sanitizers do not complain.
Is there something I can do to make them trigger or I should just accept this as limitation/bug?
#include <algorithm>
#include <iostream>
#include <vector>
struct B{
int x;
};
struct D : public B{
short y;
D& operator = (const D& other) {
y = other.y;
return *this;
}
};
int main() {
D var1{4,7},var2;
var2=var1;
std::cout << var2.x << " " << var2.y << std::endl;
}
I have tried setting O0 since that sometimes helps, but this time it did not.
godbolt
I am open to using gcc also, but I think gcc does not have memory sanitizer, only asan.

From the documentation
Uninitialized values occur when stack- or heap-allocated memory is
read before it is written. MSan detects cases where such values
affect program execution.
MSan is bit-exact: it can track uninitialized bits in a bitfield. It
will tolerate copying of uninitialized memory, and also simple logic
and arithmetic operations with it. In general, MSan silently
tracks the spread of uninitialized data in memory, and reports a
warning when a code branch is taken (or not taken) depending on an
uninitialized value.
That is, in order to minimize false positives, before complaining, clang waits until it is convinced that the uninitialized memory really has an impact on the program execution (takes a different branch, returns a different value from main, etc). Copying uninitialized memory around could be innocuous.
In your particular program, the actual use of the uninitialized value happens in the standard library, possibly even just in the C library, which haven't been instrumented with MSan, so you do not get a warning.
It is critical that you should build all the code in your program (including libraries it uses, in particular, C++ standard library) with MSan.
This constraint is the main reason why this sanitizer is much less popular than say ASan or UBSan.
To come back to this simple program, various static analysis tools can detect the issue, even just g++ -Wall -O will warn, but be aware that false positives are not rare.
x.cc: In function 'int main()':
x.cc:20:28: warning: 'var2.D::<anonymous>.B::x' is used uninitialized [-Wuninitialized]
20 | std::cout << var2.x << " " << var2.y << std::endl;
| ^~~~~

Related

Out of bounds pointer arithmetic not detected?

According to Wikipedia and this, this code is undefined behavior:
#include <iostream>
int main(int, char**) {
int data[1] = {123};
int* p = data + 5; // undefined behavior
std::cout << *(p - 5) << std::endl;
}
Compiled with clang++-6.0 -fsanitize=undefined and executed, the undefined behavior is detected which is fantastic, I get this message:
ub.cpp:5:19: runtime error: index 5 out of bounds for type 'int [1]'
But when I don't use an array, the undefined behavior is not detectable:
#include <iostream>
int main(int, char**) {
int data = 123;
int* p = &data + 5; // undefined behavior
std::cout << *(p - 5) << std::endl;
}
The sanitizer detects nothing, even though this is still undefined behavior. Valgrind also does not show any problem. Any way to detect this undefined behavior?
Since I am never accessing any invalid data, this is not a duplicate of Recommended way to track down array out-of-bound access/write in C program.
The standard very clearly specifies that most forms of undefined behaviour is "no diagnostic required". Meaning that your compiler is under no obligation to diagnose UB (which would also be unreasonable, since it is very hard to do so in many cases). Instead, the compiler is allowed to just assume that you "of course did not" write any UB and generate code as if you didn't. And if you did, that's on you and you get to keep the broken pieces.
Some tools (like asan and ubsan and turning your compilers warning level to 11) will detect some UB for you. But not all.
Your compiler implementors are not out to harm you. They do try to warn you of UB when they can. So, at the very least you should enable all warnings and let them help you as best they can.
One way to detect UB is to have intimate knowledge of the C++ standard and read code really carefully. But, you cannot really do better than that + let some tools help you find the low-hanging fruit. You just have to know (all) the rules and know what you are doing.
There are no training wheels or similar in C++.

Clang++ 6.0 Memory Sanitizer not reporting uninitialised local variable in a function whose return value dictates a conditional branch

The following code(in src.cpp) was used to experiment with Clang's Memory Sanitizer (MSan)
#include <iostream>
#include <vector>
int add(int x, int y) {
int sum;
sum = x + y;
return sum;
}
int main() {
if(add(10, 20) > 0) {
std::cout << "Greater";
}
std::cout << std::endl;
return 0;
}
We can clearly see that sum is unitialized and would cause an undenfined behaviour. As per MSan Github Wiki
MemorySanitizer is bit-exact: it can track uninitialized bits in a
bitfield. It will tolerate copying of uninitialized memory, and also
simple logic and arithmetic operations with it. In general,
MemorySanitizer silently tracks the spread of uninitialized data in
memory, and reports a warning when a code branch is taken (or not
taken) depending on an uninitialized value.
This clearly seems to conform to this use case since the if branch will be taken based on the intial value of sum. However, no error/warning is displayed while running this code compiled with
clang++ -fsanitize=memory -fsanitize-memory-track-origins -O0 -std=c++14 src.cpp -o src
Clang 6.0 is used on Linux x86_64.
sum is not uninitialized, because the next instruction is the assignment of the sum variable.
This code is the same as:
int sum = x + y;
And thats why it is initialized.

Is this compiler optimization inconsistency entirely explained by undefined behaviour?

During a discussion I had with a couple of colleagues the other day I threw together a piece of code in C++ to illustrate a memory access violation.
I am currently in the process of slowly returning to C++ after a long spell of almost exclusively using languages with garbage collection and, I guess, my loss of touch shows, since I've been quite puzzled by the behaviour my short program exhibited.
The code in question is as such:
#include <iostream>
using std::cout;
using std::endl;
struct A
{
int value;
};
void f()
{
A* pa; // Uninitialized pointer
cout<< pa << endl;
pa->value = 42; // Writing via an uninitialized pointer
}
int main(int argc, char** argv)
{
f();
cout<< "Returned to main()" << endl;
return 0;
}
I compiled it with GCC 4.9.2 on Ubuntu 15.04 with -O2 compiler flag set. My expectations when running it were that it would crash when the line, denoted by my comment as "writing via an uninitialized pointer", got executed.
Contrary to my expectations, however, the program ran successfully to the end, producing the following output:
0
Returned to main()
I recompiled the code with a -O0 flag (to disable all optimizations) and ran the program again. This time, the behaviour was as I expected:
0
Segmentation fault
(Well, almost: I didn't expect a pointer to be initialized to 0.) Based on this observation, I presume that when compiling with -O2 set, the fatal instruction got optimized away. This makes sense, since no further code accesses the pa->value after it's set by the offending line, so, presumably, the compiler determined that its removal would not modify the observable behaviour of the program.
I reproduced this several times and every time the program would crash when compiled without optimization and miraculously work, when compiled with -O2.
My hypothesis was further confirmed when I added a line, which outputs the pa->value, to the end of f()'s body:
cout<< pa->value << endl;
Just as expected, with this line in place, the program consistently crashes, regardless of the optimization level, with which it was compiled.
This all makes sense, if my assumptions so far are correct.
However, where my understanding breaks somewhat is in case where I move the code from the body of f() directly to main(), like so:
int main(int argc, char** argv)
{
A* pa;
cout<< pa << endl;
pa->value = 42;
cout<< pa->value << endl;
return 0;
}
With optimizations disabled, this program crashes, just as expected. With -O2, however, the program successfully runs to the end and produces the following output:
0
42
And this makes no sense to me.
This answer mentions "dereferencing a pointer that has not yet been definitely initialized", which is exactly what I'm doing, as one of the sources of undefined behaviour in C++.
So, is this difference in the way optimization affects the code in main(), compared to the code in f(), entirely explained by the fact that my program contains UB, and thus compiler is technically free to "go nuts", or is there some fundamental difference, which I don't know of, between the way code in main() is optimized, compared to code in other routines?
Your program has undefined behaviour. This means that anything may happen. The program is not covered at all by the C++ Standard. You should not go in with any expectations.
It's often said that undefined behaviour may "launch missiles" or "cause demons to fly out of your nose", to reinforce that point. The latter is more far-fetched but the former is feasible, imagine your code is on a nuclear launch site and the wild pointer happens to write a piece of memory that starts global thermouclear war..
Writing unknown pointers has always been something which could have unknown consequences. What's nastier is a currently-fashionable philosophy which suggests that compilers should assume that programs will never receive inputs that cause UB, and should thus optimize out any code which would test for such inputs if such tests would not prevent UB from occurring.
Thus, for example, given:
uint32_t hey(uint16_t x, uint16_t y)
{
if (x < 60000)
launch_missiles();
else
return x*y;
}
void wow(uint16_t x)
{
return hey(x,40000);
}
a 32-bit compiler could legitimately replace wow with an unconditional call to
launch_missiles without regard for the value of x, since x "can't possibly" be greater than 53687 (any value beyond that would cause the calculation of x*y to overflow. Even though the authors of C89 noted that the majority of compilers of that era would calculate the correct result in a situation like the above, since the Standard doesn't impose any requirements on compilers, hyper-modern philosophy regards it as "more efficient" for compilers to assume programs will never receive inputs that would necessitate reliance upon such things.

Erratic behaviour with missing return in c++ and optimizations

Suppose you wrote a function in c++, but absentmindedly forgot to type the word return. What would happen in that case? I was hoping that the compiler would complain, or at least a segmentation fault would be raised once the program got to that point. However, what actually happens is far worse: the program spews out rubbish. Not only that, but the actual output depends on the level of optimization! Here's some code that demonstrate this problem:
#include <iostream>
#include <vector>
using namespace std;
double max_1(double n1,
double n2)
{
if(n1>n2)
n1;
else
n2;
}
int max_2(const int n1,
const int n2)
{
if(n1>n2)
n1;
else
n2;
}
size_t max_length(const vector<int>& v1,
const vector<int>& v2)
{
if(v1.size()>v2.size())
v1.size();
else
v2.size();
}
int main(void)
{
cout << max_1(3,4) << endl;
cout << max_1(4,3) << endl;
cout << max_2(3,4) << endl;
cout << max_2(4,3) << endl;
cout << max_length(vector<int>(3,1),vector<int>(4,1)) << endl;
cout << max_length(vector<int>(4,1),vector<int>(3,1)) << endl;
return 0;
}
And here's what I get when I compile it at different optimization levels:
$ rm ./a.out; g++ -O0 ./test.cpp && ./a.out
nan
nan
134525024
134525024
4
4
$ rm ./a.out; g++ -O1 ./test.cpp && ./a.out
0
0
0
0
0
0
$ rm ./a.out; g++ -O2 ./test.cpp && ./a.out
0
0
0
0
0
0
$ rm ./a.out; g++ -O3 ./test.cpp && ./a.out
0
0
0
0
0
0
Now imagine that you're trying to debug the function max_length. In production mode you get the wrong answer, so you recompile in debug mode, and now when you run it everything works fine.
I know there are ways to avoid such cases altogether by adding the appropriate warning flags (-Wreturn-type), but I'm still have two questions
Why does the compiler even agree to compile a function without a return statement? Is this feature required for legacy code?
Why does the output depend on the optimization level?
This is undefined behavior to drop off the end of the value returning function, this is covered in the draft C++ standard section `6.6.31 The return statement which says:
Flowing off the end of a function is equivalent to a return with no
value; this results in undefined behavior in a value-returning
function.
The compiler is not required to issue a diagnostic, we can see this from section 1.4 Implementation compliance which says:
The set of diagnosable rules consists of all syntactic and semantic
rules in this International Standard except for those rules containing
an explicit notation that “no diagnostic is required” or which are
described as resulting in “undefined behavior.”
although compiler in general do try and catch a wide range of undefined behaviors and produce warnings, although usually you need to use the right set of flags. For gcc and clang I find the following set of flags to be useful:
-Wall -Wextra -Wconversion -pedantic
and in general I would encourage you to turn warnings into errors using -Werror.
Compiler are notorious for taking advantage of undefined behavior during the optimization stages, see Finding Undefined Behavior Bugs by Finding Dead Code for some good examples including the infamous Linux kernel null pointer check removal where in processing this code:
struct foo *s = ...;
int x = s->f;
if (!s) return ERROR;
gcc inferred that since s was deferenced in s->f; and since dereferencing a null pointer is undefined behavior then s must not be null and therefore optimizes away the if (!s) check on the next line (copied from my answer here).
Since undefined behavior is unpredictable, then at more aggressive settings the compiler in many cases will do more aggressive optimizations many of them may not make much intuitive sense but, hey it is undefined behavior so you should have no expectations anyway.
Note, that although there are many cases the compiler can determine a function is not properly returning in the general case this is the halting problem. Doing this at run-time automatically would carry a cost which violates the don't pay for what you don't use philosophy. Although both gcc and clang implement sanitizers to check for things like undefined behavior, for example using the -fsanitize=undefined flag would check for undefined behavior at run-time.
You may want to check out this answer here
The just of it is that the compiler allows you to not have a return statement since there are potentially many different execution paths, ensuring each will exit with a return can be tricky at compile time, so the compiler will take care of it for you.
Things to remember:
if main ends without a return it will always return 0.
if another function ends without a return it will always return the last value in the eax register, usually the last statement
optimization changes the code on the assembly level. This is why you are getting the weird behavior, the compiler is "fixing" your code for you changing when things are executed giving a different last value, and thus return value.
Hope this helped!

Different behavior of shift operator with -O2 and without

Without -O2 this code prints 84 84, with O2 flag the output is 84 42. The code was compiled using gcc 4.4.3. on 64-bit Linux platform. Why the output for the following code is different?
Note that when compiled with -Os the output is 0 42
#include <iostream>
using namespace std;
int main() {
long long n = 42;
int *p = (int *)&n;
*p <<= 1;
cout << *p << " " << n << endl;
return 0;
}
When you use optimization with gcc, it can use certain assumptions based on the type of expressions to avoid repeating unnecessary reads and to allow retaining variables in memory.
Your code has undefined behaviour because you cast a pointer to a long long (which gcc allows as an extenstion) to a pointer to an int and then manipulate the pointed-to-object as if it were an int. A pointer-to-int cannot normally point to an object of type long long so gcc is allowed to assume that an operation that writes to an int (via a pointer) won't affect an object that has type long long.
It is therefore legitimate of it to cache the value of n between the time it was originally assigned and the time at which it is subsequently printed. No valid write operation could have changed its value.
The particular switch and documentation to read is -fstrict-aliasing.
You're breaking strict aliasing. Compiling with -Wall should give you a dereferencing type-punned pointer warning. See e.g. http://cellperformance.beyond3d.com/articles/2006/06/understanding-strict-aliasing.html
I get the same results with GCC 4.4.4 on Linux/i386.
The program's behavior is undefined, since it violates the strict aliasing rule.