UB when manipulating nullptr [duplicate] - c++

This question already has answers here:
Is incrementing a null pointer well-defined?
(9 answers)
Closed 7 years ago.
This is a related question to the discussion around Example of error caused by UB of incrementing a NULL pointer
Suppose I define this data structure:
union UPtrMem
{
void* p;
char ach[sizeof(void*)];
}
UPtrMem u;
u.p = nullptr;
u.p++; // UB according to standards
u.ach[0]++; // why is this OK then??
p and ach share the same memory, so is merely the act of modifying a memory location (that happens to contain a pointer) UB? I would think it only gets undefined once you try to dereference the pointer.

This is still UB because
it's undefined behavior to read from the member of the union that wasn't most recently written.
(from here). So you have UB, regardless of the value of p. To conclude:
why is this OK then??
It is not.

Your example doesn't contain any UB, because you don't get that far: it's invalid code that just won't compile.
To have the kind of UB you're thinking about, the title's “UB when manipulating nullptr”, you need to have the code executed.
That doesn't happen when it doesn't compile.
Just in case the question is changed after I answer, which isn't uncommon with these kinds of apparently designed-to-trap-the-responder questions, this is the code presented as I'm writing this:
union UPtrMem
{
void* p;
char ach[sizeof(void*)];
}
UPtrMem u;
u.p = nullptr;
u.p++; // UB according to standards
u.ach[0]++; // why is this OK then??
Incrementing a void* is just invalid, not a supported operation, and won't compile.

The reason why the standard makes incrementing a null pointer undefined is because it is not always the case that a null pointer contains an arithmetically meaningful value like 0. It could contain a specific bit pattern that indicates non-addressable memory to the CPU.
Your example has other problems too.
When you increment an allocated pointer it adds the size of the thing it points to to its value.
So on a 32bit computer and int* will likely advance 4 places (sizeof(int)) when you add 1 to it.
The problem with void* is the compiler has no size information and so can not know how far to increment its value.
In your example you then do this:
u.ach[0]++;
That doesn't increment a pointer at all, it increments whatever char value is contained in the first element of the char array. This, of course, is undefined so, even though it works, you can not rely on it having any specific value.

Seems to me u.p++; isn't even valid because void has no size so - nothing to increment. But u.ach[0]++; is valid because your incrementing a char.
edit yes it takes up space in the structure... but what it points to has no size... what would it increment by?

Related

I have a local char array in a function — when I return the array name, why is the return value null? [duplicate]

This question already has answers here:
Returning an array using C
(8 answers)
Closed 3 years ago.
I have come across a very confusing thing. I have made a local char array in a function, and return the array name, but the return value is null?
char* get_string(){
char local[] ="hello world\n";
cout<<"1"<<(int)local<<endl;//shows a reasonable value
return local;
}
int main(){
char* p = get_string();
cout<<"2"<<(int) p<<endl;//shows 0
return 0;
}
I know it is not good to use a local variable, because when the function returns, the stack part that the local variable occupies would be used by other function calls, but I think this should return the address of the first element of the array, should not be null. I'm very confused; any help would be appreciated.
I use QT 32 version, compiler is MSVC2015 (I am at baby stage about compiler; not even sure that MSVC is compiler name).
--updated, I think this question is not a duplicate of this Returning an array using C I know it is not valid to use atomic/local storage outside the scope, and my question is why the return value becomes 0 despite its inappropriate use.
--ok, thank you, everyone. I think I found the answer. I see the assembly code of the function char* get_string(), the last part of the assembly code is this
0x44bce7 mov $0x0,%eax
0x44bcec leave
0x44bced ret
I think this is implementation defined, hard coded in the compiler, if I return the address of a local variable, then %eax or %rax is set to 0.
The C++ standard says (quoting the latest draft):
[basic.stc]
When the end of the duration of a region of storage is reached, the values of all pointers representing the address of any part of that region of storage become invalid pointer values.
Indirection through an invalid pointer value and passing an invalid pointer value to a deallocation function have undefined behavior.
Any other use of an invalid pointer value has implementation-defined behavior.
p contains an invalid pointer value, and printing the value of the pointer is included in "any other use", and thus the behaviour is implementation defined. In the observed case, the behaviour was to output 0.
Note to readers that in the code in the example there is no indirection through the invalid pointer and the behaviour is not undefined.
P.S. Converting pointer to int is not correct. int isn't guaranteed to be sufficiently large to represent all pointer values, and on most 64 bit systems, it isn't sufficiently large. Standard only specifies the behaviour for conversion to sufficiently large integer type. I would suggest converting to void* instead for this case.

What are null pointers used for [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have just started using c++ and saw that their is a null value for pointers. I am curious as to what this is used for. It seems like it would be pointless to add a pointer to point to nothing.
Well, the null pointer value has the remarkable property that, despite it being a well-defined and unique constant value, the exact value depending on machine-architecture and ABI (on most modern ones all-bits-zero, not that it matters), it never points to (or just behind) an object.
This allows it to be used as a reliable error-indicator when a valid pointer is expected (functions might throw an exception or terminate execution instead), as well as a sentinel value, or to mark the absence of something optional.
On many implementations accessing memory through a nullpointer will reliably cause a hardware exception (some even trap on arithmetic), though on many others, especially those without paging and / or segmentation it will not.
Generally it's a placeholder. If you just declare a pointer, int *a;, there's no guarantee what is in the pointer when you want to access it. So if your code may or may not set the pointer later, there's no way to tell if the pointer is valid or just pointing to garbage memory. But if you declare it as NULL, such as int *a = NULL; you can then check later to see if the pointer was set, like if(a == NULL).
Most of the time during initialization we assign null value to a pointer so that we can check whether it is still null or a address has been assign to it or not.
It seems like it would be pointless to add a pointer to point to
nothing.
No, it is not. Suppose you have a function returning optional dynamically allocated value. When you want to return "nothing" you return null. The caller can check for null and distinguish between 2 different cases: when the return value is "nothing" and when the return value is some valid usable object.
null value in C and C++ is equal to 0. But nullptr in C++ is different from it, nullptr is always a pointer type in C++. We assign a null value to a pointer variable for various reason.
To check whether a memory has been allocated to the pointer or not
To neutralize a dangling pointer so that it should not create any side effect
To check whether a return address is a valid address or not etc.
Most of the time during initialization we assign null value to a pointer so that we can check whether it is still null or a address has been assign to it or not.
Basically, pointers are just integers. The null pointer is a pointer with a value of 0. It doesn't strictly point to nothing, it points to absolute address 0, which generally isn't accessible to your program; dereferencing it causes a fault.
It's generally used as a flag value, so that you can, for example, use it to end a loop.
Update:
There seem to be a lot of people confused by this answer, which is, strictly, completely correct. See C11(ISO/IEC 9899:201x) §6.3.2.3 Pointers Section 3:
An integer constant expression with the value 0, or such an expression cast to type void *, is called a null pointer constant. If a null pointer constant is converted to a pointer type, the resulting pointer, called a null pointer, is guaranteed to compare unequal to a pointer to any object or function.
So, what's an address? It's a number n where 0 ≤ n ≤ max_address. And how do we represent such a number? Why, it's an integer, just like the standard says.
The C11 standard makes it clear that there's never anything to reference at address 0, because in some old pathologically non-portable code in BSD 4.2, you often saw code like this:
/* DON'T TRY THIS AT HOME */
int
main(){
char target[100] ;
char * tp = &target ;
char * src = "This won't do what you think." ;
void exit(int);
while((*tp++ = *src++))
;
exit(0);
}
This is still valid C:
$ gcc -o dumb dumb.c
dumb.c:6:12: warning: incompatible pointer types initializing 'char *' with an
expression of type 'char (*)[100]' [-Wincompatible-pointer-types]
char * tp = &target ;
^ ~~~~~~~
1 warning generated.
$
In 4.2BSD on a VAX, you could get away with that nonsense, because address 0 reliably contained the value 0, so the assignment evaluated to 0, which is of course FALSE.
Now, to demonstrate:
/* Very simple program dereferencing a NULL pointer. */
int
main() {
int * a_pointer ;
int a_value ;
void exit(int); /* To avoid any #includes */
a_pointer = ((void*)0);
a_value = *a_pointer ;
exit(0);
}
Here's the results:
$ gcc -o null null.c
$ ./null
Segmentation fault: 11
$

C++: Memory allocation

#include <iostream>
using namespace std;
int main()
{
int a[100],n;
cout<<&n<<" "<<&a[100]<<endl;
if(&n!=&a[100])
{
cout<<" What is wrong with C++?";
}
}
It prints the address of n and a[100] as same. But when I compare the two values in the if loop It says that they both are not equal.
What does this mean?
When I change the value of n, a[100] also changes so doesn't that mean n and a[100] are equal.
First, let's remember that there is no a[100]. It does not exist! If you tried to access a[100]'s "value" then, according to the abstract machine called C++, anything can happen. This includes blowing up the sun or dying my hair purple; now, I like my hair the way it is, so please don't!
Anyway, what you're doing is playing with the array's "one-past-the-end" pointer. You are allowed to obtain this pointer, which is a fake pointer of sorts, as long as you do not dereference it.
It is available only because the "101st element" would be "one past the end" of the array. (And there is debate as to whether you are allowed to write &a[100], rather than a+100, to get such a pointer; I am in the "no" camp.)
However, that still says nothing about comparing it to the address of some entirely different object. You cannot assume anything about the relative location of local variables in memory. You simply have no idea where n will live with respect to a.
The results you're observing are unpredictable, unreliable and meaningless side effects of the undefined behaviour exhibited by your program, caused by a combination of compiler optimisations and data locality.
For the array declaration int a[100], array index starts from 0 to 99 and when you try access the address of 101th element which is out of its range could overlap with next member (in your case its variable n) on the stack. However its undefined behavior.
For a bit if facts here is the relevant text from the specifications
Equality operator (==,!=)
Pointers to objects of the same type can be compared for equality with the 'intuitive' expected results:
From § 5.10 of the C++11 standard:
Pointers of the same type (after pointer conversions) can be compared for equality. Two pointers of the same type compare equal if > and only if they are both null, both point to the same function, or > both represent the same address (3.9.2).
(leaving out details on comparison of pointers to member and or the
null pointer constants - they continue down the same line of 'Do >
What I Mean':)
[...] If both operands are null, they compare equal. Otherwise if only one is null, they compare unequal.[...]
The most 'conspicuous' caveat has to do with virtuals, and it does seem to be the logical thing to expect too:
[...] if either is a pointer to a virtual member function, the result is unspecified. Otherwise they compare equal if and only if
they would refer to the same member of the same most derived object
(1.8) or the same subobject if they were dereferenced with a
hypothetical object of the associated class type. [...]
Maybe the problem could be that array of int is not the same as int!
by writing &a[100] you're invoking undefined behavior, since there is no element with index 100 in the array a. To get the same address safely, you could instead write a+100 (or &a[0]+100) in which case your program would be well-defined, but whether or not the if condition will hold cannot be predicted or relied upon on any implementation.

Can I safely create references to possibly invalid memory as long as I don't use it?

I want to parse UTF-8 in C++. When parsing a new character, I don't know in advance if it is an ASCII byte or the leader of a multibyte character, and also I don't know if my input string is sufficiently long to contain the remaining characters.
For simplicity, I'd like to name the four next bytes a, b, c and d, and because I am in C++, I want to do it using references.
Is it valid to define those references at the beginning of a function as long as I don't access them before I know that access is safe? Example:
void parse_utf8_character(const string s) {
for (size_t i = 0; i < s.size();) {
const char &a = s[i];
const char &b = s[i + 1];
const char &c = s[i + 2];
const char &d = s[i + 3];
if (is_ascii(a)) {
i += 1;
do_something_only_with(a);
} else if (is_twobyte_leader(a)) {
i += 2;
if (is_safe_to_access_b()) {
do_something_only_with(a, b);
}
}
...
}
}
The above example shows what I want to do semantically. It doesn't illustrate why I want to do this, but obviously real code will be more involved, so defining b,c,d only when I know that access is safe and I need them would be too verbose.
There are three takes on this:
Formally
well, who knows. I could find out for you by using quite some time on it, but then, so could you. Or any reader. And it's not like that's very practically useful.
EDIT: OK, looking it up, since you don't seem happy about me mentioning the formal without looking it up for you. Formally you're out of luck:
N3280 (C++11) §5.7/5 “If both the pointer operand and the result point to elements of the same array object, or one past
the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.”
Two situations where this can produce undesired behavior: (1) computing an address beyond the end of a segment, and (2) computing an address beyond an array that the compiler knows the size of, with debug checks enabled.
Technically
you're probably OK as long as you avoid any lvalue-to-rvalue conversion, because if the references are implemented as pointers, then it's as safe as pointers, and if the compiler chooses to implement them as aliases, well, that's also ok.
Economically
relying needlessly on a subtlety wastes your time, and then also the time of others dealing with the code. So, not a good idea. Instead, declare the names when it's guaranteed that what they refer to, exists.
Before going into the legality of references to unaccessible memory, you have another problem in your code. Your call to s[i+x] might call string::operator[] with a parameter bigger then s.size(). The C++11 standard says about string::operator[] ([string.access], §21.4.5):
Requires: pos <= size().
Returns: *(begin()+pos) if pos < size(), otherwise a reference to an object of type T with value charT(); the referenced value shall not be modified.
This means that calling s[x] for x > s.size() is undefined behaviour, so the implementation could very well terminate your program, e.g. by means of an assertion, for that.
Since string is now guaranteed to be continous, you could go around that problem using &s[i]+x to get an address. In praxis this will probably work.
However, strictly speaking doing this is still illegal unfortunately. The reason for this is that the standard allows pointer arithmetic only as long as the pointer stays inside the same array, or one past the end of the array. The relevant part of the (C++11) standard is in [expr.add], §5.7.5:
If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.
Therefore generating references or pointers to invalid memory locations might work on most implementations, but it is technically undefined behaviour, even if you never dereference the pointer/use the reference. Relying on UB is almost never a good idea , because even if it works for all targeted systems, there are no guarantees about it continuing to work in the future.
In principle, the idea of taking a reference for a possibly illegal memory address is itself perfectly legal. The reference is only a pointer under the hood, and pointer arithmetic is legal until dereferencing occurs.
EDIT: This claim is a practical one, not one covered by the published standard. There are many corners of the published standard which are formally undefined behaviour, but don't produce any kind of unexpected behaviour in practice.
Take for example to possibility of computing a pointer to the second item after the end of an array (as #DanielTrebbien suggests). The standard says overflow may result in undefined behaviour. In practice, the overflow would only occur if the upper end of the array is just short of the space addressable by a pointer. Not a likely scenario. Even when if it does happen, nothing bad would happen on most architectures. What is violated are certain guarantees about pointer differences, which don't apply here.
#JoSo If you were working with a character array, you can avoid some of the uncertainty about reference semantics by replacing the const-references with const-pointers in your code. That way you can be certain no compiler will alias the values.

Pointers assignment

What is the meaning of
*(int *)0 = 0;
It does compile successfully
It has no meaning. That's an error. It's parsed as this
(((int)0) = 0)
Thus, trying to assign to an rvalue. In this case, the right side is a cast of 0 to int (it's an int already, anyway). The result of a cast to something not a reference is always an rvalue. And you try to assign 0 to that. What Rvalues miss is an object identity. The following would work:
int a;
(int&)a = 0;
Of course, you could equally well write it as the following
int a = 0;
Update: Question was badly formatted. The actual code was this
*(int*)0 = 0
Well, now it is an lvalue. But a fundamental invariant is broken. The Standard says
An lvalue refers to an object or function
The lvalue you assign to is neither an object nor a function. The Standard even explicitly says that dereferencing a null-pointer ((int*)0 creates such a null pointer) is undefined behavior. A program usually will crash on an attempt to write to such a dereferenced "object". "Usually", because the act of dereferencing is already declared undefined by C++.
Also, note that the above is not the same as the below:
int n = 0;
*(int*)n = 0;
While the above writes to something where certainly no object is located, this one will write to something that results from reinterpreting n to a pointer. The mapping to the pointer value is implementation defined, but most compilers will just create a pointer referring to address zero here. Some systems may keep data on that location, so this one may have more chances to stay alive - depending on your system. This one is not undefined behavior necessarily, but depends on the compiler and runtime-environment it is invoked in.
If you understand the difference between the above dereference of a null pointer (only constant expressions valued 0 converted to pointers yield null pointers!) and the below dereference of a reinterpreted zero value integer, i think you have learned something important.
It will usually cause an access violation at runtime. The following is done: first 0 is cast to an int * and that yields a null pointer. Then a value 0 is written to that address (null address) - that causes undefined behaviour, usually an access violation.
Effectively it is this code:
int* address = reinterpret_cast<int*>( 0 );
*address = 0;
Its a compilation error. You cant modify a non-lvalue.
It puts a zero on address zero. On some systems you can do this. Most MMU-based systems will not allow this in run-time. I once saw an embedded OS writing to address 0 when performing time(NULL).
there is no valid lvalue in that operation so it shouldn't compile.
the left hand side of an assignment must be... err... assignable