Weird output when use prefix and postfix on pointer together - c++

Given the code below
char buf[] = "asfsf";
char *a=buf;
++*a++;
cout<<*a;
I expect the result is the next character of 's' that is 't', but the result is still 's'. Why?
Why ++*a++ is not the same as
*a++;
++*a;
cout<<*a;
Is that really a duplicate question with ++i++? I know ++i++ is a undefined behavior and will cause compile error, but ++*i++ actually can run. Is my case also a undefined behavior?

According to the language grammar, the operators associate as:
++(*a++)
Note: associativity does not imply an order of operations.
*a++ evaluates to an lvalue designating the location where a was originally pointing, with side-effect of modifying a. All fine so far.
Applying prefix-++ to that lvalue increments the value stored there (changing 'a' to 'b').
Although the two increments are unsequenced, this does not cause UB because different objects are being incremented, and the lvalue designating the latter location does not depend on the increment. (It uses the old value of a).

As it stands right now, your code has undefined behavior, because it attempts to modify the contents of a string literal.
One way (probably the preferred way) to prevent the compiler from accepting such code is to define your a like:
char const *a="asfsf";
This way, the ++*a part simply won't compile.
For the sake of exposition, let's change the code a little bit, to become:
#include <iostream>
int main(){
char x[]="asfsf";
char *a = x;
++*a++;
std::cout<<x;
}
Now a points at memory we can actually write to, and get meaningful results. This prints out bsfsf. If we print out a, we'll get sfsf.
What's happening is that a++ increments a, but still yields the original value of a. That is dereferenced, giving a reference to the first element of x. Then the pre-increment is applied to that, changing it from a to b.
If you want to increment the pointer, dereference the result, then increment that, you'd use: ++*++a;. Well, no, you wouldn't use that--or at least I hope you wouldn't. It does increment a to point at the second element of the array, then increment that second element to change it from s to t--but anybody who read the code would be completely forgiven if they hated you for writing it that way.

Related

Understanding output of this code

This is an excercise in my textbook. I need to find the output of this code.
#include<iostream>
using namespace std;
int main()
{
int x[]={10,20,30,40,50};
int *p,**q,*t;
p=x;
t=x+1;
q=&t;
cout<<*p<<","<<**q<<","<<*t++;
return 0;
}
The output is
10,30,20
Here I dont understand the declaration of **q, and also how its value comes out to be 30. I also noticed that changing the last statement to
cout<<*p<<","<<**q<<","<<*t;
changes the output to
10,20,20
Could somebody explain what goes on behind the scenes here? Thanks a lot in advance.
Here, q is a pointer to a pointer to int, and it was set to point to t. So *q is identical to t, and **q is *t. Which means the cout expression can be rewritten as:
cout<<*p<<","<<*t<<","<<*t++;
Here you can see that t is read and modified in different parts of the expression, and the standard says that the order in which these parts are executed is not specified. So t may be modified before or after (or even while) it is read. When this kind of thing (unsequenced read and write to a variable) happens, we get undefined behavior: Anything can happen as a result. A specific compiler may give a specific result on a specific computer, but there is no guarantee that you will always get this result.
So this exercise is invalid, and there is no point in trying to figure out why you saw a specific output.
On the other hand, the second line you attempted:
cout<<*p<<","<<**q<<","<<*t;
is perfectly valid, because it doesn't modify t anywhere.
p and t are both of the type pointer to int, q is of the type pointer to (pointer to int)
The * operator makes a pointer to a reference.
So *p is of the type int&, so is *t.
*q is of the type int*& (read reference to a pointer to int)
You want to print an int value here and must therefore use the * operator a second time.
So the **q is just making a pointer to a pointer to int to a reference to int
I forgot to mention it: The process is called dereferencing pointers.
Maybe the descirption on this side will give you a better insight:
http://www.cplusplus.com/doc/tutorial/pointers/
++ operator has higher precedence than <<
When program is executed this are events:
int x[]={10,20,30,40,50};
int *p,**q,*t;
p=x;
t=x+1;
q=&t;
cout<<*p<<","<<**q<<","<<*t++; //1st change value of t to t+1,
//but return old t in place ^
//then to output stream 'p'=10, then 'q'=new 't'=old 't'+1=30,
//then old 't'=20 which is returned by sufix ++ operator

C++: Memory allocation

#include <iostream>
using namespace std;
int main()
{
int a[100],n;
cout<<&n<<" "<<&a[100]<<endl;
if(&n!=&a[100])
{
cout<<" What is wrong with C++?";
}
}
It prints the address of n and a[100] as same. But when I compare the two values in the if loop It says that they both are not equal.
What does this mean?
When I change the value of n, a[100] also changes so doesn't that mean n and a[100] are equal.
First, let's remember that there is no a[100]. It does not exist! If you tried to access a[100]'s "value" then, according to the abstract machine called C++, anything can happen. This includes blowing up the sun or dying my hair purple; now, I like my hair the way it is, so please don't!
Anyway, what you're doing is playing with the array's "one-past-the-end" pointer. You are allowed to obtain this pointer, which is a fake pointer of sorts, as long as you do not dereference it.
It is available only because the "101st element" would be "one past the end" of the array. (And there is debate as to whether you are allowed to write &a[100], rather than a+100, to get such a pointer; I am in the "no" camp.)
However, that still says nothing about comparing it to the address of some entirely different object. You cannot assume anything about the relative location of local variables in memory. You simply have no idea where n will live with respect to a.
The results you're observing are unpredictable, unreliable and meaningless side effects of the undefined behaviour exhibited by your program, caused by a combination of compiler optimisations and data locality.
For the array declaration int a[100], array index starts from 0 to 99 and when you try access the address of 101th element which is out of its range could overlap with next member (in your case its variable n) on the stack. However its undefined behavior.
For a bit if facts here is the relevant text from the specifications
Equality operator (==,!=)
Pointers to objects of the same type can be compared for equality with the 'intuitive' expected results:
From § 5.10 of the C++11 standard:
Pointers of the same type (after pointer conversions) can be compared for equality. Two pointers of the same type compare equal if > and only if they are both null, both point to the same function, or > both represent the same address (3.9.2).
(leaving out details on comparison of pointers to member and or the
null pointer constants - they continue down the same line of 'Do >
What I Mean':)
[...] If both operands are null, they compare equal. Otherwise if only one is null, they compare unequal.[...]
The most 'conspicuous' caveat has to do with virtuals, and it does seem to be the logical thing to expect too:
[...] if either is a pointer to a virtual member function, the result is unspecified. Otherwise they compare equal if and only if
they would refer to the same member of the same most derived object
(1.8) or the same subobject if they were dereferenced with a
hypothetical object of the associated class type. [...]
Maybe the problem could be that array of int is not the same as int!
by writing &a[100] you're invoking undefined behavior, since there is no element with index 100 in the array a. To get the same address safely, you could instead write a+100 (or &a[0]+100) in which case your program would be well-defined, but whether or not the if condition will hold cannot be predicted or relied upon on any implementation.

Can I safely create references to possibly invalid memory as long as I don't use it?

I want to parse UTF-8 in C++. When parsing a new character, I don't know in advance if it is an ASCII byte or the leader of a multibyte character, and also I don't know if my input string is sufficiently long to contain the remaining characters.
For simplicity, I'd like to name the four next bytes a, b, c and d, and because I am in C++, I want to do it using references.
Is it valid to define those references at the beginning of a function as long as I don't access them before I know that access is safe? Example:
void parse_utf8_character(const string s) {
for (size_t i = 0; i < s.size();) {
const char &a = s[i];
const char &b = s[i + 1];
const char &c = s[i + 2];
const char &d = s[i + 3];
if (is_ascii(a)) {
i += 1;
do_something_only_with(a);
} else if (is_twobyte_leader(a)) {
i += 2;
if (is_safe_to_access_b()) {
do_something_only_with(a, b);
}
}
...
}
}
The above example shows what I want to do semantically. It doesn't illustrate why I want to do this, but obviously real code will be more involved, so defining b,c,d only when I know that access is safe and I need them would be too verbose.
There are three takes on this:
Formally
well, who knows. I could find out for you by using quite some time on it, but then, so could you. Or any reader. And it's not like that's very practically useful.
EDIT: OK, looking it up, since you don't seem happy about me mentioning the formal without looking it up for you. Formally you're out of luck:
N3280 (C++11) §5.7/5 “If both the pointer operand and the result point to elements of the same array object, or one past
the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.”
Two situations where this can produce undesired behavior: (1) computing an address beyond the end of a segment, and (2) computing an address beyond an array that the compiler knows the size of, with debug checks enabled.
Technically
you're probably OK as long as you avoid any lvalue-to-rvalue conversion, because if the references are implemented as pointers, then it's as safe as pointers, and if the compiler chooses to implement them as aliases, well, that's also ok.
Economically
relying needlessly on a subtlety wastes your time, and then also the time of others dealing with the code. So, not a good idea. Instead, declare the names when it's guaranteed that what they refer to, exists.
Before going into the legality of references to unaccessible memory, you have another problem in your code. Your call to s[i+x] might call string::operator[] with a parameter bigger then s.size(). The C++11 standard says about string::operator[] ([string.access], §21.4.5):
Requires: pos <= size().
Returns: *(begin()+pos) if pos < size(), otherwise a reference to an object of type T with value charT(); the referenced value shall not be modified.
This means that calling s[x] for x > s.size() is undefined behaviour, so the implementation could very well terminate your program, e.g. by means of an assertion, for that.
Since string is now guaranteed to be continous, you could go around that problem using &s[i]+x to get an address. In praxis this will probably work.
However, strictly speaking doing this is still illegal unfortunately. The reason for this is that the standard allows pointer arithmetic only as long as the pointer stays inside the same array, or one past the end of the array. The relevant part of the (C++11) standard is in [expr.add], §5.7.5:
If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.
Therefore generating references or pointers to invalid memory locations might work on most implementations, but it is technically undefined behaviour, even if you never dereference the pointer/use the reference. Relying on UB is almost never a good idea , because even if it works for all targeted systems, there are no guarantees about it continuing to work in the future.
In principle, the idea of taking a reference for a possibly illegal memory address is itself perfectly legal. The reference is only a pointer under the hood, and pointer arithmetic is legal until dereferencing occurs.
EDIT: This claim is a practical one, not one covered by the published standard. There are many corners of the published standard which are formally undefined behaviour, but don't produce any kind of unexpected behaviour in practice.
Take for example to possibility of computing a pointer to the second item after the end of an array (as #DanielTrebbien suggests). The standard says overflow may result in undefined behaviour. In practice, the overflow would only occur if the upper end of the array is just short of the space addressable by a pointer. Not a likely scenario. Even when if it does happen, nothing bad would happen on most architectures. What is violated are certain guarantees about pointer differences, which don't apply here.
#JoSo If you were working with a character array, you can avoid some of the uncertainty about reference semantics by replacing the const-references with const-pointers in your code. That way you can be certain no compiler will alias the values.

Why is it allowed to cast a pointer to a reference?

Originally being the topic of this question, it emerged that the OP just overlooked the dereference. Meanwhile, this answer got me and some others thinking - why is it allowed to cast a pointer to a reference with a C-style cast or reinterpret_cast?
int main() {
char c = 'A';
char* pc = &c;
char& c1 = (char&)pc;
char& c2 = reinterpret_cast<char&>(pc);
}
The above code compiles without any warning or error (regarding the cast) on Visual Studio while GCC will only give you a warning, as shown here.
My first thought was that the pointer somehow automagically gets dereferenced (I work with MSVC normally, so I didn't get the warning GCC shows), and tried the following:
#include <iostream>
int main() {
char c = 'A';
char* pc = &c;
char& c1 = (char&)pc;
std::cout << *pc << "\n";
c1 = 'B';
std::cout << *pc << "\n";
}
With the very interesting output shown here. So it seems that you are accessing the pointed-to variable, but at the same time, you are not.
Ideas? Explanations? Standard quotes?
Well, that's the purpose of reinterpret_cast! As the name suggests, the purpose of that cast is to reinterpret a memory region as a value of another type. For this reason, using reinterpret_cast you can always cast an lvalue of one type to a reference of another type.
This is described in 5.2.10/10 of the language specification. It also says there that reinterpret_cast<T&>(x) is the same thing as *reinterpret_cast<T*>(&x).
The fact that you are casting a pointer in this case is totally and completely unimportant. No, the pointer does not get automatically dereferenced (taking into account the *reinterpret_cast<T*>(&x) interpretation, one might even say that the opposite is true: the address of that pointer is automatically taken). The pointer in this case serves as just "some variable that occupies some region in memory". The type of that variable makes no difference whatsoever. It can be a double, a pointer, an int or any other lvalue. The variable is simply treated as memory region that you reinterpret as another type.
As for the C-style cast - it just gets interpreted as reinterpret_cast in this context, so the above immediately applies to it.
In your second example you attached reference c to the memory occupied by pointer variable pc. When you did c = 'B', you forcefully wrote the value 'B' into that memory, thus completely destroying the original pointer value (by overwriting one byte of that value). Now the destroyed pointer points to some unpredictable location. Later you tried to dereference that destroyed pointer. What happens in such case is a matter of pure luck. The program might crash, since the pointer is generally non-defererencable. Or you might get lucky and make your pointer to point to some unpredictable yet valid location. In that case you program will output something. No one knows what it will output and there's no meaning in it whatsoever.
One can rewrite your second program into an equivalent program without references
int main(){
char* pc = new char('A');
char* c = (char *) &pc;
std::cout << *pc << "\n";
*c = 'B';
std::cout << *pc << "\n";
}
From the practical point of view, on a little-endian platform your code would overwrite the least-significant byte of the pointer. Such a modification will not make the pointer to point too far away from its original location. So, the code is more likely to print something instead of crashing. On a big-endian platform your code would destroy the most-significant byte of the pointer, thus throwing it wildly to point to a totally different location, thus making your program more likely to crash.
It took me a while to grok it, but I think I finally got it.
The C++ standard specifies that a cast reinterpret_cast<U&>(t) is equivalent to *reinterpret_cast<U*>(&t).
In our case, U is char, and t is char*.
Expanding those, we see that the following happens:
we take the address of the argument to the cast, yielding a value of type char**.
we reinterpret_cast this value to char*
we dereference the result, yielding a char lvalue.
reinterpret_cast allows you to cast from any pointer type to any other pointer type. And so, a cast from char** to char* is well-formed.
I'll try to explain this using my ingrained intuition about references and pointers rather than relying on the language of the standard.
C didn't have reference types, it only had values and pointer types (addresses) - since, physically in memory, we only have values and addresses.
In C++ we've added references to the syntax, but you can think of them as a kind of syntactic sugar - there is no special data structure or memory layout scheme for holding references.
Well, what "is" a reference from that perspective? Or rather, how would you "implement" a reference? With a pointer, of course. So whenever you see a reference in some code you can pretend it's really just a pointer that's been used in a special way: if int x; and int& y{x}; then we really have a int* y_ptr = &x; and if we say y = 123; we merely mean *(y_ptr) = 123;. This is not dissimilar from how, when we use C array subscripts (a[1] = 2;) what actually happens is that a is "decayed" to mean pointer to its first element, and then what gets executed is *(a + 1) = 2.
(Side note: Compilers don't actually always hold pointers behind every reference; for example, the compiler might use a register for the referred-to variable, and then a pointer can't point to it. But the metaphor is still pretty safe.)
Having accepted the "reference is really just a pointer in disguise" metaphor, it should now not be surprising that we can ignore this disguise with a reinterpret_cast<>().
PS - std::ref is also really just a pointer when you drill down into it.
Its allowed because C++ allows pretty much anything when you cast.
But as for the behavior:
pc is a 4 byte pointer
(char)pc tries to interpret the pointer as a byte, in particular the last of the four bytes
(char&)pc is the same, but returns a reference to that byte
When you first print pc, nothing has happened and you see the letter you stored
c = 'B' modifies the last byte of the 4 byte pointer, so it now points to something else
When you print again, you are now pointing to a different location which explains your result.
Since the last byte of the pointer is modified the new memory address is nearby, making it unlikely to be in a piece of memory your program isn't allowed to access. That's why you don't get a seg-fault. The actual value obtained is undefined, but is highly likely to be a zero, which explains the blank output when its interpreted as a char.
when you're casting, with a C-style cast or with a reinterpret_cast, you're basically telling the compiler to look the other way ("don't you mind, I know what I'm doing").
C++ allows you to tell the compiler to do that. That doesn't mean it's a good idea...

Pointers assignment

What is the meaning of
*(int *)0 = 0;
It does compile successfully
It has no meaning. That's an error. It's parsed as this
(((int)0) = 0)
Thus, trying to assign to an rvalue. In this case, the right side is a cast of 0 to int (it's an int already, anyway). The result of a cast to something not a reference is always an rvalue. And you try to assign 0 to that. What Rvalues miss is an object identity. The following would work:
int a;
(int&)a = 0;
Of course, you could equally well write it as the following
int a = 0;
Update: Question was badly formatted. The actual code was this
*(int*)0 = 0
Well, now it is an lvalue. But a fundamental invariant is broken. The Standard says
An lvalue refers to an object or function
The lvalue you assign to is neither an object nor a function. The Standard even explicitly says that dereferencing a null-pointer ((int*)0 creates such a null pointer) is undefined behavior. A program usually will crash on an attempt to write to such a dereferenced "object". "Usually", because the act of dereferencing is already declared undefined by C++.
Also, note that the above is not the same as the below:
int n = 0;
*(int*)n = 0;
While the above writes to something where certainly no object is located, this one will write to something that results from reinterpreting n to a pointer. The mapping to the pointer value is implementation defined, but most compilers will just create a pointer referring to address zero here. Some systems may keep data on that location, so this one may have more chances to stay alive - depending on your system. This one is not undefined behavior necessarily, but depends on the compiler and runtime-environment it is invoked in.
If you understand the difference between the above dereference of a null pointer (only constant expressions valued 0 converted to pointers yield null pointers!) and the below dereference of a reinterpreted zero value integer, i think you have learned something important.
It will usually cause an access violation at runtime. The following is done: first 0 is cast to an int * and that yields a null pointer. Then a value 0 is written to that address (null address) - that causes undefined behaviour, usually an access violation.
Effectively it is this code:
int* address = reinterpret_cast<int*>( 0 );
*address = 0;
Its a compilation error. You cant modify a non-lvalue.
It puts a zero on address zero. On some systems you can do this. Most MMU-based systems will not allow this in run-time. I once saw an embedded OS writing to address 0 when performing time(NULL).
there is no valid lvalue in that operation so it shouldn't compile.
the left hand side of an assignment must be... err... assignable