Differences in C and C++ with sequence points and UB - c++

I used this post Undefined Behavior and Sequence Points to document undefined behavior(UB) in a C program and it was pointed to me that C and C++ have their own divergent rules for this [sequence points]. So what are the differences between C and C++ when it comes to sequence points and related UB? Can’t I use a post about C++ sequences to analyze what is happening in C code?
* Of Course I am not talking about features of C++ not applicable to C.

There are two parts to this question, we can tackle a comparison of sequence points rules without much trouble. This does not get us too far though, C and C++ are different languages which have different standards(the latest C++ standard is almost twice as large as the the latest C standard) and even though C++ uses C as a normative reference it would be incorrect to quote the C++ standard for C and vice versa, regardless how similar certain sections may be. The C++ standard does explicitly reference the C standard but that is for small sections.
The second part is a comparison of undefined behavior between C and C++, there can be some big differences and enumerating all the differences in undefined behavior may not be possible but we can give some indicative examples.
Sequence Points
Since we are talking about sequence points then this is covering pre C++11 and pre C11. The sequence point rules do not differ greatly as far as I can tell between C99 and Pre C++11 draft standards. As we will see in some of the example I give of differing undefined behavior the sequence point rules do not play a part in them.
The sequence points rules are covered in the closest draft C++ standard to C++03 section 1.9 Program execution which says:
There is a sequence point at the completion of evaluation of each full-expression12).
When calling a function (whether or not the function is inline), there is a sequence point after the evaluation of all
function arguments (if any) which takes place before execution of any expressions or statements in the function body.
There is also a sequence point after the copying of a returned value and before the execution of any expressions outside
the function13). Several contexts in C++ cause evaluation of a function call, even though no corresponding function call
syntax appears in the translation unit. [ Example: evaluation of a new expression invokes one or more allocation and
constructor functions; see 5.3.4. For another example, invocation of a conversion function (12.3.2) can arise in contexts
in which no function call syntax appears. —end example ] The sequence points at function-entry and function-exit
(as described above) are features of the function calls as evaluated, whatever the syntax of the expression that calls the
function might be.
In the evaluation of each of the expressions
a && b
a || b
a ? b : c
a , b
using the built-in meaning of the operators in these expressions (5.14, 5.15, 5.16, 5.18), there is a sequence point after
the evaluation of the first expression14).
I will use the sequence point list from the draft C99 standard Annex C which although it is not normative I can find no disagreement with the normative sections it references. It says:
The following are the sequence points described in 5.1.2.3:
The call to a function, after the arguments have been evaluated (6.5.2.2).
The end of the first operand of the following operators: logical AND && (6.5.13);
logical OR || (6.5.14); conditional ? (6.5.15); comma , (6.5.17).
The end of a full declarator: declarators (6.7.5);
The end of a full expression: an initializer (6.7.8); the expression in an expression
statement (6.8.3); the controlling expression of a selection statement (if or switch)
(6.8.4); the controlling expression of a while or do statement (6.8.5); each of the
expressions of a for statement (6.8.5.3); the expression in a return statement
(6.8.6.4).
The following entries do not seem to have equivalents in the draft C++ standard but these come from the C standard library which C++ incorporates by reference:
Immediately before a library function returns (7.1.4).
After the actions associated with each formatted input/output function conversion
specifier (7.19.6, 7.24.2).
Immediately before and immediately after each call to a comparison function, and
also between any call to a comparison function and any movement of the objects
passed as arguments to that call (7.20.5).
So there is not much of a difference between C and C++ here.
Undefined Behavior
When it comes to the typical examples of sequence points and undefined behavior, for example those covered in Section 5 Expression dealing with modifying a variable more than once within a sequence points I can not come up with an example that is undefined in one but not the other. In C99 it says:
Between the previous and next sequence point an object shall have its
stored value modified at most once by the evaluation of an
expression.72) Furthermore, the prior value shall be read only to
determine the value to be stored.73)
and it provides these examples:
i = ++i + 1;
a[i++] = i;
and in C++ it says:
Except where noted, the order of evaluation of operands of individual
operators and subexpressions of individual expressions, and the order
in which side effects take place, is unspecified.57) Between the
previous and next sequence point a scalar object shall have its stored
value modified at most once by the evaluation of an expression.
Furthermore, the prior value shall be accessed only to determine the
value to be stored. The requirements of this paragraph shall be met
for each allowable ordering of the subexpressions of a full
expression; otherwise the behavior is undefined
and provides these examples:
i = v[i ++]; / / the behavior is undefined
i = ++ i + 1; / / the behavior is undefined
In C++11 and C11 we do have one major difference which is covered in Assignment operator sequencing in C11 expressions which is the following:
i = ++i + 1;
This is due to the result of pre-increment being an lvalue in C++11 but not in C11 even though the sequencing rules are the same.
We do have major difference in areas that have nothing to do with sequence points:
In C what uses of an indeterminate value is undefined has always been well specified while in C++ it was not until the recent draft C++1y standard that it has been well specified. This is covered in my answer to Has C++ standard changed with respect to the use of indeterminate values and undefined behavior in C++1y?
Type punning through a union has always been well defined in C but not in C++ or at least it is hotly debatable whether it is undefined behavior or not. I have several references to this in my answer to Why does optimisation kill this function?
In C++ simply falling off the end of value returning function is undefined behavior while in C it is only undefined behavior if you use the value.
There are probably plenty more examples but these are ones I have written about before.

Related

Definition of an "expression" in the C and C++ standards

I'm asking this question because I'm updating my C and C++ course materials and I've had past students ask about it...
From ISO/IEC 9899:2017 section 6.5 Expressions ¶1 (and similar in the C++ standard):
"An expression is a sequence of operators and operands that specifies computation of a value, or that designates an object or a function, or that generates side effects, or that performs a combination thereof. …"
Because the standards writers obviously choose their words carefully, the use of the phrase "sequence of operators and operands" seems potentially misleading to me. It seems to indicate that to be considered an expression there must be more than one operator and also more than one operand. Thus, literals like 123 or variables like XYZ would not be considered expressions because there is no operator, and they certainly can't be considered operands if there is no operator.
However, if 123 and XYZ actually are expressions, wouldn't replacing the phrase "sequence of operators and operands" with "sequence of one or more characters" or something similar be more accurate?
Please tell me what I am misinterpreting about what the standard is stating.
and similar in the C++ standard
I don't know about the C standard, but the C++ standard puts this statement in a non-normative notation. It has no normative value to C++, so it should be read as colloquial.
You forgot of Primary expressions that have a separate definition in (6.5.1).
You just confused different entities; the definition you provided describes exactly what it should describe.
6.5.1 Primary expressions
Syntax:
primary-expression:
identifier
constant
string-literal
(expression)
Yes, the definition of "expression" in the C standard is incomplete -- but not in a way that causes any actual problems (other than to picky people like me).
The word "expression" in the text you quoted is in italics, which means that that is the official definition of the term. It's clear from other parts of the standard that 123, for example, is an expression: it's a decimal-constant, which is an integer-constant, which is a constant, which is a primary-expression`, which is a postfix-expression, which (skipping multiple steps) is an expression.
It is not "a sequence of operators and operands". There is no operator, which implies that 123 is not an operand (this can be demonstrated by referring to the definitions of operator and operand elsewhere in the standard).
In practice, I've never heard of anyone, either a compiler implementer or a C programmer, having any real difficulty because of this incomplete definition. Compiler implementers refer to the language grammar. C programmers probably get a pretty good idea of what an "expression" is before reading the standard.
I'd like to see the definition of expression updated in a new edition of the standard. A definition that refers to the grammar rather than attempting an English description would IMHO be an improvement.
But if it isn't updated, we'll all keep using expressions without any problems.
As for C++, Nicol Bolas's answer correctly points out that the C++ standard doesn't have a formal definition of "expression" like the C standard does. It does have similar wording at the top of Clause 8: "An expression is a
sequence of operators and operands that specifies a computation." -- but the word "expression" is not in italics and that sentence is part of a "Note", and is therefore non-normative. In C++, the standard defines expressions syntactically.

Does this post-increment statement result in undefined behaviour? [duplicate]

This question already has answers here:
Undefined behavior and sequence points
(5 answers)
Closed 4 years ago.
When building a program using a newer version of GCC, I found a problem in the code.
count[i] = count[i]++;
This code worked with an older version of GCC (2.95), but doesn't work with a newer version (4.8).
So I suspect this statement causes undefined behaviour, am I correct? Or is there a better term for this problem?
This is actually specified as undefined behavior as each compiler defines its own order of operation as stated on: https://en.cppreference.com/w/cpp/language/eval_order
Order of evaluation of the operands of almost all C++ operators (including the order of evaluation of function arguments in a function-call expression and the order of evaluation of the subexpressions within any expression) is unspecified. The compiler can evaluate operands in any order, and may choose another order when the same expression is evaluated again.
There is actually a warning on the increment/decrement page in the cppreference: https://en.cppreference.com/w/cpp/language/operator_incdec
Because of the side-effects involved, built-in increment and decrement operators must be used with care to avoid undefined behavior due to violations of sequencing rules.
Indeed, this is undefined behavior.
int i = 2;
i = i++; // is i assigned to be 2 or 3?

In the comma operator, is the left operand guaranteed not to be actually executed if it hasn't side effects?

To show the topic I'm going to use C, but the same macro can be used also in C++ (with or without struct), raising the same question.
I came up with this macro
#define STR_MEMBER(S,X) (((struct S*)NULL)->X, #X)
Its purpose is to have strings (const char*) of an existing member of a struct, so that if the member doesn't exist, the compilation fails. A minimal usage example:
#include <stdio.h>
struct a
{
int value;
};
int main(void)
{
printf("a.%s member really exists\n", STR_MEMBER(a, value));
return 0;
}
If value weren't a member of struct a, the code wouldn't compile, and this is what I wanted.
The comma operator should evaluate the left operand and then discard the result of the expression (if there is one), so that my understanding is that usually this operator is used when the evaluation of the left operand has side effects.
In this case, however, there aren't (intended) side effects, but of course it works iff the compiler doesn't actually produce the code which evaluates the expression, for otherwise it would access to a struct located at NULL and a segmentation fault would occur.
Gcc/g++ 6.3 and 4.9.2 never produced that dangerous code, even with -O0, as if they were always able to “see” that the evaluation hasn't side effects and so it can be skipped.
Adding volatile in the macro (e.g. because accessing that memory address is the desired side effect) was so far the only way to trigger the segmentation fault.
So the question: is there anything in the C and C++ languages standard which guarantees that compilers will always avoid actual evaluation of the left operand of the comma operator when the compiler can be sure that the evaluation hasn't side effects?
Notes and fixing
I am not asking for a judgment about the macro as it is and the opportunity to use it or make it better. For the purpose of this question, the macro is bad if and only if it evokes undefined behaviour — i.e., if and only if it is risky because compilers are allowed to generate the “evaluation code” even when this hasn't side effects.
I have already two obvious fixes in mind: “reifying” the struct and using offsetof. The former needs an accessible memory area as big as the biggest struct we use as first argument of STR_MEMBER (e.g. maybe a static union could do…). The latter should work flawlessly: it gives an offset we aren't interested in and avoids the access problem — indeed I'm assuming gcc, because it's the compiler I use (hence the tag), and that its offsetof built-in behaves.
With the offsetof fix the macro becomes
#define STR_MEMBER(S,X) (offsetof(struct S,X), #X)
Writing volatile struct S instead of struct S doesn't cause the segfault.
Suggestions about other possible “fixes” are welcome, too.
Added note
Actually, the real usage case was in C++ in a static storage struct. This seems to be fine in C++, but as soon as I tried C with a code closer to the original instead of the one boiled for this question, I realized that C isn't happy at all with that:
error: initializer element is not constant
C wants the struct to be initializable at compile time, instead C++ it's fine with that.
Is there anything in the C and C++ languages standard which guarantees that compilers will always avoid actual evaluation of the left operand of the comma operator ?
It's the opposite. The standard guarantees that the left operand IS evaluated (really it does, there aren't any exceptions). The result is discarded.
Note: for lvalue expressions, "evaluate" does not mean "access the stored value". Instead, it means to work out where the designated memory location is. The other code encompassing the lvalue expression may or may not then go on to access the memory location. The process of reading from the memory location is known as "lvalue conversion" in C, or "lvalue to rvalue conversion" in C++.
In C++ a discarded-value expression (such as the left operand of the comma operator) only has lvalue to rvalue conversion performed on it if it is volatile and also meets some other criteria (see C++14 [expr]/11 for detail). In C lvalue conversion does occur for expressions whose result is not used (C11 6.3.2.1/2).
In your example, it is moot whether or not lvalue conversion happens. In both languages X->Y, where X is a pointer, is defined as (*X).Y; in C the act of applying * to a null pointer already causes undefined behaviour (C11 6.5.3/3), and in C++ the . operator is only defined for the case when the left operand actually designates an object (C++14 [expr.ref]/4.2).
The comma operator (C documentation, says something very similar) has no such guarantees.
In a comma expression E1, E2, the expression E1 is evaluated, its result is discarded ..., and its side effects are completed before evaluation of the expression E2 begins
irrelevant information omitted
To put it simply, E1 will be evaluated, although the compiler might optimize it away by the as-if rule if it is able to determine that there are no side-effects.
Gcc/g++ 6.3 and 4.9.2 never produced that dangerous code, even with -O0, as if they were always able to “see” that the evaluation hasn't side effects and so it can be skipped.
clang will produce code which raises an error if you pass it the -fsanitize=undefined option. Which should answer your question: at least one major implementation's developers clearly consider the code as having undefined behaviour. And they are correct.
Suggestions about other possible “fixes” are welcome, too.
I would look for something which is guaranteed not to evaluate the expression. Your suggestion of offsetof does the job, but may occasionally cause code to be rejected that would otherwise be accepted, such as when X is a.b. If you want that to be accepted, my thought would be to use sizeof to force an expression to remain unevaluated.
You ask,
is there anything in the C and C++ languages standard which guarantees
that compilers will always avoid actual evaluation of the left operand
of the comma operator when the compiler can be sure that the
evaluation hasn't side effects?
As others have remarked, the answer is "no". On the contrary, the standards both unconditionally state that the left-hand operand of the comma operator is evaluated, and that the result is discarded.
This is of course a description of the execution model of an abstract machine; implementations are permitted to work differently, so long as the observable behavior is the same as the abstract machine behavior would produce. If indeed evaluation of the left-hand expression produces no side effects, then that would permit skipping it altogether, but there is nothing in either standard that provides for requiring that it be skipped.
As for fixing it, you have various options, some of which apply only to one or the other of the two languages you have named. I tend to like your offsetof() alternative, but others have noted that in C++, there are types to which offsetof cannot be applied. In C, on the other hand, the standard specifically describes its application to structure types, but says nothing about union types. Its behavior on union types, though very likely to be consistent and natural, as technically undefined.
In C only, you could use a compound literal to avoid the undefined behavior in your approach:
#define HAS_MEMBER(T,X) (((T){0}).X, #X)
That works equally well on structure and union types (though you need to provide a full type name for this version, not just a tag). Its behavior is well defined when the given type does have such a member. The expansion violates a language constraint -- thus requiring a diagnostic to be emitted -- when the type does not have such a member, including when it is neither a structure type nor a union type.
You might also use sizeof, as #alain suggested, because although the sizeof expression will be evaluated, its operand will not be evaluated (except, in C, when its operand has variably-modified type, which will not apply to your use). I think this variation will work in both C and C++ without introducing any undefined behavior:
#define HAS_MEMBER(T,X) (sizeof(((T *)NULL)->X), #X)
I have again written it so that it works for both structs and unions.
The left operand of the comma operator is a discarded-value expression
5 Expressions
11 In some contexts, an expression only appears for its side effects. Such an expression is called a discarded-value
expression. The expression is evaluated and its value is discarded.
[...]
There are also unevaluated operands which, as the name implies, are not evaluated.
8 In some contexts, unevaluated operands appear (5.2.8, 5.3.3, 5.3.7,
7.1.6.2). An unevaluated operand is not evaluated. An unevaluated operand is considered a full-expression. [...]
Using a discarded-value expression in your use case is undefined behavior, but using an unevaluated operand is not.
Using sizeof for example would not cause UB because it takes an unevaluated operand.
#define STR_MEMBER(S,X) (sizeof(S::X), #X)
sizeof is preferable to offsetof, because offsetof can't be used for static members and classes that are not standard-layout:
18 Language support library
4 The macro offsetof(type, member-designator) accepts a restricted
set of type arguments in this International Standard. If type is not a
standard-layout class (Clause 9), the results are undefined. [...] The result of applying the offsetof macro to a field that
is a static data member or a function member is undefined. [...]
The language doesn't need to say anything about "actual execution" because of the as-if rule. After all, with no side effects how could you tell whether the expression is evaluated? (Looking at the assembly or setting breakpoints doesn't count; that's not part of execution of the program, which is all the language describes.)
On the other hand, dereferencing a null pointer is undefined behavior, so the language says nothing at all about what happens. You can't expect as-if to save you: as-if is a relaxation of otherwise-plausible restrictions on the implementation, and undefined behavior is a relaxation of all restrictions on the implementation. There is therefore no "conflict" between "this doesn't have side effects, so we can ignore it" and "this is undefined behavior, so nasal demons"; they're on the same side!

Does this expression invokes undefined behavior? [duplicate]

This question already has answers here:
Why are these constructs using pre and post-increment undefined behavior?
(14 answers)
Undefined behavior and sequence points
(5 answers)
Closed 9 years ago.
This started from a joke:
Interviewer: What is the difference between C and C++?
Candidate: ONE
My question is whether the expressions abs(C++ - C) and abs(C - C++) invokes undefined behavior or not?
It depends on the type of C, but at the best (a user defined
type, where ++ is a function), it is unspecified whether the
second C is evaluated before or after the evaluation of
C.operator++.
Of course, for a built-in type, the expression is undefined
behavior, and for a user defined type, the final results will
also depend on how the user defined operator++, as well as the
compiler dependent order of evaluation.
Yes, this is undefined behaviour. The compiler will not make any promises on when the increment will happen if you reuse the same variable in the statement.
yes this is UB. From C99, Section 6.5
An expression is a sequence of operators and operands that specifies
computation of a value
Except as specified later (for the function-call (), &&, ||, ?:, and
comma operators), the order of evaluation of subexpressions and the
order in which side effects take place are both unspecified
Therefore the is no guarantee in the express C++ - C when the post increment is executed.

What is the value of an undefined constant used in #if?

My preprocessor appears to assume that undefined constants are 0 for the purpose of evaluating #if conditions.
Can this be relied upon, or do undefined constants give undefined behaviour?
Yes, it can be relied upon. The C99 standard specifies at §6.10.1 ¶3:
After all replacements due to macro expansion and the defined unary
operator have been performed, all remaining identifiers are replaced with the pp-number
0
Edit
Sorry, I thought it was a C question; still, no big deal, the equivalent section in the C++ standard (§16.1 ¶4) states:
After all replacements due to macro expansion and the defined unary operator
have been performed, all remaining identifiers and keywords, except for true and false, are replaced with the pp-number 0
The only difference is the different handling of true and false, which in C do not need special handling, while in C++ they have a special meaning even in the preprocessing phase.
An identifier that is not defined as a macro is converted to 0 before the expression is evaluated.
The exception is the identifier true, which is converted to 1. This is specific to the C++ preprocessor; in C, this doesn't happen and you would need to include <stdbool.h> to use true this way, in which case it will be defined as a macro and no special handling is required.
The OP was asking specifically about the C preprocessor and the first answer was correctly referring to the C preprocessor specification. But some of the other comments seem to blur the distinction between the C preprocessor and the C compiler. Just to be clear, those are two different things with separate rules and they are applied in two separate passes.
#if 0 == NAME_UNDEFINED
int foo = NAME_UNDEFINED;
#endif
This example will successfully output the foo definition because the C preprocessor evaluates NAME_UNDEFINED to 0 as part of a conditional expression, but a compiler error is generated because the initializer is not evaluated as a conditional expression and then the C compiler evaluates it as an undefined symbol.