I'm asking this question because I'm updating my C and C++ course materials and I've had past students ask about it...
From ISO/IEC 9899:2017 section 6.5 Expressions ¶1 (and similar in the C++ standard):
"An expression is a sequence of operators and operands that specifies computation of a value, or that designates an object or a function, or that generates side effects, or that performs a combination thereof. …"
Because the standards writers obviously choose their words carefully, the use of the phrase "sequence of operators and operands" seems potentially misleading to me. It seems to indicate that to be considered an expression there must be more than one operator and also more than one operand. Thus, literals like 123 or variables like XYZ would not be considered expressions because there is no operator, and they certainly can't be considered operands if there is no operator.
However, if 123 and XYZ actually are expressions, wouldn't replacing the phrase "sequence of operators and operands" with "sequence of one or more characters" or something similar be more accurate?
Please tell me what I am misinterpreting about what the standard is stating.
and similar in the C++ standard
I don't know about the C standard, but the C++ standard puts this statement in a non-normative notation. It has no normative value to C++, so it should be read as colloquial.
You forgot of Primary expressions that have a separate definition in (6.5.1).
You just confused different entities; the definition you provided describes exactly what it should describe.
6.5.1 Primary expressions
Syntax:
primary-expression:
identifier
constant
string-literal
(expression)
Yes, the definition of "expression" in the C standard is incomplete -- but not in a way that causes any actual problems (other than to picky people like me).
The word "expression" in the text you quoted is in italics, which means that that is the official definition of the term. It's clear from other parts of the standard that 123, for example, is an expression: it's a decimal-constant, which is an integer-constant, which is a constant, which is a primary-expression`, which is a postfix-expression, which (skipping multiple steps) is an expression.
It is not "a sequence of operators and operands". There is no operator, which implies that 123 is not an operand (this can be demonstrated by referring to the definitions of operator and operand elsewhere in the standard).
In practice, I've never heard of anyone, either a compiler implementer or a C programmer, having any real difficulty because of this incomplete definition. Compiler implementers refer to the language grammar. C programmers probably get a pretty good idea of what an "expression" is before reading the standard.
I'd like to see the definition of expression updated in a new edition of the standard. A definition that refers to the grammar rather than attempting an English description would IMHO be an improvement.
But if it isn't updated, we'll all keep using expressions without any problems.
As for C++, Nicol Bolas's answer correctly points out that the C++ standard doesn't have a formal definition of "expression" like the C standard does. It does have similar wording at the top of Clause 8: "An expression is a
sequence of operators and operands that specifies a computation." -- but the word "expression" is not in italics and that sentence is part of a "Note", and is therefore non-normative. In C++, the standard defines expressions syntactically.
This question already has answers here:
Undefined behavior and sequence points
(5 answers)
Closed 4 years ago.
When building a program using a newer version of GCC, I found a problem in the code.
count[i] = count[i]++;
This code worked with an older version of GCC (2.95), but doesn't work with a newer version (4.8).
So I suspect this statement causes undefined behaviour, am I correct? Or is there a better term for this problem?
This is actually specified as undefined behavior as each compiler defines its own order of operation as stated on: https://en.cppreference.com/w/cpp/language/eval_order
Order of evaluation of the operands of almost all C++ operators (including the order of evaluation of function arguments in a function-call expression and the order of evaluation of the subexpressions within any expression) is unspecified. The compiler can evaluate operands in any order, and may choose another order when the same expression is evaluated again.
There is actually a warning on the increment/decrement page in the cppreference: https://en.cppreference.com/w/cpp/language/operator_incdec
Because of the side-effects involved, built-in increment and decrement operators must be used with care to avoid undefined behavior due to violations of sequencing rules.
Indeed, this is undefined behavior.
int i = 2;
i = i++; // is i assigned to be 2 or 3?
To show the topic I'm going to use C, but the same macro can be used also in C++ (with or without struct), raising the same question.
I came up with this macro
#define STR_MEMBER(S,X) (((struct S*)NULL)->X, #X)
Its purpose is to have strings (const char*) of an existing member of a struct, so that if the member doesn't exist, the compilation fails. A minimal usage example:
#include <stdio.h>
struct a
{
int value;
};
int main(void)
{
printf("a.%s member really exists\n", STR_MEMBER(a, value));
return 0;
}
If value weren't a member of struct a, the code wouldn't compile, and this is what I wanted.
The comma operator should evaluate the left operand and then discard the result of the expression (if there is one), so that my understanding is that usually this operator is used when the evaluation of the left operand has side effects.
In this case, however, there aren't (intended) side effects, but of course it works iff the compiler doesn't actually produce the code which evaluates the expression, for otherwise it would access to a struct located at NULL and a segmentation fault would occur.
Gcc/g++ 6.3 and 4.9.2 never produced that dangerous code, even with -O0, as if they were always able to “see” that the evaluation hasn't side effects and so it can be skipped.
Adding volatile in the macro (e.g. because accessing that memory address is the desired side effect) was so far the only way to trigger the segmentation fault.
So the question: is there anything in the C and C++ languages standard which guarantees that compilers will always avoid actual evaluation of the left operand of the comma operator when the compiler can be sure that the evaluation hasn't side effects?
Notes and fixing
I am not asking for a judgment about the macro as it is and the opportunity to use it or make it better. For the purpose of this question, the macro is bad if and only if it evokes undefined behaviour — i.e., if and only if it is risky because compilers are allowed to generate the “evaluation code” even when this hasn't side effects.
I have already two obvious fixes in mind: “reifying” the struct and using offsetof. The former needs an accessible memory area as big as the biggest struct we use as first argument of STR_MEMBER (e.g. maybe a static union could do…). The latter should work flawlessly: it gives an offset we aren't interested in and avoids the access problem — indeed I'm assuming gcc, because it's the compiler I use (hence the tag), and that its offsetof built-in behaves.
With the offsetof fix the macro becomes
#define STR_MEMBER(S,X) (offsetof(struct S,X), #X)
Writing volatile struct S instead of struct S doesn't cause the segfault.
Suggestions about other possible “fixes” are welcome, too.
Added note
Actually, the real usage case was in C++ in a static storage struct. This seems to be fine in C++, but as soon as I tried C with a code closer to the original instead of the one boiled for this question, I realized that C isn't happy at all with that:
error: initializer element is not constant
C wants the struct to be initializable at compile time, instead C++ it's fine with that.
Is there anything in the C and C++ languages standard which guarantees that compilers will always avoid actual evaluation of the left operand of the comma operator ?
It's the opposite. The standard guarantees that the left operand IS evaluated (really it does, there aren't any exceptions). The result is discarded.
Note: for lvalue expressions, "evaluate" does not mean "access the stored value". Instead, it means to work out where the designated memory location is. The other code encompassing the lvalue expression may or may not then go on to access the memory location. The process of reading from the memory location is known as "lvalue conversion" in C, or "lvalue to rvalue conversion" in C++.
In C++ a discarded-value expression (such as the left operand of the comma operator) only has lvalue to rvalue conversion performed on it if it is volatile and also meets some other criteria (see C++14 [expr]/11 for detail). In C lvalue conversion does occur for expressions whose result is not used (C11 6.3.2.1/2).
In your example, it is moot whether or not lvalue conversion happens. In both languages X->Y, where X is a pointer, is defined as (*X).Y; in C the act of applying * to a null pointer already causes undefined behaviour (C11 6.5.3/3), and in C++ the . operator is only defined for the case when the left operand actually designates an object (C++14 [expr.ref]/4.2).
The comma operator (C documentation, says something very similar) has no such guarantees.
In a comma expression E1, E2, the expression E1 is evaluated, its result is discarded ..., and its side effects are completed before evaluation of the expression E2 begins
irrelevant information omitted
To put it simply, E1 will be evaluated, although the compiler might optimize it away by the as-if rule if it is able to determine that there are no side-effects.
Gcc/g++ 6.3 and 4.9.2 never produced that dangerous code, even with -O0, as if they were always able to “see” that the evaluation hasn't side effects and so it can be skipped.
clang will produce code which raises an error if you pass it the -fsanitize=undefined option. Which should answer your question: at least one major implementation's developers clearly consider the code as having undefined behaviour. And they are correct.
Suggestions about other possible “fixes” are welcome, too.
I would look for something which is guaranteed not to evaluate the expression. Your suggestion of offsetof does the job, but may occasionally cause code to be rejected that would otherwise be accepted, such as when X is a.b. If you want that to be accepted, my thought would be to use sizeof to force an expression to remain unevaluated.
You ask,
is there anything in the C and C++ languages standard which guarantees
that compilers will always avoid actual evaluation of the left operand
of the comma operator when the compiler can be sure that the
evaluation hasn't side effects?
As others have remarked, the answer is "no". On the contrary, the standards both unconditionally state that the left-hand operand of the comma operator is evaluated, and that the result is discarded.
This is of course a description of the execution model of an abstract machine; implementations are permitted to work differently, so long as the observable behavior is the same as the abstract machine behavior would produce. If indeed evaluation of the left-hand expression produces no side effects, then that would permit skipping it altogether, but there is nothing in either standard that provides for requiring that it be skipped.
As for fixing it, you have various options, some of which apply only to one or the other of the two languages you have named. I tend to like your offsetof() alternative, but others have noted that in C++, there are types to which offsetof cannot be applied. In C, on the other hand, the standard specifically describes its application to structure types, but says nothing about union types. Its behavior on union types, though very likely to be consistent and natural, as technically undefined.
In C only, you could use a compound literal to avoid the undefined behavior in your approach:
#define HAS_MEMBER(T,X) (((T){0}).X, #X)
That works equally well on structure and union types (though you need to provide a full type name for this version, not just a tag). Its behavior is well defined when the given type does have such a member. The expansion violates a language constraint -- thus requiring a diagnostic to be emitted -- when the type does not have such a member, including when it is neither a structure type nor a union type.
You might also use sizeof, as #alain suggested, because although the sizeof expression will be evaluated, its operand will not be evaluated (except, in C, when its operand has variably-modified type, which will not apply to your use). I think this variation will work in both C and C++ without introducing any undefined behavior:
#define HAS_MEMBER(T,X) (sizeof(((T *)NULL)->X), #X)
I have again written it so that it works for both structs and unions.
The left operand of the comma operator is a discarded-value expression
5 Expressions
11 In some contexts, an expression only appears for its side effects. Such an expression is called a discarded-value
expression. The expression is evaluated and its value is discarded.
[...]
There are also unevaluated operands which, as the name implies, are not evaluated.
8 In some contexts, unevaluated operands appear (5.2.8, 5.3.3, 5.3.7,
7.1.6.2). An unevaluated operand is not evaluated. An unevaluated operand is considered a full-expression. [...]
Using a discarded-value expression in your use case is undefined behavior, but using an unevaluated operand is not.
Using sizeof for example would not cause UB because it takes an unevaluated operand.
#define STR_MEMBER(S,X) (sizeof(S::X), #X)
sizeof is preferable to offsetof, because offsetof can't be used for static members and classes that are not standard-layout:
18 Language support library
4 The macro offsetof(type, member-designator) accepts a restricted
set of type arguments in this International Standard. If type is not a
standard-layout class (Clause 9), the results are undefined. [...] The result of applying the offsetof macro to a field that
is a static data member or a function member is undefined. [...]
The language doesn't need to say anything about "actual execution" because of the as-if rule. After all, with no side effects how could you tell whether the expression is evaluated? (Looking at the assembly or setting breakpoints doesn't count; that's not part of execution of the program, which is all the language describes.)
On the other hand, dereferencing a null pointer is undefined behavior, so the language says nothing at all about what happens. You can't expect as-if to save you: as-if is a relaxation of otherwise-plausible restrictions on the implementation, and undefined behavior is a relaxation of all restrictions on the implementation. There is therefore no "conflict" between "this doesn't have side effects, so we can ignore it" and "this is undefined behavior, so nasal demons"; they're on the same side!
This question already has answers here:
Why are these constructs using pre and post-increment undefined behavior?
(14 answers)
Undefined behavior and sequence points
(5 answers)
Closed 9 years ago.
This started from a joke:
Interviewer: What is the difference between C and C++?
Candidate: ONE
My question is whether the expressions abs(C++ - C) and abs(C - C++) invokes undefined behavior or not?
It depends on the type of C, but at the best (a user defined
type, where ++ is a function), it is unspecified whether the
second C is evaluated before or after the evaluation of
C.operator++.
Of course, for a built-in type, the expression is undefined
behavior, and for a user defined type, the final results will
also depend on how the user defined operator++, as well as the
compiler dependent order of evaluation.
Yes, this is undefined behaviour. The compiler will not make any promises on when the increment will happen if you reuse the same variable in the statement.
yes this is UB. From C99, Section 6.5
An expression is a sequence of operators and operands that specifies
computation of a value
Except as specified later (for the function-call (), &&, ||, ?:, and
comma operators), the order of evaluation of subexpressions and the
order in which side effects take place are both unspecified
Therefore the is no guarantee in the express C++ - C when the post increment is executed.
My preprocessor appears to assume that undefined constants are 0 for the purpose of evaluating #if conditions.
Can this be relied upon, or do undefined constants give undefined behaviour?
Yes, it can be relied upon. The C99 standard specifies at §6.10.1 ¶3:
After all replacements due to macro expansion and the defined unary
operator have been performed, all remaining identifiers are replaced with the pp-number
0
Edit
Sorry, I thought it was a C question; still, no big deal, the equivalent section in the C++ standard (§16.1 ¶4) states:
After all replacements due to macro expansion and the defined unary operator
have been performed, all remaining identifiers and keywords, except for true and false, are replaced with the pp-number 0
The only difference is the different handling of true and false, which in C do not need special handling, while in C++ they have a special meaning even in the preprocessing phase.
An identifier that is not defined as a macro is converted to 0 before the expression is evaluated.
The exception is the identifier true, which is converted to 1. This is specific to the C++ preprocessor; in C, this doesn't happen and you would need to include <stdbool.h> to use true this way, in which case it will be defined as a macro and no special handling is required.
The OP was asking specifically about the C preprocessor and the first answer was correctly referring to the C preprocessor specification. But some of the other comments seem to blur the distinction between the C preprocessor and the C compiler. Just to be clear, those are two different things with separate rules and they are applied in two separate passes.
#if 0 == NAME_UNDEFINED
int foo = NAME_UNDEFINED;
#endif
This example will successfully output the foo definition because the C preprocessor evaluates NAME_UNDEFINED to 0 as part of a conditional expression, but a compiler error is generated because the initializer is not evaluated as a conditional expression and then the C compiler evaluates it as an undefined symbol.