From: Is it safe to overload char* and std::string?
#include <string>
#include <iostream>
void foo(std::string str) {
std::cout << "std::string\n";
}
void foo(char* str) {
std::cout << "char*\n";
}
int main(int argc, char *argv[]) {
foo("Hello");
}
The above code prints "char*" when compiled with g++-4.9.0 -ansi -pedantic -std=c++11.
I feel that this is incorrect, because the type of a string literal is "array of n const char", and it shouldn't be possible to initialize a non-const char* with it, so the std::string overload should be selected instead. Is gcc violating the standard here?
First, the type of string literals: They are all constant arrays of their character type.
2.14.5 String literals [lex.string]
7 A string literal that begins with u8, such as u8"asdf", is a UTF-8 string literal and is initialized with the given characters as encoded in UTF-8.
8 Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals. A narrow string literal has type “array of n const char”, where n is the size of the string as defined below, and has static storage duration (3.7).
9 A string literal that begins with u, such as u"asdf", is a char16_t string literal. A char16_t string literal has type “array of n const char16_t”, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters. A single c-char may produce more than one char16_t character in the form of surrogate pairs.
10 A string literal that begins with U, such as U"asdf", is a char32_t string literal. A char32_t string literal has type “array of n const char32_t”, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters.
11 A string literal that begins with L, such as L"asdf", is a wide string literal. A wide string literal has type “array of n const wchar_t”, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters.
Next, lets see that we only have standard array decay, so from T[#] to T*:
4.2 Array-to-pointer conversion [conv.array]
1 An lvalue or rvalue of type “array of N T” or “array of unknown bound of T” can be converted to a prvalue of type “pointer to T”. The result is a pointer to the first element of the array.
And last, lets see that any conforming extension must not change the meaning of a correct program:
1.4 Implementation compliance [intro.compliance]
1 The set of diagnosable rules consists of all syntactic and semantic rules in this International Standard except for those rules containing an explicit notation that “no diagnostic is required” or which are described as resulting in “undefined behavior.”
2 Although this International Standard states only requirements on C++ implementations, those requirements are often easier to understand if they are phrased as requirements on programs, parts of programs, or execution of programs. Such requirements have the following meaning:
If a program contains no violations of the rules in this International Standard, a conforming implementation shall, within its resource limits, accept and correctly execute2 that program.
If a program contains a violation of any diagnosable rule or an occurrence of a construct described in this Standard as “conditionally-supported” when the implementation does not support that construct, a conforming implementation shall issue at least one diagnostic message.
If a program contains a violation of a rule for which no diagnostic is required, this International
Standard places no requirement on implementations with respect to that program.
So, in summary, it's a compiler bug.
(Before C++11 (C++03) the conversion was allowed but deprecated, so it would have been correct. A diagnostic in case it happened would not have been required but provided as a quality of implementation issue.)
It's a GCC bug (bug-report not found yet), and also a clang bug (found by T.C.).
The test-case from the clang bug-report, which is much shorter:
void f(char*);
int &f(...);
int &r = f("foo");
Related
Is the pointer returned by the following function valid?
const char * bool2str( bool flg )
{
return flg ? "Yes" : "No";
}
It works well in Visual C++ and g++. What does C++ standard say about this?
On storage duration:
2.13.4
Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals. A narrow
string literal has type “array of n const char”, where n is the size of the string as defined below, and has
static storage duration
read in conjunction with 3.7.1
3.7.1.
All objects which do not have dynamic storage duration, do not have thread storage duration, and are
not local have static storage duration. The storage for these objects shall last for the duration of the
program (3.6.2, 3.6.3).
On type:
Annex C
Subclause 2.13.4:
Change: String literals made const
The type of a string literal is changed from “array of char ” to “array of const char.” The type of a
char16_t string literal is changed from “array of some-integer-type ” to “array of const char16_t.” The
type of a char32_t string literal is changed from “array of some-integer-type ” to “array of const char32_-
t.” The type of a wide string literal is changed from “array of wchar_t ” to “array of const wchar_t.”
Rationale: This avoids calling an inappropriate overloaded function, which might expect to be able to
modify its argument.
Effect on original feature: Change to semantics of well-defined feature.
Difficulty of converting: Simple syntactic transformation, because string literals can be converted to
char*; (4.2). The most common cases are handled by a new but deprecated standard conversion:
char* p = "abc"; // valid in C, deprecated in C++
char* q = expr ? "abc" : "de"; // valid in C, invalid in C++
How widely used: Programs that have a legitimate reason to treat string literals as pointers to potentially
modifiable memory are probably rare.
Dynamically allocated (the word 'heap' is never used in context of an area of memory AFAIK in the standard) memory requires a function call that can happen as early as main much after the static memory is allocated.
This code is perfectly valid and conformant. The only "gotcha" would be to ensure that the caller doesn't try to free the string.
This code is valid and standard compliant.
String literals are stored in read-only memory, and the function just gets the address of the chosen string.
C++ standard (2.13.4) says :
An ordinary string literal has type
“array of n const char” and static
storage duration
They key to understand your problem here, is the static storage duration : string literals are allocated when your program launch, and last for the duration of the program. Your function just gets the address and returns it.
Technically Yes it is valid.
The strings have static storage durataion.
But that is not the whole story.
These are C-Strings. The convention in C-Libraries and funcctions is to return a dynamically allocated string that should be freed. ie A pointer returned is implicitly passing ownership back to tha caller (As usuall in C there are also exceptions).
If you do not follow these conventions you will confuse a lot of experienced C-Developers that would expect this convention. If you do not follow this standard expectation then it should be well documented in the code.
Also this is C++ (as per your tags). So it is more conventional to return a std::string. The reason for this is that the passing of ownership via pointers is only implied (and this lead to a lot of errors in C code were the above expectation was broken but documented, unfortunately the documentaiton was never read by the user of the code). By using a std::string you are passing an object and their is no longer any question of ownership (the result is passed back as a value and thus yours), but because it is an object there is no questions or issues with resource allocation.
If you are worried about effeciency I think that is a false concern.
If you want this for printing via streams there is already a standard convention to do that:
std::cout << std::boolalpha << false << std::endl;
std::cout << std::boolalpha << true << std::endl;
I know it's perfectly possible to initialise a char array with a string literal:
char arr[] = "foo";
C++11 8.5.2/1 says so:
A char array (whether plain char, signed char, or unsigned char), char16_t array, char32_t array, or
wchar_t array can be initialized by a narrow character literal, char16_t string literal, char32_t string
literal, or wide string literal, respectively, or by an appropriately-typed string literal enclosed in braces.
Successive characters of the value of the string literal initialize the elements of the array. ...
However, can you do the same with two string literals in a conditional expression? For example like this:
char arr[] = MY_BOOLEAN_MACRO() ? "foo" : "bar";
(Where MY_BOOLEAN_MACRO() expands to a 1 or 0).
The relevant parts of C++11 5.16 (Conditional operator) are as follows:
1 ... The first expression is contextually converted to bool (Clause 4).
It is evaluated and if it is true, the result of the conditional expression is the value of the second expression,
otherwise that of the third expression. ...
4 If the second and third operands are glvalues of the same value category and have the same type, the result
is of that type and value category and it is a bit-field if the second or the third operand is a bit-field, or if
both are bit-fields.
Notice that the literals are of the same length and thus they're both lvalues of type const char[4].
GCC one ideone accepts the construct. But from reading the standard, I am simply not sure whether it's legal or not. Does anyone have better insight?
On the other hand clang does not accept such code (see it live) and I believe clang is correct on this (MSVC also rejects this code ).
A string literal is defined by the grammar in section 2.14.5:
string-literal:
encoding-prefixopt" s-char-sequenceopt"
encoding-prefixoptR raw-string
and the first paragraph from this section says (emphasis mine):
A string literal is a sequence of characters (as defined in 2.14.3)
surrounded by double quotes, optionally prefixed by R, u8, u8R, u, uR,
U, UR, L, or LR, as in "...", R"(...)", u8"...", u8R"(...)",
u"...", uR"˜(...)˜", U"...", UR"zzz(...)zzz", L"...", or LR"(...)",
respectively
and it further says that the type of a narrow string literal is:
“array of n const char”,
as well as:
has static storage duration
but an “array of n const char”, with static storage duration is not a string literal since it does not fit the grammar nor does it fit paragraph 1.
We can make this fail on gcc if we use a non-constant expression (see it live):
bool x = true ;
char arr[] = x ? "foo" : "bar";
which means it is probably an extension, but it is non-conforming since it does not produce a warning in strict conformance mode i.e. using -std=c++11 -pedantic. From section 1.4 [intro.compliance]:
[...]Implementations are required to diagnose programs that use such
extensions that are ill-formed according to this International
Standard. Having done so, however, they can compile and execute such
programs.
This works in GCC in C++11 or newer because the literals you're providing are deterministic during compile time (eg, they are constexpr). Since the compiler can figure out which one is true, it is allowed to figure out which one to use.
To remove the constexpr ability, try something like this:
#include <iostream>
#include <cstdlib>
int main() {
bool _bool = rand();
char arr[] = (_bool) ? "asdf" : "ffff";
std::cout << arr << std::endl;
}
GCC then errors out with:
g++ test.cpp -std=c++11
test.cpp: In function ‘int main()’:
test.cpp:6:34: error: initializer fails to determine size of ‘arr’
char arr[] = (_bool) ? "asdf" : "ffff";
^
test.cpp:6:34: error: array must be initialized with a brace-enclosed initializer
I don't know the standard's text definition well enough to know where or why this is valid, but I feel that it is valid.
For further reading on constexpr and how it can impact compilability, see the answer by #ShafikYaghmour in another question.
Out of curiosity, I'm wondering what the real underlying type of a C++ string literal is.
Depending on what I observe, I get different results.
A typeid test like the following:
std::cout << typeid("test").name() << std::endl;
shows me char const[5].
Trying to assign a string literal to an incompatible type like so (to see the given error):
wchar_t* s = "hello";
I get a value of type "const char *" cannot be used to initialize an entity of type "wchar_t *" from VS12's IntelliSense.
But I don't see how it could be const char * as the following line is accepted by VS12:
char* s = "Hello";
I have read that this was allowed in pre-C++11 standards as it was for retro-compatibility with C, although modification of s would result in Undefined Behavior. I assume that this is simply VS12 having not yet implemented all of the C++11 standard and that this line would normally result in an error.
Reading the C99 standard (from here, 6.4.5.5) suggests that it should be an array:
The multibyte character
sequence is then used to initialize an array of static storage duration and length just
sufficient to contain the sequence.
So, what is the type underneath a C++ string literal?
Thank you very much for your precious time.
The type of a string literal is indeed const char[SIZE] where SIZE is the length of the string plus the null terminating character.
The fact that you're sometimes seeing const char* is because of the usual array-to-pointer decay.
But I don't see how it could be const char * as the following line is accepted by VS12:
char* s = "Hello";
This was correct behaviour in C++03 (as an exception to the usual const-correctness rules) but it has been deprecated since. A C++11 compliant compiler should not accept that code.
The type of a string literal is char const[N] where N is the number of characters including the terminating null character. Although this type does not convert to char*, the C++ standard includes a clause allowing assignments of string literal to char*. This clause was added to support compatibility especially for C code which didn't have const back then.
The relevant clause for the type in the standard is 2.14.5 [lex.string] paragraph 8:
Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals. A narrow string literal has type “array of n const char”, where n is the size of the string as defined below, and has static storage duration (3.7).
First off, the type of a C++ string literal is an array of n const char. Secondly, if you want to initialise a wchar_t with a string literal you have to code:
wchar_t* s = L"hello"
This is valid, because a constexpr expression is allowed to take the value of "a glvalue of literal type that refers to a non-volatile object defined with constexpr, or that refers to a sub-object of such an object" (§5.19/2):
constexpr char str[] = "hello, world";
constexpr char e = str[1];
However, it would seem that string literals do not fit this description:
constexpr char e = "hello, world"[1]; // error: literal is not constexpr
2.14.5/8 describes the type of string literals:
Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals. A narrow string literal has type “array of n const char”, where n is the size of the string as defined below, and has static storage duration.
It would seem that an object of this type could be indexed, if only it were temporary and not of static storage duration (5.19/2, right after the above snippet):
[constexpr allows lvalue-to-rvalue conversion of] … a glvalue of literal type that refers to a non-volatile temporary object whose lifetime has not ended, initialized with a constant expression
This is particularly odd since taking the lvalue of a temporary object is usually "cheating." I suppose this rule applies to function arguments of reference type, such as in
constexpr char get_1( char const (&str)[ 6 ] )
{ return str[ 1 ]; }
constexpr char i = get_1( { 'y', 'i', 'k', 'e', 's', '\0' } ); // OK
constexpr char e = get_1( "hello" ); // error: string literal not temporary
For what it's worth, GCC 4.7 accepts get_1( "hello" ), but rejects "hello"[1] because "the value of ‘._0’ is not usable in a constant expression"… yet "hello"[1] is acceptable as a case label or an array bound.
I'm splitting some Standardese hairs here… is the analysis correct, and was there some design intent for this feature?
EDIT: Oh… there is some motivation for this. It seems that this sort of expression is the only way to use a lookup table in the preprocessor. For example, this introduces a block of code which is ignored unless SOME_INTEGER_FLAG is 1 or 5, and causes a diagnostic if greater than 6:
#if "\0\1\0\0\0\1"[ SOME_INTEGER_FLAG ]
This construct would be new to C++11.
The intent is that this works and the paragraphs that state when an lvalue to rvalue conversion is valid will be amended with a note that states that an lvalue that refers to a subobject of a string literal is a constant integer object initialized with a constant expression (which is described as one of the allowed cases) in a post-C++11 draft.
Your comment about the use within the preprocessor looks interesting but I'm unsure whether that is intended to work. I hear about this the first time at all.
Regarding your question about #if, it was not the intent of the standards committee to increase the set of expressions which can be used in the preprocessor, and the current wording is considered to be a defect. This will be listed as core issue 1436 in the post-Kona WG21 mailing. Thanks for bringing this to our attention!
Can you do this?
char* func()
{
char * c = "String";
return c;
}
is "String" here a globally allocated data by compiler?
You can do that. But it would be even more correct to say:
const char* func(){
return "String";
}
The c++ spec says that string literals are given static storage duration. I can't link to it because there are precious few versions of the c++ spec online.
This page on const correctness is the best reference I can find.
Section 2.13.4 of ISO/IEC 14882 (Programming languages - C++) says:
A string literal is a sequence of characters (as defined in 2.13.2) surrounded by double quotes, optionally
beginning with the letter L, as in "..." or L"...". A string literal that does not begin with L is an ordinary
string literal, also referred to as a narrow string literal. An ordinary string literal has type “array of n
const char” and static storage duration (3.7), where n is the size of the string as defined below, and is
initialized with the given characters. ...
Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation defined.
The effect of attempting to modify a string literal is undefined.
You can do this currently (there is no reason to, though). But you cannot do this anymore with C++0x. They removed the deprecated conversion of a string literal (which has the type const char[N]) to a char *.
Note that this conversion is only for string literals. Thus the following two things are illegal, the first of which specifies an array and the second of which specifies a pointer for initialization
char *x = (0 ? "123" : "345"); // illegal: const char[N] -> char*
char *x = +"123"; // illegal: const char * -> char*
GCC incorrectly accepts both, Clang correctly rejects both.
The constant is not allocated on the heap but it is a constant. You don't need to destroy it.
Not in a modern compiler. In modern compilers, the type of "String" is const char *, which you can't assign to a char * due to the const mismatch.
If you made c a const char * (and changed the return type of the function), the code would be legal. Typically the string literal "String" would be placed in the executable's data section by the linker, and in many cases, in a special section for read-only data.