I don't know if I've missed something or it really doesn't exists. In the C++11 standard the Raw string literals were added:
string s = "\\w\\\\\\w"; // I hope I got that right
string s = R"(\w\\\w)"; // I'm pretty sure I got that right
But all my attempts to use a Raw character literal have failed:
constexpr char bslash = R('\'); // error: missing terminating ' character
constexpr char bslash = R'(\)'; // error: 'R' was not declared in this scope
The second attempt is considered a multi-character constant! The only way I've found to use something similar to Raw character literal is:
constexpr char slash = *R"(\)"; // All Ok.
But I don't like this notation (dereferencing a string literal in order to store a copy of the first element) because is kind of confusing.
Well, what's the question?
Did the Raw character literals exists? (I've not found nothing about them so I'm nearly sure that they don't)
If they exists: How I should write a Raw character literal?
If they don't exist: Why? Is there a reason to add Raw string literals but AVOID to add Raw character literals?
The proposal that introduced raw string literals in C++11 is, as far as I can tell, N2442 - Raw and Unicode String Literals; Unified Proposal. It is based upon N2146 - Raw String Literals (Revision 1) by Beman Dawes, which contains a section about raw character literals:
As a deliberate design choice, raw character (as opposed to string)
literals are not proposed because there is no apparent need; escape
sequences do not pose the same practical problems in character
literals that they do in string literals.
The arguments in favor of raw character literals are symmetry and
error-reduction. Knowing that raw string-literals are allowed,
programmers are likely to assume raw character-literals are also
available. Indeed, a committee member inadvertently made that
assumption when reading a draft of this paper. Although the resulting
error is easy to fix, there is the argument that it is better to
eliminate the possibility of the error by providing raw
character-literals in the first place.
I will be happy to provide proposed wording if the committee desires
to add raw character literals.
Unfortunately, I cannot find any discussion in the meeting minutes that mention any of the related proposals. It is likely though that the reason mentioned first paragraph lead to the current situation.
Related
I know that strlen() does not count the NUL-terminating character with. I really know that this is a fact. Thus, this question is NOT about asking for why strlen() might "presumably" not return the right string length, which is already asked and answered alot well here on StackOverflow, f.e. in this thread, or this one.
So lets go ahead to my question:
In ISO/IEC 9899:1990 (E); 7.1.1., is stated:
A string is a contiguous sequence of characters terminated by and including the first null character.
What is the reason, why strlen() deviate from this formed standard, and does not "want" to accept a string with its NUL-terminating character?
Why?
Because you would expect this pseudocode's assertion to hold true:
str1 = "foo"
str2 = "bar"
str3 = concatenate(str1, str2)
Assert strlen(str1) + strlen(s2) == strlen(str3)
If terminating '\0' was counted by strlen, above assertion would not hold, which would be much more of overall headache, than what the current C string behavior is. More importantly, it would in my opinion be quite unintuitive and illogical.
Taking your doubt as a reasonable point we can state that: The C-string consists of two parts:
the string's useful content ("the text");
the null terminating character;
The null terminating character is purely a technical measure for determination of the end of the string by the C-originated library functions. Still, if one types a declaration:
char * str = "some string";
they logically would rather expect its length to be 11 which is as many as they can see in this statement. Hence the strlen() value yields only the length of the part 1. of the string.
Not really an answer to your question, but consider this example:
char string[] = "string";
printf("sizeof: %zu\n", sizeof(string));
printf("strlen: %zu\n", strlen(string));
This prints
sizeof: 7
strlen: 6
So sizeof counts the \0, but strlen doesn't.
Questions like this, that ask why a certain age-old decision was made one way and not another way, are hard to answer. I can say that it's perfectly obvious to me, anyway, that strlen should count just the real, "interesting" characters that are in the string, and ignore the \0 at the end that merely terminates it. I'm used to accounting for the \0 separately. I imagine it would have been considerably more of a nuisance overall if strlen had been defined the other way. But I can't prove this with convincing arguments, and I've been using strlen with its current definition for so long that I'm probably hopelessly biased; I might be saying "it's perfectly obvious to me that..." even if strlen's definition were quite wrong.
There is a difference between the physical, stored representation of a C style string and the logical representation of a C style string.
The physical representation, how the string is actually stored in memory or other media includes the null character. The null character is included when discussing the physical representation because it take up an additional piece of storage. In order to be a C style string the null character must be stored.
However the logical representation of a string does not include the null character. The logical representation of a string includes only the text characters that the programmer is wanting to manipulate.
I suspect that the null character, a value of binary zero, was chosen because of the original ASCII character set defined a character value of zero as the NULL character. Part of the lower values among the various teletype control codes, it seems to be the least likely ASCII character that may appear in text. See ASCII Character Codes.
Another nice quality of using a binary zero as the string terminator is that is the value that represents logical false so iterating over a string is often a matter of incrementing an array index or incrementing a pointer while logical true since all characters other than the end of string indicator have a non-zero or logical true value.
Due to how close to the hardware that the C programming language is, the programmer needs to be concerned about both representations, the physical representation when allocating memory to store a string which includes the null character and the logical representation which is the string without the null character.
The various C style string manipulation functions in the Standard Library (strlen(), strcpy(), etc.) are all designed around the logical representation of a C style string. They perform their actions by using the null character as not being part of the text but rather as a special indicator character which indicates the end of the string. However as a part of their operations they need to be aware of the null character and its use as a special symbol. For instance when strcpy() or strcat() are used to copy strings, they must also copy the null character that indicates the end of the string even though it is not part of the actual text of the logical representation.
This choice allows text strings to be stored as arrays of characters, as befits the hardware orientation and efficiency characteristics of C. There is no need to create an additional built in type for text strings and it fits well with the lean character of the C programming language.
C++ is able to provide the std::string because of being object oriented and having the additional facilities of the language that allows for objects to be created and managed. The C programming language, due to its simple syntax and lack of object oriented facilities does not have this convenience.
The problem with this approach is that the programmer needs to be aware of both the physical representation and the logical representation of text strings and be able to accommodate the needs of both when writing programs.
What exactly is the point of these as proposed by N4267 ?
Their only function seems to be to prevent extended ASCII characters or partial UTF-8 code points from being specified. They still store in a fixed-width 8-bit char (which, as I understand it, is the correct and best way to handle UTF-8 anyway for almost all use cases), so they don't support non-ASCII characters at all. What is going on?
(Actually I'm not entirely sure I understand the need for UTF-8 string literals either. I guess it's the worry of compilers doing weird/ambiguous things with Unicode strings coupled with validation of the Unicode?)
The rationale is covered in by the Evolution Working Group issue 119: N4197 Adding u8 character literals, [tiny] Why no u8 character literals? which tracked the proposal and says:
We have five encoding-prefixes for string-literals (none, L, u8, u, U)
but only four for character literals -- the missing one is u8 for
character literals.
This matters for implementations where the narrow execution character
set is not ASCII. In such a case, u8 character literals would provide
an ideal way to write character literals with guaranteed ASCII
encoding (the single-code-unit u8 encodings are exactly ASCII), but...
we don't provide them. Instead, the best one can do is something like this:
char x_ascii = { u'x' };
... where we'll get a narrowing error if the codepoint doesn't fit in
a 'char'. (Note that this is not quite the same as u8'x', which would
give us an error if the codepoint was not representable as a single
code unit in UTF-8.)
Suppose I've following initialization of a char array:
char charArray[]={'h','e','l','l','o',' ','w','o','r','l','d'};
and I also have following initialization of a string literal:
char stringLiteral[]="hello world";
The only difference between contents of first array and second string is that second string's got a null character at its end.
When it's the matter of initializing a char array, is there a macro or something that allows us to put our initializing text between two double quotation marks but where the array doesn't get an extra null terminating character?
It just doesn't make sense to me that when a terminating null character is not needed, we should use syntax of first mentioned initialization and write two single quotation marks for each character in the initializer text, as well as virgule marks to separate characters.
I should add that when I want to have a char array, it should also be obvious that I don't want to use it with functions that rely on string literals along with the fact that none of features in which using string literals results, is into my consideration.
I'm thankful for your answers.
It's allowed in C to declare the array as follows, which will initialize it without copying the terminating '\0'
char c[3] = "foo";
But it's illegal in C++. I'm not aware of a trick that would allow it for C++. The C++ Standard further says
Rationale: When these non-terminated arrays are manipulated by standard string routines, there is potential for major catastrophe.
Effect on original feature: Deletion of semantically well-defined feature.
Difficulty of converting: Semantic transformation. The arrays must be declared one element bigger to contain the string terminating ’\0’.
How widely used: Seldom. This style of array initialization is seen as poor coding style.
There is no way of doing what you want. The first way of initializing the array specifies separate initializers for each character, which allows to explicitly leave off the '\0'. The second is initializing a character array from a character string, which in C/C++ is always terminated by a null character.
EDIT: corrected: 'character pointer' --> 'character array'
litb has the technically correct answer.
As for an opinion - I say just live with the 'waste' of the extra '\0'. So many bugs are the result of code expecting a terminating null where one isn't (this advice may seem to go directly against some other advice I gave just a day or two ago about not bothering to zero an entire buffer. I claim there's no contradiction - I still advocated null terminating the string in the buffer).
If you really can't live with the '\0' terminator because of some semantics in the data structure you're dealing with, such as it might be part of some larger packed structure, you can always init the array yourself (which I think should be no less efficient than what the compiler might have done for you):
#define MY_STRING_LITERAL "hello world"
char stringLiteral[sizeof(MY_STRING_LITERAL) - 1];
memcpy( stringLiteral, MY_STRING_LITERAL, sizeof(stringLiteral));
The basic answer is that the vast majority of char arrays are strings - in C, strings are null terminated. C++ inherited that convention. Even when that null isn't needed, most of the time it isn't a problem just to leave it there anyway.
Macros aren't powerful enough to do what you want. Templates would be, except they don't have any compile-time string handling.
Usually, when people want to mix numeric bytes and string literals in the same char-array sequence, they use a string literal but use hex character escapes such as \xFF.
I might have found a way to do what i want though it isn't directly what I wanted, but it likely has the same effect.
First consider two following classes:
template <size_t size>
class Cont{
public:
char charArray[size];
};
template <size_t size>
class ArrayToUse{
public:
Cont<size> container;
inline ArrayToUse(const Cont<size+1> & input):container(reinterpret_cast<const Cont<size> &>(input)){}
};
Before proceeding, you might want to go here and take a look at constant expression constructors and initialization types.
Now look at following code:
const Cont<12> container={"hello world"};
ArrayToUse<11> temp(container);
char (&charArray)[11]=temp.container.charArray;
Finally initializer text is written between two double quotations.
What is the type of string literal in C? Is it char * or const char * or const char * const?
What about C++?
In C the type of a string literal is a char[] - it's not const according to the type, but it is undefined behavior to modify the contents. Also, 2 different string literals that have the same content (or enough of the same content) might or might not share the same array elements.
From the C99 standard 6.4.5/5 "String Literals - Semantics":
In translation phase 7, a byte or code of value zero is appended to each multibyte character sequence that results from a string literal or literals. The multibyte character sequence is then used to initialize an array of static storage duration and length just sufficient to contain the sequence. For character string literals, the array elements have type char, and are initialized with the individual bytes of the multibyte character sequence; for wide string literals, the array elements have type wchar_t, and are initialized with the sequence of wide characters...
It is unspecified whether these arrays are distinct provided their elements have the appropriate values. If the program attempts to modify such an array, the behavior is undefined.
In C++, "An ordinary string literal has type 'array of n const char'" (from 2.13.4/1 "String literals"). But there's a special case in the C++ standard that makes pointer to string literals convert easily to non-const-qualified pointers (4.2/2 "Array-to-pointer conversion"):
A string literal (2.13.4) that is not a wide string literal can be converted to an rvalue of type “pointer to char”; a wide string literal can be converted to an rvalue of type “pointer to wchar_t”.
As a side note - because arrays in C/C++ convert so readily to pointers, a string literal can often be used in a pointer context, much as any array in C/C++.
Additional editorializing: what follows is really mostly speculation on my part about the rationale for the choices the C and C++ standards made regarding string literal types. So take it with a grain of salt (but please comment if you have corrections or additional details):
I think that the C standard chose to make string literal non-const types because there was (and is) so much code that expects to be able to use non-const-qualified char pointers that point to literals. When the const qualifier got added (which if I'm not mistaken was done around ANSI standardization time, but long after K&R C had been around to accumulate a ton of existing code) if they made pointers to string literals only able to be be assigned to char const* types without a cast nearly every program in existence would have required changing. Not a good way to get a standard accepted...
I believe the change to C++ that string literals are const qualified was done mainly to support allowing a literal string to more appropriately match an overload that takes a "char const*" argument. I think that there was also a desire to close a perceived hole in the type system, but the hole was largely opened back up by the special case in array-to-pointer conversions.
Annex D of the standard indicates that the "implicit conversion from const to non-const qualification for string literals (4.2) is deprecated", but I think so much code would still break that it'll be a long time before compiler implementers or the standards committee are willing to actually pull the plug (unless some other clever technique can be devised - but then the hole would be back, wouldn't it?).
A C string literal has type char [n] where n equals number of characters + 1 to account for the implicit zero at the end of the string.
The array will be statically allocated; it is not const, but modifying it is undefined behaviour.
If it had pointer type char * or incomplete type char [], sizeof could not work as expected.
Making string literals const is a C++ idiom and not part of any C standard.
They used to be of type char[]. Now they are of type const char[].
For various historical reasons, string literals were always of type char[] in C.
Early on (in C90), it was stated that modifying a string literal invokes undefined behavior.
They didn't ban such modifications though, nor did they make string literals const char[] which would have made more sense. This was for backwards-compatibility reasons with old code. Some old OS (most notably DOS) didn't protest if you modified string literals, so there was plenty of such code around.
C still has this defect today, even in the most recent C standard.
C++ inherited the same very same defect from C, but in later C++ standards, they have finally made string literals const (flagged obsolete in C++03, finally fixed in C++11).
If I call a function like
myObj.setType("fluid");
many times in a program, how many copies of the literal "fluid" are saved in memory? Can the compiler recognize that this literal is already defined and just reference it again?
This has nothing to do with C++(the language). Instead, it is an "optimization" that a compiler can do. So, the answer yes and no, depending on the compiler/platform you are using.
#David This is from the latest draft of the language:
§ 2.14.6 (page 28)
Whether all string literals are
distinct (that is, are stored in
non overlapping objects) is
implementation defined. The effect of
attempting to modify a string literal
is undefined.
The emphasis is mine.
In other words, string literals in C++ are immutable because modifying a string literal is undefined behavior. So, the compiler is free, to eliminate redundant copies.
BTW, I am talking about C++ only ;)
Yes, it can. Of course, it depends on the compiler. For VC++, it's even configurable:
http://msdn.microsoft.com/en-us/library/s0s0asdt(VS.80).aspx
Yes it can, but there's no guarantee that it will. Define a constant if you want to be sure.
This is a compiler implementation issue. Many compilers that I have used have an option to share or merge duplicate string literals. Allowing duplicate string literals speeds up the compilation process but produces larger executables.
I believe that in C/C++ there is no specified handling for that case, but in most cases would use multiple definitions of that string.
2.13.4/2: "whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation-defined".
This permits the optimisation you're asking about.
As an aside, there may be a slight ambiguity, at least locally within that section of the standard. The definition of string literal doesn't quite make clear to me whether the following code uses one string literal twice, or two string literals once each:
const char *a = "";
const char *b = "";
But the next paragraph says "In translation phase 6 adjacent narrow string literals are concatenated". Unless it means to say that something can be adjacent to itself, I think the intention is pretty clear that this code uses two string literals, which are concatenated in phase 6. So it's not one string literal twice:
const char *c = "a" "a";
Still, if you did read that "a" and "a" are the same string literal, then the standard requires the optimisation you're talking about. But I don't think they are the same literal, I think they're different literals that happen to consist of the same characters. This is perhaps made clear elsewhere in the standard, for instance in the general information on grammar and parsing.
Whether it's made clear or not, many compiler-writers have interpreted the standard the way I think it is, so I might as well be right ;-)