pointer-to-char-array assignments in c/c++ - c++

In the past i've been using Visual Studio 2010/2013/2015 and this syntax was possible :
char* szString = "This works!";
I've decided to step on and change my lifestyle of coding towards Linux, as I have installed g++ and have SlickEdit as my IDE.
It seems like that sentence doesn't work anymore.
Can anyone please state why?
This works however :
char strString[] = "This works!";
The error is something with c++11.
Does anyone have an idea why this happens? Not how to fix it, cause in my workspace there isn't any way to install a c++11 compiler, i'm just curious if it has to do with something on the background on how the compiler works.
What I do know about the first line of code, is that it creates a constant variable on the stack and creates a new pointer setting itself towards that ESP's value, but on the second one it counts the amount of letters on the constant variable and then sets a null terminator in the end as a result.
Oh and one more thing -> There seems to be a difference also in the way the first one is set in GCC/GPP, as the first type is {char*&} and the second one is {char(*)[12]}, any explanation on that too? Thanks!

When you write the literal "text", that text will be included somewhere the compiled program image. The program image is usually placed in write-protected memory on modern operating systems for security reasons.
char* someString = "text"; declares a pointer to the string in your probably-write-protected program image. The ability to declare this pointer as non-const was a feature included until C++11 to maintain source compatibility with C. Note that even though someString is not a pointer-to-const, any attempt to modify the value that it points to will still result in undefined behavior. This backwards compatibility feature was removed in C++11. someString must now be declared as pointer-to-const: const char* someString = "text";. You can cast the const away, but attempting to write to the value pointed to will still result in undefined behavior the same as any const value you cast to non-const.
char someString[] = "text"; works differently. This copies the string "text" from your program's code memory into an array located in data memory. It's similar to
char someString[5];
strcpy(someString, "text");
Since someString is an array in your program's data memory, it's fine to write to it, and it doesn't need to be const-qualified.

According to the standard: Annex C (Compatability)
C1.1 Subclause 2.14.5:
Change: String literals made const
The type of a string literal is changed from “array of char” to “array of const char.” The type of a char16_t string literal is changed from “array of some-integer-type” to “array of const char16_t.” The type of a char32_t string literal is changed from “array of some-integer-type” to “array of const char32_t.”
The type of a wide string literal is changed from “array of wchar_t” to “array of const wchar_t.”
Rationale: This avoids calling an inappropriate overloaded function, which might expect to be able to modify its argument.
And indeed, the C++ standard 2.14.5.8 says:
Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals. A narrow string literal has type “array of n const char”, where n is the size of the string as defined below, and has static storage duration
This is also allows such strings get various special treatment: the compiler/linker can choose to eliminate duplicates of a string across compilation units, (string pooling in msvc terms), it can store them in the data section and read only memory.
2.14.5.12
Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementationdefined. The effect of attempting to modify a string literal is undefined.
The char* x = "hello world"; is a throw-back to C++'s inheritance from early C. Visual Studio supports it because of the heavy baggage in their own libraries.

Related

What is the difference in using CStringW/CStringA and CT2W/CT2A to convert strings?

In the past, I used CT2W and CT2A to convert string between Unicode & Ansi. Now It seems that CStringW and CStringA can also do the same task.
I write the following code snippet:
CString str = _T("Abc");
CStringW str1;
CStringA str2;
CT2W str3(str);
CT2A str4(str);
str1 = str;
str2 = str;
It seems CStringW and CStringA also perform conversions by using WideCharToMultibyte when I assign str to them.
So, what is the advantages of using CT2W/CT2A instead of CStringW/CStringA, since I have never heard of the latter pair. Neither MS recommend the latter pair to do the conversion.
CString offers a number of conversion constructors to convert between ANSI and Unicode encoding. They are as convenient as they are dangerous, often masking bugs. MFC allows you to disable implicit conversion by defining the _CSTRING_DISABLE_NARROW_WIDE_CONVERSION preprocessor symbol (which you probably should). Conversions always involve creating of a new CString object with heap-allocated storage (ignoring the short string optimization).
By contrast, the Cs2d macros (where s = source, d = destination) work on raw C-style strings; no CString instances are created in the process of converting between character encodings. A temporary buffer of 128 code units is always allocated on the stack, in addition to a heap-allocated buffer in case the conversion requires more space.
Both of the above perform a conversion with an implied ANSI codepage (either CP_THREAD_ACP or CP_ACP in case the _CONVERSION_DONT_USE_THREAD_LOCALE preprocessor symbol is defined). CP_ACP is particularly troublesome, as it's a process-global setting, that any thread can change at any time.
Which one should you choose for your conversions? Neither of the above. Use the EX versions instead (see string and text classes for a full list). Those are implemented as class templates that give you a lot more control you need to reliably perform your conversions. The template non-type parameter lets you control the static buffer. More importantly, those class templates have constructors with an explicit codepage parameter, so you can perform the conversion you want (including from and to UTF-8), and document your intent in code.

MISRA C++ 2008 Rule 5-2-7 violation: An object with pointer type shall not be converted to an unrelated pointer type, either directly or indirectly

In the following example:
void bad_function()
{
char_t * ptr = 0;
// MISRA doesn't complains here, it allows cast of char* to void* pointer
void* p2 = ptr;
// the following 2 MISRA violations are reported in each of the casts bellow (two per code line)
// (1) Event misra_violation: [Required] MISRA C++-2008 Rule 5-2-7 violation: An object with pointer type shall not be converted to an unrelated pointer type, either directly or indirectly
// (1) Event misra_violation: [Required] MISRA C++-2008 Rule 5-2-8 violation: An object with integer type or pointer to void type shall not be converted to an object with pointer type
ptr = (char_t*) (p2);
ptr = static_cast<char_t*> (p2);
ptr = reinterpret_cast<char_t*> (p2);
}
MISRA 5-2-8 and 5-2-7 violations are reported.
How I can remove this violation ?
I need someone experienced with C++ static analysis to help me. I am hitting my head with this stupid rules from few days.
According to MISRA C++ standard (MISRA-Cpp-2008.pdf: Rule 5-2-7 (required): An object with pointer type shall not be converted to an unrelated pointer type, either directly or indirectly.
Ok but we have a lot of code which for example needs to convert address to char* and then to use it with std::ifstream, which read(char* buffer, int length) function requires to type cast the address to (char_t*). So how according to MISRA guys someone can program in C++ and not using at all any casts? The standard doesn't say HOW pointer conversion then must be done.
In my production code my problems are in file reading operations using read with std::ifstream from files in predefined data structures:
if (file.read((char_t*)&info, (int32_t)sizeof(INFO)).gcount() != (int32_t)sizeof(INFO)
{
LOG("ERROR: Couldn't read the file info header\n");
res = GENERAL_FAILURE;
}
How is supposed to do it according to MISRA?
So are there any solutions at all?
EDIT: Peter and Q.Q. answers are both correct, it seems that MISRA really wants to do everything without any casts which is hard to be done if the project is in the final stage. Therea are two options:
1 - document the MISRA deviations one by one and explain why casts are Ok, explain how this has been tested (Q.Q. suggestion)
2 - use byte array from char type for file.read(), then after safely reading the file content cast the byte array to the headers content, this must be done for each member one by one because if you cast char* to int32_t this is again Rule 5-2-7 violation. Sometimes it is too much work.
The basic reason for the MISRA rule is that converting any pointer/address to any non-void pointer allows using that address as if it is a different object than it actually is. The compiler would complain about an implicit conversion in those cases. Using a typecast (or C++ _cast operators) essentially stops the compile complaining and - in too many circumstances to count - dereferencing that pointer gives undefined behaviour.
In other words, by forcing a type conversion, you are introducing potential undefined behaviour, and turning off all possibility of the compiler alerting you to the possibility. MISRA think that is a bad idea .... not withstanding the fact that a lot of programmers who think in terms of ease of coding think it is a good idea in some cases.
You have to realise that the philosophy of MISRA checks is less concerned about ease of programming than typical programmers, and more concerned about preventing circumstances where undefined (or implementation defined or unspecified, etc) behaviours get past all checks, and result in code "in the wild" that can cause harm.
The thing is, in your actual use case, you are relying on file.read() correctly populating the (presumably) data structure named info.
if (file.read((char_t*)&info, (int32_t)sizeof(INFO)).gcount() != (int32_t)sizeof(INFO)
{
LOG("ERROR: Couldn't read the file info header\n");
res = GENERAL_FAILURE;
}
What you need to do is work a bit harder to provide valid code that will pass the MISRA checker. Something like
std::streamsize size_to_read = whatever();
std::vector<char> buffer(size_to_read);
if (file.read(&buffer[0], size_to_read) == size_to_read)
{
// use some rules to interpret contents of buffer (i.e. a protocol) and populate info
// generally these rules will check that the data is in a valid form
// but not rely on doing any pointer type conversions
}
else
{
LOG("ERROR: Couldn't read the file info header\n");
res = GENERAL_FAILURE;
}
Yes, I realise it is more work than simply using a type conversion, and allowing binary saves and reads of structs. But them's the breaks. Apart from getting past the MISRA checker, this approach has other advantages if you do it right, such as the file format being completely independent of what compiler is used to build your code. Your code depends on implementation-defined quantities (the layout of members in a struct, the results of sizeof) so your code - if built with compiler A - may be unable to read a file generated by code built with compiler B. And one of the common themes of MISRA requirements is reducing or eliminated any code with behaviour that may be sensitive to implementation-defined quantities.
Note: you are also passing char_t * to std::istream::read() as first argument and an int32_t as the second argument. Both are actually incorrect. The actual arguments are of type char * and std::streamsize (which may be, but is not required to be an int32_t).
Converting unrelated pointers to char* is not a good practice.
But, if you have a large legacy codebase doing this frequently, you can suppress rules by adding special comments.
fread is a perfectly good C++ function for file input and it uses void*, which MISRA allows.
It also is good at reading binary data, unlike fstream which processes all data through localized character conversion logic (this is the "facet" on an iostream, which is configurable, but the Standard doesn't define any portable way to achieve a no-op conversion).
The C-style of fopen/fclose is unfortunate in a C++ program, since you might forget to cleanup your files. Luckily we have this std::unique_ptr which can add RAII functionality to an arbitrary pointer type. Use std::unique_ptr<FILE*, decltype(&fclose)> and have fast exception-safe binary file I/O in C++.
NB: A common misconception is that std::ios::binary gives binary file I/O. It does not. All it affects are newline conversion and (on some systems) end-of-file marker processing, but there is no effect on facet-driven character conversion.

Why no standard-defined literal suffix for std::string?

A quick question: why doesn't C++11 offer a "user-" (really, standard library) defined literal for creating a std::string, such as
auto str = "hello world"s; // str is a std::string
Like C++, Objective-C supports both C-style strings and a more user-friendly library type NSString; however, unlike C++, nobody ever gets confused between the two, because it's trivial to create an NSString by prefixing a string literal with #. In fact, it's rare to see Objective-C code where a literal doesn't have the # prefix. Everybody just knows that's the way to do it, uses NSString and gets on with it.
C++11 user-defined literals would allow this, and in fact the UDL section on Stroustrup's C++11 FAQ uses exactly this example. Furthermore, UDLs without a leading underscore are reserved, so there would be no problem in allowing a plain s as above -- there's no possibility of it clashing with anything else.
Perhaps I'm missing something, but it seems like this would be a very valuable and risk-free addition to the language, so does anybody know why C++11 doesn't offer it? Is it likely to appear in C++14?
User-defined literals were new enough in C++11 that this wasn't included, but it is in C++14( in §21.7, in case anybody cares). And yes, it uses s as the suffix. The precise result type depends on the type of the string literal itself -- a "narrow" literal gives an std::string, a u16 literal gives a u16string, and a u32 literal gives a u32string. Oh, and yeah, if you insist on using a wide string literal it'll produce a wstring.
Note that the s suffix is also used to mean seconds (defined in <chrono>) but there's no real conflict between the two--s on a string literal means a string, and s on a number means seconds.

Why are string literals const?

It is known that in C++ string literals are immutable and the result of modifying a string literal is undefined. For example
char * str = "Hello!";
str[1] = 'a';
This will bring to an undefined behavior.
Besides that string literals are placed in static memory. So they exists during whole program. I would like to know why do string literals have such properties.
There are a couple of different reasons.
One is to allow storing string literals in read-only memory (as others have already mentioned).
Another is to allow merging of string literals. If one program uses the same string literal in several different places, it's nice to allow (but not necessarily require) the compiler to merge them, so you get multiple pointers to the same memory, instead of each occupying a separate chunk of memory. This can also apply when two string literals aren't necessarily identical, but do have the same ending:
char *foo = "long string";
char *bar = "string";
In a case like this, it's possible for bar to be foo+5 (if I'd counted correctly).
In either of these cases, if you allow modifying a string literal, it could modify the other string literal that happens to have the same contents. At the same time, there's honestly not a lot of point in mandating that either -- it's pretty uncommon to have enough string literals that you could overlap that most people probably want the compiler to run slower just to save (maybe) a few dozen bytes or so of memory.
By the time the first standard was written, there were already compilers that used all three of these techniques (and probably a few others besides). Since there was no way to describe one behavior you'd get from modifying a string literal, and nobody apparently thought it was an important capability to support, they did the obvious: said even attempting to do so led to undefined behavior.
It's undefined behaviour to modify a literal because the standard says so. And the standard says so to allow compilers to put literals in read only memory. And it does this for a number of reasons. One of which is to allow compilers to make the optimisation of storing only one instance of a literal that is repeated many times in the source.
I believe you ask about the reason why literals are placed in
read-only memory, not about technical details of linker doing this and
that or legal details of a standard forbidding such and such.
When modification of string literals works, it leads to subtle bugs
even in the absence of string merging (which we have reasons to
disallow if we decided to permit modification). When you see code like
char *str="Hello";
.../* some code, but str and str[...] are not modified */
printf("%s world\n", str);
it's a natural conclusion that you know what's going to be printed,
because str (and its content) were not modified in a particular
place, between initialization and use.
However, if string literals are writable, you don't know it any
more: str[0] could be overwritten later, in this code or inside a
deeply nested function call, and when the code is run again,
char *str="Hello";
won't guarantee anything about the content of str anymore. As we
expect, this initialization is implemented as moving the address known
in link time into a place for str. It does not check that str
contains "Hello" and it does not allocate a new copy of it. However,
we understand this code as resetting str to "Hello". It's hard to
overcome this natural understanding, and it's hard to reason about the
code where it is not guaranteed. When you see an expression like
x+14, what if you had to think about 14 being possibly overwritten
in other code, so it became 42? The same problem with strings.
That's the reason to disallow modification of string literals, both in
the standard (with no requirement to detect the failure early) and in
actual target platforms (providing the bonus of detecting potential
bugs).
I believe that many attempts to explain this thing suffer from the
worst kind of circular reasoning. The standard forbids writing to
literals because the compiler can merge strings, or they can be placed
in read-only memory. They are placed in read-only memory to catch the
violation of the standard. And it's valid to merge literals because
the standard forbids... is it a kind of explanation you asked for?
Let's look at other
languages. Common Lisp standard
makes modification of literals undefined behaviour, even though the
history of preceding Lisps is very different with the history of C
implementations. That's because writable literals are logically
dangerous. Language standards and memory layouts only reflect that
fact.
Python language has exactly one place where something resembling
"writing to literals" can happen: parameter default values, and this
fact confuses people all the time.
Your question is tagged C++, and I'm unsure of its current state
with respect to implicit conversion to non-const char*: if it's a
conversion, is it deprecated? I expect other answers to provide a
complete enlightenment on this point. As we talk of other languages
here, let me mention plain C. Here, string literals are not const,
and an equivalent question to ask would be why can't I modify string
literals (and people with more experience ask instead, why are
string literals non-const if I can't modify them?). However, the
reasoning above is fully applicable to C, despite this difference.
Because is K&R C, there was not such thing as "const". And similarly in pre-ANSI C++. Hence there was a lot of code which had things like char * str = "Hello!"; If the Standards committee made text literals const, all those programs would have no longer compiled. So they made a compromise. Text literals are official const char[], but they have a silent implicit conversion to char*.
In C++, string literals are const because you aren't allowed
to modify them. In standard C, they would have been const as
well, except that when const was introduced into C, there was
so much code along the lines of char* p = "somethin"; that
making them const would have broke, that it was deemed
unacceptable. (The C++ committee chose a different solution to
this problem, with a deprecated implicit conversion which allows
the above.)
In the original C, string literals were not const, and were
mutable, and it was garanteed that no two string literals shared
any memory. This was quickly realized to be a serious error,
allowing things like:
void
mutate(char* p)
{
static char c = 'a';
*p = a ++;
}
and in another module:
mutate( "hello" ); // Can't trust what is written, can you.
(Some early implementations of Fortran had a similar issue,
where F(4) might call F with just about any integral value.
The Fortran committee fixed this, just like the C committee
fixed string literals in C.)

Why a segmentation fault for changing a non-const char*?

With this code, I get a segmentation fault:
char* inputStr = "abcde";
*(inputStr+1)='f';
If the code was:
const char* inputStr = "abcde";
*(inputStr+1)='f';
I will get compile error for "assigning read-only location".
However, for the first case, there is no compile error; just the segmentation fault when the assign operation actually happened.
Can anyone explain this?
Here is what the standard says about string literals in section [2.13.4/2]:
A string literal that does not begin with u, U, or L is an ordinary string literal, also referred to as a narrow string literal. An ordinary string literal has type “array of n const char”, where n is the size of the string as defined below; it has static storage duration (3.7) and is initialized with the given characters.
So, strictly speaking, "abcde" has type
const char[6]
Now what happens in your code is an implicit cast to
char*
so that the assignment is allowed. The reason why it is so is, likely, compatibility with C. Have a look also at the discussion here: http://learningcppisfun.blogspot.com/2009/07/string-literals-in-c.html
Once the cast is done, you are syntactically free to modify the literal, but it fails because the compiler stores the literal in a non writable segment of memory, as the standard itself allow.
This gets created in the code segment:
char *a = "abcde";
Essentially it's const.
If you wish to edit it, try:
char a[] = "abcde";
The standard states that you are not allowed to modify string literals directly, regardless of whether you mark them const or not:
Whether all string literals are distinct (that is, are stored in nonoverlapping
objects) is implementation-defined. The effect of attempting to modify a string literal is undefined.
In fact, in C (unlike C++), string literals are not const but you're still not allowed to write to them.
This restriction on writing allows certain optimisations to take place, such as sharing of literals along the lines of:
char *ermsg = "invalid option";
char *okmsg = "valid option";
where okmsg can actually point to the 'v' character in ermsg, rather than being a distinct string.
String literals are typically stored in read-only memory. Trying to change this memory will kill your program.
Here's a good explanation: Is a string literal in c++ created in static memory?
It is mostly ancient history; once upon a long time ago, string literals were not constant.
However, most modern compilers place string literals into read-only memory (typically, the text segment of your program, where your code also lives), and any attempt to change a string literal will yield a core dump or equivalent.
With G++, you can most certainly get the compilation warning (-Wall if it is not enabled by default). For example, G++ 4.6.0 compiled on MacOS X 10.6.7 (but running on 10.7) yields:
$ cat xx.cpp
int main()
{
char* inputStr = "abcde";
*(inputStr+1)='f';
}
$ g++ -c xx.cpp
xx.cpp: In function ‘int main()’:
xx.cpp:3:22: warning: deprecated conversion from string constant to ‘char*’ [-Wwrite-strings]
$
So the warning is enabled by default.
What happened is that the compiler put the constant "abcde" in some read-only memory segment. You pointed your (non-const) char* inputStr at that constant, and kaboom, segfault.
Lesson to be learned: Don't invoke undefined behavior.
Edit (elaboration)
However, for the first case, there is no compile error, just segmentation fault when the assign operation actually happened.
You need to enabled your compiler warnings. Always set your compiler warnings as high as possible.
Even though "abcde" is a string literal, which should not be modified, you've told the compiler that you don't care about that by having a non-const char* point to it.
The compiler will happily assume that you know what you're doing, and not throw an error. However, there's a good chance that the code will fail at runtime when you do indeed try to modify the string literal.
String literals, while officially non-const, are almost always stored in read-only memory. In your setup, this is apparently only the case if it is declared as const char array.
Note that the standard forbids you to modify any string literal.
a little bit of history of string literals in Ritchie's words.
mostly about the orgin and the evolution of string literals from K&R 1.
Hope this might clarify a thing or two about const and string literals.
"From: Dennis Ritchie
Subject: Re: History question: String literals.
Date: 02 Jun 1998
Newsgroups: comp.std.c
At the time that the C89 committee was working, writable
string literals weren't "legacy code" (Margolin) and what standard
there existed (K&R 1) was quite explicit (A.2.5) that
strings were just a way of initializing a static array.
And as Barry pointed out there were some (mktemp) routines
that used this fact.
I wasn't around for the committee's deliberations on the
point, but I suspect that the BSD utility for fiddling
the assembler code to move the initialization of strings
to text instead of data, and the realization that most
literal strings were not in fact overwritten, was more
important than some very early version of gcc.
Where I think the committee might have missed something
is in failure to find a formulation that explained
the behavior of string literals in terms of const.
That is, if "abc" is an anonymous literal of type
const char [4]
then just about all of its properties (including the
ability to make read-only, and even to share its storage
with other occurrences of the same literal) are nearly
explained.
The problem with this was not only the relatively few
places that string literals were actually written on, but much
more important, working out feasible rules for assignments
to pointers-to-const, in particular for function's actual
arguments. Realistically the committee knew that whatever
rules they formulated could not require a mandatory
diagnostic for every func("string") in the existing world.
So they decided to leave "..." of ordinary char array
type, but say one was required not to write over it.
This note, BTW, isn't intended to be read as a snipe
at the formulation in C89. It is very hard to get things
both right (coherent and correct) and usable (consistent
enough, attractive enough).
Dennis
"