Function returning std::string returns zero - c++

In maintaining a large legacy code base I came across this function which serves as an accessor to an XML tree.
std::string DvCfgNode::getStringValue() const
{
xmlChar *val = xmlNodeGetContent(mpNode);
if (val == 0)
return 0;
std::string value = (const char *)val;
xmlFree(val);
return value;
}
How can this function return '0'? In C you could return a zero pointer as char * indicating no data was found, but is this possible in C++ with std::string? I wouldn't think so but not knowledgable enough with C++.
The (10 year old) code compiles and runs under C++ 98.
EDIT I
Thanks to everyone for the comments. To answer a few questions:
a) Code is actually 18 years old and about, umm, 500K-1M lines (?)
b) exception handling is turned on, but there are no try-catch blocks anywhere except for a few in main(), which result in immediate program termination.
c) Poor design in the calling code which seems to "trust" getStringValue() to return a proper value, so lots of something like:
std::string s = pTheNode->getStringValue()
Probably just lucky it never returned zero (or if it did, nobody found the bug until now).

Your intuition about the "zero pointer as char*" is correct. What is happening is that 0 is being interpreted as the null pointer, resulting in the returned string being initialized from a const char* null pointer.
However, that is undefined behaviour in C++. The std::string(const char*) constructor requires a pointer to a null-terminated string. So you have found a bug. The fix really depends on the requirements of the function, but I throwing an exception would be an improvement over undefined behaviour*.
* That is a massive understatement. Code should not have undefined behaviour

It depends on how you want to signal that there was no data. If no data means that there is an empty string value in the xml tree you can just return an empty string.
In case you want to model e.g. that there is no data item and thus no data in the tree, you have several options depending on your data semantics.
If the data is mandatory and shall be present, you have an object with a violated invariant, i.e. an object in an illegal state. Using that object for anything is illegal. I would either std::terminate the program (or use some other termination mechanism that is suitable, e.g. an error reporter) or throw something that is guaranteed not to be caught and handled.
If the data is optional you can return something that models this. In C, you would probably go with a pointer to an object which can be null, but this introduces ownership issues. In C++, you can return an std::optional<std::string> which exactly describes this.

Related

Segmentation fault during a function call from a non-reference std::vector parameter [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
At a certain point in my program, I am getting a segmentation fault at a line like the one in g() below:
// a data member type in MyType_t below
enum class DataType {
BYTE, SHORT, INT, VCHAR, FLOAT, DOUBLE, BOOL, UNKNOWN
};
// a data member type in MyType_t below
enum class TriStateBool {
crNULL, crTRUE, crFALSE
};
// this is the underlying type for std::vector used when
// calling function f() from g(), which causes the segmentation fault
struct MyType_t {
DataType Type;
bool isNull;
std::string sVal;
union {
TriStateBool bVal;
unsigned int uVal;
int nVal;
double dVal;
};
// ctors
explicit MyType_t(DataType type = DataType::UNKNOWN) :
Type{type}, isNull{true}, dVal{0}
{ }
explicit MyType_t(std::string sVal_) :
Type{DataType::VCHAR}, isNull{sVal.empty()}, sVal{sVal_}, dVal{0}
{ }
explicit MyType_t(const char* psz) :
Type{DataType::VCHAR}, isNull{psz ? true : false}, sVal{psz}, dVal{0}
{ }
explicit MyType_t(TriStateBool bVal_) :
Type{DataType::BOOL}, isNull{false}, bVal{bVal_}
{ }
explicit MyType_t(bool bVal_) :
Type{DataType::BOOL}, isNull{false}, bVal{bVal_ ? TriStateBool::crTRUE : TriStateBool::crFALSE}
{ }
MyType_t(double dVal_, DataType type, bool bAllowTruncate = false) {
// sets data members in a switch-block, no function calls
//...
}
};
void f(std::vector<MyType_t> v) { /* intentionally empty */ } // parameter by-val
void g(const std::vector<MyType_t>& v) {
//...
f(v); // <-- this line raises segmentation fault!!!
// (works fine if parameter of f() is by-ref instead of by-val)
//...
}
When I inspect the debugger, I see that it originates from the sVal data member of MyType_t.
The code above does not reproduce the problem, so I don't expect any specific answers. I do however appreciate suggestions to get me closer to finding the source of it.
Here is a bit of context:
I have a logical expression parser. The expressions can contain arithmetical expressions, late-bind variables and function calls. Given such an expression, parser creates a tree with nodes of several types. Parse stage is always successful. I have the problem in evaluation stage.
Evaluation stage is destructive to the tree structure, so the its nodes are "restored" using a backup copy before each evaluation. 1st evaluation is also always successful. I'm having the problem in the following evaluation, with some certain expressions. Even then, the error is not consistent.
Please assume that I have no memory leaks and double releases. (I am using an global new() overload to track allocations/deallocations.)
Any ideas on how I should tackle this?
f(v); // <-- this line raises segmentation fault!!!
There are only two ways this line could trigger SIGSEGV:
You've run out of stack (this is somewhat common for deeply recursive procedures). You can do ulimit -s unlimited, and see if the problem goes away.
You have corrupted v and so its copy constructor triggers SIGSEGV.
Please assume that I have no memory leaks and double releases.
There are many more ways to corrupt heap (e.g. by overflowing heap block) than double release. It is very likely that (if you do not have stack overflow) valgrind or AddressSanitizer will point you straight at the problem.
When I inspect the debugger, I see that it [the segfault] originates from the sVal data member of MyType_t.
Well, now that you've actually added most of the relevant code, we see why asking for help with no/insufficient code, because you just know the rest is A-OK, is a terrible idea:
Problem 1
explicit MyType_t(std::string sVal_) :
Type{DataType::VCHAR}, isNull{sVal.empty()}, sVal{sVal_}, dVal{0}
{ }
AWOOGA! Your initialiser for isNull is checking whether the member, sVal, is empty(), before you have initialised sVal from the parameter, sVal_. (N.B. Although they're the same in this case, the real order is that of declaration, not of initialisation.) You might think isNull will always be set true as it's always checking the initially default-constructed, empty member sVal, rather than the input sVal_...
But it's worse than that! The member isn't default-constructed, because the compiler knows that's pointless as it's about to be copy-constructed. So, at the point of calling empty(), you are acting on an uninitialised variable, and that's undefined behaviour.
(Aside: This vindicates my preferred naming convention, m_sVal. It's much more difficult to accidentally type 2 characters than to forget a trailing underscore. But if anything, people normally include the trailing underscore in the member, not the argument. Anyway!)
But wait!
Problem 2
While your first constructor for the VCHAR variant of MyType_t guarantees UB, your second strongly invites it:
explicit MyType_t(const char* psz) :
Type{DataType::VCHAR}, isNull{psz ? true : false}, sVal{psz}, dVal{0}
{ }
In this version, you allow the possibility that sVal will be constructed from a null pointer. But let's have a look at http://en.cppreference.com/w/cpp/string/basic_string/basic_string - where the std::string constructors taking char const *, (4) and (5) with the latter being the one called here, are annotated as follows:
5) Constructs the string with the contents initialized with a copy of the null-terminated character string pointed to by s. The length of the string is determined by the first null character. The behavior is undefined if s does not point at an array of at least Traits::length(s)+1 elements of CharT, including the case when s is a null pointer.
This is formalised in the C++11 Standard at 21.4.2.9: https://stackoverflow.com/a/10771938/2757035
It's worth noting that GCC's libstdc++ as used by g++ does throw an exception when passed a nullptr: "what(): basic_string::_S_construct null not valid". I might infer from your mention of gdb that you're using g++ - in which case, this probably isn't your problem, unless you're using an older version than me. Still, the mere possibility should still be avoided. So, guard your initialisation as follows:
sVal{psz ? psz : ""}
Result
From this point on, due to UB, anything can happen - at any point in your program. But a common 'default behaviour' for UB is a segfault.
With the constructor taking std::string, even if the invalid call to .empty() completes, isNull won't just get a 'random' value based on the uninitialised .size(): rather, because it's constructed from UB, isNull is undefined, and tests of it might be removed or inlined as some arbitrary constant, potentially leading to wrong code paths being taken, or right ones not. So although sVal gets a valid value eventually, isNull doesn't.
With the constructor taking char const *, your member std::string sVal itself will be followed everywhere by UB if the passed pointer is null. If so, isNull would be OK, but any attempts to use sVal would be undefined.
In both cases, any number of unknowable things might happen, because UB is involved. If you want to know exactly what's happened in this case, disassemble your program.
The reason for the segfault not occurring when the vector is passed by reference is nebulous and of little discursive value; in either case, you're reading or copy-constructing from MyType_t elements whose construction involved UB in one way or another.
Solution
The point is that you need to fix both of these erroneous, UB-generating pieces of code and determine whether it resolves the error. It's very likely that it will because what the first and possibly second constructor are currently doing is so UB that, in a practical sense, your program nearly guaranteed to crash somewhere.
You must now comb over your program for any such coding errors and eliminate every single one, or they will catch you out eventually, "on a long enough timeline".

returning const char* to char* and then changing the data

I am confused about the following code:
string _str = "SDFDFSD";
char* pStr = (char*)_str.data();
for (int i = 0; i < iSize; i++)
pStr[i] = ::tolower(pStr[i]);
here _str.data() returns const char*. But we are assigning it to a char*. My questions is,
_str.data()is returning pointer to a constant data. How is it possible to store it in a pointer to data? The data was constant right? If we assign it to char pointer than we can change it like we are doing inside the for statement which should not be possible for a constant data.
Don't do that. It may be fine in this case, but as the documentation for data() says:
The pointer returned may be invalidated by further calls to other
member functions that modify the object.
A program shall not alter any of the characters in this sequence.
So you could very accidentally write to invalid memory if you keep that pointer around. Or, in fact, ruin the implementation of std::string. I would almost go as far as to say that this function shouldn't be exposed.
std::string offers a non-const operator[] for that purpose.
string _str = "SDFDFSD";
for (int i = 0; i < iSize; i++)
_str[i] = ::tolower(_str[i]);
What you are doing is not valid at the standard library level (you're violating std::string contract) but valid at the C++ core language level.
The char * returned from data should not be written to because for example it could be in theory(*) shared between different strings with the same value.
If you want to modify a string just use std::string::operator[] that will inform the object of the intention and will take care of creating a private buffer for the specific instance in case the string was originally shared instead.
Technically you are allowed to cast-away const-ness from a pointer or a reference, but if it's a valid operation or not depends on the semantic of the specific case. The reason for which the operation is allowed is that the main philosophy of C++ is that programmers make no mistakes and know what they are doing. For example is technically legal from a C++ language point of view to do memcpy(&x, "hello", 5) where x is a class instance, but the results are most probably "undefined behavior".
If you think that your code "works" it's because you've the wrong understanding of what "works" really should mean (hint: "works" doesn't mean that someone once observed the code doing what seemed reasonable, but that will work in all cases). A valid C++ implementation is free to do anything it wants if you run that program: that you observed something you think is fine doesn't really mean anything, may be you didn't look close enough, or may be you were just lucky (unfortunate, actually) that no crash happened right away.
(*) In modern times the COW (copy-on-write) implementations of std::string are low in popularity because they pose a lot of problems (e.g. with multithreading) and memory is a lot cheaper now. Still std::string contract says you're not allowed to change the memory pointed by the return value of data(); if you do anything may happen.
You never must change the data returned from std::string::data() or std::string::c_str() directly.
To create a copy of a std::string:
std::string str1 = "test";
std::string str2 = str1; // copy.
Change characters in a string:
std::string str1 = "test"
str1[0] = 'T';
The "correct" way would be to use std::transform instead:
std::transform(_str.begin(), _str.end(), _str.begin(), ::tolower);
The simple answer to your question is that in C++ you can cast away the 'const' of a variable.
You probably shouldn't though.
See this for const correctness in C++
String always allocates memory on heap, so this is not actually const data, it is just marked so (in method data() signature) to prevent modification.
But nothing is impossible in C++, so with a simple cast, though unsafe, you can now treat the same memory space as modifiable.
All constants in a C/C++ program (like "SDFDFSD" below)will be stored in a separate section .rodata. This section is mapped as read-only when the binary is loaded into memory during execution.
int main()
{
char* ptr = "SDFDFSD";
ptr[0]='x'; //segmentation fault!!
return 0;
}
Hence any attempt to modify the data at that location will result in a run-time error i.e. a segmentation fault.
Coming to the above question, when creating a string and assigning a string to it, a new copy in memory now exists (memory used to hold the properties of the string object _str). This is on the heap and NOT mapped to a read-only section. The member function _str.data() points to the location in memory which is mapped read/write.
The const qualifier to the return type ensure that this function is NOT accidentally passed to string manipulation functions which expect a non-const char* pointer.
In your current iteration there was no limitation on the memory location itself that was holding the string object's data; i.e. it was mapped with both read/write permissions. Hence modifying the location using another non-const pointer worked i.e. pStr[i] on the left hand side of an assignment did NOT result in a run-time error as there were no inherent restrictions on the memory location itself.
Again this is NOT guaranteed to work and just a implementation specific behaviour that you have observed (i.e. it simply happens to work for you) and cannot always depend on this.

Can I safely create references to possibly invalid memory as long as I don't use it?

I want to parse UTF-8 in C++. When parsing a new character, I don't know in advance if it is an ASCII byte or the leader of a multibyte character, and also I don't know if my input string is sufficiently long to contain the remaining characters.
For simplicity, I'd like to name the four next bytes a, b, c and d, and because I am in C++, I want to do it using references.
Is it valid to define those references at the beginning of a function as long as I don't access them before I know that access is safe? Example:
void parse_utf8_character(const string s) {
for (size_t i = 0; i < s.size();) {
const char &a = s[i];
const char &b = s[i + 1];
const char &c = s[i + 2];
const char &d = s[i + 3];
if (is_ascii(a)) {
i += 1;
do_something_only_with(a);
} else if (is_twobyte_leader(a)) {
i += 2;
if (is_safe_to_access_b()) {
do_something_only_with(a, b);
}
}
...
}
}
The above example shows what I want to do semantically. It doesn't illustrate why I want to do this, but obviously real code will be more involved, so defining b,c,d only when I know that access is safe and I need them would be too verbose.
There are three takes on this:
Formally
well, who knows. I could find out for you by using quite some time on it, but then, so could you. Or any reader. And it's not like that's very practically useful.
EDIT: OK, looking it up, since you don't seem happy about me mentioning the formal without looking it up for you. Formally you're out of luck:
N3280 (C++11) §5.7/5 “If both the pointer operand and the result point to elements of the same array object, or one past
the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.”
Two situations where this can produce undesired behavior: (1) computing an address beyond the end of a segment, and (2) computing an address beyond an array that the compiler knows the size of, with debug checks enabled.
Technically
you're probably OK as long as you avoid any lvalue-to-rvalue conversion, because if the references are implemented as pointers, then it's as safe as pointers, and if the compiler chooses to implement them as aliases, well, that's also ok.
Economically
relying needlessly on a subtlety wastes your time, and then also the time of others dealing with the code. So, not a good idea. Instead, declare the names when it's guaranteed that what they refer to, exists.
Before going into the legality of references to unaccessible memory, you have another problem in your code. Your call to s[i+x] might call string::operator[] with a parameter bigger then s.size(). The C++11 standard says about string::operator[] ([string.access], §21.4.5):
Requires: pos <= size().
Returns: *(begin()+pos) if pos < size(), otherwise a reference to an object of type T with value charT(); the referenced value shall not be modified.
This means that calling s[x] for x > s.size() is undefined behaviour, so the implementation could very well terminate your program, e.g. by means of an assertion, for that.
Since string is now guaranteed to be continous, you could go around that problem using &s[i]+x to get an address. In praxis this will probably work.
However, strictly speaking doing this is still illegal unfortunately. The reason for this is that the standard allows pointer arithmetic only as long as the pointer stays inside the same array, or one past the end of the array. The relevant part of the (C++11) standard is in [expr.add], §5.7.5:
If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.
Therefore generating references or pointers to invalid memory locations might work on most implementations, but it is technically undefined behaviour, even if you never dereference the pointer/use the reference. Relying on UB is almost never a good idea , because even if it works for all targeted systems, there are no guarantees about it continuing to work in the future.
In principle, the idea of taking a reference for a possibly illegal memory address is itself perfectly legal. The reference is only a pointer under the hood, and pointer arithmetic is legal until dereferencing occurs.
EDIT: This claim is a practical one, not one covered by the published standard. There are many corners of the published standard which are formally undefined behaviour, but don't produce any kind of unexpected behaviour in practice.
Take for example to possibility of computing a pointer to the second item after the end of an array (as #DanielTrebbien suggests). The standard says overflow may result in undefined behaviour. In practice, the overflow would only occur if the upper end of the array is just short of the space addressable by a pointer. Not a likely scenario. Even when if it does happen, nothing bad would happen on most architectures. What is violated are certain guarantees about pointer differences, which don't apply here.
#JoSo If you were working with a character array, you can avoid some of the uncertainty about reference semantics by replacing the const-references with const-pointers in your code. That way you can be certain no compiler will alias the values.

C++ const cast, unsure if this is secure

It maybe seems to be a silly question but i really need to clarify this:
Will this bring any danger to my program?
Is the const_cast even needed?
If i change the input pointers values in place will it work safely with std::string or will it create undefined behaviour?
So far the only concern is that this could affect the string "some_text" whenever I modify the input pointer and makes it unusable.
std::string some_text = "Text with some input";
char * input = const_cast<char*>(some_text.c_str());
Thanks for giving me some hints, i would like to avoid the shoot in my own foot
As an example of evil behavior: the interaction with gcc's Copy On Write implementation.
#include <string>
#include <iostream>
int main() {
std::string const original = "Hello, World!";
std::string copy = original;
char* c = const_cast<char*>(copy.c_str());
c[0] = 'J';
std::cout << original << "\n";
}
In action at ideone.
Jello, World!
The issue ? As the name implies, gcc's implementation of std::string uses a ref-counted shared buffer under the cover. When a string is modified, the implementation will neatly check if the buffer is shared at the moment, and if it is, copy it before modifying it, ensuring that other strings sharing this buffer are not affected by the new write (thus the name, copy on write).
Now, with your evil program, you access the shared buffer via a const-method (promising not to modify anything), but you do modify it!
Note that with MSVC's implementation, which does not use Copy On Write, the behavior would be different ("Hello, World!" would be correctly printed).
This is exactly the essence of Undefined Behavior.
To modify an inherently const object by casting away its constness using const_cast is an Undefined Behavior.
string::c_str() returns a const char *, i.e: a pointer to a constant c-style string. Technically, modifying this will result in Undefined Behavior.
Note, that the use of const_cast is when you have a const pointer to a non const data and you wish to modify the non-constant data.
Simply casting will not bring forth an undefined behavior. Modifying the data pointed at, however, will. (Also see ISO 14882:98 5.2.7-7).
If you want a pointer to modifiable data, you can have a
std::vector<char> wtf(str.begin(), str.end());
char* lol= &wtf[0];
The std::string manages it's own memory internally, which is why it returns a pointer to that memory directly as it does with the c_str() function. It makes sure it's constant so that your compiler will warn you if you try to do modifiy it.
Using const_cast in that way literally casts away such safety and is only an arguably acceptable practice if you are absolutely sure that memory will not be modified.
If you can't guarantee this then you must copy the string and use the copy.; it's certainly a lot safer to do this in any event (you can use strcpy).
See the C++ reference website:
const char* c_str ( ) const;
"Generates a null-terminated sequence of characters (c-string) with the same content as the string object and returns it as a pointer to an array of characters.
A terminating null character is automatically appended.
The returned array points to an internal location with the required storage space for this sequence of characters plus its terminating null-character, but the values in this array should not be modified in the program and are only guaranteed to remain unchanged until the next call to a non-constant member function of the string object."
Yes, it will bring danger, because
input points to whatever c_str happens to be right now, but if some_text ever changes or goes away, you'll be left with a pointer that points to garbage. The value of c_str is guaranteed to be valid only as long as the string doesn't change. And even, formally, only if you don't call c_str() on other strings too.
Why do you need to cast away the const? You're not planning on writing to *input, are you? That is a no-no!
This is a very bad thing to do. Check out what std::string::c_str() does and agree with me.
Second, consider why you want a non-const access to the internals of the std::string. Apparently you want to modify the contents, because otherwise you would use a const char pointer. Also you are concerned that you don't want to change the original string. Why not write
std::string input( some_text );
Then you have a std::string that you can mess with without affecting the original, and you have std::string functionality instead of having to work with a raw C++ pointer...
Another spin on this is that it makes code extremely difficult to maintain. Case in point: a few years ago I had to refactor some code containing long functions. The author had written the function signatures to accept const parameters but then was const_casting them within the function to remove the constness. This broke the implied guarantee given by the function and made it very difficult to know whether the parameter has changed or not within the rest of the body of the code.
In short, if you have control over the string and you think you'll need to change it, make it non-const in the first place. If you don't then you'll have to take a copy and work with that.
it is UB.
For example, you can do something like this this:
size_t const size = (sizeof(int) == 4 ? 1024 : 2048);
int arr[size];
without any cast and the comiler will not report an error. But this code is illegal.
The morale is that you need consider action each time.

Deprecated conversion from string const. to wchar_t*

Hello I have a pump class that requires using a member variable that is a pointer to a wchar_t array containing the port address ie: "com9".
The problem is that when I initialise this variable in the constructor my compiler flags up a depreciated conversion warning.
pump::pump(){
this->portNumber = L"com9";}
This works fine but the warning every time I compile is anoying and makes me feel like I'm doing something wrong.
I tried creating an array and then setting the member variable like this:
pump::pump(){
wchar_t port[] = L"com9";
this->portNumber = port;}
But for some reason this makes my portNumber point at 'F'.
Clearly another conceptual problem on my part.
Thanks for help with my noobish questions.
EDIT:
As request the definition of portNumber was:
class pump
{
private:
wchar_t* portNumber;
}
Thanks to answers it has now been changed to:
class pump
{
private:
const wchar_t* portNumber;
}
If portNumber is a wchar_t*, it should be a const wchar_t*.
String literals are immutable, so the elements are const. There exists a deprecated conversion from string literal to non-const pointer, but that's dangerous. Make the change so you're keeping type safety and not using the unsafe conversion.
The second one fails because you point to the contents of a local variable. When the constructor finishes, the variable goes away and you're pointing at an invalid location. Using it results in undefined behavior.
Lastly, use an initialization list:
pump::pump() :
portNumber(L"com9")
{}
The initialization list is to initialize, the constructor is to finish construction. (Also, this-> is ugly to almost all C++ people; it's not nice and redundant.)
Use const wchar_t* to point at a literal.
The reason the conversion exists is because it has been valid from early versions of C to assign a string literal to a non-const pointer[*]. The reason it's deprecated is that it's invalid to modify a literal, and it's risky to use a non-const pointer to refer to something that must not be modified.
[*] C didn't originally have const. When const was added, clearly it should apply to string literals, but there was already code out there, written before const existed, that would break if suddenly you had to sprinkle const everywhere. We're still paying today for that breaking change to the language. Since it's C++ you're using, it wasn't even a breaking change to this language.
Apparently, portNumber is a wchar_t * (non-const), correct? If so:
the first one is wrong, because string literals are read-only (they are const pointers to an array of char usually stored in the string table of the executable, which is mapped in memory somewhere, often in a readonly page).
The ugly, implicit conversion to non-const chars/wchar_ts was approved, IIRC, to achieve compatibility with old code written when const didn't even existed; sadly, it let a lot of morons which do not know what const correctness means get away with writing code that asks non-const pointers even when const pointers would be the right choice.
The second one is wrong because you're making portNumber point to a variable allocated on the stack, which is deleted when the constructor returns. After the constructor returns, the pointer stored in portNumber points to random garbage.
The correct approach is to declare portNumber as const wchar_t * if it doesn't need to be modified. If, instead, it does need to be modified during the lifetime of the class, usually the best approach is to avoid C-style strings at all and just throw in a std::wstring, that will take care of all the bookkeeping associated with the string.