How to check if `strcmp` has failed? - c++

Well this question is about C and C++ as strcmp is present in both of them.
I came across this link: C library function - strcmp().
Over here it was explained the return values of strcmp. I know that every function, how much ever safe it is, can fail. Thus, I knew that even strcmp can fail at some time.
Also, I came across this question which also explained the return values of strcmp. After searching a lot, I could not find a website which explained how to check if strcmp could fail.
I first had a thought that it would return -1, but it turned out that it returns numbers < 0 if the first string is smaller. So can someone tell me how to check if strcmp has failed.
EDIT: well, I do not understand the point of strcmp not failing. There are many ways in which a function fails. For example, in one comment, it was written that if a stack doesn't extend, it might cause a stack overlow. No program in any language is absolutely safe!

There is no defined situation where strcmp can fail.
You can trigger undefined behavior by passing an invalid argument, such as:
a null pointer
an uninitialized pointer
a pointer to memory that's been freed, or to a stack location from a function that's since been exited, or similar
a perfectly valid pointer, but with no null byte between the location that the pointer points to and the end of the allocated memory range that it points into
but in such cases strcmp makes absolutely no guarantees about its behavior — it's allowed to just start wiping your hard drive, or sending spam from your e-mail account, or whathaveyou — so it doesn't need to explicitly indicate the error. (And indeed, the only kind of invalid argument that strcmp really could detect and handle, in a typical C or C++ implementation, is the null-pointer case, which you could just as easily check before calling strcmp. So an error indicator would not be useful.)

If you look up strcmp in some documentation, you will find something like
” The behavior is undefined if lhs or rhs are not pointers to null-terminated strings.
This means that it can crash, or cause small red daemons to fly out of your nose, or whatever. Since undefined behavior can be that what you innocently thought would happen (nothing wrong, as you see it), happens, there's no sure fire way to recognize it. Although you'll recognize a crash.
So, you as a programmer can't check for undefined behavior after the fact.
But the compiler can add such checks for you, for many kinds of undefined behavior including this, because it's in full charge of that behavior. Whether it will do so depends on the compiler. Also, you can add checks in e.g. a wrapper function, that minimizes the chance of undefined behavior, although with non-terminated strings there's no practical way of checking that I know of that won't itself possibly invoke undefined behavior.

Possible failure is undefined behaviour, and may happen if you give wrong arguments (not pointers to null-terminated strings).

strcmp returns 0 if and only if the both strings are properly zero-terminated, and have the exact same characters.
If they both are not exactly the same, a non-zero value is returned. If the return value is negative, it means that the first string comes first in lexicographical ordering; if positive, it means that the second string comes first in lexicographical ordering.
There is no "failure", except possible undefined behaviour, if invalid input is provided - then, anything could happen, including a program crash, or compiler generating invalid code.

Related

Is accessing parts of a string after an embedded null terminator UB?

If I have an embedded null terminator [aside: is that UB?], is it well-defined for me to access the values after it?
#include <stdio.h>
const char foo[] = "abc\0def";
int main() {
printf("%s", foo+4);
return sizeof(foo);
}
For the record, it prints what you might expect:
def
An embedded null is NOT Undefined Behavior. It could be a logic error, if you work with functions that expect Strings to be null-terminated. But there's nothing wrong or evil or undefined about accessing the full breadth of an array you've successfully allocated, regardless of its contents.
One thing to observe though: if you attempt to store this data in a std::string (which is how you should handle all strings, TBH), how you store the string can be important.
std::string str1 = foo; //contents of str1 becomes "abc".
std::string str2 = std::string(foo, sizeof(foo)); //contents of str2 becomes "abc\0def"
[dcl.init.string] states
An array of narrow character type (3.9.1), char16_t array, char32_t array, or wchar_t array can be initialized by a narrow string literal, char16_t string literal, char32_t string literal, or wide string literal, respectively, or by an appropriately-typed string literal enclosed in braces (2.14.5). Successive characters of the value of the string literal initialize the elements of the array.
emphasis mine
So the embedded null it not a problem, it just becomes a element of the array. Since the array is sized to container all the characters and escape sequeneces we know there are elements after that embedded null and it is safe to access those.
Really the only issue with the embedded null is that any C function is going to stop when it hits that null and won't full process the string. You might consider using a std::string instead which doesn't have those issues.
Accessing a C string beyound the terminating null character per se never is undefined behaviour. Still, we can yield undefined behaviour this way, but for a totally different reason:
If the terminating null character happens to reside at the last position in the char array reserved for the string, then we access this underlying array out of its bounds if we access the string beyond its end. And this out-of-bounds-access is what really yields the undefined behaviour...
Edit:
[aside: is that UB?]
UB, undefined behaviour, is behaviour that cannot be defined, because there is no meaningful behaviour for. Relying on undefined behaviour can result in anything, including getting the expected results, but can fail miserably any other time (e. g. on another platform, after switching compiler version, after simply recompiling, even after just restarting one and the same program). Thus a program relying on undefined behaviour is considered not to be well defined.
Example: Dereferencing a pointer pointing to an object that has already been deleted (a "dangling pointer"), or close to the question: accessing an array out of bounds (could result in trying to access memory not asigned to the current process or even not existing, but could read or (bad!!!) overwrite memory of a totally different object that happens to be located at the given address (it does not even have to be the same object every time your program runs, not even within one single program run).
Undefined behaviour is not to be mixed up with unspecified behaviour (or synonymously, implementation defined behaviour): In this case, the behaviour for a given input is well defined, but it is left to the compiler vendor to define the behaviour within some given reasonable limitations.
Example: right shift of negative integers - it can occur with or without sign extension (so can be an arithmetic or a logical shift). Which one applies is not specified by the standard, though, but using right shift on negative integers is well defined.

Is it bad to depend on index 0 of an empty std::string?

std::string my_string = "";
char test = my_string[0];
I've noticed that this doesn't crash, and every time I've tested it, test is 0.
Can I depend on it always being 0? or is it arbitrary?
Is this bad programming?
Edit:
From some comments, I gather that there is some misunderstanding about the usefulness of this.
The purpose of this is NOT to check to see if the string is empty.
It is to not need to check whether the string is empty.
The situation is that there is a string that may or may not be empty.
I only care about the first character of this string (if it is not empty).
It seems to me, it would be less efficient to check to see if the string is empty, and then, if it isn't empty, look at the first character.
if (! my_string.empty())
test = my_string[0];
else
test = 0;
Instead, I can just look at the first character without needing to check to see if the string is empty.
test = my_string[0];
C++14
No; you can depend on it.
In 21.4.5.2 (or [string.access]) we can find:
Returns: *(begin() + pos) if pos < size(). Otherwise, returns a reference to an object of type charT with value charT(), where modifying the object leads to undefined behavior.
In other words, when pos == size() (which is true when both are 0), the operator will return a reference to a default-constructed character type which you are forbidden to modify.
It is not special-cased for the empty (or 0-sized) strings and works the same for every length.
C++03
And most certainly C++98 as well.
It depends.
Here's 21.3.4.1 from the official ISO/IEC 14882:
Returns: If pos < size(), returns data()[pos]. Otherwise, if pos == size(), the const version returns charT(). Otherwise, the behavior is undefined.
#Bartek Banachewicz's answer explains which circumstances allow you to make your assumption. I would like to add that
this is bad programming.
Why? For several reasons:
You have to be a language lawyer just to be sure this isn't a bug. I wouldn't know the answer if not for this page, and frankly - I don't think you should really bother to know either.
People without the intuition of a string being a null-terminated sequence of characters will have no idea what you're trying to do until they read the standard or ask their friends.
Breaks the principle of least surprise in a bad way.
Goes against the principle of "writing what you mean", i.e. having the code express problem-domain concepts.
Sort-of-a use of a magic number (it's arguable whether 0 actually constitutes a magic number in this case).
Shall I continue? ... I'm almost certain you have an alternative superior in almost every respect. I'll even venture a guess that you've done something else that's "bad" to manipulate yourself into wanting to do this.
Always remember: Other people, who will not be consulting you, will sooner-or-later need to maintain this code. Think of them, not just of yourself, who can figure it out. Plus, in a decade from now, who's to say you're going to remember your own trick? You might be that confounded maintainer...

Why is wrong to modify the contents of a pointer to a string litteral?

If I write:
char *aPtr = "blue"; //would be better const char *aPtr = "blue"
aPtr[0]='A';
I have a warning. The code above can work but isn't standard, it has a undefined behavior because it's read-only memory with a pointer at string litteral. The question is:
Why is it like this?
with this code rather:
char a[]="blue";
char *aPtr=a;
aPtr[0]='A';
is ok. I want to understand under the hood what happens
The first is a pointer to a read-only value created by the compiler and placed in a read-only section of the program. You cannot modify the characters at that address because they are read-only.
The second creates an array and copies each element from the initializer (see this answer for more details on that). You can modify the contents of the array, because it's a simple variable.
The first one works the way it does because doing anything else would require dynamically-allocating a new variable, and would require garbage collection to free it. That is not how C and C++ work.
The primary reason that string literals can't be modified (without undefined behavior) is to support string literal merging.
Long ago, when memory was much tighter than today, compiler authors noticed that many programs had the same string literals repeated many times--especially things like mode strings being passed to fopen (e.g., f = fopen("filename", "r");) and simple format strings being passed to printf (e.g., printf("%d\n", a);).
To save memory, they'd avoid allocating separate memory for each instance of these strings. Instead, they'd allocate one piece of memory, and point all the pointers at it.
In a few cases, they got even trickier than that, to merge literals that were't even entirely identical. For example consider code like this:
printf("%s\t%d\n", a);
/* ... */
printf("%d\n", b);
In this case, the string literals aren't entirely identical, but the second one is identical part of the end of the first. In this case, they'd still allocate one piece of memory. One pointer would point to the beginning of the memory, and the other to the position of the %d in that same block of memory.
With a possibility (but no requirement for) string literal merging, it's essentially impossible to say what behavior you'll get when you modify a string literal. If string literals are merged, modifying one string literal might modify others that are identical, or end identically. If string literals are not merged, modifying one will have no effect on any other.
MMUs added another dimension: they allowed memory to be marked as read-only, so attempting to modify a string literal would result in a signal of some sort--but only if the system had an MMU (which was often optional at one time) and also depending on whether the compiler/linker decided to put the string literals in memory they'd marked constant or not.
Since they couldn't define what the behavior would be when you modified a string literal, they decided that modifying a string literal would produce undefined behavior.
The second case is entirely different. Here you've defined an array of char. It's clear that if you define two separate arrays, they're still separate, regardless of content, so modifying one can't possibly affect the other. The behavior is clear and always has been, so doing so gives defined behavior. The fact that the array in question might be initialized from a string literal doesn't change that.

What type of input check can be performed against binary data in C++?

let's say I have a function like this in C++, which I wish to publish to third parties. I want to make it so that the user will know what happened, should he/she feeds invalid data in and the library crashes.
Let's say that, if it helps, I can change the interface as well.
int doStuff(unsigned char *in_someData, int in_data_length);
Apart from application specific input validation (e.g. see if the binary begins with a known identifier etc.), what can be done? E.g. can I let the user know, if he/she passes in in_someData that has only 1 byte of data but passes in 512 as in_data_length?
Note: I already asked a similar question here, but let me ask from another angle..
It cannot be checked whether the parameter in_data_length passed to the function has the correct value. If this were possible, the parameter would be redundant and thus needless.
But a vector from the standard template library solves this:
int doStuff(const std::vector<unsigned char>& in_someData);
So, there is no possibility of a "NULL buffer" or an invalid data length parameter.
If you would know how many bytes passed by in_someData why would you need in_data_length at all?
Actually, you can only check in_someData for NULL and in_data_length for positive value. Then return some error code if needed. If a user passed some garbage to your function, this problem is obviously not yours.
In C++, the magic word you're looking for is "exception". That gives you a method to tell the caller something went wrong. You'll end up with code something like
int
doStuff(unsigned char * inSomeData, int inDataLength) throws Exception {
// do a test
if(inDataLength == 0)
throw new Exception("Length can't be 0");
// only gets here if it passed the test
// do other good stuff
return theResult;
}
Now, there's another problem with your specific example, because there's no universal way in C or C++ to tell how long an array of primitives really is. It's all just bits, with inSomeData being the address of the first bits. Strings are a special case, because there's a general convention that a zero byte ends a string, but you can't depend on that for binary data -- a zero byte is just a zero byte.
Update
This has currently picked up some downvotes, apparently by people misled by the comment that exception specifications had been deprecated. As I noted in a comment below, this isn't actually true -- while the specification will be deprecated in C++11, it's still part of the language now, so unless questioner is a time traveler writing in 2014, the throws clause is still the correct way to write it in C++.
Also note that the original questioner says "I want to make it so that the user will know what happened, should he/she feeds [sic] invalid data in and the library crashes." Thus the question is not just what can I do to validate the input data (answer: not much unless you know more about the inputs than was stated), but then how do I tell the caller they screwed up? And the answer to that is "use the exception mechanism" which has certainly not been deprecated.

Why is strncpy insecure?

I am looking to find out why strncpy is considered insecure. Does anybody have any sort of documentation on this or examples of an exploit using it?
Take a look at this site; it's a fairly detailed explanation. Basically, strncpy() doesn't require NUL termination, and is therefore susceptible to a variety of exploits.
The original problem is obviously that strcpy(3) was not a memory-safe operation, so an attacker could supply a string longer than the buffer which would overwrite code on the stack, and if carefully arranged, could execute arbitrary code from the attacker.
But strncpy(3) has another problem in that it doesn't supply null termination in every case at the destination. (Imagine a source string longer than the destination buffer.) Future operations may expect conforming C nul-terminated strings between equally sized buffers and malfunction downstream when the result is copied to yet a third buffer.
Using strncpy(3) is better than strcpy(3) but things like strlcpy(3) are better still.
To safely use strncpy, one must either (1) manually stick a null character onto the result buffer, (2) know that the buffer ends with a null beforehand, and pass (length-1) to strncpy, or (3) know that the buffer will never be copied using any method that won't bound its length to the buffer length.
It's important to note that strncpy will zero-fill everything in the buffer past the copied string, while other length-limited strcpy variants will not. This may at some cases be a performance drain, but in other cases be a security advantage. For example, if one used strlcpy to copy "supercalifragilisticexpalidocious" into a buffer and then to copy "it", the buffer would hold "it^ercalifragilisticexpalidocious^" (using "^" to represent a zero byte). If the buffer gets copied to a fixed-sized format, the extra data might tag along with it.
The question is based on a "loaded" premise, which makes the question itself invalid.
The bottom line here is that strncpy is not considered insecure and has never been considered insecure. The only claims of "insecurity" that can be attached to that function are the broad claims of general insecurity of C memory model and C language itself. (But that is obviously a completely different topic).
Within the realm of C language the misguided belief of some kind of "insecurity" inherent in strncpy is derived from the widespread dubious pattern of using strncpy for "safe string copying", i.e. something this function does not do and has never been intended for. Such usage is indeed highly error prone. But even if you put an equality sign between "highly error prone" and "insecure", it is still a usage problem (i.e. a lack of education problem) not a strncpy problem.
Basically, one can say that the only problem with strncpy is a unfortunate naming, which makes newbie programmers assume that they understand what this function does instead of actually reading the specification. Looking at the function name an incompetent programmer assumes that strncpy is a "safe version" of strcpy, while in reality these two functions are completely unrelated.
Exactly the same claim can be made against the division operator, for one example. As most of you know, one of the most frequently-asked questions about C language goes as "I assumed that 1/2 will evaluate to 0.5 but I got 0 instead. Why?" Yet, we don't claim that the division operator is insecure just because language beginners tend to misinterpret its behavior.
For another example, we don't call pseudo-random number generator functions "insecure" just because incompetent programmers are often unpleasantly surprised by the fact that their output is not truly random.
That is exactly how it is with strncpy function. Just like it takes time for beginner programmers to learn what pseudo-random number generators actually do, it takes them time to learn what strncpy actually does. It takes time to learn that strncpy is a conversion function, intended for converting zero-terminated strings to fixed-width strings. It takes time to learn that strncpy has absolutely nothing to do with "safe string copying" and can't be meaningfully used for that purpose.
Granted, it usually takes much longer for a language student to learn the purpose of strncpy than to sort things out with the division operator. However, this is a basis for any "insecurity" claims against strncpy.
P.S. The CERT document linked in the accepted answer is dedicated to exactly that: to demonstrate the insecurities of the typical incompetent abuse of strncpy function as a "safe" version of strcpy. It is not in any way intended to claim that strncpy itself is somehow insecure.
A pathc of Git 2.19 (Q3 2018) finds that it is too easy to misuse system API functions such as strcat(); strncpy(); ... and forbids those functions in this codebase.
See commit e488b7a, commit cc8fdae, commit 1b11b64 (24 Jul 2018), and commit c8af66a (26 Jul 2018) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit e28daf2, 15 Aug 2018)
banned.h: mark strcat() as banned
The strcat() function has all of the same overflow problems as strcpy().
And as a bonus, it's easy to end up accidentally quadratic, as each subsequent call has to walk through the existing string.
The last strcat() call went away in f063d38 (daemon: use
cld->env_array when re-spawning, 2015-09-24, Git 2.7.0).
In general, strcat() can be replaced either with a dynamic string
(strbuf or xstrfmt), or with xsnprintf if you know the length is bounded.