What type of input check can be performed against binary data in C++? - c++

let's say I have a function like this in C++, which I wish to publish to third parties. I want to make it so that the user will know what happened, should he/she feeds invalid data in and the library crashes.
Let's say that, if it helps, I can change the interface as well.
int doStuff(unsigned char *in_someData, int in_data_length);
Apart from application specific input validation (e.g. see if the binary begins with a known identifier etc.), what can be done? E.g. can I let the user know, if he/she passes in in_someData that has only 1 byte of data but passes in 512 as in_data_length?
Note: I already asked a similar question here, but let me ask from another angle..

It cannot be checked whether the parameter in_data_length passed to the function has the correct value. If this were possible, the parameter would be redundant and thus needless.
But a vector from the standard template library solves this:
int doStuff(const std::vector<unsigned char>& in_someData);
So, there is no possibility of a "NULL buffer" or an invalid data length parameter.

If you would know how many bytes passed by in_someData why would you need in_data_length at all?
Actually, you can only check in_someData for NULL and in_data_length for positive value. Then return some error code if needed. If a user passed some garbage to your function, this problem is obviously not yours.

In C++, the magic word you're looking for is "exception". That gives you a method to tell the caller something went wrong. You'll end up with code something like
int
doStuff(unsigned char * inSomeData, int inDataLength) throws Exception {
// do a test
if(inDataLength == 0)
throw new Exception("Length can't be 0");
// only gets here if it passed the test
// do other good stuff
return theResult;
}
Now, there's another problem with your specific example, because there's no universal way in C or C++ to tell how long an array of primitives really is. It's all just bits, with inSomeData being the address of the first bits. Strings are a special case, because there's a general convention that a zero byte ends a string, but you can't depend on that for binary data -- a zero byte is just a zero byte.
Update
This has currently picked up some downvotes, apparently by people misled by the comment that exception specifications had been deprecated. As I noted in a comment below, this isn't actually true -- while the specification will be deprecated in C++11, it's still part of the language now, so unless questioner is a time traveler writing in 2014, the throws clause is still the correct way to write it in C++.
Also note that the original questioner says "I want to make it so that the user will know what happened, should he/she feeds [sic] invalid data in and the library crashes." Thus the question is not just what can I do to validate the input data (answer: not much unless you know more about the inputs than was stated), but then how do I tell the caller they screwed up? And the answer to that is "use the exception mechanism" which has certainly not been deprecated.

Related

`ncurses` function `wgetstr` is modifying my variables

SOLUTION Apparently, the wgetstr function does not make a new buffer. If the second argument is called data and has size n and you give an input of more than n characters, it will access and overwrite parts in memory that do not belong to data, such as the place in memory where cursorY is stored. To make everything work, I declared data with char data[] = " "; (eight spaces) and wrote wgetnstr(inputWin, data, 8);.
--------------------------------------------------------------------------------------------------------------
It seems that the ncurses function wgetstr is literally changing the values of my variables. In a function called playGame, I have a variable called cursorY (of type int) which is adjusted whenever I press the up- or down-arrow on my keyboard (this works fine).
Please take a look at this code (inputWin is of type WINDOW*):
mvprintw(0, 0, (to_string(cursorY)).c_str());
refresh();
usleep(500000);
wgetstr(inputWin, data);
mvprintw(0, 0, (to_string(cursorY)).c_str());
refresh();
usleep(500000);
Suppose I move the cursor to the 6th row and then press Enter (which causes this piece of code to be executed). There are two things I can do:
Input just 1 character. After both refresh calls, the value 6 is shown on the screen (at position (0, 0)).
Input 2 or more characters. In this case, after the first refresh call I simply get 6, but after the second, I magically get 0.
The first two lines after the code above are
noecho();
_theView -> _theActualSheet -> putData(cursorY-1, cursorX/9 - 1, data);
(don't worry about the acutal parameters: the math regarding them checks out). While I'm in putData, I get a Segmentation fault, and gdb says that the first argument of putData was -1, so then cursorY had to be 0 (the first two arguments of putData are used to access a two-dimensional array using SheetCells[row][column], where row and column are, respectively, the first and second formal parameter of putData).
Clearly, wgetstr modifies the value of cursorY. The name of the latter variable doesn't matter: changing it to cursorrY or something weird like monkeyBusiness (yes I've tried that) doesn't work. What sort of works is replacing the piece of code above with
mvprintw(0, 0, (to_string(cursorY)).c_str());
refresh();
usleep(500000);
int a = cursorY;
wgetstr(inputWin, data);
cursorY = a;
mvprintw(0, 0, (to_string(cursorY)).c_str());
refresh();
usleep(500000);
In both cases I see 6 at the top-left corner of my screen. However, know the string is acting all weird: when I type in asdf as my string, then move to the right (i.e., I press the right key on my keyboard), then type in asdf again, I get as^a.
So basically, I would like to know two things:
Why the HELL is wgetstr changing my variables?
Why is it only happening when I input more than 1 character?
What seems to be wrong with wgetstr in general? It seems terrible at handling input.
I could try other things (like manually reading in characters and then concatenating data with them), but wgetstr seems perfect for what I want to do, and there is no reason I should switch here.
Any help is much appreciated. (Keep in mind: I specifically want to know why the value of cursorY is being changed. If you would recommend not using wgetstr and have a good alternative, please tell me, but I'm most interested in knowing why cursorY is being altered.)
EDIT The variable data is of type char[] and declared like so: char data[] = "". I don't "clear" this variable (i.e., remove all "letters"), but I don't think this makes any difference, as I think wgetstr just overrides the whole variable (or am I terribly wrong here?).
The buffer you provide for the data, data, is defined as being a single character long (only the null-terminator will be there). This means that if you enter any input of one or more characters, you will be writing outside the space provided by data, and thus overwrite something else. It looks like cursorY is the lucky variable that got hit.
You need to make sure that data is at least big enough to handle all inputs. And preferably, you should switch to some input function (like wgetnstr) that will let you pass the size of the buffer, otherwise it will always be possible to crash your application by typing enough characters.
wgetstr expects to write the received characters to a preallocated buffer, which should be at least as long as the expected input string. It does not allocate a new buffer for you!
What you've done is provide it with a single byte buffer, and are writing multiple bytes to it. This will stomp over the other variables you've defined in your function after data, such as cursorY, regardless of what it is called. Any changes to variables will in turn change the string that was read in:
int a = cursorY;
wgetstr(inputWin, data);
cursorY = a;
will write an int value into your string, which is why it is apparently getting corrupted.
What you should actually do is to make data actually long enough for the anticipated input, and ideally use something like wgetnstr to ensure you don't walk off the end of the buffer and cause damage.

How to check if `strcmp` has failed?

Well this question is about C and C++ as strcmp is present in both of them.
I came across this link: C library function - strcmp().
Over here it was explained the return values of strcmp. I know that every function, how much ever safe it is, can fail. Thus, I knew that even strcmp can fail at some time.
Also, I came across this question which also explained the return values of strcmp. After searching a lot, I could not find a website which explained how to check if strcmp could fail.
I first had a thought that it would return -1, but it turned out that it returns numbers < 0 if the first string is smaller. So can someone tell me how to check if strcmp has failed.
EDIT: well, I do not understand the point of strcmp not failing. There are many ways in which a function fails. For example, in one comment, it was written that if a stack doesn't extend, it might cause a stack overlow. No program in any language is absolutely safe!
There is no defined situation where strcmp can fail.
You can trigger undefined behavior by passing an invalid argument, such as:
a null pointer
an uninitialized pointer
a pointer to memory that's been freed, or to a stack location from a function that's since been exited, or similar
a perfectly valid pointer, but with no null byte between the location that the pointer points to and the end of the allocated memory range that it points into
but in such cases strcmp makes absolutely no guarantees about its behavior — it's allowed to just start wiping your hard drive, or sending spam from your e-mail account, or whathaveyou — so it doesn't need to explicitly indicate the error. (And indeed, the only kind of invalid argument that strcmp really could detect and handle, in a typical C or C++ implementation, is the null-pointer case, which you could just as easily check before calling strcmp. So an error indicator would not be useful.)
If you look up strcmp in some documentation, you will find something like
” The behavior is undefined if lhs or rhs are not pointers to null-terminated strings.
This means that it can crash, or cause small red daemons to fly out of your nose, or whatever. Since undefined behavior can be that what you innocently thought would happen (nothing wrong, as you see it), happens, there's no sure fire way to recognize it. Although you'll recognize a crash.
So, you as a programmer can't check for undefined behavior after the fact.
But the compiler can add such checks for you, for many kinds of undefined behavior including this, because it's in full charge of that behavior. Whether it will do so depends on the compiler. Also, you can add checks in e.g. a wrapper function, that minimizes the chance of undefined behavior, although with non-terminated strings there's no practical way of checking that I know of that won't itself possibly invoke undefined behavior.
Possible failure is undefined behaviour, and may happen if you give wrong arguments (not pointers to null-terminated strings).
strcmp returns 0 if and only if the both strings are properly zero-terminated, and have the exact same characters.
If they both are not exactly the same, a non-zero value is returned. If the return value is negative, it means that the first string comes first in lexicographical ordering; if positive, it means that the second string comes first in lexicographical ordering.
There is no "failure", except possible undefined behaviour, if invalid input is provided - then, anything could happen, including a program crash, or compiler generating invalid code.

How to print elements from tcl_obj in gdb?

I am debugging a c++-tcl interface application and I need to see the elements of Tcl_Obj objv.
I tried doing print *(objv[1]) and so on but it doesnt seem helping.
Is there any way to see Tcl_Obj elements in gdb?
It's not particularly easy to understand a Tcl_Obj * from GDB as the data structure uses polymorphic pointers with shrouded types. (Yeah, this is tricky C magic.) However, there are definitely some things you can try. (I'll pretend that the pointer is called objPtr below, and that it is of type Tcl_Obj *.)
Firstly, check out what the objPtr->typePtr points to, if anything. A NULL objPtr->typePtr means that the object just has something in the objPtr->bytes field, which is a UTF-8 string containing objPtr->length bytes with a \0 at objPtr->bytes[objPtr->length]. A Tcl_Obj * should never have both its objPtr->bytes and objPtr->typePtr being NULL at the same time.
If the objPtr->typePtr is not NULL, it points to a static constant structure that defines the basic polymorphic type operations on the Tcl_Obj * (think of it as being like a vtable). Of initial interest to you is going to be the name field though; that's a human-readable const char * string, and it will probably help you a lot. The other things in that structure include a definition of how to duplicate the object and how to serialize the object. (The objPtr->bytes field really holds the serialization.)
The objPtr->typePtr defines the interpretation of the objPtr->internalRep, which is a C union that is big enough to hold two generic pointers (and a few other things besides, like a long and double; you'll also see a Tcl_WideInt, which is probably a long long but that depends on the compiler). How this happens is up to the implementation of the type so it's difficult to be all-encompassing here, but it's basically the case that small integers have the objPtr->internalRep.longValue field as meaningful, floating point numbers have the objPtr->internalRep.doubleValue as meaningful, and more complex types hang a structure off the side.
With a list, the structure actually hangs off the objPtr->internalRep.twoPtrValue.ptr1 and is really a struct List (which is declared in tclInt.h and is not part of Tcl's public API). The struct List in turn has a variable-length array in it, the elements field; don't modify inside there or you'll break things. Dictionaries are similar, but use a struct Dict instead (which contains a variation on the theme of hash tables) and which is declared just inside tclDictObj.c; even the rest of Tcl's implementation can't see how they work internally. That's deliberate.
If you want to debug into a Tcl_Obj *, you'll have to proceed carefully, look at the typePtr, apply relevant casts where necessary, and make sure you're using a debug build of Tcl with all the symbol and type information preserved.
There's nothing about this that makes debugging a whole array of values particularly easy. The simplest approach is to print the string view of the object, like this:
print Tcl_GetString(objv[1])
Be aware that this does potentially trigger the serialization of the object (including memory allocation) so it's definitely not perfect. It is, however, really easy to do. (Tcl_GetString generates the serialization if necessary — storing it in the objPtr->bytes field of course — and returns a pointer to it. This means that the value returned is definitely UTF-8. Well, Tcl's internal variation on UTF-8 that's slightly denormalized in a couple of places that probably don't matter to you right now.)
Note that you can read some of this information from scripts in Tcl 8.6 (the current recommended release) with the ::tcl::unsupported::representation command. As you can guess from the name, it's not supported (because it violates a good number of Tcl's basic semantic model rules) but it can help with debugging before you break out the big guns of attaching gdb.

Should I unit-test with data that should not be passed in a function (invalid input)?

I am trying to use TDD for my coding practice. I would like to ask should I test with a data that should not happen in a function BUT this data may possibly break your program.
Here is one of a easy example to illustrate to what I ask :
a ROBOT function that has a one INT parameter. In this function I know that the valid range would only be 0-100. If -1, 101 is used, the function will be break.
function ROBOT (int num){
...
...
...
return result;
}
So I decided some automated test cases for this function...
1. function ROBOT with input argument 0
2. function ROBOT with input argument 1
3. function ROBOT with input argument 10
4. function ROBOT with input argument 100
But should I write test cases with input argument -1 or 101 for this ROBOT function IF I would guard that in my other function that call function ROBOT???
5. function ROBOT with input argument -1
6. function ROBOT with input argument 101
I don't know if it is necessary cause I think it is redundancy to test -1 and 101. And If it is really necessary to cover all the cases, I have to write more code to guard -1 and 101.
So in Common practice of TDD, will you write test case on -1 and 101 as well???
Yes, you should test those invalid inputs. BUT, if your language has accessibility modifiers and ROBOT() is private you shouldn't be testing it; you should only test public functions/methods.
The functional testing technique is called Boundary Value Analysis.
If your range is 0-100, your boundary values are 0 and 100. You should test, at least:
below the boundary value
the boundary value
above the boundary value
In this case:
-1,0,1,
99,100,101
You assume everything below -1 to -infinity behaves the same, everything between 1-99 behaves the same and everything above 101 behaves the same. This is called Equivalence Partitioning. The ranges outside and between the boundary values are called partitions and you assume that they will have equivalent behaviour.
You should always consider using -1 as a test case to make sure nothing funny happens with negative numbers and a text string if the parameter is not strongly typed.
If the expected outcome is that an exception is thrown with invalid input values, then a test that the exceptions get properly thrown would be appropriate.
Edit:
As I noted in my comment below, if these cases will break your application, you should throw an exception. If it really is logically impossible for these cases to occur, then I would say no, you don't need to throw an exception, and you don't need test cases to cover it.
Note that if your system is well componentized, and this function is one component, the fact that it is logically impossible now doesn't mean it will always be logically impossible. It may be used differently down the road.
In short, if it can break, then you should test it. Also validate data at the earliest point possible.
The answer depends on whether you control the inputs passed to Robot. If Robot is an internal class (C#) ; values only flow in from RobotClientX which is a public type. Then I'd put the guard checks in RobotClientX, write tests for it. I'd not write tests for Robot, because invalid values cannot materialize in-between.
e.g. if I put my validations in the GUI such that all invalid values are filtered off at the source, then I don't check for invalid values in all classes below the GUI (Unless I've also exposed a public API which bypasses the GUI).
On the other hand, if Robot is publicly visible i.e. Anyone can call Robot with any value that they please, then I need tests that document it's behavior given specific kinds of input.. invalid being one of them. e.g. if you pass an out-of-range value, it'd throw an ArgumentException.
You said your method will raise an exception if the argument is not valid.
So, yes you should, because you should test that the exception gets raised.
If other code guards against calling that method incorrectly, and no one else will be writing code to call that method, then I don't see a reason to test with invalid values. To me, it would seem a waste of time.
The programming by contract style of design and implementation draws attention to the fact that a single function (method) should be responsible for only some things, not for everything. The other functions that it calls (delegates to) and which call it also have responsibilities. This partition of responsibilities is at the heart of dividing the task of programming into smaller tasks that can be performed separately. The contract part of programming by contract is that the specification of a function says what a function must do if and only if the caller of the function fulfills the responsibilities placed on the caller by that specification. The requirement that the input integer is within the range [0,100] is that kind of requirement.
Now, unit tests should not test implementation details. They should test that the function conforms to its specification. This enables the implementation to change without the tests breaking. It makes refactoring possible.
Combining those two ideas, how can we write a test for a function that is given some particular invalid input? We should check that the function behaves according to the specification. But the specification does not say what the function must do in this case. So we can not write any checks of the program state after the invalid function call; the behaviour is undefined. So we can not write such a test at all.
My answer is that, no, you don't want exceptions, you don't want to have to have ROBOT() check for out of range input. The clients should be so well behaved that they don't pass garbage values in.
You might want to document this - Just say that clients must be careful about the values they pass in.
Besides where are you going to get invalid values from? Well, user input or by converting strings to numbers. But in those cases it should be the conversion routines that perform the checks and give feedback about whether the values are valid or not. The values should be guaranteed to be valid long before they get anywhere near ROBOT()!

Initializing a char array in C. Which way is better?

The following are the two ways of initializing a char array:
char charArray1[] = "foo";
char charArray2[] = {'f','o','o','\0'};
If both are equivalent, one would expect everyone to use the first option above (since it requires fewer key strokes). But I've seen code where the author takes the pain to always use the second method.
My guess is that in the first case the string "foo" is stored in the data segment and copied into the array at runtime, whereas in the second case the characters are stored in the code segment and copied into the array at runtime. And for some reason, the author is allergic to having anything in the data segment.
Edit: Assume the arrays are declared local to a function.
Questions: Is my reasoning correct? Which is your preferred style and why?
What about another possibility:
char charArray3[] = {102, 111, 111, 0};
You shouldn't forget the C char type is a numeric type, it just happens the value is often used as a char code. But if I use an array for something not related to text at all, I would would definitely prefer initialize it with the above syntax than encode it to letters and put them between quotes.
If you don't want the terminal 0 you also have to use the second form or in C use:
char charArray3[3] = "foo";
It is a a C feature that nearly nobody knows, but if the compiler does not have room enough to hold the final 0 when initializing a charArray, it does not put it, but the code is legal. However this should be avoided because this feature has been removed from C++, and a C++ compiler would yield an error.
I checked the assembly code generated by gcc, and all the different forms are equivalent. The only difference is that it uses either .string or .byte pseudo instruction to declare data. But tha's just a readability issue and does not make a bit of difference in the resulting program.
I think the second method is used mostly in legacy code where compilers didn't support the first method. Both methods should store the data in the data segments. I prefer the first method due to readability. Also, I needed to patch a program once (can't remember which, it was a standard UNIX tool) to not use /etc (it was for an embedded system). I had a very hard time finding the correct place because they used the second method and my grep couldn't find "etc" anywhere :-)