How does using arrays in C++ result in security problems - c++

I was told that the optimal way to program in C++ is to use STL and string rather than arrays and character arrays.
i.e.,
vector<int> myInt;
rather than
int myInt[20]
However, I don't understand the rational behind why it would result in security problems.

I suggest you read up on buffer overruns, then. It's much more likely that a programmer creates or risks buffer overruns when using raw arrays, since they give you less protection and don't offer an API. Sure, it's possible to shoot yourself in the foot using STL too, but at least it's harder.

There appears to be some confusion here about what security vectors can and cannot provide. Ignoring the use of iterators, there are three main ways of accessing elements ina vector.
the operator[] function of vector - this provides no bounds checking and will
result in undefined behaviour on a bounds error, in the same way as an array would if you use an invalid index.
the at() vector member function - this provides bounds checking and will raise an exception if an invalid index is used, at a small performance cost
the C++ operator [] for the vector's underlying array - this provides no bounds checking, but gives the highest possible access speed.

Arrays don't perform bound checking. Hence they are very vulnerable to bound checking errors which can be hard to detect.
Note: the following code has a programming error.
int Data[] = { 1, 2, 3, 4 };
int Sum = 0;
for (int i = 0; i <= 4; ++i) Sum += Data[i];
Using arrays like this, you won't get an exception that helps you find the error; only an incorrect result.
Arrays don't know their own size, whereas a vector defines begin and end methods to access its elements. With arrays you'll always have to rely on pointer arithmetics (And since they are nothing but pointers you can accidentially cast them)

C++ arrays do not perform bounds checking, on either insert or read and it is quite easy to accidentally access items from outside of the array bounds.
From an OO perspective, the vector also has more knowledge about itself and so can take care of its own housekeeping.

Your example has a static array with a fixed number of items; depending on your algorithm, this may be just as safe as a vector with a fixed number of items.
However, as a rule of thumb, when you want to dynamically allocate an array of items, a vector is much easier and also lets you make fewer mistakes. Any time you have to think, there's a possibility for a bug, which might be exploited.

Related

Partially sort a C-style 2D array with std::sort

I came across this question regarding sorting the first 2 lines of an array of integers, the obvious way that came to mind was to use std::sort so I proposed a solution like:
int mat[][3] = { {4, 5, 3},
{6, 8, 7},
{9, 5, 4},
{2, 1, 3} };
std::sort(std::begin(mat[0]), std::end(mat[1])); //sprting the first two rows
As you can see here it works without errors or warnings.
Meanwhile #Jarod42 pointed out that this is pedantically undefined behaviour in C++ because these are pointers of two different arrays.
I inclined towards this given that in C this would be a good way to do it, (without the std::sort, std::begin and std::end of course), using a similar method of accessing the array in line, given the way 2D arrays are stored in C.
We agreed that it would be undefined behaviour, but as #SergeBallesta remebered, pretty much all compilers accept this method, so should it be used?
And what about if one uses a int(*mat)[3] pointer to array, would it still be pedantic UB to use std::sort this way?
//...
srand(time(0));
int(*mat)[3] = (int(*)[3])malloc(sizeof(int) * 4 * 3);
//or int(*mat)[3] = new int[4][3];
for(int i = 0; i < 4 ; i++)
for(int j = 0; j < 3; j++)
mat[i][j] = rand() % 9 + 1;
std::sort(std::begin(mat[0]), std::end(mat[1])); //sorting the first two rows
//...
The problem comes from the way the standard defines an array type (8.3.4 [dcl.array]):
An object of array type contains a contiguously allocated non-empty set of N subobjects of type T.
but it does not explicitely says that a contiguous allocated set of objects of the same type can be used as an array.
For compatibility reasons all compilers I know accept that reciprocity, but on a pedantical point of view, it is not explicitely defined in the standard and is Undefined Behaviour.
The rationale behind the non reciprocity is that a program is expected to represent a model. And in the model, an object has no reason to be member or more than one array at the same time. So the standard does not allow it. In fact all the (real world) use cases I have ever encountered for handling a 2D array as if it was a 1D one were just low level optimization reasons. And in modern C++, the programmer should not care for low level optimization but let the compiler handle it.
The following it only my opinion.
When you find yourself processing a 2D array as if it was a 1D one, you should ask yourself for the reason. If you are using a legacy code, do not worry about it: compilers currently accept it, and even in the future will probably continue, even at the price of special options.
But if you are writing new code, you should try to move one step higher (or back) and wonder what it represents at the model level. Most of the time, you will find the the array is intrisically 1D or 2D but not both. Once this is done, if performance is not critical try to always handle it the conformant way. Or even better, try to use containers from the standard library instead of raw arrays.
If you are in a performance critical code where saying that any contiguous allocated set of objects is an array provides an important benefit, do it and document it for future maintainers. But only do that after profiling...

Why aren't built-in arrays safe?

The book C++ Primer, 5th edition by Stanley B. Lippman (ISBN 0-321-71411-3/978-0-321-71411-4) mentions:
An [std::]array is a safer, easier-to-use alternative to built-in arrays.
What's wrong with built-in arrays?
A built-in array is a contiguous block of bytes, usually on the stack. You really have no decent way to keep useful information about the array, its boundaries or its state. std::array keeps this information.
Built-in arrays are decayed into pointers when passed from/to functions. This may cause:
When passing a built-in array, you pass a raw pointer. A pointer doesn't keep any information about the size of the array. You will have to pass along the size of the array and thus uglify the code. std::array can be passed as reference, copy or move.
There is no way of returning a built-in array, you will eventually return a pointer to local variable if the array was declared in that function scope.
std::array can be returned safely, because it's an object and its lifetime is managed automatically.
You can't really do useful stuff on built-in arrays such as assigning, moving or copying them. You'll end writing a customized function for each built-in array (possibly using templates). std::array can be assigned.
By accessing an element which is out of the array boundaries, you are triggering undefined behaviour. std::array::at will preform boundary checking and throw a regular C++ exception if the check fails.
Better readability: built in arrays involves pointers arithmetic. std::array implements useful functions like front, back, begin and end to avoid that.
Let's say I want to sort a built-in array, the code could look like:
int arr[7] = {/*...*/};
std::sort(arr, arr+7);
This is not the most robust code ever. By changing 7 to a different number, the code breaks.
With std::array:
std::array<int,7> arr{/*...*/};
std::sort(arr.begin(), arr.end());
The code is much more robust and flexible.
Just to make things clear, built-in arrays can sometimes be easier. For example, many Windows as well as UNIX API functions/syscalls require some (small) buffers to fill with data. I wouldn't go with the overhead of std::array instead of a simple char[MAX_PATH] that I may be using.
It's hard to gauge what the author meant, but I would guess they are referring to the following facts about native arrays:
they are raw
There is no .at member function you can use for element access with bounds checking, though I'd counter that you usually don't want that anyway. Either you're accessing an element you know exists, or you're iterating (which you can do equally well with std::array and native arrays); if you don't know the element exists, a bounds-checking accessor is already a pretty poor way to ascertain that, as it is using the wrong tool for code flow and it comes with a substantial performance penalty.
they can be confusing
Newbies tend to forget about array name decay, passing arrays into functions "by value" then performing sizeof on the ensuing pointer; this is not generally "unsafe", but it will create bugs.
they can't be assigned
Again, not inherently unsafe, but it leads to silly people writing silly code with multiple levels of pointers and lots of dynamic allocation, then losing track of their memory and committing all sorts of UB crimes.
Assuming the author is recommending std::array, that would be because it "fixes" all of the above things, leading to generally better code by default.
But are native arrays somehow inherently "unsafe" by comparison? No, I wouldn't say so.
How is std::array safer and easier-to-use than a built-in array?
It's easy to mess up with built-in arrays, especially for programmers who aren't C++ experts and programmers who sometimes make mistakes. This causes many bugs and security vulnerabilities.
With a std::array a1, you can access an element with bounds checking a.at(i) or without bounds checking a[i]. With a built-in array, it's always your responsibility to diligently avoid out-of-bounds accesses. Otherwise the code can smash some memory that goes unnoticed for a long time and becomes very difficult to debug. Even just reading outside an array's bounds can be exploited for security holes like the Heartbleed bug that divulges private encryption keys.
C++ tutorials may pretend that array-of-T and pointer-to-T are the same thing, then later tell you about various exceptions where they are not the same thing. E.g. an array-of-T in a struct is embedded in the struct, while a pointer-to-T in a struct is a pointer to memory that you'd better allocate. Or consider an array of arrays (such as a raster image). Does auto-increment the pointer to the next pixel or the next row? Or consider an array of objects where an object pointer coerces to its base class pointer. All this is complicated and the compiler doesn't catch mistakes.
With a std::array a1, you can get its size a1.size(), compare its contents to another std::array a1 == a2, and use other standard container methods like a1.swap(a2). With built-in arrays, these operations take more programming work and are easier to mess up. E.g. given int b1[] = {10, 20, 30}; to get its size without hard-coding 3, you must do sizeof(b1) / sizeof(b1[0]). To compare its contents, you must loop over those elements.
You can pass a std::array to a function by reference f(&a1) or by value f(a1) [i.e. by copy]. Passing a built-in array only goes by reference and confounds it with a pointer to the first element. That's not the same thing. The compiler doesn't pass the array size.
You can return a std::array from a function by value, return a1. Returning a built-in array return b1 returns a dangling pointer, which is broken.
You can copy a std::array in the usual way, a1 = a2, even if it contains objects with constructors. If you try that with built-in arrays, b1 = b2, it'll just copy the array pointer (or fail to compile, depending on how b2 is declared). You can get around that using memcpy(b1, b2, sizeof(b1) / sizeof(b1[0])), but this is broken if the arrays have different sizes or if they contain elements with constructors.
You can easily change code that uses std::array to use another container like std::vector or std::map.
See the C++ FAQ Why should I use container classes rather than simple arrays? to learn more, e.g. the perils of built-in arrays containing C++ objects with destructors (like std::string) or inheritance.
Don't Freak Out About Performance
Bounds-checking access a1.at(i) requires a few more instructions each time you fetch or store an array element. In some inner loop code that jams through a large array (e.g. an image processing routine that you call on every video frame), this cost might add up enough to matter. In that rare case it makes sense to use unchecked access a[i] and carefully ensure that the loop code takes care with bounds.
In most code you're either offloading the image processing code to the GPU, or the bounds-checking cost is a tiny fraction of the overall run time, or the overall run time is not at issue. Meanwhile the risk of array access bugs is high, starting with the hours it takes you to debug it.
The only benefit of a built-in array would be slightly more concise declaration syntax. But the functional benefits of std::array blow that out of the water.
I would also add that it really doesn't matter that much. If you have to support older compilers, then you don't have a choice, of course, since std::array is only for C++11. Otherwise, you can use whichever you like, but unless you make only trivial use of the array, you should prefer std::array just to keep things in line with other STL containers (e.g., what if you later decide to make the size dynamic, and use std::vector instead, then you will be happy that you used std::array because all you will have to change is probably the array declaration itself, and the rest will be the same, especially if you use auto and other type-inference features of C++11.
std::array is a template class that encapsulate a statically-sized array, stored inside the object itself, which means that, if you instantiate the class on the stack, the array itself will be on the stack. Its size has to be known at compile time (it's passed as a template parameter), and it cannot grow or shrink.
Arrays are used to store a sequence of objects
Check the tutorial: http://www.cplusplus.com/doc/tutorial/arrays/
A std::vector does the same but it's better than built-in arrays (e.g: in general, vector hasn't much efficiency difference than built
in arrays when accessing elements via operator[]): http://www.cplusplus.com/reference/stl/vector/
The built-in arrays are a major source of errors – especially when they are used to build multidimensional arrays.
For novices, they are also a major source of confusion. Wherever possible, use vector, list, valarray, string, etc.
STL containers don't have the same problems as built in arrays
So, there is no reason in C++ to persist in using built-in arrays. Built-in arrays are in C++ mainly for backwards compatibility with C.
If the OP really wants an array, C++11 provides a wrapper for the built-in array, std::array. Using std::array is very similar to using the built-in array has no effect on their run-time performance, with much more features.
Unlike with the other containers in the Standard Library, swapping two array containers is a linear operation that involves swapping all the elements in the ranges individually, which generally is a considerably less efficient operation. On the other side, this allows the iterators to elements in both containers to keep their original container association.
Another unique feature of array containers is that they can be treated as tuple objects: The header overloads the get function to access the elements of the array as if it was a tuple, as well as specialized tuple_size and tuple_element types.
Anyway, built-in arrays are all ways passed by reference. The reason for this is when you pass an array to a function as a argument, pointer to it's first element is passed.
when you say void f(T[] array) compiler will turn it into void f(T* array)
When it comes to strings. C-style strings (i.e. null terminated character sequences) are all ways passed by reference since they are 'char' arrays too.
STL strings are not passed by reference by default. They act like normal variables.
There are no predefined rules for making parameter pass by reference. Even though the arrays are always passed by reference automatically.
vector<vector<double>> G1=connectivity( current_combination,M,q2+1,P );
vector<vector<double>> G2=connectivity( circshift_1_dexia(current_combination),M,q1+1,P );
This could also be copying vectors since connectivity returns a vector by value. In some cases, the compiler will optimize this out. To avoid this for sure though, you can pass the vector as non-const reference to connectivity rather than returning them. The return value of maxweight is a 3-dimensional vector returned by value (which may make a copy of it).
Vectors are only efficient for insert or erase at the end, and it is best to call reserve() if you are going to push_back a lot of values. You may be able to re-write it using list if you don't really need random access; with list you lose the subscript operator, but you can still make linear passes through, and save iterators to elements, rather than subscripts.
With some compilers, it can be faster to use pre-increment, rather than post-increment. Prefer ++i to i++ unless you actually need to use the post-increment. They are not the same.
Anyway, vector is going to be horribly slow if you are not compiling with optimization on. With optimization, it is close to built-in arrays. Built-in arrays can be quite slow without optimization on also, but not as bad as vector.
std::array has the at member function which is safe. It also have begin, end, size which you can use to make your code safer.
Raw arrays don't have that. (In particular, when raw arrays are decayed to pointers -e.g. when passed as arguments-, you lose any size information, which is kept in the std::array type since it is a template with the size as argument)
And a good optimizing C++11 compiler will handle std::array (or references to them) as efficiently as raw arrays.
Built-in arrays are not inherently unsafe - if used correctly. But it is easier to use built-in arrays incorrectly than it is to use alternatives, such as std::array, incorrectly and these alternatives usually offer better debugging features to help you detect when they have been used incorrectly.
Built-in arrays are subtle. There are lots of aspects that behave unexpectedly, even to experienced programmers.
std::array<T, N> is truly a wrapper around a T[N], but many of the already mentioned aspects are sorted out for free essentially, which is very optimal and something you want to have.
These are some I haven't read:
Size: N shall be a constant expression, it cannot be variable, for both. However, with built-in arrays, there's VLA(Variable Length Array) that allows that as well.
Officially, only C99 supports them. Still, many compilers allow that in previous versions of C and C++ as extensions. Therefore, you may have
int n; std::cin >> n;
int array[n]; // ill-formed but often accepted
that compiles fine. Had you used std::array, this could never work because N is required and checked to be an actual constant expression!
Expectations for arrays: A common flawed expectation is that the size of the array is carried along with the array itself when the array isn't really an array anymore, due to the kept misleading C syntax as in:
void foo(int array[])
{
// iterate over the elements
for (int i = 0; i < sizeof(array); ++i)
array[i] = 0;
}
but this is wrong because array has already decayed to a pointer, which has no information about the size of the pointed area. That misconception triggers undefined behavior if array has less than sizeof(int*) elements, typically 8, apart from being logically erroneous.
Crazy uses: Even further on that, there are some quirks arrays have:
Whether you have array[i] or i[array], there is no difference. This is not true for array, because calling an overloaded operator is effectively a function call and the order of the parameters matters.
Zero-sized arrays: N shall be greater than zero but it is still allowed as an extensions and, as before, often not warned about unless more pedantry is required. Further information here.
array has different semantics:
There is a special case for a zero-length array (N == 0). In that
case, array.begin() == array.end(), which is some unique value. The
effect of calling front() or back() on a zero-sized array is
undefined.

Why can I access array indexes greater than array's size in C++? [duplicate]

This question already has answers here:
Accessing an array out of bounds gives no error, why?
(18 answers)
Closed 4 years ago.
I was writing code and realized that I can "access" elements of an array that are at the same or greater index than the size of the array. Why doesn't this produce an error?
For example,
#include <iostream>
using namespace std;
int main ()
{
int b_array[5] = {1, 2, 3, 4, 5};
cout << b_array[5] << endl // Returns 0
<< b_array[66] << endl; // Returns some apparently random value.
return 0;
}
The only technical answer is "because the C++ language specification say that". Accessing an out-of-bounds value is undefined behavior. Your personal taste are irrelevant.
Behind the "undefined behaviors" (there are many in the C++ specs) there is the need to let compiler developer to implement different optimizations depending on the platform they have to run on.
If you consider that indexes are often used in loops, if you check the bounds, you end up with a check for each iteration, always succeeding (thus wasting processor time).
C++ does not implement boundary checking due to the performance penalties that incurs.
For example the vector template contains an at()function which checks for boundary, but is ~5 times slower than the [] operator.
Low-level language tend to force the programmer to produce safe and error free code in return for high performance.
Although there are simple cases like your where compilers and/or static analyzers could detect that an access is out of bounds, doing it in general at compile-time is not doable. For example, if you pass off your array to a function it immediately decays into a pointer and the compiler has no chance to do bounds checking at compile-time.
The alternative, run-time bounds checking, is comparatively expensive: doing a check upon each access would turn a simple memory dereference into a potentially stalling branch. To make things even harder, you can use the dereference operator on pointers, i.e., you can't even know easily where to locate the size of the actual array object.
As a result, the behavior of out of bounds array accesses is deliberately made undefined: a system can track these accesses but it doesn't have to. Also, what the system actually does upon an out of bounds array access is not specified, i.e., it can do different things depending on the context. In many cases, it will just return junk which isn't really too useful. However, especially with suitable debug settings the system may instead assert() upon detecting a violation.
C++ allows direct memory access to your program. There are no boundary checks done for you. This can be cause of very nasty bugs, but it's also very efficient as compared to other "safer" languages.
An array is nothing but a pointer to a memory location. The index that you are trying to access, such as index 66 in array [66], is resolved by adding 66 * sizeof(int) to the starting address of the array. Whether the finally calculated address is within some bounds or not is beyond the things checked by the compiler.
In other words, array [i] is same as *(array + i) in C++. In fact, you might be surprised that array [i] can also be written as i [array]!

C/C++ overwriting array bounds

What is a good way to detect bugs where I overwrite an array bound?
int a[100];
for (int i = 0; i<1000; i++) a[i] = i;
It would be helpful to collect a list of different strategies that people have used in their experience to uncover bugs of this type.
For example, doing a backtrace on from the point of the memory fault (for me often this doesn't work because the stack has been corrupted).
Valgrind will spot this sort of thing pretty reliably!
Use a std::vector, and either use .at() -which always checks ranges - or use[] and turn on range checking in your compiler.
Edit - if you a c++ compiler there is NO reason not to use std::vector. It is no slower than an array (if you turn off bounds checking) and you can use exactly the same loops with .size() and [] - you don't need to be scared off by complex iterators
Static code analysis (e.g. lint)
Runtime memory analysis (e.g. valgrind)
Avoid fixed-size buffers, prefer dynamically sized containers
Use sizeof() instead of magic numbers whenever you can
Write unit tests and run them under valgrind. Such bugs are relatively easy caught at the unit test level.
Overwriting end of array is an undefined behaviour, and as such the compiler is not required to issue a diagnostic.
Some static analysis tool might help, but sometimes they give a false alarm.
Some good suggestions here.
Here's some more, especially for C-style code rather than C++:
Avoid certain unsafe string and memory functions. In particular, if a function writes to a buffer and doesn't let you specify a size, don't use it.Examples for functions to avoid: strcpy, strcat, sprintf, gets, scanf("%s", ptr). Anywhere these are used are red flags. Instead use things like memcpy, strncpy (or better yet, strlcpy, though not available everywhere), snprintf, fgets.
When writing your own interfaces, you should always be able to answer the question: how big are the buffers I'm using? Usually this means keeping a parameter to track the size, for example as memcpy does.
While using STL containers like vector is best, there are some handy idioms for controlling this kind of thing, such as this one that I've used quite a bit.
int a[100];
const size_t A_SIZE = sizeof(a) / sizeof(*a);
for ( int i = 0; i < A_SIZE; ++i )...
Just dynamically allocate memory for you arrays and use exception handling to figure out if you have enough room.

C++ Performance of structs used as a safe packaging of arrays

In C or C++, there is no checking of arrays for out of bounds. One way to work around this is to package it with a struct:
struct array_of_foo{
int length;
foo *arr; //array with variable length.
};
Then, it can be initialized:
array_of_foo *ar(int length){
array_of_foo *out = (array_of_foo*) malloc(sizeof(array_of_foo));
out->arr = (foo*) malloc(length*sizeof(foo));
}
And then accessed:
foo I(array_of_foo *ar, int ix){ //may need to be foo* I(...
if(ix>ar->length-1){printf("out of range!\n")} //error
return ar->arr[ix];
}
And finally freed:
void freeFoo(array_of_foo *ar){ //is it nessessary to free both ar->arr and ar?
free(ar->arr); free(ar);
}
This way it can warn programmers about out of bounds. But will this packaging slow down the preformance substantially?
I agree on the std::vector recommendation. Additionally you might try boost::array libraries, which include a complete (and tested) implementation of fixed sized array containers:
http://svn.boost.org/svn/boost/trunk/boost/array.hpp
In C++, there's no need to come up with your own incomplete version of vector. (To get bounds checking on vector, use .at() instead of []. It'll throw an exception if you get out of bounds.)
In C, this isn't necessarily a bad idea, but I'd drop the pointer in your initialization function, and just return the struct. It's got an int and a pointer, and won't be very big, typically no more than twice the size of a pointer. You probably don't want to have random printfs in your access functions anyway, as if you do go out of bounds you'll get random messages that won't be very helpful even if you look for them.
Most likely the major performance hit will come from checking the index for every access, thus breaking pipelining in the processor, rather than the extra indirection. It seems to me unlikely that an optimizer would find a way to optimize away the check when it's definitely not necessary.
For example, this will be very noticed in long loops traversing the entire array - which is a relatively common pattern.
And just for the sake of it:
- You should initialize the length field too in ar()
- You should check for ix < 0 in I()
I don't have any formal studies to cite, but echoes I've had from languages where array bound checking is optional is that turning it off rarely speeds up a program down perceptibly.
If you have C code that you'd like to make safer, you may be interested in Cyclone.
You can test it yourself, but on certain machines you may have serious performance issues under different scenarios. If you are looping over millions of elements, then checking the bounds every time will lead to numerous cache misses. How much of an impact that will have depends on what your code is doing. Again, you could test this pretty quickly.