C/C++ overwriting array bounds - c++

What is a good way to detect bugs where I overwrite an array bound?
int a[100];
for (int i = 0; i<1000; i++) a[i] = i;
It would be helpful to collect a list of different strategies that people have used in their experience to uncover bugs of this type.
For example, doing a backtrace on from the point of the memory fault (for me often this doesn't work because the stack has been corrupted).

Valgrind will spot this sort of thing pretty reliably!

Use a std::vector, and either use .at() -which always checks ranges - or use[] and turn on range checking in your compiler.
Edit - if you a c++ compiler there is NO reason not to use std::vector. It is no slower than an array (if you turn off bounds checking) and you can use exactly the same loops with .size() and [] - you don't need to be scared off by complex iterators

Static code analysis (e.g. lint)
Runtime memory analysis (e.g. valgrind)
Avoid fixed-size buffers, prefer dynamically sized containers
Use sizeof() instead of magic numbers whenever you can

Write unit tests and run them under valgrind. Such bugs are relatively easy caught at the unit test level.
Overwriting end of array is an undefined behaviour, and as such the compiler is not required to issue a diagnostic.
Some static analysis tool might help, but sometimes they give a false alarm.

Some good suggestions here.
Here's some more, especially for C-style code rather than C++:
Avoid certain unsafe string and memory functions. In particular, if a function writes to a buffer and doesn't let you specify a size, don't use it.Examples for functions to avoid: strcpy, strcat, sprintf, gets, scanf("%s", ptr). Anywhere these are used are red flags. Instead use things like memcpy, strncpy (or better yet, strlcpy, though not available everywhere), snprintf, fgets.
When writing your own interfaces, you should always be able to answer the question: how big are the buffers I'm using? Usually this means keeping a parameter to track the size, for example as memcpy does.

While using STL containers like vector is best, there are some handy idioms for controlling this kind of thing, such as this one that I've used quite a bit.
int a[100];
const size_t A_SIZE = sizeof(a) / sizeof(*a);
for ( int i = 0; i < A_SIZE; ++i )...

Just dynamically allocate memory for you arrays and use exception handling to figure out if you have enough room.

Related

Is there ever a valid reason to use C-style arrays in C++?

Between std::vector and std::array in TR1 and C++11, there are safe alternatives for both dynamic and fixed-size arrays which know their own length and don't exhibit horrible pointer/array duality.
So my question is, are there any circumstances in C++ when C arrays must be used (other than calling C library code), or is it reasonable to "ban" them altogether?
EDIT:
Thanks for the responses everybody, but it turns out this question is a duplicate of
Now that we have std::array what uses are left for C-style arrays?
so I'll direct everybody to look there instead.
[I'm not sure how to close my own question, but if a moderator (or a few more people with votes) wander past, please feel free to mark this as a dup and delete this sentence.]
I didnt want to answer this at first, but Im already getting worried that this question is going to be swamped with C programmers, or people who write C++ as object oriented C.
The real answer is that in idiomatic C++ there is almost never ever a reason to use a C style array. Even when using a C style code base, I usually use vectors. How is that possible, you say? Well, if you have a vector v and a C style function requires a pointer to be passed in, you can pass &v[0] (or better yet, v.data() which is the same thing).
Even for performance, its very rare that you can make a case for a C style array. A std::vector does involve a double indirection but I believe this is generally optimized away. If you dont trust the compiler (which is almost always a terrible move), then you can always use the same technique as above with v.data() to grab a pointer for your tight loop. For std::array, I believe the wrapper is even thinner.
You should only use one if you are an awesome programmer and you know exactly why you are doing it, or if an awesome programmer looks at your problem and tells you to. If you arent awesome and you are using C style arrays, the chances are high (but not 100%) that you are making a mistake,
Foo data[] = {
is a pretty common pattern. Elements can be added to it easily, and the size of the data array grows based on the elements added.
With C++11 you can replicate this with a std::array:
template<class T, class... Args>
auto make_array( Args&&... args )
-> std::array< T, sizeof...(Args) >
{
return { std::forward<Args>(args)... };
}
but even this isn't as good as one might like, as it does not support nested brackets like a C array does.
Suppose Foo was struct Foo { int x; double y; };. Then with C style arrays we can:
Foo arr[] = {
{1,2.2},
{3,4.5},
};
meanwhile
auto arr = make_array<Foo>(
{1,2.2},
{3,4.5}
};
does not compile. You'd have to repeat Foo for each line:
auto arr = make_array<Foo>(
Foo{1,2.2},
Foo{3,4.5}
};
which is copy-paste noise that can get in the way of the code being expressive.
Finally, note that "hello" is a const array of size 6. Code needs to know how to consume C-style arrays.
My typical response to this situation is to convert C-style arrays and C++ std::arrays into array_views, a range that consists of two pointers, and operate on them. This means I do not care if I was fed an array based on C or C++ syntax: I just care I was fed a packed sequence of data elements. These can also consume std::dynarrays and std::vectors with little work.
It did require writing an array_view, or stealing one from boost, or waiting for it to be added to the standard.
Sometimes an exsisting code base can force you to use them
The last time I needed to use them in new code was when I was doing embedded work and the standard library just didn't have an implementation of std::vector or std::array. In some older code bases you have to use arrays because of design decisions made by the previous developers.
In most cases if you are starting a new project with C++11 the old C style arrays are a fairly poor choice. This is because relative to std::array they are difficult to get correct and this difficulty is a direct expense when developing. This C++ FAQ entry sums up my thoughts on the matter fairly well: http://www.parashift.com/c++-faq/arrays-are-evil.html
Pre-C++14: In some (rare) cases, the missing initialization of types like int can improve the execution speed notably. Especially if some algorithm needs many short-lived arrays during his execution and the machine has not enough memory for pre-allocating making sense and/or the sizes could not be known first
C-style arrays are very useful in embedded system where memory is constrained (and severely limited).
The arrays allow for programming without dynamic memory allocation. Dynamic memory allocation generates fragmented memory and at some point in run-time, the memory has to be defragmented. In safety critical systems, defragmentation cannot occur during the periods that have critical timing.
The const arrays allow for data to be put into Read Only Memory or Flash memory, out of the precious RAM area. The data can be directly accessed and does not require any additional initialization time, as with std::vector or std::array.
The C-style array is a convenient tool to place raw data into a program. For example, bitmap data for images or fonts. In smaller embedded systems with no hard drives or flash drives, the data must directly accessed. C-style arrays allow for this.
Edit 1:
Also, std::array cannot be used with compiler that don't support C++11 or afterwards.
Many companies do not want to switch compilers once a project has started. Also, they may need to keep the compiler version around for maintenance fixes, and when Agencies require the company to reproduce an issue with a specified software version of the product.
I found just one reason today : when you want to know preciselly the size of the data block and control it for aligning in a giant data block .
This is usefull when your are dealing with stream processors or Streaming extensions like AVX or SSE.
Control the data block allocation to a huge single aligned block in memory is usefull. Your objects can manipulate the segments they are responsible and, when they finished , you can move and/or process the huge vector in an aligned way .

Overhead to using std::vector?

I know that manual dynamic memory allocation is a bad idea in general, but is it sometimes a better solution than using, say, std::vector?
To give a crude example, if I had to store an array of n integers, where n <= 16, say. I could implement it using
int* data = new int[n]; //assuming n is set beforehand
or using a vector:
std::vector<int> data;
Is it absolutely always a better idea to use a std::vector or could there be practical situations where manually allocating the dynamic memory would be a better idea, to increase efficiency?
It is always better to use std::vector/std::array, at least until you can conclusively prove (through profiling) that the T* a = new T[100]; solution is considerably faster in your specific situation. This is unlikely to happen: vector/array is an extremely thin layer around a plain old array. There is some overhead to bounds checking with vector::at, but you can circumvent that by using operator[].
I can't think of any case where dynamically allocating a C style
vector makes sense. (I've been working in C++ for over 25
years, and I've yet to use new[].) Usually, if I know the
size up front, I'll use something like:
std::vector<int> data( n );
to get an already sized vector, rather than using push_back.
Of course, if n is very small and is known at compile time,
I'll use std::array (if I have access to C++11), or even
a C style array, and just create the object on the stack, with
no dynamic allocation. (Such cases seem to be rare in the
code I work on; small fixed size arrays tend to be members of
classes. Where I do occasionally use a C style array.)
If you know the size in advance (especially at compile time), and don't need the dynamic re-sizing abilities of std::vector, then using something simpler is fine.
However, that something should preferably be std::array if you have C++11, or something like boost::scoped_array otherwise.
I doubt there'll be much efficiency gain unless it significantly reduces code size or something, but it's more expressive which is worthwhile anyway.
You should try to avoid C-style-arrays in C++ whenever possible. The STL provides containers which usually suffice for every need. Just imagine reallocation for an array or deleting elements out of its middle. The container shields you from handling this, while you would have to take care of it for yourself, and if you haven't done this a hundred times it is quite error-prone.
An exception is of course, if you are adressing low-level-issues which might not be able to cope with STL-containers.
There have already been some discussion about this topic. See here on SO.
Is it absolutely always a better idea to use a std::vector or could there be practical situations where manually allocating the dynamic memory would be a better idea, to increase efficiency?
Call me a simpleton, but 99.9999...% of the times I would just use a standard container. The default choice should be std::vector, but also std::deque<> could be a reasonable option sometimes. If the size is known at compile-time, opt for std::array<>, which is a lightweight, safe wrapper of C-style arrays which introduces zero overhead.
Standard containers expose member functions to specify the initial reserved amount of memory, so you won't have troubles with reallocations, and you won't have to remember delete[]ing your array. I honestly do not see why one should use manual memory management.
Efficiency shouldn't be an issue, since you have throwing and non-throwing member functions to access the contained elements, so you have a choice whether to favor safety or performance.
std::vector could be constructed with an size_type parameter that instantiate the vector with the specified number of elements and that does a single dynamic allocation (same as your array) and also you can use reserve to decrease the number of re-allocations over the usage time.
In n is known at compile-time, then you should choose std::array as:
std::array<int, n> data; //n is compile-time constant
and if n is not known at compile-time, OR the array might grow at runtime, then go for std::vector:
std::vector<int> data(n); //n may be known at runtime
Or in some cases, you may also prefer std::deque which is faster than std::vector in some scenario. See these:
C++ benchmark – std::vector VS std::list VS std::deque
Using Vector and Deque by Herb Sutter
Hope that helps.
From a perspective of someone who often works with low level code with C++, std vectors are really just helper methods with a safety net for a classic C style array. The only overheads you'd experience realistically are memory allocations and safety checks for boundaries. If you're writing a program which needs performance and are going to be using vectors as a regular array I'd recommend to just use C style arrays instead of vectors. You should realistically be vetting the data that comes into the application and check the boundaries yourself to avoid checks on every memory access to the array.
It's good to see that others are checking the differences of the C ways and the C++ ways. More often than not C++ standard methods have significantly worse performance and uglier syntax than their C counterparts and is generally the reason people call C++ bloated. I think C++ focuses more on safety and making the language more like JavaScript/C# even though the language fundamentally lacks the foundation to be one.

C++ Performance of structs used as a safe packaging of arrays

In C or C++, there is no checking of arrays for out of bounds. One way to work around this is to package it with a struct:
struct array_of_foo{
int length;
foo *arr; //array with variable length.
};
Then, it can be initialized:
array_of_foo *ar(int length){
array_of_foo *out = (array_of_foo*) malloc(sizeof(array_of_foo));
out->arr = (foo*) malloc(length*sizeof(foo));
}
And then accessed:
foo I(array_of_foo *ar, int ix){ //may need to be foo* I(...
if(ix>ar->length-1){printf("out of range!\n")} //error
return ar->arr[ix];
}
And finally freed:
void freeFoo(array_of_foo *ar){ //is it nessessary to free both ar->arr and ar?
free(ar->arr); free(ar);
}
This way it can warn programmers about out of bounds. But will this packaging slow down the preformance substantially?
I agree on the std::vector recommendation. Additionally you might try boost::array libraries, which include a complete (and tested) implementation of fixed sized array containers:
http://svn.boost.org/svn/boost/trunk/boost/array.hpp
In C++, there's no need to come up with your own incomplete version of vector. (To get bounds checking on vector, use .at() instead of []. It'll throw an exception if you get out of bounds.)
In C, this isn't necessarily a bad idea, but I'd drop the pointer in your initialization function, and just return the struct. It's got an int and a pointer, and won't be very big, typically no more than twice the size of a pointer. You probably don't want to have random printfs in your access functions anyway, as if you do go out of bounds you'll get random messages that won't be very helpful even if you look for them.
Most likely the major performance hit will come from checking the index for every access, thus breaking pipelining in the processor, rather than the extra indirection. It seems to me unlikely that an optimizer would find a way to optimize away the check when it's definitely not necessary.
For example, this will be very noticed in long loops traversing the entire array - which is a relatively common pattern.
And just for the sake of it:
- You should initialize the length field too in ar()
- You should check for ix < 0 in I()
I don't have any formal studies to cite, but echoes I've had from languages where array bound checking is optional is that turning it off rarely speeds up a program down perceptibly.
If you have C code that you'd like to make safer, you may be interested in Cyclone.
You can test it yourself, but on certain machines you may have serious performance issues under different scenarios. If you are looping over millions of elements, then checking the bounds every time will lead to numerous cache misses. How much of an impact that will have depends on what your code is doing. Again, you could test this pretty quickly.

How does using arrays in C++ result in security problems

I was told that the optimal way to program in C++ is to use STL and string rather than arrays and character arrays.
i.e.,
vector<int> myInt;
rather than
int myInt[20]
However, I don't understand the rational behind why it would result in security problems.
I suggest you read up on buffer overruns, then. It's much more likely that a programmer creates or risks buffer overruns when using raw arrays, since they give you less protection and don't offer an API. Sure, it's possible to shoot yourself in the foot using STL too, but at least it's harder.
There appears to be some confusion here about what security vectors can and cannot provide. Ignoring the use of iterators, there are three main ways of accessing elements ina vector.
the operator[] function of vector - this provides no bounds checking and will
result in undefined behaviour on a bounds error, in the same way as an array would if you use an invalid index.
the at() vector member function - this provides bounds checking and will raise an exception if an invalid index is used, at a small performance cost
the C++ operator [] for the vector's underlying array - this provides no bounds checking, but gives the highest possible access speed.
Arrays don't perform bound checking. Hence they are very vulnerable to bound checking errors which can be hard to detect.
Note: the following code has a programming error.
int Data[] = { 1, 2, 3, 4 };
int Sum = 0;
for (int i = 0; i <= 4; ++i) Sum += Data[i];
Using arrays like this, you won't get an exception that helps you find the error; only an incorrect result.
Arrays don't know their own size, whereas a vector defines begin and end methods to access its elements. With arrays you'll always have to rely on pointer arithmetics (And since they are nothing but pointers you can accidentially cast them)
C++ arrays do not perform bounds checking, on either insert or read and it is quite easy to accidentally access items from outside of the array bounds.
From an OO perspective, the vector also has more knowledge about itself and so can take care of its own housekeeping.
Your example has a static array with a fixed number of items; depending on your algorithm, this may be just as safe as a vector with a fixed number of items.
However, as a rule of thumb, when you want to dynamically allocate an array of items, a vector is much easier and also lets you make fewer mistakes. Any time you have to think, there's a possibility for a bug, which might be exploited.

Are there any practical limitations to only using std::string instead of char arrays and std::vector/list instead of arrays in c++?

I use vectors, lists, strings and wstrings obsessively in my code. Are there any catch 22s involved that should make me more interested in using arrays from time to time, chars and wchars instead?
Basically, if working in an environment which supports the standard template library is there any case using the primitive types is actually better?
For 99% of the time and for 99% of Standard Library implementations, you will find that std::vectors will be fast enough, and the convenience and safety you get from using them will more than outweigh any small performance cost.
For those very rare cases when you really need bare-metal code, you can treat a vector like a C-style array:
vector <int> v( 100 );
int * p = &v[0];
p[3] = 42;
The C++ standard guarantees that vectors are allocated contiguously, so this is guaranteed to work.
Regarding strings, the convenience factor becomes almnost overwhelming, and the performance issues tend to go away. If you go beack to C-style strings, you are also going back to the use of functions like strlen(), which are inherently very inefficent themselves.
As for lists, you should think twice, and probably thrice, before using them at all, whether your own implementation or the standard. The vast majority of computing problems are better solved using a vector/array. The reason lists appear so often in the literature is to a large part because they are a convenient data structure for textbook and training course writers to use to explain pointers and dynamic allocation in one go. I speak here as an ex training course writer.
I would stick to STL classes (vectors, strings, etc). They are safer, easier to use, more productive, with less probability to have memory leaks and, AFAIK, they make some additional, run-time checking of boundaries, at least at DEBUG time (Visual C++).
Then, measure the performance. If you identify the bottleneck(s) is on STL classes, then move to C style strings and arrays usage.
From my experience, the chances to have the bottleneck on vector or string usage are very low.
One problem is the overhead when accessing elements. Even with vector and string when you access an element by index you need to first retrieve the buffer address, then add the offset (you don't do it manually, but the compiler emits such code). With raw array you already have the buffer address. This extra indirection can lead to significant overhead in certain cases and is subject to profiling when you want to improve performance.
If you don't need real time responses, stick with your approach. They are safer than chars.
You can occasionally encounter scenarios where you'll get better performance or memory usage from doing some stuff yourself (example, std::string typically has about 24 bytes of overhead, 12 bytes for the pointers in the std::string itself, and a header block on its dynamically allocated piece).
I have worked on projects where converting from std::string to const char* saved noticeable memory (10's of MB). I don't believe these projects are what you would call typical.
Oh, using STL will hurt your compile times, and at some point that may be an issue. When your project results in over a GB of object files being passed to the linker, you might want to consider how much of that is template bloat.
I've worked on several projects where the memory overhead for strings has become problematic.
It's worth considering in advance how your application needs to scale. If you need to be storing an unbounded number of strings, using const char*s into a globally managed string table can save you huge amounts of memory.
But generally, definitely use STL types unless there's a very good reason to do otherwise.
I believe the default memory allocation technique is a buffer for vectors and strings is one that allocates double the amount of memory each time the currently allocated memory gets used up. This can be wasteful. You can provide a custom allocator of course...
The other thing to consider is stack vs. heap. Staticly sized arrays and strings can sit on the stack, or at least the compiler handles the memory management for you. Newer compilers will handle dynamically sized arrays for you too if they provide the relevant C99/C++0x feature. Vectors and strings will always use the heap, and this can introduce performance issues if you have really tight constraints.
As a rule of thumb use whats already there unless it hurts your project with its speed/memory overhead... you'll probably find that for 99% of stuff the STL provided classes save you time and effort with little to no impact on your applications performance. (i.e. "avoid premature optimisation")