Stack overflow in C++ with big array

Stack overflow in C++ with big array - c++

Well, I am writing a program for the university, where I have to put a data dump into HDF format. The data dump looks like this:
"1444394028","1","5339","M","873"
"1444394028","1","7045","V","0.34902"
"1444394028","1","7042","M","2"
"1444394028","1","7077","V","0.0470588"
"1444394028","1","5415","M","40"
"1444394028","1","7077","V","0.462745"
"1444394028","1","7076","B","10001101"
"1444394028","1","7074","M","19"
"1444394028","1","7142","M","16"
"1444394028","1","7141","V","0.866667"
For the HDF5 API I need an array. So my method at the moment is, to write the data dump into an array like this:
int count = 0;
std::ifstream countInput("share/ObservationDump.txt");
std::string line;
if(!countInput) cout << "Datei nicht gefunden" << endl;
while( std::getline( countInput, line ) ) {
count++;
}
cout << count << endl;
struct_t writedata[count];
int i = 0;
std::ifstream dataInput("share/ObservationDump.txt");
std::string line2;
char delimeter(',');
std::string timestampTemp, millisecondsSinceStartTemp, deviceTemp, typeTemp, valueTemp;
while (std::getline(dataInput, timestampTemp, delimeter) )
{
std::getline(dataInput, millisecondsSinceStartTemp, delimeter);
std::getline(dataInput, deviceTemp, delimeter);
std::getline(dataInput, typeTemp, delimeter);
std::getline(dataInput, valueTemp);
writedata[i].timestamp = atoi(timestampTemp.substr(1, timestampTemp.size()-2).c_str());
writedata[i].millisecondsSinceStart = atoi(millisecondsSinceStartTemp.substr(1, millisecondsSinceStartTemp.size()-2).c_str());
writedata[i].device = atoi(deviceTemp.substr(1, deviceTemp.size()-2).c_str());
writedata[i].value = atof(valueTemp.substr(1, valueTemp.size()-2).c_str());
writedata[i].type = *(typeTemp.substr(1, typeTemp.size()-2).c_str());
i++;
}
with struct_t defined as
struct struct_t
{
int timestamp;
int millisecondsSinceStart;
int device;
char type;
double value;
};
As some of you might see, with big data dumps (at about 60 thousand lines) the array writedata tends to generate a stack overflow (segmentation error). I need an array to pass it to my HDF adapter. How can I prevent the overflow? I was not able to find answers by extensive googling. Thanks in advance!

The example code you are following is in C, while the code you are writing is it C++. In most cases, valid C code is valid C++ code, although not necessarily good style; this is one of the times where it is not, although since that isn’t your real problem I’ll leave the explanation of that at the end of my answer.
When you declare struct_t writedata[count];, you are creating an array on the stack. The stack is often artificially limited in size, and so creating a large array on the stack could lead to a problem where you run out of stack space. This is what you are seeing. The typical solution is to create large data structures in the heap (although the primary use of the heap is to make data that lasts past the return of the function that creates it).
The most C++-idiomatic way to access the heap is to not do it directly, but to use a helper container class. In this case what you want is an std::vector, which lets you push data onto the end and will automatically grow as you push on more data. Since it automatically grows, you don’t need to specify the size in advance; just declare it as a std::vector<struct_t> writedata; (read “std::vector of struct_t”). Again, since it doesn’t need to know the size in advance, you can also ignore the whole first loop.
The vector is initially empty; to put data into it, you usually want to use writedata.push_back() or writedata.emplace_back(). The first of these takes an existing struct_t; the second takes the parameters you’d use to create one. All of the elements are stored contiguously in memory, like in a C array, which you can access directly with writedata.data().
At the end of the function, when the vector goes out of scope and is no longer accessible, its destructor will be called and automatically clean up the memory it used.
Another option, instead of using std::vector, is to manage the memory yourself. The C++ way of doing that is with new and delete. The easiest way to handle that is to still calculate count, as you do, but instead of creating the array on the stack by just declaring it as a count-sized array, you do struct_t* writedata = new struct_t[count];. This will create an array of count struct_ts in the heap, and set writedata as a pointer to the first element of this array. Then you can use it as you use the array in your program, but since it’s on the heap you won’t run out of stack space.
The downsides to this are that you need to know the size in advance, and you need to clean up the memory you used yourself. To do this, when you no longer need the data, you should run delete[] writedata. After that, writedata will still point to the same place in memory, but your program no longer owns that data, so you need to make sure to never use that value again; the standard way is to, immediately after deletion, set writedata to nullptr.
You can also use the C equivalents to new and delete, which are malloc and free. They are mostly equivalent in your case, but for more complicated examples you should keep in mind that these leave the memory uninitialized, while new and delete will run the constructors/destructors of what you create to make sure the objects are in a sane state at the beginning and don’t leave resources lying around at the end.
Now for why your original code isn’t actually valid C++ for any size of file: Your line struct_t writedata[count]; tries to create an array of count struct_ts. Since count is a variable, this is called a variable-length array (VLA). Such things are legal in newer versions of C, but not in C++. This alone is just worth a warning as long as you only want to compile the code on the same system you’re currently using, since your compiler seems to support VLAs as an extension. However, if you want to compile your code on any other system (make it more portable), you shouldn’t use compiler extensions like this.

struct_t writedata[count];
This array is allocated on the stack which is normally quite small, and when it gets to a value that's too big (which is semi-arbitrary) this will overflow the stack.
You'd be better off allocating on the heap by doing something like:
struct_t* writedata = (struct_t*)malloc(sizeof(struct_t) * count);
And then add a corresponding call to free once you're finished with the memory, e.g.
free(writedata);
writedata = nullptr;
It's best practice to check that i < count in your while loop, as if you write off the end of your array Bad Things may happen.

Related

practical explanation of c++ functions with pointers

I am relatively new to C++...
I am learning and coding but I am finding the idea of pointers to be somewhat fuzzy. As I understand it * points to a value and & points to an address...great but why? Which is byval and which is byref and again why?
And while I feel like I am learning and understanding the idea of stack vs heap, runtime vs design time etc, I don't feel like I'm fully understanding what is going on. I don't like using coding techniques that I don't fully understand.
Could anyone please elaborate on exactly what and why the pointers in this fairly "simple" function below are used, esp the pointer to the function itself.. [got it]
Just asking how to clean up (delete[]) the str... or if it just goes out of scope.. Thanks.
char *char_out(AnsiString ansi_in)
{
// allocate memory for char array
char *str = new char[ansi_in.Length() + 1];
// copy contents of string into char array
strcpy(str, ansi_in.c_str());
return str;
}

Revision 3
TL;DR:
AnsiString appears to be an object which is passed by value to that function.
char* str is on the stack.
A new array is created on the heap with (ansi_in.Length() + 1) elements. A pointer to the array is stored in str. +1 is used because strings in C/C++ typically use a null terminator, which is a special character used to identify the end of the string when scanning through it.
ansi_in.cstr() is called, copying a pointer to its string buffer into an unnamed local variable on the stack.
str and the temporary pointer are pushed onto the stack and strcpy is called. This has the effect of copying the string(including the null-terminator) pointed at from the temporary to str.
str is returned to the caller
Long answer:
You appear to be struggling to understand stack vs heap, and pointers vs non-pointers. I'll break them down for you and then answer your question.
The stack is a concept where a fixed region of memory is allocated for each thread before it starts and before any user code runs.
Ignoring lower level details such as calling conventions and compiler optimizations, you can reason that the following happens when you call a function:
Arguments are pushed onto the stack. This reserves part of the stack for use of the arguments.
The function performs some job, using and copying the arguments as needed.
The function pops the arguments off the stack and returns. This frees the space reserved for the arguments.
This isn't limited to function calls. When you declare objects and primitives in a function's body, space for them is reserved via pushing. When they're out of scope, they're automatically cleaned up by calling destructors and popping.
When your program runs out of stack space and starts using the space outside of it, you'll typically encounter an error. Regardless of what the actual error is, it's known as a stack overflow because you're going past it and therefore "overflowing".
The heap is a different concept where the remaining unused memory of the system is available for you to manually allocate and deallocate from. This is primarily used when you have a large data set that's too big for the stack, or when you need data to persist across arbitrary functions.
C++ is a difficult beast to master, but if you can wrap your head around the core concepts is becomes easier to understand.
Suppose we wanted to model a human:
struct Human
{
const char* Name;
int Age;
};
int main(int argc, char** argv)
{
Human human;
human.Name = "Edward";
human.Age = 30;
return 0;
}
This allocates at least sizeof(Human) bytes on the stack for storing the 'human' object. Right before main() returns, the space for 'human' is freed.
Now, suppose we wanted an array of 10 humans:
int main(int argc, char** argv)
{
Human humans[10];
humans[0].Name = "Edward";
humans[0].Age = 30;
// ...
return 0;
}
This allocates at least (sizeof(Human) * 10) bytes on the stack for storing the 'humans' array. This too is automatically cleaned up.
Note uses of ".". When using anything that's not a pointer, you access their contents using a period. This is direct memory access if you're not using a reference.
Here's the single object version using the heap:
int main(int argc, char** argv)
{
Human* human = new Human();
human->Name = "Edward";
human->Age = 30;
delete human;
return 0;
}
This allocates sizeof(Human*) bytes on the stack for the pointer 'human', and at least sizeof(Human) bytes on the heap for storing the object it points to. 'human' is not automatically cleaned up, you must call delete to free it. Note uses of "a->b". When using pointers, you access their contents using the "->" operator. This is indirect memory access, because you're accessing memory through an variable address.
It's sort of like mail. When someone wants to mail you something they write an address on an envelope and submit it through the mail system. A mailman takes the mail and moves it to your mailbox. For comparison the pointer is the address written on the envelope, the memory management unit(mmu) is the mail system, the electrical signals being passed down the wire are the mailman, and the memory location the address refers to is the mailbox.
Here's the array version using the heap:
int main(int argc, char** argv)
{
Human* humans = new Human[10];
humans[0].Name = "Edward";
humans[0].Age = 30;
// ...
delete[] humans;
return 0;
}
This allocates sizeof(Human*) bytes on the stack for pointer 'humans', and (sizeof(Human) * 10) bytes on the heap for storing the array it points to. 'humans' is also not automatically cleaned up; you must call delete[] to free it.
Note uses of "a[i].b" rather than "a[i]->b". The "[]" operator(indexer) is really just syntactic sugar for "*(a + i)", which really just means treat it as a normal variable in a sequence so I can type less.
In both of the above heap examples, if you didn't write delete/delete[], the memory that the pointers point to would leak(also known as dangle). This is bad because if left unchecked it could eat through all your available memory, eventually crashing when there isn't enough or the OS decides other apps are more important than yours.
Using the stack is usually the wiser choice as you get automatic lifetime management via scope(aka RAII) and better data locality. The only "drawback" to this approach is that because of scoped lifetime you can't directly access your stack variables once the scope has exited. In other words you can only use stack variables within the scope they're declared. Despite this, C++ allows you to copy pointers and references to stack variables, and indirectly use them outside the scope they're declared in. Do note however that this is almost always a very bad idea, don't do it unless you really know what you're doing, I can't stress this enough.
Passing an argument by-ref means pushing a copy of a pointer or reference to the data on the stack. As far as the computer is concerned pointers and references are the same thing. This is a very lightweight concept, but you typically need to check for null in functions receiving pointers.
Pointer variant of an integer adding function:
int add(const int* firstIntPtr, const int* secondIntPtr)
{
if (firstIntPtr == nullptr) {
throw std::invalid_argument("firstIntPtr cannot be null.");
}
if (secondIntPtr == nullptr) {
throw std::invalid_argument("secondIntPtr cannot be null.");
}
return *firstIntPtr + *secondIntPtr;
}
Note the null checks. If it didn't verify its arguments are valid, they very well may be null or point to memory the app doesn't have access to. Attempting to read such values via dereferencing(*firstIntPtr/*secondIntPtr) is undefined behavior and if you're lucky results in a segmentation fault(aka access violation on windows), crashing the program. When this happens and your program doesn't crash, there are deeper issues with your code that are out of the scope of this answer.
Reference variant of an integer adding function:
int add(const int& firstInt, const int& secondInt)
{
return firstInt + secondInt;
}
Note the lack of null checks. By design C++ limits how you can acquire references, so you're not suppose to be able to pass a null reference, and therefore no null checks are required. That said, it's still possible to get a null reference through converting a pointer to a reference, but if you're doing that and not checking for null before converting you have a bug in your code.
Passing an argument by-val means pushing a copy of it on the stack. You almost always want to pass small data structures by value. You don't have to check for null when passing values because you're passing the actual data itself and not a pointer to it.
i.e.
int add(int firstInt, int secondInt)
{
return firstInt + secondInt;
}
No null checks are required because values, not pointers are used. Values can't be null.
Assuming you're interested in learning about all this, I highly suggest you use std::string(also see this) for all your string needs and std::unique_ptr(also see this) for managing pointers.
i.e.
std::string char_out(AnsiString ansi_in)
{
return std::string(ansi_in.c_str());
}
std::unique_ptr<char[]> char_out(AnsiString ansi_in)
{
std::unique_ptr<char[]> str(new char[ansi_in.Length() + 1]);
strcpy(str.get(), ansi_in.c_str());
return str; // std::move(str) if you're using an older C++11 compiler.
}

What if I delete an array once in C++, but allocate it multiple times?

Suppose I have the following snippet.
int main()
{
int num;
int* cost;
while(cin >> num)
{
int sum = 0;
if (num == 0)
break;
// Dynamically allocate the array and set to all zeros
cost = new int [num];
memset(cost, 0, num);
for (int i = 0; i < num; i++)
{
cin >> cost[i];
sum += cost[i];
}
cout << sum/num;
}
` `delete[] cost;
return 0;
}
Although I can move the delete statement inside the while loop
for my code, for understanding purposes, I want to know what happens with the code as it's written. Does C++ allocate different memory spaces each time I use operator new?
Does operator delete only delete the last allocated cost array?

Does C++ allocate different memory spaces each time I use operator new?
Yes.
Does operator delete only delete the last allocated cost array?
Yes.
You've lost the only pointers to the others, so they are irrevocably leaked. To avoid this problem, don't juggle pointers, but use RAII to manage dynamic resources automatically. std::vector would be perfect here (if you actually needed an array at all; your example could just keep reading and re-using a single int).

I strongly advise you not to use "C idioms" in a C++ program. Let the std library work for you: that's why it's there. If you want "an array (vector) of n integers," then that's what std::vector is all about, and it "comes with batteries included." You don't have to monkey-around with things such as "setting a maximum size" or "setting it to zero." You simply work with "this thing," whose inner workings you do not [have to ...] care about, knowing that it has already been thoroughly designed and tested.
Furthermore, when you do this, you're working within C++'s existing framework for memory-management. In particular, you're not doing anything "out-of-band" within your own application "that the standard library doesn't know about, and which might (!!) it up."
C++ gives you a very comprehensive library of fast, efficient, robust, well-tested functionality. Leverage it.

There is no cost array in your code. In your code cost is a pointer, not an array.
The actual arrays in your code are created by repetitive new int [num] calls. Each call to new creates a new, independent, nameless array object that lives somewhere in dynamic memory. The new array, once created by new[], is accessible through cost pointer. Since the array is nameless, that cost pointer is the only link you have that leads to that nameless array created by new[]. You have no other means to access that nameless array.
And every time you do that cost = new int [num] in your cycle, you are creating a completely new, different array, breaking the link from cost to the previous array and making cost to point to the new one.
Since cost was your only link to the old array, that old array becomes inaccessible. Access to that old array is lost forever. It is becomes a memory leak.
As you correctly stated it yourself, your delete[] expression only deallocates the last array - the one cost ends up pointing to in the end. Of course, this is only true if your code ever executes the cost = new int [num] line. Note that your cycle might terminate without doing a single allocation, in which case you will apply delete[] to an uninitialized (garbage) pointer.

Yes. So you get a memory leak for each iteration of the loop except the last one.
When you use new, you allocate a new chunk of memory. Assigning the result of the new to a pointer just changes what this pointer points at. It doesn't automatically release the memory this pointer was referencing before (if there was any).

First off this line is wrong:
memset(cost, 0, num);
It assumes an int is only one char long. More typically it's four. You should use something like this if you want to use memset to initialise the array:
memset(cost, 0, num*sizeof(*cost));
Or better yet dump the memset and use this when you allocate the memory:
cost = new int[num]();
As others have pointed out the delete is incorrectly placed and will leak all memory allocated by its corresponding new except for the last. Move it into the loop.

Every time you allocate new memory for the array, the memory that has been previously allocated is leaked. As a rule of thumb you need to free memory as many times as you have allocated.

Different categories of memory

static const int MAX_SIZE = 256; //I assume this is static Data
bool initialiseArray(int* arrayParam, int sizeParam) //where does this lie?
{
if(size > MAX_SIZE)
{
return false;
}
for(int i=0; i<sizeParam; i++)
{
arrayParam[i] = 9;
}
return true;
}
void main()
{
int* myArray = new int[30]; //I assume this is allocated on heap memory
bool res = initialiseArray(myArray, 30); //Where does this lie?
delete myArray;
}
We're currently learning the different categories of memory, i know that theres
-Code Memory
-Static Data
-Run-Time Stack
-Free Store(Heap)
I have commented where im unsure about, just wondering if anyone could help me out. My definition for the Run-Time stack describes that this is used for functions but my code memory defines that it contains all instructions for the methods/functions so im just a bit confused.
Can anyone lend a hand?

static const int MAX_SIZE = 256; //I assume this is static Data
Yes indeed. In fact, because it's const, this value might not be kept in your final executable at all, because the compiler can just substitute "256" anywhere it sees MAX_SIZE.
bool initialiseArray(int* arrayParam, int sizeParam) //where does this lie?
The code for the initialiseArray() function will be in the data section of your exectuable. You can get a pointer to the memory address, and call the function via that address, but other than that there's not much else you can do with it.
The arrayParam and sizeParam arguments will be passed to the function by value, on the stack. Likewise, the bool return value will be placed into the calling function's stack area.
int* myArray = new int[30]; //I assume this is allocated on heap memory
Correct.
bool res = initialiseArray(myArray, 30); //Where does this lie?
Effectively, the myArray pointer and the literal 30 are copied into the stack area of initialiseArray(), which then operates on them, and then the resulting bool is copied into the stack area of the calling function.
The actual details of argument passing are a lot more grizzly and depend on calling conventions (of which there are several, particularly on Windows), but unless you're doing something really specialised then they're not really important :-)

The stack is used for automatic variables - that is, variables declared within functions, or as function parameters. These variables are destroyed automatically when the program leaves the block of code they were declared in.
You're correct that MAX_SIZE has a static lifetime - it is destroyed automatically at the end of the program. You're also correct that the array allocated with new[] is on the heap (having a dynamic lifetime) - it won't be destroyed automatically, so need to be deleted. By the way, you need delete [] myArray; to match the use of new [].
The pointer to it (myArray) is an automatic variable, on the stack, as are res and the function arguments.

There is just one type of memory... it is a memory :D
What is different is where it is and how you access it.
If you go deep into the exe loader in Windows ( or in any kind of OS actually ) what it really does is that is stores the information of your sections ( parts of you exe ) and at run time at lays it out properly into the memory and applies access rights. So generally the code section where your "program" is the same memory ( your RAM ) as your data section. The difference is that the access rights are different, the code section usually only have read + execute the data just read + write ( and no execute ).
The stack is still a memory, it is special in the sense that it is again controlled by the OS, the stack size is the size in bytes of how big your stack is, but here the purpose is to hold immediate values between function calls ( as per stdcall ) and local variables ( depends on the compiler how it does it exactly ) so because it is a memory you CAN use it but like you it is to to lets say allocate a 10000 byte string on the stack. In assembly you have direct access since there is a stack pointer EBP ( If I remember correctly :P ) or in C/C++ you can use alloca.
The new and the delete operators are built ins for the C++ language but as far as I know they use the same system allocators as you do, in fact you can override them and use malloc/free and it should work which means that again this is the same memory.
The difference between using new/delete and an os specific function is that you let the language handle the allocation but in the end you will get a pointer just like you would with any other function.
On top of this there are special ways but those change the way the memory handled, in Windows this is the virtual memory for example, like VirutalAlloc, VirutalFree will allow you specify what you will do with the memory you want to use thus you allow the OS to optimize better, like you tell it I want 2Gb of memory BUT it doesn't have to be in RAM, so it may save it on the disk but you STILL access this with memory pointers.
And about your questions :
static const int MAX_SIZE = 256; //I assume this is static Data
It usually depends on the compiler but mostly they will treat this as const ( static is something else ) which means that it will be in the const section of the exe which in turn means that this memory block will be read-only.
int* myArray = new int[30]; //I assume this is allocated on heap memory
Yes this will be on the heap, but how it is allocated depends on the implementation and whenever you override the new operator, if you do you can for example force it to be in the Virtual memory so in fact it could be on the disk or in RAM, but this is silly thing to do so yes it will be on the heap.
bool res = initialiseArray(myArray, 30); //Where does this lie?
Multiple things happen here, because the compiler know that the first parameter of initialiseArray must be a pointer to an int it will pass a pointer to myArray so both a pointer and the value of 30 will go on the stack and then the function is called.
In the function which is in the memory ( the code section ) it runs and gets the parameters ( int* arrayParam, int sizeParam ) from the stack it will know that you want to write to the arrayParam and that is is pointer so it will write into the location arrayParam points to. To where exactly you specify it with arrayParam[i] < i will offset the memory pointer to the correct value, again C++ does some magic by adjusting the pointer for you since the adjustment in code should be in bytes it will move the memory pointer by 4 since ( usually ) int == 4 bytes.
To get a better overview of where goes what and how it works, use a debugger or a disassembler ( like OllyDbg ) and see it for yourself, if you want know more about how the stack is used look up the stdcall calling convention.

Heap corruption

Why is it a problem if we have a huge piece of code between new and delete of a char array.
Example
void this_is_bad() /* You wouldn't believe how often this kind of code can be found */
{
char *p = new char[5]; /* spend some cycles in the memory manager */
/* do some stuff with p */
delete[] p; /* spend some more cycles, and create an opportunity for a leak */
}

Because somebody may throw an exception.
Because somebody may add a return.
If you have lots of code between the new and delete you may not spot that you need to deallocate the memory before the throw/return?
Why do you have a RAW pointer in your code.
Use a std::vector.

The article you reference is making the point that
char p[5];
would be just as effective in this case and have no danger of a leak.
In general, you avoid leaks by making the life cycle of allocated memory very clear, the new and the delete can be seen to be related.
Large separation between the two is harder to check, and needs to consider carefully whether there are any ways out of the code that might dodge the delete.

The link (and source) of that code is lamenting the unnecessary use of the heap in that code. For a constant, and small, amount of memory there's no reason not to allocate it on the stack.
Instead:
void this_is_good()
{
/* Avoid allocation of small temporary objects on the heap*/
char p[5]; /* Use the stack instead */
/* do some stuff */
}
There's nothing inherently wrong with the original code though, its just less than optimal.

Next to all interesting answers about the heap, and about having new and delete occur close to each other, I might add that the sheer fact of having a huge amount of code in one function is to be avoided. If the huge amount of code separates two related lines of code, it's even worse.
I would differentiate between 'amount of work' and 'amount of code':
void do_stuff( char* const p );
void formerly_huge_function() {
char* p = new char[5];
CoInitialize( NULL );
do_stuff( p );
CoUninitialize();
delete[] p;
}
Now do_stuff can do a lot of things without interfering with the allocation problem. But also other symmetrical stuff stays together this way.
It's all about the guy who's going to maintain your code. It might be you, in a month.

That particular example isn't stating that having a bunch of code in between a new and delete is necessarily bad; it stating that if there are ways to write code that don't use the heap, you might want to prefer that to avoid heap corruption.
It's decent enough advice; if you reduce the amount you use the heap, you know where to look when the heap is corrupted.

I would argue that it's not a problem to have huge amounts of code between a new and delete of any variable. The idea of using new is to place a value on the heap and hence keep it alive for long periods of time. Having code execute on these values is an expected operation.
What can get you into trouble when you have huge amounts of code within the same method between a new and delete is the chance of accidentally leaking the variable due to ...
An exception being thrown
Methods that get so long you can't see the begining or end and hence people start arbitrarily returning from the middle without realizing they skipped a delete call
Both of these can be fixed by using an RAII type such as std::vector<char> instead of a new / delete pair.

It isn't, assuming that the "huge piece of code":
Always runs "delete [] p;" before calling "p = new char[size];" or "throw exception;"
Always runs "p = 0;" after calling "delete [] p;"
Failure to meet the first condition, will cause the contents of p to be leaked. Failure to meet the second condition may result in a double-delete. In general, it is best to use std::vector, so as to avoid any problems.

Are you asking if this would be better?
void this_is_great()
{
char* p = new char[5];
delete[] p;
return;
}
It's not.

C++ string manipulation

My lack of C++ experience, or rather my early learning in garbage collected languages is really stinging me at the moment and I have a problem working with strings in C++.
To make it very clear, using std::string or equlivents is not an option - this is char* 's all the way.
So: what I need to do is very simple and basically boils down to concatenating strings. At runtime I have 2 classes.
One class contains "type" information in the form of a base filename.
in the header:
char* mBaseName;
and later, in the .cpp it is loaded with info passed in from elsewhere.
mBaseName = attributes->BaseName;
The 2nd class provides version information in the form of a suffix to the base file name, it's a static class and implemented like this at present:
static const char* const suffixes[] = {"Version1", "Version", "Version3"}; //etc.
static char* GetSuffix()
{
int i = 0;
//perform checks on some data structures
i = somevalue;
return suffixes[i];
}
Then, at runtime the base class creates the filename it needs:
void LoadStuff()
{
char* suffix = GetSuffix();
char* nameToUse = new char[50];
sprintf(nameToUse, "%s%s",mBaseName,suffix);
LoadAndSetupData(nameToUse);
}
And you can see the problem immediately. nameToUse never gets deleted, memory leak.
The suffixes are a fixed list, but the basefilenames are arbitrary. The name that is created needs to persist beyond the end of "LoadStuff()" as it's not clear when if and how it is used subsequently.
I am probably worrying too much, or being very stupid, but similar code to LoadStuff() happens in other places too, so it needs solving. It's frustrating as I don't quite know enough about the way things work to see a safe and "un-hacky" solution. In C# I'd just write:
LoadAndSetupData(mBaseName + GetSuffix());
and wouldn't need to worry.
Any comments, suggestions, or advice much appreciated.
Update
The issue with the code I am calling LoadAndSetupData() is that, at some point it probably does copy the filename and keep it locally, but the actual instantiation is asynchranous, LoadAndSetupData actually puts things into a queue, and at that point at least, it expects that the string passed in still exists.
I do not control this code so I can't update it's function.

Seeing now that the issue is how to clean up the string that you created and passed to LoadAndSetUpData()
I am assuming that:
LoadAndSetUpData() does not make its own copy
You can't change LoadAndSetUpData() to do that
You need the string to still exist for some time after LoadAndSetupData() returns
Here are suggestions:
Can you make your own queue objects to be called? Are they guaranteed to be called after the ones that use your string. If so, create cleanup queue events with the same string that call delete[] on them
Is there a maximum number you can count on. If you created a large array of strings, could you use them in a cycle and be assured that when you got back to the beginning, it would be ok to reuse that string
Is there an amount of time you can count on? If so, register them for deletion somewhere and check that after some time.
The best thing would be for functions that take char* to take ownership or copy. Shared ownership is the hardest thing to do without reference counting or garbage collection.

EDIT: This answer doesn't address his problem completely -- I made other suggestions here:
C++ string manipulation
His problem is that he needs to extend the scope of the char* he created to outside the function, and until an asynchronous job is finished.
Original Answer:
In C++, if I can't use the standard library or Boost, I still have a class like this:
template<class T>
class ArrayGuard {
public:
ArrayGuard(T* ptr) { _ptr = ptr; }
~ArrayGuard() { delete[] _ptr; }
private:
T* _ptr;
ArrayGuard(const ArrayGuard&);
ArrayGuard& operator=(const ArrayGuard&);
}
You use it like:
char* buffer = new char[50];
ArrayGuard<char *> bufferGuard(buffer);
The buffer will be deleted at the end of the scope (on return or throw).
For just simple array deleting for dynamic sized arrays that I want to be treated like a static sized array that gets released at the end of the scope.
Keep it simple -- if you need fancier smart pointers, use Boost.
This is useful if the 50 in your example is variable.

The thing to remember with C++ memory management is ownership. If the LoadAndSetupData data is not going to take ownership of the string, then it's still your responsibility. Since you can't delete it immediately (because of the asynchronicity issue), you're going to have to hold on to those pointers until such time as you know you can delete them.
Maintain a pool of strings that you have created:
If you have some point in time where you know that the queue has been completely dealt with, you can simply delete all the strings in the pool.
If you know that all strings created after a certain point in time have been dealt with, then keep track of when the strings were created, and you can delete that subset. - If you can somehow find out when an individual string has been dealt with, then just delete that string.
class StringPool
{
struct StringReference {
char *buffer;
time_t created;
} *Pool;
size_t PoolSize;
size_t Allocated;
static const size_t INITIAL_SIZE = 100;
void GrowBuffer()
{
StringReference *newPool = new StringReference[PoolSize * 2];
for (size_t i = 0; i < Allocated; ++i)
newPool[i] = Pool[i];
StringReference *oldPool = Pool;
Pool = newPool;
delete[] oldPool;
}
public:
StringPool() : Pool(new StringReference[INITIAL_SIZE]), PoolSize(INITIAL_SIZE)
{
}
~StringPool()
{
ClearPool();
delete[] Pool;
}
char *GetBuffer(size_t size)
{
if (Allocated == PoolSize)
GrowBuffer();
Pool[Allocated].buffer = new char[size];
Pool[Allocated].buffer = time(NULL);
++Allocated;
}
void ClearPool()
{
for (size_t i = 0; i < Allocated; ++i)
delete[] Pool[i].buffer;
Allocated = 0;
}
void ClearBefore(time_t knownCleared)
{
size_t newAllocated = 0;
for (size_t i = 0; i < Allocated; ++i)
{
if (Pool[i].created < knownCleared)
{
delete[] Pool[i].buffer;
}
else
{
Pool[newAllocated] = Pool[i];
++newAllocated;
}
}
Allocated = newAllocated;
}
// This compares pointers, not strings!
void ReleaseBuffer(char *knownCleared)
{
size_t newAllocated = 0;
for (size_t i = 0; i < Allocated; ++i)
{
if (Pool[i].buffer == knownCleared)
{
delete[] Pool[i].buffer;
}
else
{
Pool[newAllocated] = Pool[i];
++newAllocated;
}
}
Allocated = newAllocated;
}
};

Since std::string is not an option, for whatever reason, have you looked into smart pointers? See boost
But I can only encourage you to use std::string.
Christian

If you must use char*'s, then LoadAndSetupData() should explicitly document who owns the memory for the char* after the call. You can do one of two things:
Copy the string. This is probably the simplest thing. LoadAndSetupData copies the string into some internal buffer, and the caller is always responsible for the memory.
Transfer ownership. LoadAndSetupData() documents that it will be responsible for eventually freeing the memory for the char*. The caller doesn't need to worry about freeing the memory.
I generally prefer safe copying as in #1, because the allocator of the string is also responsible for freeing it. If you go with #2, the allocator has to remember NOT to free things, and memory management happens in two places, which I find harder to maintain. In either case, it's a matter of explicitly documenting the policy so that the caller knows what to expect.
If you go with #1, take a look at Lou Franco's answer to see how you might allocate a char[] in an exception-safe, sure to be freed way using a guard class. Note that you can't (safely) use std::auto_ptr for arrays.

Since you need nameToUse to still exist after the function, you are stuck using new, what I would do is return a pointer to it, so the caller can "delete" it at a later time when it is no longer needed.
char * LoadStuff()
{
char* suffix = GetSuffix();
char* nameToUse = new char[50];
sprintf("%s%s",mBaseName,suffix);
LoadAndSetupData(nameToUse);
return nameToUse;
}
then:
char *name = LoadStuff();
// do whatever you need to do:
delete [] name;

There is no need to allocate on heap in this case. And always use snprintf:
char nameToUse[50];
snprintf(nameToUse, sizeof(nameToUse), "%s%s",mBaseName,suffix);

Where exactly nameToUse is used beyond the scope of LoadStuff? If someone needs it after LoadStuff it needs to pass it, along with the responisbility for memory deallocation
If you would have done it in c# as you suggested
LoadAndSetupData(mBaseName + GetSuffix());
then nothing would reference LoadAndSetupData's parameter, therefore you can safely change it to
char nameToUse[50];
as Martin suggested.

You're going to have to manage the lifetime of the memory you allocate for nameToUse. Wrapping it up in a class such as std::string makes your life a bit simpler.
I guess this is a minor outrage, but since I can't think of any better solution to your problem, I'll point out another potential problem. You need to be very careful to check the size of the buffer you're writing into when copying or concatenating strings. Functions such as strcat, strcpy and sprintf can easily overwrite the end of their target buffers, leading to spurious runtime errors and security vulnerabilities.
Apologies, my own experience is mostly on the Windows platform, where they introduced "safe" versions of these functions, called strcat_s, strcpy_s, and sprintf_s. The same goes for all their many related functions.

First: Why do you need for the allocated string to persist beyond the end of LoadStuff()? Is there a way you can refactor to remove that requirement.
Since C++ doesn't provide a straightforward way to do this kind of stuff, most programming environments use a set of guidelines about pointers to prevent delete/free problems. Since things can only be allocated/freed once, it needs to be very clear who "owns" the pointer. Some sample guidelines:
1) Usually the person that allocates the string is the owner, and is also responsible for freeing the string.
2) If you need to free in a different function/class than you allocated in, there must be an explicit hand-off of ownership to another class/function.
3) Unless explicitly stated otherwise, pointers (including strings) belong to the caller. A function, constructor, etc. cannot assume that the string pointer it gets will persist beyond the end of the function call. If they need a persistent copy of the pointer, they should make a local copy with strdup().
What this boils down to in your specific case is that LoadStuff() should delete[] nameToUse, and the function that it calls should make a local copy.
One alternate solution: if nameToUse is going to be passed lots of places and needs to persist for the lifetime of the program, you could make it a global variable. (This saves the trouble of making lots of copies of it.) If you don't want to pollute your global namespace, you could just declare it static local to the function:
static char *nameToUse = new char[50];

Thankyou everyone for your answers. I have not selected one as "the answer" as there isn't a concrete solution to this problem and the best discussions on it are all upvoted be me and others anyway.
Your suggestions are all good, and you have been very patient with the clunkiness of my question. As I am sure you can see, this is a simplification of a more complicated problem and there is a lot more going on which is connected with the example I gave, hence the way that bits of it may not have entirely made sense.
For your interest I have decided to "cheat" my way out of the difficulty for now. I said that the base names were arbitrary, but this isn't quite true. In fact they are a limited set of names too, just a limited set that could change at some point, so I was attempting to solve a more general problem.
For now I will extend the "static" solution to suffixes and build a table of possible names. This is very "hacky", but will work and moreover avoids refactoring a large amount of complex code which I am not able to.
Feedback has been fantastic, many thanks.

You can combine some of the ideas here.
Depending on how you have modularized your application, there may be a method (main?) whose execution determines the scope in which nameToUse is definable as a fixed size local variable. You can pass the pointer (&nameToUse[0] or simply nameToUse) to those other methods that need to fill it (so pass the size too) or use it, knowing that the storage will disappear when the function having the local variable exits or your program terminates by any other means.
There is little difference between this and using dynamic allocation and deletion (since the pointer holding the location will have to be managed more-or-less the same way). The local allocation is more direct in many cases and is very inexpensive when there is no problem with associating the maximum-required lifetime with the duration of a particular function's execution.

I'm not totally clear on where LoadAndSetupData is defined, but it looks like it's keeping its own copy of the string. So then you should delete your locally allocated copy after the call to LoadAndSetupData and let it manage its own copy.
Or, make sure LoadAndSetupData cleans up the allocated char[] that you give it.
My preference would be to let the other function keep its own copy and manage it so that you don't allocate an object for another class.
Edit: since you use new with a fixed size [50], you might as well make it local as has been suggested and the let LoadAndSetupData make its own copy.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Stack overflow in C++ with big array - c++

Related

practical explanation of c++ functions with pointers

What if I delete an array once in C++, but allocate it multiple times?

Different categories of memory

Heap corruption

C++ string manipulation

Categories

Resources