Reading a large file into an array causes a crash

Reading a large file into an array causes a crash - c++

I have a assignment where I need to take a 7 digit input (a phone number) and check if it's found in the digits of pi. The digits of pi are stored in a supplied space separated text file. It seems reasonably straightforward: break the input into an array, read the digits of pi into an array, and check if a match is found. Long story short, I got the program working to my satisfaction. We were supplied text documents with the digits of pi in multiples of 10, 100, and so on up to 1 million digits. My program works up to 100,000 digits. But for whatever reason, on the 1 million digit file, it crashes with a generic windows error. I have no information on why it crashes and no error message is given (except the generic "a problem caused this program to stop working" message).
Noting that limits on the assignment state I cannot use any object-orientated code except for cin, cout, and the file stream objects (this limitation is because we've yet to get into classes and they don't want us using functions without knowing how they work).
Anyway, I'm looking for insight as to why the program is crashing. I'm using long ints on every variable that should need them (including counters and function returns), which should be sufficient, since they can go up to roughly 2 billion and there should not be any numbers larger than a million here.
Thanks for any help. I've been at this the past few hours with no success.
const long int numberOfDigits = 1000000;
int digitsOfPi[numberOfDigits];

int digitsOfPi[numberOfDigits];
The stack does not have enough room to hold such a large array. The stack is where automatic variables (AKA local variables) are stored. Memory is automatically allocated for local variables when execution enters a function and is freed when the function returns. The stack is great because of this automatic memory management, but one restriction is that its size is limited.
Large objects should go on the heap.1 The heap is a gigantic pool of memory from which you can allocate pieces dynamically whenever you like. The difference between the heap and the stack is that you're responsible for allocating and freeing heap memory. It does not get automatically freed for you.
To allocate memory on the heap in C++, use the new operator, with each new having a corresponding delete to free the memory once it's no longer needed. (Or in our case, we use new[] and delete[] since we're dealing with an array.)
// Allocate memory on the heap.
int *digitsOfPi = new int[numberOfDigits];
// Use it.
// Then free it.
delete[] digitsOfPi;
// Or better yet, once you're allowed to use the STL...
std::vector<int> digitsOfPi;
The larger question, though, is why you need to read all the digits of π into memory at once. A better design, though trickier to code, would only need a fixed O(1) amount of memory—say, 7 digits at a time.
See also
What and where are the stack and heap?
1 You could explore your compiler's options to increase the stack size, but that's not the right solution.

Related

I don't understand about memory issue of appending string

Runtime error: pointer index expression with base 0x000000000000 overflowed to 0xffffffffffffffff for frequency sort
In first answer of that link, it says that appending char to string can cause memory issue.
string s = "";
char c = 'a';
int max = INT_MAX;
for(int j=0;j<max;j++)
s = s + c;
The answer explains [s=s+c in above code copies the same string again and again so it will cause memory issue.] But I don't understand why that code copies the same string again and again.
Is there someone who is likely to make me understand that part :)?

I don't understand why that code copies the same string again and
again.
Okay, let's look at the what happens each time the loop is iterated:
s = s + c;
There are three things the program has to do in order to execute that line of code:
Compute the temporary value s + c -- to do that, the program has to create a temporary, anonymous std::string object, and allocate for it (from the heap) an internal byte-buffer that is at least one byte larger than the number of chars currently in s (so that it can hold all of s's old contents, plus the additional char provided by c)
Set s equal to the temporary-string. In C++03 and earlier, this would be done by reallocating s's internal byte-buffer to be larger, then copying all of the bytes from the temporary-string into s's new/larger buffer. C++11 optimizes this a bit via the new move-assignment operator, so that all the bytes don't have to be copied; rather, s can simply take ownership of the temporary-string's byte-buffer.
Free the temporary string's resources, now that we're done using it. In practice, this takes the form of the std::string class's destructor calling delete[] on the old (no-longer-large-enough) byte-buffer.
Given that the above is going to be performed at least 2 billion times in a loop, it's already quite inefficient.
However, what I think the answer you referred to was particularly concerned about was heap fragmentation. Keep in mind that heap allocation doesn't work by magic; when you (or the std::string class, or anyone) asks to allocate N bytes of memory from the heap, the heap implementation's job is to find N bytes of contiguous memory and return it. And since there is no provision in C++ for moving blocks of memory around (as doing so would invalidate any pointers that the program might have pointing into those blocks of memory), the heap can't create an N-byte contiguous-memory-chunk out of smaller chunks; instead, there has to be a range of contiguous-memory-space already available. For example, it does the heap no good to have a total of 1GB of memory available, if that 1GB of memory is made up of thousands of nonconsecutive 1KB chunks and the caller is asking for a 2KB allocation.
Therefore, the heap's job is to efficiently allocate chunks of memory of the sizes the program requests, and when they are freed again, it will try to glue them back together into larger chunks again if it can, but it may not always be able to. Certain patterns of allocating and freeing memory may result in heap fragmentation, which is simply a large number of discontinuous memory allocations that render the small regions of free memory between them unusable for large allocations.
Whether or not this particular allocate/free pattern would cause that, I'm not sure; given that only one or two buffers are being allocated at a time, the heap may be able to reabsorb them back into adjacent free-memory chunks as they get freed again -- it probably depends on the particular heap algorithm the system is using, as well as on whether any other threads are allocating/freeing heap memory while this is going on. But I wouldn't be too surprised if there are systems out there where it would cause problems (particularly on 16-bit or 32-bit systems where virtual address space is limited, or embedded systems that don't use virtual memory)

What is the purpose of allocating a specific amount of memory for arrays in C++?

I'm a student taking a class on Data Structures in C++ this semester and I came across something that I don't quite understand tonight. Say I were to create a pointer to an array on the heap:
int* arrayPtr = new int [4];
I can access this array using pointer syntax
int value = *(arrayPtr + index);
But if I were to add another value to the memory position immediately after the end of the space allocated for the array, I would then be able to access it
*(arrayPtr + 4) = 0;
int nextPos = *(arrayPtr + 4);
//the value of nextPos will be 0, or whatever value I previously filled that space with
The position in memory of *(arrayPtr + 4) is past the end of the space allocated for the array. But as far as I understand, the above still would not cause any problems. So aside from it being a requirement of C++, why even give arrays a specific size when declaring them?

When you go past the end of allocated memory, you are actually accessing memory of some other object (or memory that is free right now, but that could change later). So, it will cause you problems. Especially if you'll try to write something to it.

I can access this array using pointer syntax
int value = *(arrayPtr + index);
Yeah, but don't. Use arrayPtr[index]
The position in memory of *(arrayPtr + 4) is past the end of the space allocated for the array. But as far as I understand, the above still would not cause any problems.
You understand wrong. Oh so very wrong. You're invoking undefined behavior and undefined behavior is undefined. It may work for a week, then break one day next week and you'll be left wondering why. If you don't know the collection size in advance use something dynamic like a vector instead of an array.

Yes, in C/C++ you can access memory outside of the space you claim to have allocated. Sometimes. This is what is referred to as undefined behavior.
Basically, you have told the compiler and the memory management system that you want space to store four integers, and the memory management system allocated space for you to store four integers. It gave you a pointer to that space. In the memory manager's internal accounting, those bytes of ram are now occupied, until you call delete[] arrayPtr;.
However, the memory manager has not allocated that next byte for you. You don't have any way of knowing, in general, what that next byte is, or who it belongs to.
In a simple example program like your example, which just allocates a few bytes, and doesn't allocate anything else, chances are, that next byte belongs to your program, and isn't occupied. If that array is the only dynamically allocated memory in your program, then it's probably, maybe safe to run over the end.
But in a more complex program, with multiple dynamic memory allocations and deallocations, especially near the edges of memory pages, you really have no good way of knowing what any bytes outside of the memory you asked for contain. So when you write to bytes outside of the memory you asked for in new you could be writing to basically anything.
This is where undefined behavior comes in. Because you don't know what's in that space you wrote to, you don't know what will happen as a result. Here's some examples of things that could happen:
The memory was not allocated when you wrote to it. In that case, the data is fine, and nothing bad seems to happen. However, if a later memory allocation uses that space, anything you tried to put there will be lost.
The memory was allocated when you wrote to it. In that case, congratulations, you just overwrote some random bytes from some other data structure somewhere else in your program. Imagine replacing a variable somewhere in one of your objects with random data, and consider what that would mean for your program. Maybe a list somewhere else now has the wrong count. Maybe a string now has some random values for the first few characters, or is now empty because you replaced those characters with zeroes.
The array was allocated at the edge of a page, so the next bytes don't belong to your program. The address is outside your program's allocation. In this case, the OS detects you accessing random memory that isn't yours, and terminates your program immediately with SIGSEGV.
Basically, undefined behavior means that you are doing something illegal, but because C/C++ is designed to be fast, the language designers don't include an explicit check to make sure you don't break the rules, like other languages (e.g. Java, C#). They just list the behavior of breaking the rules as undefined, and then the people who make the compilers can have the output be simpler, faster code, since no array bounds checks are made, and if you break the rules, it's your own problem.
So yes, this sometimes works, but don't ever rely on it.

It would not cause any problems in a a purely abstract setting, where you only worry about whether the logic of the algorithm is sound. In that case there's no reason to declare the size of an array at all. However, your computer exists in the physical world, and only has a limited amount of memory. When you're allocating memory, you're asking the operating system to let you use some of the computer's finite memory. If you go beyond that, the operating system should stop you, usually by killing your process/program.

Yes, you must write it as arrayptr[index] because the position in memory of *(arrayptr + 4) is past the end of the space which you have allocated for the array. Its the flaw in C++ that the array size cant be extended once allocated.

Maximum number of pointer in one variable

In my project, there are one million inputs and I am supposed to take different numbers of inputs in order to compare sort/search algorithms. Everything was allright till I tried to take five hundread thousand inputs. Therefore, I have realized that I can't create five hundred thousand pointers to my class or even an integer type by using array. However, I can create five pointers with size of one hundred thousand.
If I didn't explain very well, just look these two codes;
int *ptr[500000]; // it crashes
int *ptr1[100000]; // it runs well
int *ptr2[100000];
int *ptr3[100000];
int *ptr4[100000];
int *ptr5[100000];
What is the reason of crashing? Is there a limiting or is it about memory? And of course, how can I fix it?

You are trying to allocate a 500,000-entry array on the stack. The stack is not really designed for holding large amounts of data like this. In your case, the stack just happens to be big enough to hold 100,000 entries (or even several different lots of 100,000 entries) but not 500,000 in a single block. If you overflow the stack, behaviour is undefined but a crash is likely.
You will get much better results by allocating your array on the heap instead.
int **ptr = malloc(500000*sizeof(int*));
Remember to check for a NULL return value from malloc, and free the memory when you're finished with it.

Why is the heap after array allocation so large

I've got a very basic application that boils down to the following code:
char* gBigArray[200][200][200];
unsigned int Initialise(){
for(int ta=0;ta<200;ta++)
for(int tb=0;tb<200;tb++)
for(int tc=0;tc<200;tc++)
gBigArray[ta][tb][tc]=new char;
return sizeof(gBigArray);
}
The function returns the expected value of 32000000 bytes, which is approximately 30MB, yet in the Windows Task Manager (and granted it's not 100% accurate) gives a Memory (Private Working Set) value of around 157MB. I've loaded the application into VMMap by SysInternals and have the following values:
I'm unsure what Image means (listed under Type), although irrelevant of that its value is around what I'm expecting. What is really throwing things out for me is the Heap value, which is where the apparent enormous size is coming from.
What I don't understand is why this is? According to this answer if I've understood it correctly, gBigArray would be placed in the data or bss segment - however I'm guessing as each element is an uninitialised pointer it would be placed in the bss segment. Why then would the heap value be larger by a silly amount than what is required?

It doesn't sound silly if you know how memory allocators work. They keep track of the allocated blocks so there's a field storing the size and also a pointer to the next block, perhaps even some padding. Some compilers place guarding space around the allocated area in debug builds so if you write beyond or before the allocated area the program can detect it at runtime when you try to free the allocated space.

you are allocating one char at a time. There is typically a space overhead per allocation
Allocate the memory on one big chunk (or at least in a few chunks)

Do not forget that char* gBigArray[200][200][200]; allocates space for 200*200*200=8000000 pointers, each word size. That is 32 MB on a 32 bit system.
Add another 8000000 char's to that for another 8MB. Since you are allocating them one by one it probably can't allocate them at one byte per item so they'll probably also take the word size per item resulting in another 32MB (32 bit system).
The rest is probably overhead, which is also significant because the C++ system must remember how many elements an array allocated with new contains for delete [].

Owww! My embedded systems stuff would roll over and die if faced with that code. Each allocation has quite a bit of extra info associated with it and either is spaced to a fixed size, or is managed via a linked list type object. On my system, that 1 char new would become a 64 byte allocation out of a small object allocator such that management would be in O(1) time. But in other systems, this could easily fragment your memory horribly, make subsequent new and deletes run extremely slowly O(n) where n is number of things it tracks, and in general bring doom upon an app over time as each char would become at least a 32 byte allocation and be placed in all sorts of cubby holes in memory, thus pushing your allocation heap out much further than you might expect.
Do a single large allocation and map your 3D array over it if you need to with a placement new or other pointer trickery.

Allocating 1 char at a time is probably more expensive. There are metadata headers per allocation so 1 byte for a character is smaller than the header metadata so you might actually save space by doing one large allocation (if possible) that way you mitigate the overhead of each individual allocation having its own metadata.

Perhaps this is an issue of memory stride? What size of gaps are between values?

30 MB is for the pointers. The rest is for the storage you allocated with the new call that the pointers are pointing to. Compilers are allowed to allocate more than one byte for various reasons, like to align on word boundaries, or give some growing room in case you want it later. If you want 8 MB worth of characters, leave the * off your declaration for gBigArray.

Edited out of the above post into a community wiki post:
As the answers below say, the issue here is I am creating a new char 200^3 times, and although each char is only 1 byte, there is overhead for every object on the heap. It seems creating a char array for all chars knocks the memory down to a more believable level:
char* gBigArray[200][200][200];
char* gCharBlock=new char[200*200*200];
unsigned int Initialise(){
unsigned int mIndex=0;
for(int ta=0;ta<200;ta++)
for(int tb=0;tb<200;tb++)
for(int tc=0;tc<200;tc++)
gBigArray[ta][tb][tc]=&gCharBlock[mIndex++];
return sizeof(gBigArray);
}

When can a memory leak occur?

I don't know what to think here...
We have a component that runs as a service. It runs perfectly well on my local machine, but on some other machine (on both machine RAM's are equal to 2GB) it starts to generate bad_alloc exceptions on the second and consecutive days. The thing is that the memory usage of the process stays the same at aproximately 50Mb level. The other weird thing is that by means of tracing messages we have localized the exception to be thrown from a stringstream object which does but insert no more than 1-2 Kb data into the stream. We're using STL-Port if that matters.
Now, when you get a bad_alloc exception, you think it's a memory leak. But all our manual allocations are wrapped into a smart pointer. Also, I can't understand how a stringstream object lacks memory when the whole process uses only ~50Mb (the memory usage stays approximtely constant(and sure doesn't rise) from day to day).
I can't provide you with code, because the project is really big, and the part which throws the exception really does nothing else but create a stringstream and << some data and then log it.
So, my question is... How can a memory leak/bad_alloc occur when the process uses only 50Mb memory out of 2GB ? What other wild guesses do you have as to what could possibly be wrong?
Thanks in advance, I know the question is vague etc., I'm just sort of desperate and I tried my best to explain the problem.

One likely reason within your description is that you try to allocate a block of some unreasonably big size because of an error in your code. Something like this;
size_t numberOfElements;//uninitialized
if( .... ) {
numberOfElements = obtain();
}
elements = new Element[numberOfElements];
now if numberOfElements is left uninitialized it can contain some unreasonably big number and so you effectively try to allocate a block of say 3GB which the memory manager refuses to do.
So it can be not that your program is short on memory, but that it tries to allocate more memory than it could possibly be allowed to under even the best condition.

bad_alloc doesn't necessarily mean there is not enough memory. The allocation functions might also fail because the heap is corrupted. You might have some buffer overrun or code writing into deleted memory, etc.
You could also use Valgrind or one of its Windows replacements to find the leak/overrun.

Just a hunch,
But I have had trouble in the past when allocating arrays as so
int array1[SIZE]; // SIZE limited by COMPILER to the size of the stack frame
when SIZE is a large number.
The solution was to allocate with the new operator
int* array2 = new int[SIZE]; // SIZE limited only by OS/Hardware
I found this very confusing, the reason turned out to be the stack frame as discussed here in the solution by Martin York:
Is there a max array length limit in C++?
All the best,
Tom

Check the profile of other processes on the machine using Process Explorer from sysinternals - you will get bad_alloc if memory is short, even if it's not you that's causing memory pressure.
Check your own memory usage using umdh to get snapshots and compare usage profile over time. You'll have to do this early in the cycle to avoid blowing up the tool, but if your process's behaviour is not degrading over time (ie. no sudden pathological behaviour) you should get accurate info on its memory usage at time T vs time T+t.

Another long shot: you don't say in which of the three operations the error occurs (construction, << or log), but the problem may be memory fragmentation, rather than memory consumption. Maybe stringstream can't find a contiguous memory block long enough to hold a couple of Kb.
If this is the case, and if you exercise that function on the first day (without mishap) then you could make the stringstream a static variable and reuse it. As far as I know stringstream does not deallocate it's buffer space during its lifetime, so if it establishes a big buffer on the first day it will continue to have it from then on (for added safety you could run a 5Kb dummy string through it when it is first constructed).

I fail to see why a stream would throw. Don't you have a dump of the failed process? Or perhaps attach a debugger to it to see what the allocator is trying to allocate?
But if you did overload the operator <<, then perhaps your code does have a bug.
Just my 2 (euro) cts...
1. Fragmentation ?
The memory could be fragmented.
At one moment, you try to allocate SIZE bytes, but the allocator finds no contiguous chunk of SIZE bytes in memory, and then throw a bad_alloc.
Note: This answer was written before I read this possibility was ruled out.
2. signed vs. unsigned ?
Another possibility would be the use of a signed value for the size to be allocated:
char * p = new char[i] ;
If the value of i is negative (e.g. -1), the cast into the unsigned integral size_t will make it go beyond what is available to the memory allocator.
As the use of signed integral is quite common in user code, if only to be used as a negative value for an invalid value (e.g. -1 for a failed search), this is a possibility.

~className(){
//delete stuff in here
}

By way of example, Memory leaks can occur when you use the new operator in c++ and forget to use the delete operator.
Or, in other words, when you allocate a block of memory and you forget to deallocate it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js