Potential memory leaks in reading binary files

Potential memory leaks in reading binary files - c++

I would like to ask if this part of the code might suffer from memory leaks (I'm quite sure it does, but how severely?).
The "input" variable is a pointer to double, i.e. double* input. The reason I didn't use float (more compatible in this case) is because I wanted to maintain compatibility with other parts of the code.
else if (filetype == "BinaryFile")
{
char* memblock;
std::ifstream file(filename1, std::ios::binary | std::ios::in);
file.seekg(0, std::ios::end);
int size = file.tellg();
file.seekg(0, std::ios::beg);
std::cout << "Size=" << size << " [in bytes]"
<< "\n";
std::cout << "There are overall " << grid_points << "^3 = " << std::setprecision(10) << pow(grid_points, 3) << " values of the field1, written as float type.\n";
memblock = new char[size];
file.seekg(0, std::ios::beg);
file.read(memblock, size);
file.close();
float* values = (float*)memblock; //reinterpret as float, because the file was saved as float
for (int i = 0; i < grid_points * grid_points * grid_points; i++) {
input1[i] = (double)values[i]; //cast to double, since input1 is an array of doubles
}
file.close();
delete[] memblock;
}
The files that I need to work on are quite big, coming from cosmological simulations; for example one file is 4GB and the other could be 20 GB. I'm using the supercomputer infrastructure for that reason.
This kind of reading works for files that have 512^3 float values (e.x. density evaluated on points in a cube of side 512) but memory leaks happen for a file with 1024^3 entries.
I had thought I should delete[] the "values" array, but when I do that, I get even worse memory leaks, crashing my program even in the case where previously all was calculated correctly (512^3).
How could I improve on this code? I would have used the std::vector container but I had to use the FFTW library.
EDIT:
Following the suggestions in the comments, I have rewritten the reading part of the code as:
std::ifstream file(filename1,std::ios::binary);
std::vector<float> buf(pow(grid_points,3));
file.read(reinterpret_cast<char*>(buf.data()), buf.size()*sizeof(float));
std::copy_n(buf.begin(),pow(grid_points,3),input1);
Where I explicitly make use of the knowledge of how many elements there will be in the input1 array. No memory leaks occur now.

Related

C++ zlib: Trouble reading a zlib buffer with <fstream>:

So I'm trying to make use of zlib in C++ using Visual Studio 2019, to extract the contents from a specific file format. According to the documentation that I'm following, this file format is consisted mainly consisted of values that's consisted of "32-bit (4-byte) little-endian signed integers", and within several sections of the file there's also blocks of data that is compressed by zlib to save space.
But I believe that's not relevant to my problem, I'm having trouble with just simply using zlib.
I should note that I'm unfamiliar to using fstream and more specifically, zlib. I can guess uncompress() may be the function I'm looking since the number for the compressed bytes is read before I can even call it. It's not unlikely the issue could be related to the former library.
But I do believe I'm not putting in the buffer properly (or maybe not even reading it from the file properly), as I'm getting either syntax errors for incorrect types, the program crashing, and most importantly, unable to get the blocks of the uncompressed data. I can tell it's not working properly as it's returning Z_STREAM_ERROR (-2) or Z_DATA_ERROR (-3) from the call, not Z_OK (0). The program at least reads the 32-bit data correctly, at least.
#include <iostream>
#include <fstream>
#include "zlib.h"
using namespace std;
//Basically it works like this.
int main()
{
streampos size;
unsigned char memblock;
char* memblock2;
Bytef memblock3;
Bytef memblock_res;
int ret;
int res=0;
uint32_t a;
ifstream file("not_a_real_file.lol", ios::in | ios::binary | ios::ate);
if (file.is_open())
{
size = file.tellg();
file.seekg(0, ios::beg);
file.read(reinterpret_cast<char*>(&a), sizeof(a));
std::cout << "Format Identifier: " << a << "\n";
file.read(reinterpret_cast<char*>(&a), sizeof(a));
std::cout << "File Version: " << a << "\n";
//A bunch of other 32-bit values be here, would be redudent to put them all.
file.read(reinterpret_cast<char*>(&a), sizeof(a));
std::cout << "Length of Zlib Block: " << a << "\n";
//Anyways, this is where things get really weird. I'm using 'a' to determine the length of bytes, and I know it should be stored into it's own variable.
char* membuffer = new char[a];
file.read(membuffer, a);
uLongf zaz;
res=uncompress(&memblock_res, &zaz, (unsigned char*)(&membuffer), a);
if (res==Z_OK)
std::cout << "Good!\n";
std::cout << "This resulted in " << (int)res << ", it's got this many bytes: " << zaz << "\n";
//It should be Z_DATA_ERROR with 0 bytes returned; it's obviously not the desired results.
file.read(reinterpret_cast<char*>(&a), sizeof(a));
std::cout << "Value after Block: " << a << "\n";
//At least it seems the 32-bit value that comes after the block is correctly read.
file.close();
}
}
Either I'm using read() incorrectly, or don't know how to properly convert the data into the use for uncompress(). Or maybe I'm using the wrong functions; I honestly have no clue. I spent hours trying to figure this out from looking up things, but having no avail.

Segmentation fault from pointer error (simple...)

I have a function which returns a pointer to an array of doubles:
double * centerOfMass(System &system) {
long unsigned int size = system.atoms.size();
double x_mass_sum=0.0; double y_mass_sum=0.0; double z_mass_sum=0.0; double mass_sum=0.0;
for (int i=0; i<=size; i++) {
double atom_mass = system.atoms[i].m;
mass_sum += atom_mass;
x_mass_sum += system.atoms[i].pos["x"]*atom_mass;
y_mass_sum += system.atoms[i].pos["y"]*atom_mass;
z_mass_sum += system.atoms[i].pos["z"]*atom_mass;
}
double comx = x_mass_sum/mass_sum;
double comy = y_mass_sum/mass_sum;
double comz = z_mass_sum/mass_sum;
double* output = new double[3]; // <-------- here is output
output[0] = comx*1e10; // convert all to A for writing xyz
output[1] = comy*1e10;
output[2] = comz*1e10;
return output;
}
When I try to access the output by saving the array to a variable (in a different function), I get a segmentation fault when the program runs (but it compiles fine):
void writeXYZ(System &system, string filename, int step) {
ofstream myfile;
myfile.open (filename, ios_base::app);
long unsigned int size = system.atoms.size();
myfile << to_string(size) + "\nStep count: " + to_string(step) + "\n";
for (int i = 0; i < size; i++) {
myfile << system.atoms[i].name;
myfile << " ";
myfile << system.atoms[i].pos["x"]*1e10;
myfile << " ";
myfile << system.atoms[i].pos["y"]*1e10;
myfile << " ";
myfile << system.atoms[i].pos["z"]*1e10;
myfile << "\n";
}
// get center of mass
double* comfinal = new double[3]; // goes fine
comfinal = centerOfMass(system); // does NOT go fine..
myfile << "COM " << to_string(comfinal[0]) << " " << to_string(comfinal[1]) << " " << to_string(comfinal[2]) << "\n";
myfile.close();
}
Running the program yields normal function until it tries to call centerOfMass.
I've checked most possible solutions; I think I just lack understanding on pointers and their scope in C++. I'm seasoned in PHP so dealing with memory explicitly is problematic.
Thank you kindly

I'm not sure about the type of system.atoms. If it's a STL container like std::vector, the condition part of the for loop inside function centerOfMass is wrong.
long unsigned int size = system.atoms.size();
for (int i=0; i<=size; i++) {
should be
long unsigned int size = system.atoms.size();
for (int i=0; i<size; i++) {
PS1: You can use Range-based for loop (since C++11) to avoid such kind of problem.
PS2: You didn't delete[] the dynamically allocated arrays; Consider about using std::vector, std::array, or std::unique_ptr instead, they're designed to help you to avoid such kind of issues.

In addition to the concerns pointed out by songyuanyao, the usage of the function in writeXYZ() causes a memory leak.
To see this, note that centerOfMass() does (with extraneous details removed)
double* output = new double[3]; // <-------- here is output
// assign values to output
return output;
and writeXYZ() does (note that I've changed comments to reflect what is actually happening, as distinct from your comments on what you thought was happening)
double* comfinal = new double[3]; // allocate three doubles
comfinal = centerOfMass(system); // lose reference to them
// output to myfile
If writeXYZ() is being called multiple times, then the three doubles will be leaked every time, EVEN IF, somewhere, delete [] comfinal is subsequently performed. If this function is called numerous times (for example, in a loop) eventually the amount of memory leaked can exceed what is available, and subsequent allocations will fail.
One fix of this problem is to change the relevant part of writeXYZ() to
double* comfinal = centerOfMass(system);
// output to myfile
delete [] comfinal; // eventually
Introducing std::unique_ptr in the above will alleviate the symptoms, but that is more a happy accident than good logic in the code (allocating memory only to discard it immediately without having used it is rarely good technique).
In practice, you are better off using standard containers (std::vector, etc) and avoid using operator new at all. But they still require you to keep within bounds.

Writing/reading large vectors of data to binary file in c++

I have a c++ program that computes populations within a given radius by reading gridded population data from an ascii file into a large 8640x3432-element vector of doubles. Reading the ascii data into the vector takes ~30 seconds (looping over each column and each row), while the rest of the program only takes a few seconds. I was asked to speed up this process by writing the population data to a binary file, which would supposedly read in faster.
The ascii data file has a few header rows that give some data specs like the number of columns and rows, followed by population data for each grid cell, which is formatted as 3432 rows of 8640 numbers, separated by spaces. The population data numbers are mixed formats and can be just 0, a decimal value (0.000685648), or a value in scientific notation (2.687768e-05).
I found a few examples of reading/writing structs containing vectors to binary, and tried to implement something similar, but am running into problems. When I both write and read the vector to/from the binary file in the same program, it seems to work and gives me all the correct values, but then it ends with either a "segment fault: 11" or a memory allocation error that a "pointer being freed was not allocated". And if I try to just read the data in from the previously written binary file (without re-writing it in the same program run), then it gives me the header variables just fine but gives me a segfault before giving me the vector data.
Any advice on what I might have done wrong, or on a better way to do this would be greatly appreciated! I am compiling and running on a mac, and I don't have boost or other non-standard libraries at present. (Note: I am extremely new at coding and am having to learn by jumping in the deep end, so I may be missing a lot of basic concepts and terminology -- sorry!).
Here is the code I came up with:
# include <stdio.h>
# include <stdlib.h>
# include <string.h>
# include <fstream>
# include <iostream>
# include <vector>
# include <string.h>
using namespace std;
//Define struct for population file data and initialize one struct variable for reading in ascii (A) and one for reading in binary (B)
struct popFileData
{
int nRows, nCol;
vector< vector<double> > popCount; //this will end up having 3432x8640 elements
} popDataA, popDataB;
int main() {
string gridFname = "sample";
double dum;
vector<double> tempVector;
//open ascii population grid file to stream
ifstream gridFile;
gridFile.open(gridFname + ".asc");
int i = 0, j = 0;
if (gridFile.is_open())
{
//read in header data from file
string fileLine;
gridFile >> fileLine >> popDataA.nCol;
gridFile >> fileLine >> popDataA.nRows;
popDataA.popCount.clear();
//read in vector data, point-by-point
for (i = 0; i < popDataA.nRows; i++)
{
tempVector.clear();
for (j = 0; j<popDataA.nCol; j++)
{
gridFile >> dum;
tempVector.push_back(dum);
}
popDataA.popCount.push_back(tempVector);
}
//close ascii grid file
gridFile.close();
}
else
{
cout << "Population file read failed!" << endl;
}
//create/open binary file
ofstream ofs(gridFname + ".bin", ios::trunc | ios::binary);
if (ofs.is_open())
{
//write struct to binary file then close binary file
ofs.write((char *)&popDataA, sizeof(popDataA));
ofs.close();
}
else cout << "error writing to binary file" << endl;
//read data from binary file into popDataB struct
ifstream ifs(gridFname + ".bin", ios::binary);
if (ifs.is_open())
{
ifs.read((char *)&popDataB, sizeof(popDataB));
ifs.close();
}
else cout << "error reading from binary file" << endl;
//compare results of reading in from the ascii file and reading in from the binary file
cout << "File Header Values:\n";
cout << "Columns (ascii vs binary): " << popDataA.nCol << " vs. " << popDataB.nCol << endl;
cout << "Rows (ascii vs binary):" << popDataA.nRows << " vs." << popDataB.nRows << endl;
cout << "Spot Check Vector Values: " << endl;
cout << "Index 0,0: " << popDataA.popCount[0][0] << " vs. " << popDataB.popCount[0][0] << endl;
cout << "Index 3431,8639: " << popDataA.popCount[3431][8639] << " vs. " << popDataB.popCount[3431][8639] << endl;
cout << "Index 1600,4320: " << popDataA.popCount[1600][4320] << " vs. " << popDataB.popCount[1600][4320] << endl;
return 0;
}
Here is the output when I both write and read the binary file in the same run:
File Header Values:
Columns (ascii vs binary): 8640 vs. 8640
Rows (ascii vs binary):3432 vs.3432
Spot Check Vector Values:
Index 0,0: 0 vs. 0
Index 3431,8639: 0 vs. 0
Index 1600,4320: 25.2184 vs. 25.2184
a.out(11402,0x7fff77c25310) malloc: *** error for object 0x7fde9821c000: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug
Abort trap: 6
And here is the output I get if I just try to read from the pre-existing binary file:
File Header Values:
Columns (binary): 8640
Rows (binary):3432
Spot Check Vector Values:
Segmentation fault: 11
Thanks in advance for any help!

When you write popDataA to the file, you are writing the binary representation of the vector of vectors. However this really is quite a small object, consisting of a pointer to the actual data (itself a series of vectors, in this case) and some size information.
When it's read back in to popDataB, it kinda works! But only because the raw pointer that was in popDataA is now in popDataB, and it points to the same stuff in memory. Things go crazy at the end, because when the memory for the vectors is freed, the code tries to free the data referenced by popDataA twice (once for popDataA, and once again for popDataB.)
The short version is, it's not a reasonable thing to write a vector to a file in this fashion.
So what to do? The best approach is to first decide on your data representation. It will, like the ASCII format, specify what value gets written where, and will include information about the matrix size, so that you know how large a vector you will need to allocate when reading them in.
In semi-pseudo code, writing will look something like:
int nrow=...;
int ncol=...;
ofs.write((char *)&nrow,sizeof(nrow));
ofs.write((char *)&ncol,sizeof(ncol));
for (int i=0;i<nrow;++i) {
for (int j=0;j<ncol;++j) {
double val=data[i][j];
ofs.write((char *)&val,sizeof(val));
}
}
And reading will be the reverse:
ifs.read((char *)&nrow,sizeof(nrow));
ifs.read((char *)&ncol,sizeof(ncol));
// allocate data-structure of size nrow x ncol
// ...
for (int i=0;i<nrow;++i) {
for (int j=0;j<ncol;++j) {
double val;
ifs.read((char *)&val,sizeof(val));
data[i][j]=val;
}
}
All that said though, you should consider not writing things into a binary file like this. These sorts of ad hoc binary formats tend to live on, long past their anticipated utility, and tend to suffer from:
Lack of documentation
Lack of extensibility
Format changes without versioning information
Issues when using saved data across different machines, including endianness problems, different default sizes for integers, etc.
Instead, I would strongly recommend using a third-party library. For scientific data, HDF5 and netcdf4 are good choices which address all of the above issues for you, and come with tools that can inspect the data without knowing anything about your particular program.
Lighter-weight options include the Boost serialization library and Google's protocol buffers, but these address only some of the issues listed above.

istringstream() (possibly) memory leak

Suppose you want to read the data from large text file (~300mb) to array of vectors: vector<string> *Data (assume that the number of columns is known).
//file is opened with ifstream; initial value of s is set up, etc...
Data = new vector<string>[col];
string u;
int i = 0;
do
{
istringstream iLine = istringstream(s);
i=0;
while(iLine >> u)
{
Data[i].push_back(u);
i++;
}
}
while(getline(file, s));
This code works fine for small files (<50mb) but memory usage is increasing exponentially when reading large file. I'm pretty sure that the problem is in creating istringstream objects each time in a loop. However, defining istringstream iLine; outside of both loops and putting each string into stream by iLine.str(s); and clearing the stream after inner while-loop (iLine.str(""); iLine.clear();) causes the same order of memory explosion as well.
The questions that arise:
why istringstream behaves this way;
if it is the intended behavior, how the above task can be accomplished?
Thank you
EDIT: In regards to the 1st answer I do clean the memory allocated by array later in the code:
for(long i=0;i<col;i++)
Data[i].clear();
delete []Data;
FULL COMPILE-READY CODE (add headers):
int _tmain(int argc, _TCHAR* argv[])
{
ofstream testfile;
testfile.open("testdata.txt");
srand(time(NULL));
for(int i = 1; i<1000000; i++)
{
for(int j=1; j<100; j++)
{
testfile << rand()%100 << " ";
}
testfile << endl;
}
testfile.close();
vector<string> *Data;
clock_t begin = clock();
ifstream file("testdata.txt");
string s;
getline(file,s);
istringstream iss = istringstream(s);
string nums;
int col=0;
while(iss >> nums)
{
col++;
}
cout << "Columns #: " << col << endl;
Data = new vector<string>[col];
string u;
int i = 0;
do
{
istringstream iLine = istringstream(s);
i=0;
while(iLine >> u)
{
Data[i].push_back(u);
i++;
}
}
while(getline(file, s));
cout << "Rows #: " << Data[0].size() << endl;
for(long i=0;i<col;i++)
Data[i].clear();
delete []Data;
clock_t end = clock();
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
cout << elapsed_secs << endl;
getchar();
return 0;
}

vector<> grows memory geometrically. A typical pattern would be that it doubles the capacity whenever it needs to grow. That may leave a lot of extra space allocated but unused, if your loop ends right after such a threshold. You could try calling shrink_to_fit() on each vector when you are done.
Additionally, memory allocated by the C++ allocators (or even plain malloc()) is often not returned to the OS, but left in a process-internal free memory pool. this may lead to further apparent growth. And it may cause the results of shrink_to_fit() to be invisible from outside the process.
Finally if you have lots of small strings ("2-digit numbers"), the overhead of a stringobject may be considerable. Even if the implementation uses a small-string optimization, I'd assume that a typical string uses no less than 16 or 24 bytes (size, capacity, data pointer or small string buffer) - probably more on a platform where size_type is 64 bits. That is a lot of memory for 3 bytes of payload.
So I assume you are seeing normal behaviour of vector<>

I seriously suspect this is not istringstream problem (especially, given you have same result with iLine constructor outside the loop).
Possibly, this is a normal behavior of the std::vector. To test that, how about you run the exact same lines, but comment out: Data[i].push_back(u);. See if your memory grows this way. If it doesn't then you know where the problem is..
Depends on your library, vector::push_back will expand its capacity by a factor of 1.5 (Microsoft) or 2 (glib) every time it needs more room.

Several ifstreams access-violation

I try to realise an external merge sort (wiki) and I want to open 2048 ifstreams and read data to personal buffers.
ifstream *file;
file = (ifstream *)malloc(2048 * sizeof(ifstream));
for (short i = 0; i < 2048; i++) {
itoa(i, fileName + 5, 10);
file[i].open(fileName, ios::in | ios::binary); // Access violation Error
if (!file[i]) {
cout << i << ".Bad open file" << endl;
}
if (!file[i].read((char*)perfile[i], 128*4)) {
cout << i << ". Bad read source file" << endl;
}
}
But, it crashes with
Unhandled exception at 0x58f3a5fd (msvcp100d.dll) in sorting.exe: 0xC0000005: Access violation reading location 0xcdcdcdfd.
Is it possible to use so much opened ifstreams?
Or maybe it is very bad idea to have 2048 opened ifstreams and there is a better way to realize this algorithm?

Arrays of non-POD objects are allocated with new, not with malloc, otherwise the constructors aren't run.
Your code is getting uninitialized memory and "interpreting" it as ifstreams, which obviously results in a crash (because the constructor of the class hasn't been run not even the virtual table pointers are in place).
You can either allocate all your objects on the stack:
ifstream file[2048];
or allocate them on the heap if stack occupation is a concern;
ifstream *file=new ifstream[2048];
// ...
delete[] file; // frees the array
(although you should use a smart pointer here to avoid memory leaks in case of exceptions)
or, better, use a vector of ifstream (requires header <vector>):
vector<ifstream> file(2048);
which do not require explicit deallocation of its elements.
(in theory, you could use malloc and then use placement new, but I wouldn't recommend it at all)
... besides, opening 2048 files at the same time doesn't feel like a great idea...

This is C++.ifstream is non-POD, so you can't just malloc it: the instances need to get constructed
ifstream file[2048];
for (short i = 0; i < 2048; i++) {
itoa(i, fileName + 5, 10);
file[i].open(fileName, ios::in | ios::binary); // Access violation Error
if (!file[i]) {
cout << i << ".Bad open file" << endl;
}
if (!file[i].read((char*)perfile[i], 128*4)) {
cout << i << ". Bad read source file" << endl;
}
}
Besides that, opening 2048 files doesn't sound like a good plan, but you can figure that out later

The value 0xcdcdcdcd is used by VS in debug mode to represent uninitialized memory (also keep an eye out for 0xbaadf00d).
You are using malloc which is of C heritage and does not call constructors, it simply gives you a pointer to a chunk of data. An ifstream is not a POD (Plain Old Data) type; it needs you to call its constructor in order to initialize properly. This is C++; use new and delete.
Better yet, don't use either; just construct the thing on the stack and let it handle dynamic memory allocation as it was meant to be used.
Of course, this doesn't even touch on the horrible idea to open 2048 files, but you should probably learn that one the hard way...

You cannot open 2048 files, there is an operating system limit for open files

As far as I can see, you don't really need an array of 2048 separate ifstreams here at all. You only need one ifstream at any given time, so each iteration you close one file and open another. Destroying an ifstream closes the file automatically, so you can do something like this:
for (short i = 0; i < 2048; i++) {
itoa(i, fileName + 5, 10);
ifstream file(fileName, ios::in | ios::binary);
if (!file) {
cout << i << ".Bad open file" << endl;
}
if (!file.read((char*)perfile[i], 128*4)) {
cout << i << ". Bad read source file" << endl;
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Potential memory leaks in reading binary files - c++

Related

C++ zlib: Trouble reading a zlib buffer with <fstream>:

Segmentation fault from pointer error (simple...)

Writing/reading large vectors of data to binary file in c++

istringstream() (possibly) memory leak

Several ifstreams access-violation

Categories

Resources