I have a c++ program that computes populations within a given radius by reading gridded population data from an ascii file into a large 8640x3432-element vector of doubles. Reading the ascii data into the vector takes ~30 seconds (looping over each column and each row), while the rest of the program only takes a few seconds. I was asked to speed up this process by writing the population data to a binary file, which would supposedly read in faster.
The ascii data file has a few header rows that give some data specs like the number of columns and rows, followed by population data for each grid cell, which is formatted as 3432 rows of 8640 numbers, separated by spaces. The population data numbers are mixed formats and can be just 0, a decimal value (0.000685648), or a value in scientific notation (2.687768e-05).
I found a few examples of reading/writing structs containing vectors to binary, and tried to implement something similar, but am running into problems. When I both write and read the vector to/from the binary file in the same program, it seems to work and gives me all the correct values, but then it ends with either a "segment fault: 11" or a memory allocation error that a "pointer being freed was not allocated". And if I try to just read the data in from the previously written binary file (without re-writing it in the same program run), then it gives me the header variables just fine but gives me a segfault before giving me the vector data.
Any advice on what I might have done wrong, or on a better way to do this would be greatly appreciated! I am compiling and running on a mac, and I don't have boost or other non-standard libraries at present. (Note: I am extremely new at coding and am having to learn by jumping in the deep end, so I may be missing a lot of basic concepts and terminology -- sorry!).
Here is the code I came up with:
# include <stdio.h>
# include <stdlib.h>
# include <string.h>
# include <fstream>
# include <iostream>
# include <vector>
# include <string.h>
using namespace std;
//Define struct for population file data and initialize one struct variable for reading in ascii (A) and one for reading in binary (B)
struct popFileData
{
int nRows, nCol;
vector< vector<double> > popCount; //this will end up having 3432x8640 elements
} popDataA, popDataB;
int main() {
string gridFname = "sample";
double dum;
vector<double> tempVector;
//open ascii population grid file to stream
ifstream gridFile;
gridFile.open(gridFname + ".asc");
int i = 0, j = 0;
if (gridFile.is_open())
{
//read in header data from file
string fileLine;
gridFile >> fileLine >> popDataA.nCol;
gridFile >> fileLine >> popDataA.nRows;
popDataA.popCount.clear();
//read in vector data, point-by-point
for (i = 0; i < popDataA.nRows; i++)
{
tempVector.clear();
for (j = 0; j<popDataA.nCol; j++)
{
gridFile >> dum;
tempVector.push_back(dum);
}
popDataA.popCount.push_back(tempVector);
}
//close ascii grid file
gridFile.close();
}
else
{
cout << "Population file read failed!" << endl;
}
//create/open binary file
ofstream ofs(gridFname + ".bin", ios::trunc | ios::binary);
if (ofs.is_open())
{
//write struct to binary file then close binary file
ofs.write((char *)&popDataA, sizeof(popDataA));
ofs.close();
}
else cout << "error writing to binary file" << endl;
//read data from binary file into popDataB struct
ifstream ifs(gridFname + ".bin", ios::binary);
if (ifs.is_open())
{
ifs.read((char *)&popDataB, sizeof(popDataB));
ifs.close();
}
else cout << "error reading from binary file" << endl;
//compare results of reading in from the ascii file and reading in from the binary file
cout << "File Header Values:\n";
cout << "Columns (ascii vs binary): " << popDataA.nCol << " vs. " << popDataB.nCol << endl;
cout << "Rows (ascii vs binary):" << popDataA.nRows << " vs." << popDataB.nRows << endl;
cout << "Spot Check Vector Values: " << endl;
cout << "Index 0,0: " << popDataA.popCount[0][0] << " vs. " << popDataB.popCount[0][0] << endl;
cout << "Index 3431,8639: " << popDataA.popCount[3431][8639] << " vs. " << popDataB.popCount[3431][8639] << endl;
cout << "Index 1600,4320: " << popDataA.popCount[1600][4320] << " vs. " << popDataB.popCount[1600][4320] << endl;
return 0;
}
Here is the output when I both write and read the binary file in the same run:
File Header Values:
Columns (ascii vs binary): 8640 vs. 8640
Rows (ascii vs binary):3432 vs.3432
Spot Check Vector Values:
Index 0,0: 0 vs. 0
Index 3431,8639: 0 vs. 0
Index 1600,4320: 25.2184 vs. 25.2184
a.out(11402,0x7fff77c25310) malloc: *** error for object 0x7fde9821c000: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug
Abort trap: 6
And here is the output I get if I just try to read from the pre-existing binary file:
File Header Values:
Columns (binary): 8640
Rows (binary):3432
Spot Check Vector Values:
Segmentation fault: 11
Thanks in advance for any help!
When you write popDataA to the file, you are writing the binary representation of the vector of vectors. However this really is quite a small object, consisting of a pointer to the actual data (itself a series of vectors, in this case) and some size information.
When it's read back in to popDataB, it kinda works! But only because the raw pointer that was in popDataA is now in popDataB, and it points to the same stuff in memory. Things go crazy at the end, because when the memory for the vectors is freed, the code tries to free the data referenced by popDataA twice (once for popDataA, and once again for popDataB.)
The short version is, it's not a reasonable thing to write a vector to a file in this fashion.
So what to do? The best approach is to first decide on your data representation. It will, like the ASCII format, specify what value gets written where, and will include information about the matrix size, so that you know how large a vector you will need to allocate when reading them in.
In semi-pseudo code, writing will look something like:
int nrow=...;
int ncol=...;
ofs.write((char *)&nrow,sizeof(nrow));
ofs.write((char *)&ncol,sizeof(ncol));
for (int i=0;i<nrow;++i) {
for (int j=0;j<ncol;++j) {
double val=data[i][j];
ofs.write((char *)&val,sizeof(val));
}
}
And reading will be the reverse:
ifs.read((char *)&nrow,sizeof(nrow));
ifs.read((char *)&ncol,sizeof(ncol));
// allocate data-structure of size nrow x ncol
// ...
for (int i=0;i<nrow;++i) {
for (int j=0;j<ncol;++j) {
double val;
ifs.read((char *)&val,sizeof(val));
data[i][j]=val;
}
}
All that said though, you should consider not writing things into a binary file like this. These sorts of ad hoc binary formats tend to live on, long past their anticipated utility, and tend to suffer from:
Lack of documentation
Lack of extensibility
Format changes without versioning information
Issues when using saved data across different machines, including endianness problems, different default sizes for integers, etc.
Instead, I would strongly recommend using a third-party library. For scientific data, HDF5 and netcdf4 are good choices which address all of the above issues for you, and come with tools that can inspect the data without knowing anything about your particular program.
Lighter-weight options include the Boost serialization library and Google's protocol buffers, but these address only some of the issues listed above.
Related
So I'm trying to make use of zlib in C++ using Visual Studio 2019, to extract the contents from a specific file format. According to the documentation that I'm following, this file format is consisted mainly consisted of values that's consisted of "32-bit (4-byte) little-endian signed integers", and within several sections of the file there's also blocks of data that is compressed by zlib to save space.
But I believe that's not relevant to my problem, I'm having trouble with just simply using zlib.
I should note that I'm unfamiliar to using fstream and more specifically, zlib. I can guess uncompress() may be the function I'm looking since the number for the compressed bytes is read before I can even call it. It's not unlikely the issue could be related to the former library.
But I do believe I'm not putting in the buffer properly (or maybe not even reading it from the file properly), as I'm getting either syntax errors for incorrect types, the program crashing, and most importantly, unable to get the blocks of the uncompressed data. I can tell it's not working properly as it's returning Z_STREAM_ERROR (-2) or Z_DATA_ERROR (-3) from the call, not Z_OK (0). The program at least reads the 32-bit data correctly, at least.
#include <iostream>
#include <fstream>
#include "zlib.h"
using namespace std;
//Basically it works like this.
int main()
{
streampos size;
unsigned char memblock;
char* memblock2;
Bytef memblock3;
Bytef memblock_res;
int ret;
int res=0;
uint32_t a;
ifstream file("not_a_real_file.lol", ios::in | ios::binary | ios::ate);
if (file.is_open())
{
size = file.tellg();
file.seekg(0, ios::beg);
file.read(reinterpret_cast<char*>(&a), sizeof(a));
std::cout << "Format Identifier: " << a << "\n";
file.read(reinterpret_cast<char*>(&a), sizeof(a));
std::cout << "File Version: " << a << "\n";
//A bunch of other 32-bit values be here, would be redudent to put them all.
file.read(reinterpret_cast<char*>(&a), sizeof(a));
std::cout << "Length of Zlib Block: " << a << "\n";
//Anyways, this is where things get really weird. I'm using 'a' to determine the length of bytes, and I know it should be stored into it's own variable.
char* membuffer = new char[a];
file.read(membuffer, a);
uLongf zaz;
res=uncompress(&memblock_res, &zaz, (unsigned char*)(&membuffer), a);
if (res==Z_OK)
std::cout << "Good!\n";
std::cout << "This resulted in " << (int)res << ", it's got this many bytes: " << zaz << "\n";
//It should be Z_DATA_ERROR with 0 bytes returned; it's obviously not the desired results.
file.read(reinterpret_cast<char*>(&a), sizeof(a));
std::cout << "Value after Block: " << a << "\n";
//At least it seems the 32-bit value that comes after the block is correctly read.
file.close();
}
}
Either I'm using read() incorrectly, or don't know how to properly convert the data into the use for uncompress(). Or maybe I'm using the wrong functions; I honestly have no clue. I spent hours trying to figure this out from looking up things, but having no avail.
I am reading a file header using ifstream.
Edit: I was asked to put the full minimal program, so here it is.
#include <iostream>
#include <fstream>
using namespace std;
#pragma pack(push,2)
struct Header
{
char label[20];
char st[11];
char co[7];
char plusXExtends[9];
char minusXExtends[9];
char plusYExtends[9];
};
#pragma pack(pop)
int main(int argc,char* argv[])
{
string fileName;
fileName = "test";
string fileInName = fileName + ".dst";
ifstream fileIn(fileInName.c_str(), ios_base::binary|ios_base::in);
if (!fileIn)
{
cout << "File Not Found" << endl;
return 0;
}
Header h={};
if (fileIn.is_open()) {
cout << "\n" << endl;
fileIn.read(reinterpret_cast<char *>(&h.label), sizeof(h.label));
cout << "Label: " << h.label << endl;
fileIn.read(reinterpret_cast<char *>(&h.st), sizeof(h.st));
cout << "Stitches: " << h.st << endl;
fileIn.read(reinterpret_cast<char *>(&h.co), sizeof(h.co));
cout << "Colour Count: " << h.co << endl;
fileIn.read(reinterpret_cast<char *>(&h.plusXExtends),sizeof(h.plusXExtends));
cout << "Extends: " << h.plusXExtends << endl;
fileIn.read(reinterpret_cast<char *>(&h.minusXExtends),sizeof(h.minusXExtends));
cout << "Extends: " << h.minusXExtends << endl;
fileIn.read(reinterpret_cast<char *>(&h.plusYExtends),sizeof(h.plusYExtends));
cout << "Extends: " << h.plusYExtends << endl;
// This will output corrupted
cout << endl << endl;
cout << "Label: " << h.label << endl;
cout << "Stitches: " << h.st << endl;
cout << "Colour Count: " << h.co << endl;
cout << "Extends: " << h.plusXExtends << endl;
cout << "Extends: " << h.minusXExtends << endl;
cout << "Extends: " << h.plusYExtends << endl;
}
fileIn.close();
cout << "\n";
//cin.get();
return 0;
}
ifstream fileIn(fileInName.c_str(), ios_base::binary|ios_base::in);
Then I use a struct to store the header items
The actual struct is longer than this. I shortened it because I didn't need the whole struct for the question.
Anyway as I read the struct I do a cout to see what I am getting. This part is fine.
As expected my cout shows the Label, Stitches, Colour Count no problem.
The problem is that if I want to do another cout after it has read the header I am getting corruption in the output. For instance if I put the following lines right after the above code eg
Instead of seeing Label, Stitches and Colour Count I get strange symbols, and corrupt output. Sometimes you can see the output of the h.label, with some corruption, but the labels are Stitches are written over. Sometimes with strange symbols, but sometimes with text from the previous cout. I think either the data in the struct is getting corrupted, or the cout output is getting corrupted, and I don't know why. The longer the header the more the problem becomes apparent. I would really like to do all the couts at the end of the header, but if I do that I see a big mess instead of what should be outputting.
My question is why is my cout becoming corrupted?
Using arrays to store strings is dangerous because if you allocate 20 characters to store the label and the label happens to be 20 characters long, then there is no room to store a NUL (0) terminating character. Once the bytes are stored in the array there's nothing to tell functions that are expecting null-terminated strings (like cout) where the end of the string is.
Your label has 20 chars. That's enough to store the first 20 letters of the alphabet:
ABCDEFGHIJKLMNOPQRST
But this is not a null-terminated string. This is just an array of characters. In fact, in memory, the byte right after the T will be the first byte of the next field, which happens to be your 11-character st array. Let's say those 11 characters are: abcdefghijk.
Now the bytes in memory look like this:
ABCDEFGHIJKLMNOPQRSTabcdefghijk
There's no way to tell where label ends and st begins. When you pass a pointer to the first byte of the array that is intended to be interpreted as a null-terminated string by convention, the implementation will happily start scanning until it finds a null terminating character (0). Which, on subsequent reuses of the structure, it may not! There's a serious risk of overrunning the buffer (reading past the end of the buffer), and potentially even the end of your virtual memory block, ultimately causing an access violation / segmentation fault.
When your program first ran, the memory of the header structure was all zeros (because you initialized with {}) and so after reading the label field from disk, the bytes after the T were already zero, so your first cout worked correctly. There happened to be a terminating null character at st[0]. You then overwrite this when you read the st field from disk. When you come back to output label again, the terminator is gone, and some characters of st will get interpreted as belonging to the string.
To fix the problem you probably want to use a different, more practical data structure to store your strings that allows for convenient string functions. And use your raw header structure just to represent the file format.
You can still read the data from disk into memory using fixed sized buffers, this is just for staging purposes (to get it into memory) but then store the data into a different structure that uses std::string variables for convenience and later use by your program.
For this you'll want these two structures:
#pragma pack(push,2)
struct RawHeader // only for file IO
{
char label[20];
char st[11];
char co[7];
char plusXExtends[9];
char minusXExtends[9];
char plusYExtends[9];
};
#pragma pack(pop)
struct Header // A much more practical Header struct than the raw one
{
std::string label;
std::string st;
std::string co;
std::string plusXExtends;
std::string minusXExtends;
std::string plusYExtends;
};
After you read the first structure, you'll transfer the fields by assigning the variables. Here's a helper function to do it.
#include <string>
#include <string.h>
template <int n> std::string arrayToString(const char(&raw)[n]) {
return std::string(raw, strnlen_s(raw, n));
}
In your function:
Header h;
RawHeader raw;
fileIn.read((char*)&raw, sizeof(raw));
// Now marshal all the fields from the raw header over to the practical header.
h.label = arrayToString(raw.label);
h.st = arrayToString(raw.st);
h.st = arrayToString(raw.st);
h.co = arrayToString(raw.co);
h.plusXExtends = arrayToString(raw.plusXExtends);
h.minusXExtends = arrayToString(raw.minusXExtends);
h.plusYExtends = arrayToString(raw.plusYExtends);
It's worth mentioning that you also have the option of keeping the raw structure around and not copying your raw char arrays to std::strings when you read the file. But you must then be certain that when you want to use the data, you always to compute and pass lengths of the strings to functions that will deal with those buffers as string data. (Similar to what my arrayToString helper does anyway.)
I am assigned a task where I have to explain why PrintStream and OutputDataStream produce two different kinds of output files (which I know - the first writes a string representation byte-by-byte, whilst the second writes the raw binary data). In order to elaborate on the background of this, I wanted to write a small C++ file to demonstrate reading the written data off the file back to stdout.
The idea is simple: Write short values from 20.000 to 32.000 to a file using OutputDataStream using it's writeShort(int) method. According to the Java documentation, those values are written in two bytes.
Now... I did try to implement this with std::ifstream on the C++ side, and I believe I ran into some endianess-related issues. According to what I have gathered from various SO questions, Java will write in "network format", which is apparently a different description for "Little Endian". But as far as I think I am aware of, my Mac (MacBook, mid. 2014), uses "Big Endian" - so the bytes are in a wrong order.
This is what I have come up with so far:
#include <iostream>
#include <fstream>
using namespace std;
int main(int argc, char** argv) {
ifstream fh("./out.DataOutputStream.dat", ios::in|ios::binary);
if(!fh.is_open()) {
cerr << "Error while opening file." << endl;
cerr << "Are you in the same directory as <out.DataOutputStream.dat>?" << endl;
return 1;
}
cout << "--- Begin of data ---" << endl;
char num1, num2;
#define SWAP(b) ( (b >> 8) | (b << 8) )
while(!fh.eof()) {
fh.read(&num1, 1); // read one byte
fh.read(&num2, 1); // read the next byte
cout << (unsigned short)SWAP(num2) << (unsigned short)SWAP(num1);
}
cout << flush;
cout << "--- End of data ---" << endl;
return 0;
}
This result does print 32000 at the (very) end...but it prints that twice, and everything else is completely off... Any idea on how I can get this to work with the STL only?
I need to read in an mp3 file so that I can run the hash(). I do not need to parse the mp3 tag data out of this so I can just read the whole thing all together.
Currently I am using ifstream() to open the file in binary mode. I then get the size of the file, allocate enough space with a char* and read it all at once.
I know that when I run cout on this data I can only see "ID3 and some gibberish." I opened the mp3 file up in a hex editor and ID3 and the gibberish was what was at the beginning of the file. The next binary data I believe is being interpreted as end of line/string and does not print.
This is okay because I don't need to print it. I need to get the data in a format that I can run the Hash function on. Any ideas on a type I can convert it to that will not interpret the end of the file being a couple bytes in?
Here is code of what I have so far.
bool Sender::openSoundFile(){
streampos size;
soundSampleStream.open(soundFilePath.c_str(), ios::in|ios::binary|ios::ate);
if(!soundSampleStream.is_open()){
return false;
}
size = soundSampleStream.tellg();
cout << "Size of MP3: " << size << endl;
soundFileInMemory = new char [size];
soundSampleStream.seekg (0, ios::beg);
soundSampleStream.read(soundFileInMemory, size);
cout << "Error is: " << strerror(errno) << endl;
cout << "gcount: " << soundSampleStream.gcount() << endl;
soundSampleStream.close();
cout << soundFileInMemory << endl;
return true;
}
I get no error on reading the file and gcount() comes back with the correct numbers of bytes for the file.
Edit 1:
To add some more on this. The hash() seems to hash the char* and not the data being pointed at because the hash value changes on different program runs. This is why I need to convert to some other thing. I also don't think that a vector is supported by the c++11 hash().
std::string has a constructor that takes a char * and a size_t. See the fourth item in http://en.cppreference.com/w/cpp/string/basic_string/basic_string.
std::string file_contents(soundFileInMemory, size);
That will convert your char array to a string.
I want to write my array structure in to a binary file.
My structure
typedef struct student{
char name[15];
vector<int> grade;
}arr_stu;
I can write and read back my data if I write and read in the same program; but if I create another program for read data only and put the binary file, it does not work because the vector grade is null.
size = 0;
unable to read from memory
Program to write array structure to file
int main()
{
arr_stu stu[100];
for (size_t i = 0; i < 100; i++)
{
strcpy(stu[i].name, randomName());
for (size_t j = 0; j < 10; j++)
{
stu[i].grade.push_back(randomGrade());
}
}
ofstream outbal("class", ios::out | ios::binary);
if (!outbal) {
cout << "Cannot open file.\n";
return 1;
}
outbal.write((char *)&stu, sizeof(stu));
outbal.close();
}
Program to read array structure to file
int main(){
feature_struc stu[100];
ifstream inbal("class", ios::in | ios::binary);
if (!inbal) {
cout << "Cannot open file.\n";
return 1;
}
inbal.read((char *)&stu, sizeof(stu));
for (size_t idx = 0; idx < 100; idx++)
{
cout << "Name : " << stu[idx].name << endl;
for (size_t index = 0; index < 10; index++)
{
cout << endl << "test: " << stu[idx].grade[index] << endl;
}
}
inbal.close();
return 0;
}
For me it seems like the use of vector pose the problem,
The reason that if we combine the two in one program it work well I think because vector is saved in the memory so it can still accessible.
Any suggestions?
You cannot serialize a vector like that. The write and read functions access the memory at the given address directly. Since vector is a complex class type only parts of its data content are stored sequentially at its base address. Other parts (heap allocated memory etc) are located elsewhere. The simplest solution would be to write the length of the vector to the file followed by each of the values. You have to loop over the vector elements to accomplish that.
outbal.write((char *)&stu, sizeof(stu));
The sizeof is a compile-time constant. In other words, it never changes. If the vector contained 1, 10, 1000, or 1,000,000 items, you're writing the same number of bytes to the file. So this way of writing to the file is totally wrong.
The struct that you're writing is non-POD due to the vector being a non-POD type. This means you can't just treat it as a set of bytes that you can copy from or to. If you want further proof, open the file you created in any editor. Can you see the data from the vector in that file? What you will see is more than likely, gibberish.
To write the data to the file, you have to properly serialize the data, meaning you have to write the data to a file, not the struct itself. You write the data in a way so that when you read the data back, you can recreate the struct. Ultimately, this means you have to
Write the name to the file, and possibly the number of bytes the name consists of.
Write the number of items in the vector
Write each vector item to the file.
If not this, then some way where you can distinctly figure out the name and the vector's data from the file so that your code to read the data parses the file correctly and recreates the struct.
What is the format of the binary file? Basically, you have to
define the format, and then convert each element to and from
that format. You can never just dump the bits of an internal
representation to disk and expect to be able to read it back.
(The fact that you need a reinterpret_cast to call
ostream::write on your object should tell you something.)