C++: read int from binaryfile - c++

I have pixels from an image which are stored in a binary file.
I would like to use a function to quickly read this file.
For the moment I have this:
std::vector<int> _data;
std::ifstream file(_rgbFile.string(), std::ios_base::binary);
while (!file.eof())
{
char singleByte[1];
file.read(singleByte, 1);
int b = singleByte[0];
_data.push_back(b);
}
std::cout << "end" << std::endl;
file.close();
But on 4096 * 4096 * 3 images it already takes a little time.
Is it possible to optimize this function?

You could make this faster by reading the whole file in one go, and preallocating the necessary storage in the vector beforehand:
std::ifstream file(_rgbFile.string(), std::ios_base::binary);
std::streampos posStart = file.tellg();
file.seekg(0, std::ios::end);
std::streampos posEnd = file.tellg();
file.seekg(posStart);
std::vector<char> _data;
_data.resize(posEnd - posStart, 0);
file.read(&_data[0], posEnd - posStart);
std::cout << "end" << std::endl;
file.close();
Avoiding unnecessary i/o
By reading the file as a whole in one read() call you can avoid a lot of read calls, and buffering of the ifstream. If the file is very large and you don't want to load it all in memory at once, then you can load smaller chunks of maybe a few MB each.
Also you avoid lots of functions calls - by reading it byte-by-byte you need to issue ifstream::read 50'331'648 times!
vector preallocation
std::vector grows dynamically when you try to insert new elements but no space is left. Each time the vector resizes, it needs to allocate a new, larger, memory area and copy all current elements in the vector over to the new location.
Most vector implementions choose a growth factor between 1.5 - 2, so each time the vector needs to resize it'll be a 1.5-2x larger allocation.
This can be completely avoided by calling std::vector::reserve or std::vector::resize.
With these functions the vector memory only needs to be allocated once, with at least as many elements as you requested.
Godbolt example
Here's a godbolt example that shows the performance improvement.
testing a ~5MB file (4096*4096*3 bytes)
gcc 11.2, with optimizations disabled:
Old
New
1300ms
16ms
gcc 11.2, -O3
Old
New
878ms
13ms
Small bug in the code
As #TedLyngmo has pointed out your code also contains a small bug.
The EOF marker will only be set once you tried to read past the end of the file. see this question
So the last read that sets the EOF bit didn't actually read a byte, so you have one more byte in your array that contains uninitialized garbage.
You could fix this by checking for EOF directly after the read:
while(true) {
char singleByte[1];
file.read(singleByte, 1);
if(file.eof()) break;
int b = singleByte[0];
_data.push_back(b);
}

Related

Dynamic allocation of a vector inside a struct while reading a large txt file

I am currently learning the C ++ language and need to read a file containing more than 5000 double type numbers. Since push_back will make a copy while allocating new data, I was trying to figure out a way to decrease computational work. Note that the file may contain a random number of double types, so allocating memory by specifying a large enough vector is not the solution looking for.
My idea would be to quickly read the whole file and get and approximation size of the array. In Save & read double vector from file C++? found an interesting idea that can be found in the code below.
Basically, the vector containing the file data is inserted in a structure type named PathStruct. Bear in mind that the PathStruct contains more that this vector, but for the sake of simplicity I deleted all the rest. The function receives a reference of the PathStruct pointer and read the file.
struct PathStruct
{
std::vector<double> trivial_vector;
};
bool getFileContent(PathStruct *&path)
{
std::ifstream filename("simplePath.txt", std::ios::in | std::ifstream::binary);
if (!filename.good())
return false;
std::vector<char> buffer{};
std::istreambuf_iterator<char> iter(filename);
std::istreambuf_iterator<char> end{};
std::copy(iter, end, std::back_inserter(buffer));
path->trivial_vector.reserve(buffer.size() / sizeof(double));
memcpy(&path->trivial_vector[0], &buffer[0], buffer.size());
return true;
};
int main(int argc, char **argv)
{
PathStruct *path = new PathStruct;
const int result = getFileContent(path);
return 0;
}
When I run the code, the compiler give the following error:
corrupted size vs. prev_size, Aborted (core dumped).
I believe my problem in the incorrect use of pointer. Is definitely not my strongest point, but I cannot find the problem. I hope someone could help out this poor soul.
If your file contains only consecutive double values, you can check the file size and divide it by double size. To determine the file size you can use std::filesystem::file_size but this function is available from C++ 17. If you cannot use C++ 17, you can find other methods for determining file size here
auto fileName = "file.bin";
auto fileSize = std::filesystem::file_size(fileName);
std::ifstream inputFile("file.bin", std::ios::binary);
std::vector<double> values;
values.reserve(fileSize / sizeof(double));
double val;
while(inputFile.read(reinterpret_cast<char*>(&val), sizeof(double)))
{
values.push_back(val);
}
or using pointers:
auto numberOfValues = fileSize / sizeof(double);
std::vector<double> values(numberOfValues);
// Notice that I pass numberOfValues * sizeof(double) as a number of bytes to read instead of fileSize
// because the fileSize may be not divisable by sizeof(double)
inputFile.read(reinterpret_cast<char*>(values.data()), numberOfValues * sizeof(double));
Alternative
If you can modify the file structure, you can add a number of double values at the beginning of the file and read this number before reading double values. This way you will always know the number of values to read, without checking file size.
Alternative 2
You can also change a container from std::vector to std::deque. This container is similar to std::vector, but instead of keeping a single buffer for data, it has may smaller array. If you are inserting data and the array is full, the additional array will be allocated and linked without copying previous data.
This has however a small price, data access requires two pointer dereferences instead of one.

Rewrite file with 0's. What am I doing wrong?

I want rewrite file with 0's. It only write a few bytes.
My code:
int fileSize = boost::filesystem::file_size(filePath);
int zeros[fileSize] = { 0 };
boost::filesystem::path rewriteFilePath{filePath};
boost::filesystem::ofstream rewriteFile{rewriteFilePath, std::ios::trunc};
rewriteFile << zeros;
Also... Is this enough to shred the file? What should I do next to make the file unrecoverable?
EDIT: Ok. I rewrited my code to this. Is this code ok to do this?
int fileSize = boost::filesystem::file_size(filePath);
boost::filesystem::path rewriteFilePath{filePath};
boost::filesystem::ofstream rewriteFile{rewriteFilePath, std::ios::trunc};
for(int i = 0; i < fileSize; i++) {
rewriteFile << 0;
}
There are several problems with your code.
int zeros[fileSize] = { 0 };
You are creating an array that is sizeof(int) * fileSize bytes in size. For what you are attempting, you need an array that is fileSize bytes in size instead. So you need to use a 1-byte data type, like (unsigned) char or uint8_t.
But, more importantly, since the value of fileSize is not known until runtime, this type of array is known as a "Variable Length Array" (VLA), which is a non-standard feature in C++. Use std::vector instead if you need a dynamically allocated array.
boost::filesystem::ofstream rewriteFile{rewriteFilePath, std::ios::trunc};
The trunc flag truncates the size of an existing file to 0. What that entails is to update the file's metadata to reset its tracked byte size, and to mark all of the file's used disk sectors as available for reuse. The actual file bytes stored in those sectors are not wiped out until overwritten as sectors get reused over time. But any bytes you subsequently write to the truncated file are not guaranteed to (and likely will not) overwrite the old bytes on disk. So, do not truncate the file at all.
rewriteFile << zeros;
ofstream does not have an operator<< that takes an int[], or even an int*, as input. But it does have an operator<< that takes a void* as input (to output the value of the memory address being pointed at). An array decays into a pointer to the first element, and void* accepts any pointer. This is why only a few bytes are being written. You need to use ofstream::write() instead to write the array to file, and be sure to open the file with the binary flag.
Try this instead:
int fileSize = boost::filesystem::file_size(filePath);
std::vector<char> zeros(fileSize, 0);
boost::filesystem::path rewriteFilePath(filePath);
boost::filesystem::ofstream rewriteFile(rewriteFilePath, std::ios::binary);
rewriteFile.write(zeros.data()/*&zeros[0]*/, fileSize);
That being said, you don't need a dynamically allocated array at all, let alone one that is allocated to the full size of the file. That is just a waste of heap memory, especially for large files. You can do this instead:
int fileSize = boost::filesystem::file_size(filePath);
const char zeros[1024] = {0}; // adjust size as desired...
boost::filesystem::path rewriteFilePath(filePath);
boost::filesystem::ofstream rewriteFile(rewriteFilePath, std::ios::binary);
int loops = fileSize / sizeof(zeros);
for(int i = 0; i < loops; ++i) {
rewriteFile.write(zeros, sizeof(zeros));
}
rewriteFile.write(zeros, fileSize % sizeof(zeros));
Alternatively, if you open a memory-mapped view of the file (MapViewOfFile() on Windows, mmap() on Linux, etc) then you can simply use std::copy() or std::memset() to zero out the bytes of the entire file directly on disk without using an array at all.
Also... Is this enough to shred the file?
Not really, no. At the physical hardware layer, overwriting the file just one time with zeros can still leave behind remnant signals in the disk sectors, which can be recovered with sufficient tools. You should overwrite the file multiple times, with varying types of random data, not just zeros. That will more thoroughly scramble the signals in the sectors.
I cannot stress strongly enough the importance of the comments that overwriting a file's contents does not guarantee that any of the original data is overwritten. ALL OTHER ANSWERS TO THIS QUESTION ARE THEREFORE IRRELEVANT ON ANY RECENT OPERATING SYSTEM.
Modern filing systems are extents based, meaning that files are stored as a linked list of allocated chunks. Updating a chunk may be faster for the filing system to write a whole new chunk and simply adjust the linked list, so that's what they do. Indeed copy-on-write filing systems always write a copy of any modified chunk and update their B-tree of currently valid extents.
Furthermore, even if your filing system doesn't do this, your hard drive may use the exact same technique also for performance, and any SSD almost certainly always uses this technique due to how flash memory works. So overwriting data to "erase" it is meaningless on modern systems. Can't be done. The only safe way to keep old data hidden is full disk encryption. Anything else you are deceiving yourself and your users.
Just for fun, overwriting with random data:
Live On Coliru
#include <boost/iostreams/device/mapped_file.hpp>
#include <random>
namespace bio = boost::iostreams;
int main() {
bio::mapped_file dst("main.cpp");
std::mt19937 rng { std::random_device{} () };
std::uniform_int_distribution<char> dist;
std::generate_n(dst.data(), dst.size(), [&] { return dist(rng); });
}
Note that it scrambles its own source file after compilation :)

Data read from file takes way more memory than file size

I've written some data into a file the following way:
result = new QHash<QPair<int, int>, QVector<double> >;
QFile resfile("result.txt");
resfile.open(QIODevice::WriteOnly | QIODevice::Append);
QDataStream out(&resfile);
while(condition)
{
QString s=" something";
out<<s;
res->insert(QPair<int, int>(arange,trange),coeffs);
out<<res;
}
The file ended up with the equal to 484MB.
After that i read it in a loop:
QString s;
QVector<QHash<QPair<int, int>, QVector <double> > > thickeness_result;
QFile resfile("result.txt");
resfile.open(QIODevice::ReadOnly);
QDataStream out(&resfile);
while (!out.atEnd())
{
thickeness_result.resize(thickeness_result.size()+1);
out>>s>>thickness_result.last();
}
While this read loop is running i see that in the task manager my program starts taking about 1300MB of memory and after that i receive an "In file text\qharfbuzzng.cpp, line 626: Out of memory" error.
My question is: Is it normal that program starts taking more that 2x size of file memory and i should read it in chunks or am i doing something wrong?
WARNING All the following assumes that QVector behaves like std::vector
Yes, it's normal. What's happening is that when you have 1024 elements, and want to read another one, the call to resize is allocating capacity for 2048 elements, moving the first 1024 elements in, and then constructing the 1025th element. It destroys the old array, and returns the memory to the heap (but not to the operating system). Then when you come to read the 2049th element, it does it all again, only this time allocating 4096 elements. The heap has a chunk of 1024 elements space, but that's no use when you want 4096. Now you have chunks of 1024, 2048, and 4096 elements in the heap (two of which are free and available for reuse).
Repeat until you have read the file. You will see that you end up with (about) twice the file size.
The first rule is "don't worry about it" - it usually isn't a problem. However, for you it clearly is.
Can you switch to a 64-bit program? That should make the problem go away.
The other option is to guess how many elements you have (from the file size) and call .reserve on the vector at the start.

C++ Storing vector<std::string> from a file using reserve and possibly emplace_back

I am looking for a quick way to store strings from a file into a vector of strings such that I can reserve the number of lines ahead of time. What is the best way to do this? Should I cont the new line characters first or just get a total size of the file and just reserve say the size / 80 in order to give a rough estimate on what to reserve. Ideally I don't want to have the vector have to realloc each time which would slow things down tremendously for a large file. Ideally I would count the number of items ahead of time but should I do this by opening in binary mode counting the new lines and then reopening? That seems wasteful, curious on some thoughts for this. Also is there a way to use emplace_back to get rid of the temporary somestring in the getline code below. I did see the following 2 implmentations for counting the number of lines ahead of time Fastest way to find the number of lines in a text (C++)
std::vector<std::string> vs;
std::string somestring;
std::ifstream somefile("somefilename");
while (std::getline(somefile, somestring))
vs.push_back(somestring);
Also I could do something to get the total size ahead of time, can I just transform the char* in this case into the vector directly? This goes back to my reserve hint of saying size / 80 or some constant to give an estimated size to the reserve upfront.
#include <iostream>
#include <fstream>
int main () {
char* contents;
std::ifstream istr ("test.txt");
if (istr)
{
std::streambuf * pbuf = istr.rdbuf();
//which I can use as a reserve hint say size / 80
std::streamsize size = pbuf->pubseekoff(0,istr.end);
//maybe I can construct the vector from the char buf directly?
pbuf->pubseekoff(0,istr.beg);
contents = new char [size];
pbuf->sgetn (contents,size);
}
return 0;
}
Rather than waste time counting the lines ahead of time, I would just reserve() an initial value, then start pushing the actual lines, and if you happen to push the reserved number of items then just reserve() some more space before continuing with more pushing, repeating as needed.
The strategy for reserving space in a std::vector is designed to "grow on demand". That is, you will not allocate one string at a time, you will first allocate one, then, say, ten, then, one hundred and so on (not exactly those numbers, but that's the idea). In other word, the implementation of std::vector::push_back already manages this for you.
Consider the following example: I am reading the entire text of War and Peace (65007 lines) using two versions: one which allocates and one which does not (i.e., one reserves zero space, and the other reserves the full 65007 lines; text from: http://www.gutenberg.org/cache/epub/2600/pg2600.txt)
#include<iostream>
#include<fstream>
#include<vector>
#include<string>
#include<boost/timer/timer.hpp>
void reader(size_t N=0) {
std::string line;
std::vector<std::string> lines;
lines.reserve(N);
std::ifstream fp("wp.txt");
while(std::getline(fp, line)) {
lines.push_back(line);
}
std::cout<<"read "<<lines.size()<<" lines"<<std::endl;
}
int main() {
{
std::cout<<"no pre-allocation"<<std::endl;
boost::timer::auto_cpu_timer t;
reader();
}
{
std::cout<<"full pre-allocation"<<std::endl;
boost::timer::auto_cpu_timer t;
reader(65007);
}
return 0;
}
Results:
no pre-allocation
read 65007 lines
0.027796s wall, 0.020000s user + 0.000000s system = 0.020000s CPU (72.0%)
full pre-allocation
read 65007 lines
0.023914s wall, 0.020000s user + 0.010000s system = 0.030000s CPU (125.4%)
You see, for a non-trivial amount of text I have a difference of milliseconds.
Do you really need to know the lines beforehand? Is it really a bottleneck? Are you saving, say, one second of Wall time but complicating your code ten-fold by preallocating the lines?

Fastest way to write large STL vector to file using STL

I have a large vector (10^9 elements) of chars, and I was wondering what is the fastest way to write such vector to a file. So far I've been using next code:
vector<char> vs;
// ... Fill vector with data
ofstream outfile("nanocube.txt", ios::out | ios::binary);
ostream_iterator<char> oi(outfile, '\0');
copy(vs.begin(), vs.end(), oi);
For this code it takes approximately two minutes to write all data to file. The actual question is: "Can I make it faster using STL and how"?
With such a large amount of data to be written (~1GB), you should write to the output stream directly, rather than using an output iterator. Since the data in a vector is stored contiguously, this will work and should be much faster.
ofstream outfile("nanocube.txt", ios::out | ios::binary);
outfile.write(&vs[0], vs.size());
There is a slight conceptual error with your second argument to ostream_iterator's constructor. It should be NULL pointer, if you don't want a delimiter (although, luckily for you, this will be treated as such implicitly), or the second argument should be omitted.
However, this means that after writing each character, the code needs to check for the pointer designating the delimiter (which might be somewhat inefficient).
I think, if you want to go with iterators, perhaps you could try ostreambuf_iterator.
Other options might include using the write() method (if it can handle output this large, or perhaps output it in chunks), and perhaps OS-specific output functions.
Since your data is contiguous in memory (as Charles said), you can use low level I/O. On Unix or Linux, you can do your write to a file descriptor. On Windows XP, use file handles. (It's a little trickier on XP, but well documented in MSDN.)
XP is a little funny about buffering. If you write a 1GB block to a handle, it will be slower than if you break the write up into smaller transfer sizes (in a loop). I've found the 256KB writes are most efficient. Once you've written the loop, you can play around with this and see what's the fastest transfer size.
OK, I did write method implementation with for loop that writes 256KB blocks (as Rob suggested) of data at each iteration and result is 16 seconds, so problem solved. This is my humble implementation so feel free to comment:
void writeCubeToFile(const vector<char> &vs)
{
const unsigned int blocksize = 262144;
unsigned long blocks = distance(vs.begin(), vs.end()) / blocksize;
ofstream outfile("nanocube.txt", ios::out | ios::binary);
for(unsigned long i = 0; i <= blocks; i++)
{
unsigned long position = blocksize * i;
if(blocksize > distance(vs.begin() + position, vs.end())) outfile.write(&*(vs.begin() + position), distance(vs.begin() + position, vs.end()));
else outfile.write(&*(vs.begin() + position), blocksize);
}
outfile.write("\0", 1);
outfile.close();
}
Thnx to all of you.
If you have other structure this method is still valid.
For example:
typedef std::pair<int,int> STL_Edge;
vector<STL_Edge> v;
void write_file(const char * path){
ofstream outfile(path, ios::out | ios::binary);
outfile.write((const char *)&v.front(), v.size()*sizeof(STL_Edge));
}
void read_file(const char * path,int reserveSpaceForEntries){
ifstream infile(path, ios::in | ios::binary);
v.resize(reserveSpaceForEntries);
infile.read((char *)&v.front(), v.size()*sizeof(STL_Edge));
}
Instead of writing via the file i/o methods, you could try to create a memory-mapped file, and then copy the vector to the memory-mapped file using memcpy.
Use the write method on it, it is in ram after all and you have contigous memory.. Fastest, while looking for flexibility later? Lose the built-in buffering, hint sequential i/o, lose the hidden things of iterator/utility, avoid streambuf when you can but do get dirty with boost::asio ..