How to do file I/O with std::vector<bool>? - c++

I need to implement a Boolean data container that will store fairly large amount of variables. I suppose I could just use char* and implement C-style macro accessors but I would prefer to have it wrapped in an std:: structure. std::bitset<size_t> does not seem practical as it has a fixed-during-compilation size.
So that leaves me with std::vector<bool> which is optimized for space; and it has a nice bool-like accessor.
Is there a way to do something like directly feed a pointer from it to fwrite()?
And how would one do file input into such a vector?
And lastly, is it a good data structure when a lot of file I/O is needed?
What about random file access (fseek etc)?
EDIT: I've decided to wrap a std::vector<unsigned int> in a new class that has functionality demanded by my requirements.

Is there a way to do something like directly feed a pointer from it to fwrite()?
No, but you can do it with std::fstream
std::ofstream f("output.file");
std::copy(vb.begin(), vb.end(), std::ostream_iterator<bool>(f, " "));
And how would one do file input into such a vector?
Using std::fstream
std::ifstream f("input.file");
std::copy(std::istream_iterator<bool>(f), {}, std::back_inserter(vb));
And lastly, is it a good data structure when a lot of file I/O is needed?
No, vector<bool> is rarely a good data structure for any purpose. See http://howardhinnant.github.io/onvectorbool.html
What about random file access (fseek etc)?
What about it?

You could use a std::vector<char>, resize it to the size of the file (or other size, say you want to process fixed length blocks), then you can pass the contents of it to a function like fread() or fwrite() in the following way:
std::vector<char> fileContents;
fileContents.resize(100);
fread(&fileContents[0], 1, 100, theFileStream);
This really just allows you to have a resizable char array, in C++ style. Perhaps it is a useful starting point? The point being that you can directly access the memory behind the vector, since it is guaranteed to be laid out sequentially, just like an array.
The same concept would work for a std::vector<bool> - I'd just be careful when freading into this as off the top of my head I can't tell you how big (sizeof wise) a bool is, as it depends on the platform (8bit vs 16bit vs 32bit, if you were working on a microcontroller for example).
It seems std::vector<bool> can be optimised to store each bool in a single bit, so, definitely do not attempt to use the memory behind a vector<bool> directly unless you know it is going to work!

Related

How can I read/write a fixed number of ints from/to file in one operation (as fast as I can, file can be assumed to be binary)?

I have a big file (let's assume I can make it binary), that can not fit in RAM, and I want to sort numbers from it. In the process I need to read/write a big amount of numbers from/to the file (from/to vector<int> or int[]) quickly, so I'd like not to read/write it one by one, but read/write it by blocks with a fixed size. How can I do it?
I have a big file (let's assume I can make it binary), that can not fit in RAM, and I want to sort numbers from it.
Given that the file is binary, perhaps the simplest and presumably efficient solution is to memory map the file. Unfortunately there is no standard interface to perform memory mapping. On POSIX systems, there is the mmap function.
Now, the memory mapped file is simply an array of raw bytes. Treating it as an array of integers is technically not allowed until C++20 where C-style "implicit creation of low level objects" is introduced. In practice, that already works on most current language implementations Note 1.
For this reinterpretation to work, the representation of the integers in the file must match the representation of integers used by the CPU. The file will not be portable to the same program running on other, incompatible systems.
We can simply use std::sort on this array. The operating system should take care of paging the file in and out of memory. The algorithm used by std::sort isn't necessarily optimised for this use case however. To find the optimal algorithm, you may need to do some research.
1 In case Pre-C++20 standard conformance is a concern, it is possible to iterate over the array, copy the underlying bytes into an integer, placement-new an integer object into the memory using the copied integer as the new value. A compiler can optimise these operations into zero instructions, and this makes the program's behaviour well defined.
You can use ostream::write to write into a file and istream::read to read from a file.
To make the process clean, it will be good to have the number of items also in the file.
Let's say you have a vector<int>.
You can use the following code to write its contents to a file.
std::vector<int> myData;
// .. Fill up myData;
// Open a file to write to, in binary mode.
std::ofstream out("myData.bin", std::ofstream::binary);
// Write the size first.
auto size = myData.size();
out.write(reinterpret_cast<char const*>(&size), sizeof(size));
// Write the data.
out.write(reinterpret_cast<char const*>(myData.data()), sizeof(int)*size);
You can read the contents of such a file using the following code.
std::vector<int> myData;
// Open the file to read from, in binary mode.
std::ifstream in("myData.bin", std::ifstream::binary);
// Read the size first.
auto size = myData.size();
in.read(reinterpret_cast<char*>(&size), sizeof(size));
// Resize myData so it has enough space to read into.
myData.resize(size);
// Read the data.
in.read(reinterpret_cast<char*>(myData.data()), sizeof(int)*size);
If not all of the data can fit into RAM, you can read and write the data in smaller chunks. However, if you read/write them in smaller chunks, I don't know how you would sort them.

Reading large strings in C++ -- is there a safe fast way?

http://insanecoding.blogspot.co.uk/2011/11/how-to-read-in-file-in-c.html reviews a number of ways of reading an entire file into a string in C++. The key code for the fastest option looks like this:
std::string contents;
in.seekg(0, std::ios::end);
contents.resize(in.tellg());
in.seekg(0, std::ios::beg);
in.read(&contents[0], contents.size());
Unfortunately, this is not safe as it relies on the string being implemented in a particular way. If, for example, the implementation was sharing strings then modifying the data at &contents[0] could affect strings other than the one being read. (More generally, there's no guarantee that this won't trash arbitrary memory -- it's unlikely to happen in practice, but it's not good practice to rely on that.)
C++ and the STL are designed to provide features that are efficient as C, so one would expect there to be a version of the above that was just as fast but guaranteed to be safe.
In the case of vector<T>, there are functions which can be used to access the raw data, which can be used to read a vector efficiently:
T* vector::data();
const T* vector::data() const;
The first of these can be used to read a vector<T> efficiently. Unfortunately, the string equivalent only provides the const variant:
const char* string::data() const noexcept;
So this cannot be used to read a string efficiently. (Presumably the non-const variant is omitted to support the shared string implementation.)
I have also checked the string constructors, but the ones that accept a char* copy the data -- there's no option to move it.
Is there a safe and fast way of reading the whole contents of a file into a string?
It may be worth noting that I want to read a string rather than a vector<char> so that I can access the resulting data using a istringstream. There's no equivalent of that for vector<char>.
If you really want to avoid copies, you can slurp the file into a std::vector<char>, and then roll your own std::basic_stringbuf to pull data from the vector.
You can then declare a std::istringstream and use std::basic_ios::rdbuf to replace the input buffer with your own one.
The caveat is that if you choose to call istringstream::str it will invoke std::basic_stringbuf::str and will require a copy. But then, it sounds like you won't be needing that function, and can actually stub it out.
Whether you get better performance this way would require actual measurement. But at least you avoid having to have two large contiguous memory blocks during the copy. Additionally, you could use something like std::deque as your underlying structure if you want to cope with truly huge files that cannot be allocated in contiguous memory.
It's also worth mentioning that if you're really just streaming that data you are essentially double-buffering by reading it into a string first. Unless you also require the contents in memory for some other purpose, the buffering inside std::ifstream is likely to be sufficient. If you do slurp the file, you may get a boost by turning buffering off.
I think using &string[0] is just fine, and it should work with the widely used standard library implementations (even if it is technically UB).
But since you mention that you want to put the data into an istringstream, here's an alternative:
Read the data into a char array (new char[in.tellg()])
Construct a stringstream (without the leading 'i')
Insert the data with stringstream::write
The istringstream would have to copy the data anyway, because a std::stringstream doesn't store a std::string internally as far as I'm aware, so you can leave the std::string away and put the data into it directly.
EDIT: Actually, instead of the manual allocation (or make_unique), this way you could also use the vector<char> you mentioned.

Why does std::fstream use char*?

I'm writing a small program that reads the bytes from a file in binary file in groups of 16 bytes (please don't ask why), modifies them, and then writes them to another file.
The fstream::read function reads into a char * buffer, which I was initially passing to a function that looks like this:
char* modify (char block[16], std::string key)
The modification was done on block which was then returned. On roaming the posts of SO, I realized that it might be a better idea to use std::vector<char>. My immediate next worry was how to convert a char * to a std::vector<char>. Once again, SO gave me an answer.
But now what I'm wondering is: If its such a good idea to use std::vector<char> instead of char*, why do the fstream functions use char* at all?
Also, is it a good idea to convert the char* from fstream to std::vector<char> in the first place?
EDIT: I now realize that since fstream::read is used to write data into objects directly, char * is necessary. I must now modify my question. Firstly, why are there no overloaded functions for fstream::read? And secondly, in the program that I've written about, which is a better option?
To use it with a vector, do not pass a pointer to the vector. Instead, pass a pointer to the vector content:
vector<char> v(size);
stream.read(&v[0], size);
fstream() functions let you use char*s so you can point them at arbitrary pre-allocated buffers. std::vector<char> can be sized to provide an appropriate buffer, but it will be on the heap and there's allocation costs involved with that. Sometimes too you may want to read or write data to a specific location in memory - even in shared memory - rather than accepting whatever heap memory the vector happens to have allocated. Further, you may want to use fstream without having included the vector header... it's nice to be able to avoid unnecessary includes as it reduces compilation time.
As your buffers are always 16 bytes in size, it's probably best to allocate them as char [16] data members in an appropriate owning object (if any exists), or on the stack (i.e. some function's local variable).
vector<> is more useful when the alternative is heap allocation - whether because the size is unknown at compile time, or is particularly large, or you want more flexible control of the memory lifetime. It's also useful when you specifically want some of the other vector functionality, such as ability to change the number of elements afterwards, to sort the bytes etc. - it seems very unlikely you'll want to do any of that so a vector raises questions in the mind of the person reading your code about what you'll do for no good purpose. Still, the choice of char[16] vs. vector appears (based on your stated requirements) more a matter of taste than objective benefit.

C++ ifstream::read() and C arrays

It seems to be a general consensus that C arrays are bad and that using the smarter alternatives, like vectors or C++ strings is the way to go. No problem here.
Having said that, why does the read() member of ifstream input data to a char*...
The question is: can I input to a vector of bytes somehow using the STL alone?
A related bonus question: do you often check for ios::badbit and ios::failbit especially if you work with a dynamically allocated C string within that scope? Do you do deallocation of the C string in the catch()'s?
Thank you for reading.
You can read directly into an allocated vector (I have no ability to compile this from here so there might be typos or transposed parameters etc...) but the idea is correct.
vector<char> data;
data.resize(100);
// Read 100 characters into the storage of data
thing.read(&data[0], 100);
Having said that, why does the read() member of ifstream input data to a char*...
It was a design choice. Not necessarily the smartest one.
The question is: can I input to a vector of bytes somehow
The underlying storage for a std::vector<char> is guaranteed to be contiguous (i.e., a single block of memory), so yes. To use ifstream::read, you must (a) ensure that the vector's size is big enough (using .resize() - not .reserve()! This is because the vector has no way of knowing about the data you're reading in to its unused capacity and updating its size), and then (b) obtain a pointer to the beginning element of the vector (e.g. with &v.front() or &(v[0])).
If you don't want to pre-size the vector, then there are more advanced techniques you can use that involve iterator types and standard library algorithms. These have the advantage that you can easily read an entire file into a vector without having to check the file length first (and the standard tricks for checking a file's length might not be as reliable as you think!).
It looks something like:
#include <iterator>
#include <algorithm> // in addition to what you already have.
// ...
std::ifstream ifs;
std::vector v;
// ...
std::istreambuf_iterator<char> begin(ifs), end;
std::copy(begin, end, std::back_inserter(v));
A vector of bytes is not a string, of course. But you can do the exact same thing with std::string - the std::back_inserter function is smart enough to create the appropriate iterator type for any supplied type that provides a .push_back(), which both string and vector do. That's the magic of templates. :)
using the STL alone?
I'm confused. I thought we were talking about the C++ standard library. What is this STL of which you speak?
A related bonus question: do you often check for ios::badbit and ios::failbit
No; I usually write code in such a way that I can just check the result of the reading operation directly. See http://www.parashift.com/c++-faq-lite/input-output.html#faq-15.4 for an example/discussion.
especially if you work with a dynamically allocated C string within that scope?
This is a bad idea in general.
Do you do deallocation of the C string in the catch()'s?
And having to deal with this sort of thing is one of the main reasons why. This is not easy to get right. But I hope you realize that catching an exception and checking for ios::bits are two totally different things. Although you can configure the stream object to throw exceptions instead of just setting the flag bits :)

Performance of ostream_iterator for writing numeric data to a file?

I've got various std::vector instances with numeric data in them, primarily int16_t, int32_t, etc. I'd like to dump this data to a file in as fast a manner as possible. If I use an ostream_iterator, will it write the entire block of memory in a single operation, or will it iterate over the elements of the vector, issuing a write operation for each one?
A stream iterator and a vector will definitely not use a block copy in any implementation I'm familiar with. If the vector item type was a class rather than POD, for example, a direct copy would be a bad thing. I suspect the ostream will format the output as well, rather than writing the values directly (i.e., ascii instead of binary output).
You might have better luck with boost::copy, as it's specifically optimized to do block writes when possible, but the most practical solution is to operate on the vector memory directly using &v[0].
Most ofstream implementations I know of do buffer data, so you probably will not end up doing an excessive number of writes. The buffer in the ofstream() has to fill up before an actual write is done, and most OS's buffer file data underneath this, too. The interplay of these is not at all transparent from the C++ application level; selection of buffer sizes, etc. is left up to the implementation.
C++ does provide a way to supply your own buffer to an ostream's streambuf. You can try calling pubsetbuf like this:
char *mybuffer = new char[bufsize];
os.rdbuf()->pubsetbuf(mybuffer, bufsize);
The downside is that this doesn't necessarily do anything. Some implementations just ignore it.
The other option you have if you want to buffer things and still use ostream_iterator is to use an ostringstream, e.g.:
ostringstream buffered_chars;
copy(data.begin(), data.end(), ostream_iterator<char>(buffered_chars, " ");
string buffer(buffered_chars.str());
Then once all your data is buffered, you can write the entire buffer using one big ostream::write(), POSIX I/O, etc.
This can still be slow, though, since you're doing formatted output, and you have to have two copies of your data in memory at once: the raw data and the formatted, buffered data. If your application pushes the limits of memory already, this isn't the greatest way to go, and you're probably better off using the built-in buffering that ofstream gives you.
Finally, if you really want performance, the fastest way to do this is to dump the raw memory to disk using ostream::write() as Neil suggests, or to use your OS's I/O functions. The disadvantage here is that your data isn't formatted, your file probably isn't human-readable, and it isn't easily readable on architectures with a different endianness than the one you wrote from. But it will get your data to disk fast and without adding memory requirements to your application.
The quickest (but most horrible) way to dump a vector will be to write it in one operation with ostream::write:
os.write( (char *) &v[0], v.size() * sizeof( value_type) );
You can make this a bit nicer with a template function:
template <typename T>
std::ostream & DumpVec( std::ostream & os, const std::vector <T> & v ) {
return os.write( &v[0], v.size() * sizeof( T ) );
}
which allows you to say things like:
vector <unsigned int> v;
ofstream f( "file.dat" );
...
DumpVec( f, v );
Reading it back in will be a bit problematic, unless you prefix the write with an the size of the vector somehow (or the vectors are fixed-sized), and even then you will have problems on different endian and/or 32 v 64 bit architectures, as several people have pointed out.
I guess that's implementation dependent. If you don't get the performance you want, you can always memmap the result file and memcpy the std::vector data to the memmapped file.
if you construct the ostream_iterator with an ofstream, that will make sure the output is buffered:
ofstream ofs("file.txt");
ostream_iterator<int> osi(ofs, ", ");
copy(v.begin(), v.end(), osi);
the ofstream object is buffered, so anything written to the stream will get buffered before written to disk.
You haven't written how you want to use the iterators (I'll presume std::copy) and whether you want to write the data binary or as strings.
I would expect a decent implementation of std::copy to fork into std::memcpy for PODs and with dumb pointers as iterators (Dinkumware, for example, does so). However, with ostream iterators, I don't think any implementation of std::copy will do this, as it doesn't have direct access to the ostream's buffer to write into.
The streams themselves, though, buffer, too.
In the end, I would write the simplest possible code first, and measure this. If it's fast enough, move on to the next problem. If this is code of the sort that cannot be fast enough, you'll have to resort to OS-specific tricks anyway.
It will iterate over the elements. Iterators don't let you mess with more than one item at a time. Also, IIRC, it will convert your integers to their ASCII representations.
If you want to write everything in the vector, in binary, to the file in one step via an ostream, you want something like:
template<class T>
void WriteArray(std::ostream& os, const std::vector<T>& v)
{
os.write(static_cast<const char*>(&v[0]), v.size() * sizeof(T));
}