How to store / load big C++ containers - c++

I was wondering how can I store C++ containers for efficient loading, for example how can I store very large vectors of integers. I know I can save them in a file, and make new vector out of that data
#include <fstream>
#include <vector>
int main()
{
vector<int> data = {1, 2, 3, 4, 5}; // some elements
std::ifstream file(path);
for (const auto &c : data)
file << c << " ";
return 0;
}
but if I want to save 1 gigabyte of data, loading it every time from a file takes a loooooooooong time. So is there a way to store this kind of data, for fast loading that doesn't take forever, if possible I would like to store my own classes this way as well.

std::vector is stored in a contiguous memory block.
If you want to store/load data from a vector to file you should be able to do something like this.
std::string filename{ "test.dat" };
std::vector<int> vec_source = { 1, 2, 3, 4, 5 }; // some elements
// Save to file
std::ofstream OutFile;
OutFile.open(filename, std::ofstream::out | std::ofstream::binary);
OutFile.write(reinterpret_cast<char*>(vec_source.data()), vec_source.size() * sizeof(int));
OutFile.close();
// Prepare
std::vector<int> vec_target;
vec_target.resize(vec_source.size());
// Load from file
std::ifstream InFile;
InFile.open(filename, std::ofstream::in| std::ofstream::binary);
InFile.read(reinterpret_cast<char*>(vec_target.data()), vec_target.size() * sizeof(int));
InFile.close();
See working example here:
https://wandbox.org/permlink/oQuwXxU8q230FaJC
[EDIT]
Few notes and limitations:
Note 1: If you plan to do more then just save/load the whole array. Like changing the data and storing only the changes you should consider a better method (like split the data into chunks, save each chunk separately)
Note 2: This method is correct only for containers which use contiguous memory block like std::vector, std::array and std::string. It will certainly not work for std::list or std::map
Note 3: Following interesting discussion between #DavidSchwartz and #Acorn in the comments of this post. This code example will work correctly only if the endianness of the platform is constant and same when storing and loading the data from the file. It will certainly will not work in case the platform changes its endianness across runs or if mixing platforms!.

Related

Getting input from text file and storing into array but text file contains more than 20.000 strings

Getting inputs from a text file and storing it into an array but text file contains more than 20.000 strings. I'm trying to read strings from the text file and store them into a huge-sized array. How can I do that?
I can not use vectors.
Is it possible to do it without using a hash table?
Afterward, I will try to find the most frequently used words using sorting.
You requirement is to NOT use any standard container like for example a std::vector or a std::unordered_map.
In this case we need to create a dynamic container by ourself. That is not complicated. And we can use this even for storing strings. So, I will even not use std::string in my example.
I created some demo for you with ~700 lines of code.
We will first define the term "capacity". This is the number of elements that could be stored in the container. It is the currently available space. It has nothing to do, how many elements are really stored in the container.
But there is one and the most important functionality of a dynamic container. It must be able to grow. And this is always necessary, if we want to store add more elements to the container, as its capacity.
So, if we want to add an additional element at the end of the container, and if the number of elements is >= its capacity, then we need to reallocate bigger memory and then copy all the old elements to the new memory space. For such events, we will usually double the capacity. This should prevent frequent reallocations and copying activities.
Let me show you one example for a push_back function, which could be implemented like this:
template <typename T>
void DynamicArray<T>::push_back(const T& d) { // Add a new element at the end
if (numberOfElements >= capacity) { // Check, if capacity of this dynamic array is big enough
capacity *= 2; // Obviously not, we will double the capacity
T* temp = new T[capacity]; // Allocate new and more memory
for (unsigned int k = 0; k < numberOfElements; ++k)
temp[k] = data[k]; // Copy data from old memory to new memory
delete[] data; // Release old memory
data = temp; // And assign newly allocated memory to old pointer
}
data[numberOfElements++] = d; // And finally, store the given data at the end of the container
}
This is a basic approach. I use templates in order to be able to store any type in the dynamic array.
You could get rid of the templates, by deleting all template stuff and replacing "T" with your intended data type.
But, I would not do that. See, how easy we can create a "String" class. We just typedef a dynamic array for chars as "String".
using String = DynamicArray<char>;
will give us basic string functionality. And if we want to have a dynamic array of strings later, we can write:
using StringArray = DynamicArray<String>;
and this gives us a DynamicArray<DynamicArray<char>>. Cool.
For this special classes we can overwrite some operators, which will make the handling and our life even more simple.
Please look in the provided code
And, to be able to use the container in the typical C++ environment, we can add full iterator capability. That makes life even more simple.
This needs really some typing effort, but is not complicated. And, it will make life really simpler.
You also wanted to create a hash map. For counting words.
For that we will create a key/value pair. The key is the String that we defined above and the value will be the frequency counter.
We implement a hash function which should be carefully selected to work with strings, has a high entropy and give good results for the bucket size of the hash map.
The hash map itself is a dynamic container. We will also add iterator functionality to it.
For all this I drafted some 700 lines of code for you. You can take this as an example for your further studies.
It can also be easily enhanced with additional functionality.
But caveat: I did only some basic tests and I even used raw pointers for owned memory. This can be done in a schoolproject to learn some dynamic memory management, but not in reality.
Additionlly. You can replace all this code, by simply using std::string, std::vector and std::unordered_map. Nobody would use such code and reinvent the wheel.
But it may give you some ideas on how to implement similar things.
Because Stackoverlof limits the answer size to 32000 characters, I will put the main part on github.
Please click here.
I will just show you main so that you can see how easy the solution can be used:
int main() {
// Open file and check, if it could be opened
std::ifstream ifs{ "r:\\test.txt" };
if (ifs) {
// Define a dynamic array for strings
StringArray stringArray{};
// Use overwritten extraction operator and read all strings from the file to the dynamic array
ifs >> stringArray;
// Create a dynamic hash map
HashMap hm{};
// Now count the frequency of words
for (const String& s : stringArray)
hm[s]++;
// Put the resulting key/value pairs into a dynamic array
DynamicArray<Item> items(hm.begin(), hm.end());
// Sort in descending order by the frequency
std::sort(items.begin(), items.end(), [](const Item& i1, const Item& i2) { return i1.count > i2.count; });
// SHow resulton screen
for (const auto& [string, count] : items)
std::cout << std::left << std::setw(20) << string << '\t' << count << '\n';
}
else std::cerr << "\n\nError: Could not open source file\n\n";
}
You do not need to keep the whole file in memory to count frequency of words. You only need to keep a single entry and some data structure to count the frequencies, for example a std::unordered_map<std::string,unsigned>.
Not tested:
std::unordered_map<std::string,unsigned> processFileEntries(std::ifstream& file) {
std::unordered_map<std::string,unsigned> freq;
std::string word;
while ( file >> entry ) {
++freqs[entry];
}
return freq;
}
For more efficient reading or more elaborated processing you could also read chunks of the file (eg 100 words), process chunks, and then continue with the next chunk.
Assuming you're using C-Style / raw arrays you could do something like:
const size_t number_of_entries = count_entries_in_file();
//Make sure we actually have entries
assert(number_of_entries > 0);
std::string* file_entries = new std::string[number_of_entries];
//fill file_entries with the files entries
//...
//release heap memory again, so we don't create a leak
delete[] file_entries;
file_entries = nullptr;
You can use a std::map to get the frequency of each word in your text file. One example for reference is given below:
#include <iostream>
#include <map>
#include <string>
#include <sstream>
#include <fstream>
int main()
{
std::ifstream inputFile("input.txt");
std::map<std::string, unsigned> freqMap;
std::string line, word;
if(inputFile)
{
while(std::getline(inputFile, line))//go line by line
{
std::istringstream ss(line);
while(ss >> word)//go word by word
{
++freqMap[word]; //increment the count value corresponding to the word
}
}
}
else
{
std::cout << "input file cannot be opened"<<std::endl;
}
//print the frequency of each word in the file
for(auto myPair: freqMap)
{
std::cout << myPair.first << ": " << myPair.second << std::endl;
}
return 0;
}
The output of the above program can be seen here.

How do I serialise/deserialise a std::vector<bool> most efficiently?

I'm trying to write the contents of a std::vector<bool> to disk into a binary file. As the write() method of many of the STL output streams takes in a pointer to the array itself, as well as the number of bytes to write, for a 'normal' vector I'd end up doing something like this:
std::vector<unsigned int> dataVector = {0, 1, 2, 3, 4};
std::fstream outStream = std::fstream("vectordump.bin", std::ios::out | std::ios::binary);
outStream.write((char*) dataVector.data(), dataVector.size() * sizeof(unsigned int));
outStream.close();
However, the std::vector<bool> is a special case, as the STL implementation is allowed to pack the bools into single bits. The above approach will therefore technically not consistently work, because it's unspecified how the data is precisely laid out in memory.
Is there any way of serialising/deserialising my bool vector without having to pack/unpack the data?
I think you're better off to just translate that vector into std::vector<std::byte>/std::vector<unsigned char>.
std::vector<bool> isn't even required to have contiguous memory so writing starting from data() is implementation defined too.
No, there isn't.
Sorry.
A good reason to avoid this container!

creating large number of object pointers

I have defined a class like this:
class myClass {
private:
int count;
string name;
public:
myClass (int, string);
...
...
};
myClass::myClass(int c, string n)
{
count = c;
name = n;
}
...
...
I have also a *.txt file which in each line there is a name:
David
Jack
Peter
...
...
Now I read the file line by line and create a new object pointer for each line and store all objects in a vector. The function is like this:
vector<myClass*> myFunction (string fileName)
{
vector<myClass*> r;
myClass* obj;
ifstream infile(fileName);
string line;
int count = 0;
while (getline(infile, line))
{
obj = new myClass (count, line);
r.push_back(obj);
count++;
}
return r;
}
For small *.txt files I have no problem. However, sometimes my *.txt files contain more than 1 million lines. In these cases, the program is dramatically slow. Do you have any suggestion to make it faster?
First, find faster io than std streams.
Second, can you use string views instead of strings? They are C++17, but there are C++11 and earlier versions everywhere.
Third,
myClass::myClass(int c, string n) {
count = c;
name = n;
}
should read
myClass::myClass(int c, std::string n):
count(c),
name(std::move(n))
{}
which would make a difference for long names. None for short ones due to "small string optimization".
Forth, stop making vectors of pointers. Create vectors of values.
Fifth, failing that, find a more efficient way to allocate/deallocate the objects.
One thing you can do is directly move the string you've read from the file into the objects you're creating:
myClass::myClass(int c, string n)
: count{c}, name{std::move(n)}
{ }
You could also benchmark:
myClass::myClass(int c, string&& n)
: count{c}, name{std::move(n)}
{ }
The first version above will make a copy of line as the function is called, then let the myClass object take over the dynamically allocated buffer used for that copy. The second version (with string&& n argument), will let the myClass object rip out line's buffer directly: that means less copying of textual data but also line's likely to be stripped of any buffer as each line of the file is read in. Hopefully your allocation will normally be able to see from the input buffer how large a capacity line needs to read in the next line, and avoid any extra allocations/copying. As always, measure when you've reason to care.
You'd likely get a small win by reserving space for your vector up front, though the fact that you're storing pointers in the vector instead of storing myClass objects by value makes any vector resizing relatively cheap. Countering that, storing pointers does mean you're doing an extra dynamic allocation.
Another thing you can do is increase the stream buffer size: see pubsetbuf and the example therein.
If speed it extremely important, you should memory map the file and store pointers into the memory mapped region, instead of copying from the file stream buffer into distinct dynamically-allocated memory regions inside distinct strings. This could easily make a dramatic difference - perhaps as much as an order of magnitude - but a lot depends on the speed of your disk etc. so benchmark both if you've reason to care.

Permanently storing a buffer value

I'm trying to parse the lines of a text file and then store them inside of a vector<string>. I'm coming from a Java background and am confused on how C++ handles assigning stuff to the value of a buffer. Here is my code:
string line;
vector<string> adsList;
ifstream inputFile;
inputFile.open("test.txt");
while(getline(inputFile, line))
{
adsList.push_back(line);
}
In Java, when adding to a data structure a copy of the object is made and then that copy is inserted. In C++, my understanding is that the data structures only hold references so that any operation is very fast. What is the proper way to achieve what I want to do in C++? I have also tried the following code:
vector<string> adsList;
string line;
ifstream inputFile;
inputFile.open("test.txt");
while(getline(inputFile, line))
{
string *temp = new string;
*temp = line;
adsList.push_back(*temp);
}
With my reasoning here being that creating a new string object and storing that would preserve it after being destroyed each loop iteration. C++ seems to handle this completely opposite of Java and I am having a hard time wrapping my head around it.
edit: here is what test.txt looks like:
item1 item1 item1
item2 item2 item2
item3 item3 item3
item4 item4 item4
I'm trying to store each line as a string and then store the string inside my vector. So the front of the vector would have a string with value "item1 item1 item1".
push_back() makes a copy, so your first code sample does exactly what you want it to do. In fact, all C++ structures store copies by default. You'd have to have a container of pointers to not get copies.
Your understanding re references is incorrect - Java stored references, C++ stores whatever you ask it to , be it pointers or copies (note you can't store references in stl containers, the equivalent is pointers)
vector::push_back stores a copy of the item being stored in the vector - so you don't have to create a pointer, and new some memory on the heap in order to store the string.
(Internally, there is some heap allocation going on, but that's implementation details of std::string)
What option we do have in C++ is to rather store pointers, and these you have to heap allocate, otherwise when the current stack frame is popped off, the pointers will be pointing to defunct memory... but that is another topic.
See here for a simple working example of your code:
#include <iostream>
#include <vector>
#include <fstream>
int main()
{
std::vector<std::string> adsList;
std::string line;
std::ifstream inputFile;
inputFile.open("test.txt");
// read a line from the file - store it in 'line'
while(getline(inputFile, line))
{
// store a *copy* of line in the vector
adsList.push_back(line);
}
// for each element in adsList vector, get a *reference* (note the '&')
for (std::string& s : adsList)
{
std::cout << s << std::endl;
}
exit(0);
}
Since you didn't post the entire code, I suggest you try this to see if it is reading the file:
#include <iostream>
#include <fstream>
#include <vector>
using namespace std;
int main() {
fstream inputFile("test.txt",fstream::in);
string l;
vector<string> v;
while(getline(inputFile,l)) v.push_back(l);
//Display the content of the vector:
for(int i = 0; i < v.size(); ++i) {
cout << v[i] << endl;
}
return 0;
}
Your initial assumption is incorrect. A copy is (generally) stored in vector (ignoring move operations which were brought in with C++11). Generally, this is the way you want to be operating.
If you are truly worried about speed and want to store references (pointers, actually) to things, you'll want to utilize something like std::unique_ptr or std::shared_ptr. For example:
std::vector<std::unique_ptr<std::string>> adsList;
std::string line;
inputFile.open("test.txt");
while(std::getline(inputFile, line)) {
adsList.push_back(std::unique_ptr<std::string>(new std::string(line));
}
Generally this is only done if you must be able to modify the values in the container and have the modifications reflected to the original object - in this case you'd use a std::shared_ptr. By far the most common scenario is your first code example.

Reading and writing C++ vector to a file

For some graphics work I need to read in a large amount of data as quickly as possible and would ideally like to directly read and write the data structures to disk. Basically I have a load of 3d models in various file formats which take too long to load so I want to write them out in their "prepared" format as a cache that will load much faster on subsequent runs of the program.
Is it safe to do it like this?
My worries are around directly reading into the data of the vector? I've removed error checking, hard coded 4 as the size of the int and so on so that i can give a short working example, I know it's bad code, my question really is if it is safe in c++ to read a whole array of structures directly into a vector like this? I believe it to be so, but c++ has so many traps and undefined behavour when you start going low level and dealing directly with raw memory like this.
I realise that number formats and sizes may change across platforms and compilers but this will only even be read and written by the same compiler program to cache data that may be needed on a later run of the same program.
#include <fstream>
#include <vector>
using namespace std;
struct Vertex
{
float x, y, z;
};
typedef vector<Vertex> VertexList;
int main()
{
// Create a list for testing
VertexList list;
Vertex v1 = {1.0f, 2.0f, 3.0f}; list.push_back(v1);
Vertex v2 = {2.0f, 100.0f, 3.0f}; list.push_back(v2);
Vertex v3 = {3.0f, 200.0f, 3.0f}; list.push_back(v3);
Vertex v4 = {4.0f, 300.0f, 3.0f}; list.push_back(v4);
// Write out a list to a disk file
ofstream os ("data.dat", ios::binary);
int size1 = list.size();
os.write((const char*)&size1, 4);
os.write((const char*)&list[0], size1 * sizeof(Vertex));
os.close();
// Read it back in
VertexList list2;
ifstream is("data.dat", ios::binary);
int size2;
is.read((char*)&size2, 4);
list2.resize(size2);
// Is it safe to read a whole array of structures directly into the vector?
is.read((char*)&list2[0], size2 * sizeof(Vertex));
}
As Laurynas says, std::vector is guaranteed to be contiguous, so that should work, but it is potentially non-portable.
On most systems, sizeof(Vertex) will be 12, but it's not uncommon for the struct to be padded, so that sizeof(Vertex) == 16. If you were to write the data on one system and then read that file in on another, there's no guarantee that it will work correctly.
You might be interested in the Boost.Serialization library. It knows how to save/load STL containers to/from disk, among other things. It might be overkill for your simple example, but it might become more useful if you do other types of serialization in your program.
Here's some sample code that does what you're looking for:
#include <algorithm>
#include <fstream>
#include <vector>
#include <boost/archive/binary_oarchive.hpp>
#include <boost/archive/binary_iarchive.hpp>
#include <boost/serialization/vector.hpp>
using namespace std;
struct Vertex
{
float x, y, z;
};
bool operator==(const Vertex& lhs, const Vertex& rhs)
{
return lhs.x==rhs.x && lhs.y==rhs.y && lhs.z==rhs.z;
}
namespace boost { namespace serialization {
template<class Archive>
void serialize(Archive & ar, Vertex& v, const unsigned int version)
{
ar & v.x; ar & v.y; ar & v.z;
}
} }
typedef vector<Vertex> VertexList;
int main()
{
// Create a list for testing
const Vertex v[] = {
{1.0f, 2.0f, 3.0f},
{2.0f, 100.0f, 3.0f},
{3.0f, 200.0f, 3.0f},
{4.0f, 300.0f, 3.0f}
};
VertexList list(v, v + (sizeof(v) / sizeof(v[0])));
// Write out a list to a disk file
{
ofstream os("data.dat", ios::binary);
boost::archive::binary_oarchive oar(os);
oar << list;
}
// Read it back in
VertexList list2;
{
ifstream is("data.dat", ios::binary);
boost::archive::binary_iarchive iar(is);
iar >> list2;
}
// Check if vertex lists are equal
assert(list == list2);
return 0;
}
Note that I had to implement a serialize function for your Vertex in the boost::serialization namespace. This lets the serialization library know how to serialize Vertex members.
I've browsed through the boost::binary_oarchive source code and it seems that it reads/writes the raw vector array data directly from/to the stream buffer. So it should be pretty fast.
std::vector is guaranteed to be continuous in memory, so, yes.
Another alternative to explicitly reading and writing your vector<> from and to a file is to replace the underlying allocator with one that allocates memory from a memory mapped file. This would allow you to avoid an intermediate read/write related copy. However, this approach does have some overhead. Unless your file is very large it may not make sense for your particular case. Profile as usual to determine if this approach is a good fit.
There are also some caveats to this approach that are handled very well by the Boost.Interprocess library. Of particular interest to you may be its allocators and containers.
I just ran into this exact same problem.
First off, these statements are broken
os.write((const char*)&list[0], size1 * sizeof(Vertex));
is.read((char*)&list2[0], size2 * sizeof(Vertex));
There is other stuff in the Vector data structure, so this will make your new vector get filled up with garbage.
Solution:
When you are writing your vector into a file, don't worry about the size your Vertex class, just directly write the entire vector into memory.
os.write((const char*)&list, sizeof(list));
And then you can read the entire vector into memory at once
is.seekg(0,ifstream::end);
long size2 = is.tellg();
is.seekg(0,ifstream::beg);
list2.resize(size2);
is.read((char*)&list2, size2);
If this is used for caching by the same code, I don't see any problem with this. I've used this same technique on multiple systems without a problem (all Unix based). As an extra precaution, you might want to write a struct with known values at the beginning of the file, and check that it reads ok. You might also want to record the size of the struct in the file. This will save a lot of debugging time in the future if the padding ever changes.