10 fold size increase when reading file into struct - c++

I am trying to read a csv file into a struct containing a vector of vector of strings. The file contains ~2 million lines and size on disk is ~350 mb. When I read the file into struct top shows me that the on reading the full file, the program is now using almost 3.5GB of my memory. I have used vector reserve to try to limit vector capacity increase on push_back.
#include<iomanip>
#include<stdio.h>
#include<stdlib.h>
#include<iostream>
#include<fstream>
#include<string.h>
#include<sstream>
#include<math.h>
#include<vector>
#include<algorithm>
#include<array>
#include<ctime>
#include<boost/algorithm/string.hpp>
using namespace std;
struct datStr{
vector<string> colNames;
vector<vector<string>> data;
};
datStr readBoost(string fileName)
{
datStr ds;
ifstream inFile;
inFile.open(fileName);
string line;
getline(inFile, line);
vector<string> colNames;
stringstream ss(line);
string item;
int i = 0;
vector<int> colTypeInt;
while(getline(ss, item, ','))
{
item.erase( remove( item.begin(), item.end(), ' ' ), item.end() );
colNames.push_back(item);
vector<string> colVec;
ds.data.push_back(colVec);
ds.data[i].reserve(3000000);
i++;
}
int itr = 0;
while(getline(inFile, line))
{
vector<string> rowStr;
boost::split(rowStr, line, boost::is_any_of(","));
for(int ktr = 0; ktr < rowStr.size(); ktr++)
{
rowStr[ktr].erase( remove( rowStr[ktr].begin(), rowStr[ktr].end(), ' ' ), rowStr[ktr].end() );
ds.data[ktr].push_back(rowStr[ktr]);
}
itr++;
}
int main()
{
datStr ds = readBoost("file.csv");
while(true)
{
}
}
PS: The last while is just so I can monitor the memory usage on completion of the program.
Is this something expected when using vectors or am I missing something here?
Another interesting fact. I started adding up size and capacity for each string in the read loop. Surprisingly it just adds up to 1/10 of what I am shown in top on ubuntu? Could it be that top is misreporting or my compiler is allocating too much space?

I tested your code with an input file that has 1886850 lines of text, with a size of 105M.
With your code, the memory consumption was about 2.5G.
Then, I started modifying how data is stored.
First test:
Change datStr to:
struct datStr{
vector<string> colNames;
vector<string> lines;
};
This reduced the memory consumption to 206M. That's more than 10 fold reduction in size. It's clear that the penalty for using
vector<vector<string>> data;
is rather stiff.
Second test:
Change datStr to:
struct datStr{
vector<string> colNames;
vector<string> lines;
vector<vector<string::size_type>> indices;
};
with indices keeping track of where the tokens in lines start. You can extract the tokens from each line by using lines and indices.
With this change, the memory consumption went to 543MB but sill is five times smaller than the original.
Third test
Change dataStr to:
struct datStr{
vector<string> colNames;
vector<string> lines;
vector<vector<unsigned int>> indices;
};
With this change, the memory consumption came down to 455MB. This should work if you don't expect your lines to be longer or equal to UINT_MAX.
Fourth Test
Change dataStr to:
struct datStr{
vector<string> colNames;
vector<string> lines;
vector<vector<unsigned short>> indices;
};
With this change, the memory consumption came down to 278MB. This should work if you don't expect your lines to be longer or equal to USHRT_MAX. For this case, the overhead of indices is really small, only 72MB.
Here's the modified code I used for my tests.
#include<iomanip>
#include<stdio.h>
#include<stdlib.h>
#include<iostream>
#include<fstream>
#include<string.h>
#include<sstream>
#include<math.h>
#include<vector>
#include<algorithm>
#include<array>
#include<ctime>
// #include<boost/algorithm/string.hpp>
using namespace std;
struct datStr{
vector<string> colNames;
vector<string> lines;
vector<vector<unsigned short>> data;
};
void split(vector<unsigned short>& rowStr, string const& line)
{
string::size_type begin = 0;
string::size_type end = line.size();
string::size_type iter = begin;
while ( iter != end)
{
++iter;
if ( line[iter] == ',' )
{
rowStr.push_back(static_cast<unsigned short>(begin));
++iter;
begin = iter;
}
}
if (begin != end )
{
rowStr.push_back(static_cast<unsigned short>(begin));
}
}
datStr readBoost(string fileName)
{
datStr ds;
ifstream inFile;
inFile.open(fileName);
string line;
getline(inFile, line);
vector<string> colNames;
stringstream ss(line);
string item;
int i = 0;
vector<int> colTypeInt;
while(getline(ss, item, ','))
{
item.erase( remove( item.begin(), item.end(), ' ' ), item.end() );
ds.colNames.push_back(item);
}
int itr = 0;
while(getline(inFile, line))
{
ds.lines.push_back(line);
vector<unsigned short> rowStr;
split(rowStr, line);
ds.data.push_back(rowStr);
}
}
int main(int argc, char** argv)
{
datStr ds = readBoost(argv[1]);
while(true)
{
}
}

Your vector<vector<string>> suffers from the costs of indirection (pointer members to dynamically allocated memory), housekeeping (members supporting size()/end()/capacity()), and the housekeeping and rounding-up internal to the dynamic memory allocation functions... if you look at the first graph titled Real memory consumption for different string lengths here it suggests total overheads around 40-45 bytes per string for a 32 bit app built with G++ 4.6.2, though an implementation could potentially get this as low as 4 bytes for strings of up to ~4 characters. Then there's waste for vector overheads....
You can address the issue in any of several ways, depending on your input data and efficiency needs:
store vector<std::pair<string, Column_Index>>, where Column_Index is a class you write that records the offsets in the string where each field appear
store vector<std::string> where column values are padded to known maximum widths, which will help most if the lengths are small, fixed and/or similar (e.g. date/times, small monetary amounts, ages)
memory map the file, then store offsets (but unquoting/unescaping is an issue - you could do that in-place, e.g. abc""def or abc\"def (whichever you support) -> abc"deff)
with the last two approaches, you can potentially overwrite the character after each field with a NUL if that's useful to you, so you can treat the fields as "C"-style ASCIIZ NUL-terminated strings
if some/all fields contain values like values like say 1.23456789012345678... where the textual representation may be longer than a binary inbuilt type (float, double, int64_t), doing a conversion before storage makes sense
similarly, if there's a set of repeating values - like a field of what are logically enumeration identifiers, you can encoding them as integers, or if the values are repetitive but not known until runtime, you can create a bi-directional mapping from incrementing indices to values

A couple things come to mind:
You say your file has about 2 million lines, but you reserve space for 3 million strings for each column. Even if you only have one column, that's a lot of wasted space. If you have a bunch of columns, that's a ton of wasted space. It might be informative to see how much space difference it makes if you don't reserve at all.
string has a small* but nonzero amount of overhead that you're paying for every single field in your 2-million line file. If you really need to hold all the data in memory at once and it's causing problems to do so, this may actually be a case where you're better off just using char* instead of string. But I'd only resort to this if adjusting reserve doesn't help.
* The overhead due to metadata is small, but if the strings are allocating extra capacity for their internal buffers, that could really add up. See this recent question.
Update: The problem with your update is that you are storing pointers to temporary std::string objects in datStr. By the time you get around to printing, those strings have been destroyed and your pointers are wild.
If you want a simple, safe way to store your strings in datStr that doesn't allocate more space than it needs, you could use something like this:
class TrivialReadOnlyString
{
private:
char* m_buffer;
public:
TrivialReadOnlyString(const std::string& src)
{
InitFrom(src.c_str());
}
TrivialReadOnlyString(const TrivialReadOnlyString& src)
{
InitFrom(src.m_buffer);
}
~TrivialReadOnlyString()
{
delete[] m_buffer;
}
const char* Get()
{
return m_buffer;
}
private:
void InitFrom(const char* src)
{
// Can switch to the safe(r) versions of these functions
// if you're using vc++ and it complains.
size_t length = strlen(src);
m_buffer = new char[length + 1];
strcpy(m_buffer, src);
}
};
There are a lot of further enhancements that could be made to this class, but I think it is sufficient for your program's needs as shown. This will fragment memory more than Blastfurnace's idea of storing the whole file in a single buffer. however.
If there is a lot of repetition in your data, you might also consider "folding" the repeats into a single object to avoid redundantly storing the same strings in memory over and over (flyweight pattern).

Indulge me while I take a very different approach in answering your question. Others have already answered your direct question quite well, so let me provide another perspective entirely.
Do you realize that you could store that data in memory with a single allocation, plus one pointer for each line, or perhaps one pointer per cell?
On a 32 bit machine, that's 350MB + 8MB (or 8MB * number columns).
Did you know that it's easy to parallelize CSV parsing?
The problem you have is layers and layers of bloat. ifstream, stringstream, vector<vector<string>>, and boost::split are wonderful if you don't care about size or speed. All of that can be done more directly and at lower cost.
In situations like this, where size and speed do matter, you should consider doing things the manual way. Read the file using an API from your OS. Read it into a single memory location, and modify the memory in place by replacing commas or EOLs with '\0'. Store pointers to those C strings in your datStr and you're done.
You can write similar solutions for variants of the problem. If the file is too large for memory, you can process it in pieces. If you need to convert data to other formats like floating point, that's easy to do. If you'd like to parallelize parsing, that's far easier without the extra layers between you and your data.
Every programmer should be able to choose to use convenience layers or to use simpler methods. If you lack that choice, you won't be able to solve some problems.

Related

Weird behavior writing/reading simple binary file

I'm writing and reading on a binary file. I'm getting small errors when outputting the reads.
The strings are there but with little snippets like: (I"�U) (�U) appended to the end of ~30% of them
I'm using g++ compiler on Ubuntu
Simplified code:
struct Db_connection
{
public:
string name;
}
int Db_connection::write_config()
{
ofstream config_f("config.dat", std::ios_base::binary | std::ios_base::out); //open file
string str = name;
int size = str.length();
config_f.write(reinterpret_cast<char *>(&size), sizeof(int)); // write size of string in int size chunk
config_f.write(str.c_str(), size); //write string
config_f.close();
return 0;
}
Db_connection read_config()
{
ifstream config_f("config.dat", std::ios_base::binary | std::ios_base::in);
Db_connection return_obj;
int size;
string data;
config_f.read(reinterpret_cast<char *>(&size), sizeof(int)); // read string size
char buffer[size];
config_f.read(buffer, size); // read string
data.assign(buffer);
return_obj.name = data;
return return_obj;
}
Is there anything obvious I am messing up? Does this have to do with Endian? I tried to minimize the code to it's absolute essentials
The actual code is more complex. I have a class holding vectors of 2 structs. 1 struct has four string members and the other has a string and bool. These fuctions are actually a member of and return (respectively) that class. The fuctions loop through the vectors writing struct members sequentially.
Two oddities:
To debug, I added outputs of the size and data variables on each iteration in both the read and write functions. size comes out accurate and consistent on both sides. data is accurate on the write side but with the weird special characters on the read side. I'm looking at outputs like:
Read Size: 12
Data: random addy2�U //the 12 human readable chars are there but with 2 extra symbols
The final chunk of data (a bool) comes out fine every time, so I don't think there is a file pointer issue. If its relevant: every bool and int is fine. Its just a portion of the strings.
Hopefully i'm making a bonehead mistake and this minimized code can be critiqued. The actual example would be too long.
Big thanks to WhozCraig,
The following edit did, indeed, work:
Db_connection read_config()
{
ifstream config_f("config.dat", std::ios_base::binary | std::ios_base::in);
Db_connection return_obj;
int size;
string data;
config_f.read(reinterpret_cast<char *>(&size), sizeof(int)); // read string size
vector<char> buff(size);
config_f.read(buff.data(), size);
data = string(buff.begin(), buff.end());
return_obj.name = data;
return return_obj;
}
As paddy pointed out directly and WhozCraig alluded to, this code still needs to implement a standardized, portable data type for recording the integer properly into binary and the write function needs to be rethought as well.
Thank you very much to the both of you. I read like 5-8 top search results for "cpp binary i/o" before writing my code and still ended up with that mess. You guys saved me hours/days of my life.

c++: problem while copying string into character array

I am trying to copy contents of a file into fields in a class courseInfo.
this is the code im using:
#include<iostream>
#include<fstream>
#include<vector>
#include<sstream>
#include <bits/stdc++.h>
using namespace std;
class courseInfo
{
public:
char courseCode[8];
char courseName[80];
int ECTS;
};
int main()
{
ifstream fin("courses.txt");
if(!fin.is_open())
{
cout<<"file doesn't exist";
return 0;
}
string line;
vector<courseInfo> courses;
while(getline(fin,line))
{
stringstream linestream(line);
string segment;
vector<string> segmentlist;
while(getline(linestream, segment, ';'))
{
segmentlist.push_back(segment);
}
//cout<<segmentlist.at(0).c_str();
courseInfo c;
//segmentlist.at(0).copy(c.courseCode, segmentlist.at(0).size()+1);
//c.courseCode[segmentlist.at(0).size()] = '\0';
strcpy(c.courseCode, segmentlist.at(0).c_str());
cout<<c.courseCode<<"\n;
strcpy(c.courseName, segmentlist.at(1).c_str());
cout<<c.courseCode;
}
return 0;
}
content of courses.txt file:
TURK 101;Turkish l;3.
output i get:
TURK 101
TURK 101Turkish l
the contents of courseCode changes when i copy something into courseName.
why does this happen?
how do i rectify this?
Note how TURK 101 is exactly 8 bytes.
When you cout << c.courseCode, your program prints characters until it encounters a NUL byte. By accident, the first byte of c.courseName is NUL.
After you read into it, it is no longer NUL and thus printing c.courseCode happily continues into c.courseName.
Some options:
The most obvious (and recommended) solution is to use std::string in your struct instead of fixed-size char arrays.
However, this looks like a homework question, so you probably are not allowed to use std::string.
Use std::vector<char> instead, but that is probably also not allowed.
Make courseCode large enough to contain any possible course code, plus one character for the NUL-terminator. In this case, make courseCode 9 chars large.
Use heap-allocated memory (new char[str.size()+1] to allocate a char *, delete[] ptr to free it afterwards). And then change courseInfo to take regular pointers. Ideally all the memory management is done in constructors/destructors. See the rule of three/five/zero.

C++ Fast way to load large txt file in vector<string>

I have a file ~12.000.000 hex lines and 1,6GB
Example of file:
999CBA166262923D53D3EFA72F5C4E8EE1E1FF1E7E33C42D0CE8B73604034580F2
889CBA166262923D53D3EFA72F5C4E8EE1E1FF1E7E33C42D0CE8B73604034580F2
Example of code:
vector<string> buffer;
ifstream fe1("strings.txt");
string line1;
while (getline(fe1, line1)) {
buffer.push_back(line1);
}
Now the loading takes about 20 minutes. Any suggestions how to speed up? Thanks a lot in advance.
Loading a large text file into std::vector<std::string> is rather inefficient and wasteful because it allocates heap memory for each std::string and re-allocates the vector multiple times. Each of these heap allocations requires heap book-keeping information under the hood (normally 8 bytes per allocation on a 64-bit system), and each line requires an std::string object (8-32 bytes depending on the standard library), so that a file loaded this way takes a lot more space in RAM than on disk.
One fast way is to map the file into memory and implement iterators to walk over lines in it. This sidesteps the issues mentioned above.
Working example:
#include <boost/interprocess/file_mapping.hpp>
#include <boost/interprocess/mapped_region.hpp>
#include <boost/iterator/iterator_facade.hpp>
#include <boost/range/iterator_range_core.hpp>
#include <iostream>
class LineIterator
: public boost::iterator_facade<
LineIterator,
boost::iterator_range<char const*>,
boost::iterators::forward_traversal_tag,
boost::iterator_range<char const*>
>
{
char const *p_, *q_;
boost::iterator_range<char const*> dereference() const { return {p_, this->next()}; }
bool equal(LineIterator b) const { return p_ == b.p_; }
void increment() { p_ = this->next(); }
char const* next() const { auto p = std::find(p_, q_, '\n'); return p + (p != q_); }
friend class boost::iterator_core_access;
public:
LineIterator(char const* begin, char const* end) : p_(begin), q_(end) {}
};
inline boost::iterator_range<LineIterator> crange(boost::interprocess::mapped_region const& r) {
auto p = static_cast<char const*>(r.get_address());
auto q = p + r.get_size();
return {LineIterator{p, q}, LineIterator{q, q}};
}
inline std::ostream& operator<<(std::ostream& s, boost::iterator_range<char const*> const& line) {
return s.write(line.begin(), line.size());
}
int main() {
boost::interprocess::file_mapping file("/usr/include/gnu-versions.h", boost::interprocess::read_only);
boost::interprocess::mapped_region memory(file, boost::interprocess::read_only);
unsigned n = 0;
for(auto line : crange(memory))
std::cout << n++ << ' ' << line;
}
You can read the entire file into memory. This can be done with C++ streams, or you may be able to get even more performance by using platform specific API's, such as memory mapped files or their own file reading API's.
Once you have this block of data, for performance you want to avoid any further copies and use it in-place. In C++17 you have std::string_view which is similar to std::string but uses existing string data, avoiding the copy. Otherwise you might just work with C style char* strings, either by replacing the newline with a null (\0), using a pair of pointers (begin/end) or a pointer and size.
Here I used string_view, I also assumed newlines are always \n and that there is a newline at the end. You may need to adjust the loop if this is not the case. Guessing the size of the vector will also gain a little performance, you could maybe do so from the file length. I also skipped some error handling.
std::fstream is("data.txt", std::ios::in | std::ios::binary);
is.seekg(0, std::ios::end);
size_t data_size = is.tellg();
is.seekg(0, std::ios::beg);
std::unique_ptr<char[]> data(new char[data_size]);
is.read(data.get(), data_size);
std::vector<std::string_view> strings;
strings.reserve(data_size / 40); // If you have some idea, avoid re-allocations as general practice with vector etc.
for (size_t i = 0, start = 0; i < data_size; ++i)
{
if (data[i] == '\n') // End of line, got string
{
strings.emplace_back(data.get() + start, i - start);
start = i + 1;
}
}
To get a little more performance, you might run the loop to do CPU work in parallel of the file IO. This can be done with threads or using platform-specific async file IO. However in this case the loop will be very fast, so there would not be much to gain.
You can simply allocate enough RAM memory and read the whole text file almost at once. Than you can access the data in RAM by memory pointer. I read the whole 4GB text file in about 3 seconds.

Unexpected Huge Memory Consumption for vector of vectors

Processing a dict file with variant length ASCII words.
constexpr int MAXLINE = 1024 * 1024 * 10; // total number of words, one word per line.
Goal: read in the whole file into memory, and be able to access each word by index.
I want to quick access each word by index.We can use two-dimensional array to achieve that; however a MAXLENGTH need to be set, not mention that MAXLENGTH is not known ahead.
constexpr int MAXLENGTH= 1024; // since I do not have the maximum length of the word
char* aray = new char[MAXLINE * MAXLENGTH];
The code above would NOT be memory friendly if most words are shorter than MAXLENGTH; and also some words can be longer than MAXLENGTH, causing errors.
For variant length object, I think vector might be best fit to this problem. So I come up with vector of vector to store them.
vector<vector<char>> array(MAXLINE);
This looks so promising, until I realize that is not the case.
I tested both approaches on a dict file with MAXLINE 4-ASCII-character words.(here all words are 4-char words)
constexpr int MAXLINE = 1024 * 1024 * 10;
if I new operator the array to store, (here MAXLENGTH is just 4)
char* aray = new char[MAXLINE * 4];
The memory consumption is roughly 40MB. However, if I try to use vector to store ( I changed the char to int32_t for just fit four chars)
vector<vector<int32_t>> array(MAXLINE);
you can also use char vector, and reserve space for 4 chars.
vector<vector<char>> array(MAXLINE);
for (auto & c : array) {
c.reserve(4);
}
The memory consumption jumps up to about 720MB (debug mode), 280MB(release mode), which is so unexpected high and can someone give me some explanations for clarification why so.
obseravation: Size of vector is implementation dependent and if you are compiling in debug mode.
As on my system
sizeof(vector<int32_t>) = 16 // debug mode
and
sizeof(vector<int32_t>) = 12 // release mode
In debug mode the momory consumption is 720MB for vector<vector<int32_t>> array(MAXLINE);, while the actual vector only takes sizeof(vector<int32_t>) * MAXLINE = 16 * 10MB = 160 MB
In relase mode, the momory consumption is 280MB , however the expected value is sizeof(vector<int32_t>) * MAXLINE = 12 * 10MB = 120 MB
Can someone explain the big difference in real memory consumption and expected consumption(calculated from sub-vector size).
Appreciate, and Happy new year!
For your case:
so, does it mean vector of vectors is not a good idea to store small
objects? –
Generally no. A nested sub-vector isn't such a good solution for storing a boatload of teeny variable-sized sequences. You don't want to represent an indexed mesh that allows variable-polygons (triangles, quads, pentagons, hexagons, n-gons) using a separate std::vector instance per polygon, for example, or else you'll tend to blow up memory usage and have a really slow solution: slow because there's a heap allocation involved for every single freaking polygon, and explosive in memory because vector often preallocates some memory for the elements in addition to storing size and capacity in ways that are often larger than needed if you have a boatload of teeny sequences.
vector is an excellent data structure for storing a million things contiguously, but not so excellent for storing a million teeny vectors.
In such cases even a singly-linked indexed list can work better with the indices pointing to a bigger vector, performing so much faster and sometimes even using less memory in spite of the 32-bit link overhead, like this:
That said, for your particular case with a big ol' random-access sequence of variable-length strings, this is what I recommend:
// stores the starting index of each null-terminated string
std::vector<int> string_start;
// stores the characters for *all* the strings in one single vector.
std::vector<char> strings;
That will reduce the overhead down to closer to 32-bits per string entry (assuming int is 32-bits) and you will no longer require a separate heap allocation for every single string entry you add.
After you finish reading everything in, you can minimize memory use with a compaction to truncate the array (eliminating any excess reserve capacity):
// Compact memory use using copy-and-swap.
vector<int>(string_start).swap(string_start);
vector<char>(strings).swap(strings);
Now to retrieve the nth string, you can do this:
const char* str = strings.data() + string_start[n];
If you have need for search capabilities as well, you can actually store a boatload of strings and search for them quickly (including things like prefix-based searches) storing less memory than even the above solution would take using a compressed trie. It's a bit more of an involved solution though but might be worthwhile if your software revolves around dictionaries of strings and searching through them and you might just be able to find some third party library that already provides one for ya.
std::string
Just for completeness I thought I'd throw in a mention of std::string. Recent implementations often optimize for small strings by storing a buffer in advance which is not separately heap-allocated. However, in your case that can lead to even more explosive memory usage since that makes sizeof(string) bigger in ways that can consume far more memory than needed for really short strings. It does make std::string more useful though for temporary strings, making it so you might get something that performs perfectly fine if you fetched std::string out of that big vector of chars on demand like so:
std::string str = strings.data() + string_start[n];
... as opposed to:
const char* str = strings.data() + string_start[n];
That said, the big vector of chars would do much better performance and memory wise for storing all the strings. Just generally speaking, generalized containers of any sort tend to cease to perform so well if you want to store millions of teeny ones.
The main conceptual problem is that when the desire is for like a million variable-sized sequences, the variable-sized nature of the requirement combined with the generalized nature of the container will imply that you have a million teeny memory managers, all having to potentially allocate on the heap or, if not, allocate more data than is needed, along with keeping track of its size/capacity if it's contiguous, and so on. Inevitably a million+ managers of their own memory gets quite expensive.
So in these cases, it often pays to forego the convenience of a "complete, independent" container and instead use one giant buffer, or one giant container storing the element data as with the case of vector<char> strings, along with another big container that indexes or points to it, as in the case of vector<int> string_start. With that you can represent the analogical one million variable-length strings just using two big containers instead of a million small ones.
removing nth string
Your case doesn't sound like you need to ever remove a string entry, just load and access, but if you ever need to remove a string, that can get tricky when all the strings and indices to their starting positions are stored in two giant buffers.
Here I recommend, if you wanna do this, to not actually remove the string immediately from the buffer. Instead you can simply do this:
// Indicate that the nth string has been removed.
string_start[n] = -1;
When iterating over available strings, just skip the ones where string_start[n] is -1. Then, every now and then to compact memory use after a number of strings have been removed, do this:
void compact_buffers(vector<char>& strings, vector<int>& string_start)
{
// Create new buffers to hold the new data excluding removed strings.
vector<char> new_strings;
vector<int> new_string_start;
new_strings.reserve(strings.size());
new_string_start.reserve(string_start.size());
// Store a write position into the 'new_strings' buffer.
int write_pos = 0;
// Copy strings to new buffers, skipping over removed ones.
for (int start: string_start)
{
// If the string has not been removed:
if (start != -1)
{
// Fetch the string from the old buffer.
const char* str = strings.data() + start;
// Fetch the size of the string including the null terminator.
const size_t len = strlen(str) + 1;
// Insert the string to the new buffer.
new_strings.insert(new_strings.end(), str, str + len);
// Append the current write position to the starting positions
// of the new strings.
new_string_start.push_back(write_pos);
// Increment the write position by the string size.
write_pos += static_cast<int>(len);
}
}
// Swap compacted new buffers with old ones.
vector<char>(new_strings).swap(strings);
vector<int>(new_string_start).swap(string_start);
}
You can call the above periodically to compact memory use after removing a number of strings.
String Sequence
Here's some code throwing all this stuff together that you can freely use and modify however you like.
////////////////////////////////////////////////////////
// StringSequence.hpp:
////////////////////////////////////////////////////////
#ifndef STRING_SEQUENCE_HPP
#define STRING_SEQUENCE_HPP
#include <vector>
/// Stores a sequence of strings.
class StringSequence
{
public:
/// Creates a new sequence of strings.
StringSequence();
/// Inserts a new string to the back of the sequence.
void insert(const char str[]);
/// Inserts a new string to the back of the sequence.
void insert(size_t len, const char str[]);
/// Removes the nth string.
void erase(size_t n);
/// #return The nth string.
const char* operator[](size_t n) const;
/// #return The range of indexable strings.
size_t range() const;
/// #return True if the nth index is occupied by a string.
bool occupied(size_t n) const;
/// Compacts the memory use of the sequence.
void compact();
/// Swaps the contents of this sequence with the other.
void swap(StringSequence& other);
private:
std::vector<char> buffer;
std::vector<size_t> start;
size_t write_pos;
size_t num_removed;
};
#endif
////////////////////////////////////////////////////////
// StringSequence.cpp:
////////////////////////////////////////////////////////
#include "StringSequence.hpp"
#include <cassert>
StringSequence::StringSequence(): write_pos(1), num_removed(0)
{
// Reserve the front of the buffer for empty strings.
// We'll point removed strings here.
buffer.push_back('\0');
}
void StringSequence::insert(const char str[])
{
assert(str && "Trying to insert a null string!");
insert(strlen(str), str);
}
void StringSequence::insert(size_t len, const char str[])
{
const size_t str_size = len + 1;
buffer.insert(buffer.end(), str, str + str_size);
start.push_back(write_pos);
write_pos += str_size;
}
void StringSequence::erase(size_t n)
{
assert(occupied(n) && "The nth string has already been removed!");
start[n] = 0;
++num_removed;
}
const char* StringSequence::operator[](size_t n) const
{
return &buffer[0] + start[n];
}
size_t StringSequence::range() const
{
return start.size();
}
bool StringSequence::occupied(size_t n) const
{
return start[n] != 0;
}
void StringSequence::compact()
{
if (num_removed > 0)
{
// Create a new sequence excluding removed strings.
StringSequence new_seq;
new_seq.buffer.reserve(buffer.size());
new_seq.start.reserve(start.size());
for (size_t j=0; j < range(); ++j)
{
const char* str = (*this)[j];
if (occupied(j))
new_seq.insert(str);
}
// Swap the new sequence with this one.s
new_seq.swap(*this);
}
// Remove excess capacity.
if (buffer.capacity() > buffer.size())
std::vector<char>(buffer).swap(buffer);
if (start.capacity() > start.size())
std::vector<size_t>(start).swap(start);
}
void StringSequence::swap(StringSequence& other)
{
buffer.swap(other.buffer);
start.swap(other.start);
std::swap(write_pos, other.write_pos);
std::swap(num_removed, other.num_removed);
}
////////////////////////////////////////////////////////
// Quick demo:
////////////////////////////////////////////////////////
#include "StringSequence.hpp"
#include <iostream>
using namespace std;
int main()
{
StringSequence seq;
seq.insert("foo");
seq.insert("bar");
seq.insert("baz");
seq.insert("hello");
seq.insert("world");
seq.erase(2);
seq.erase(3);
cout << "Before compaction:" << endl;
for (size_t j=0; j < seq.range(); ++j)
{
if (seq.occupied(j))
cout << j << ": " << seq[j] << endl;
}
cout << endl;
cout << "After compaction:" << endl;
seq.compact();
for (size_t j=0; j < seq.range(); ++j)
{
if (seq.occupied(j))
cout << j << ": " << seq[j] << endl;
}
cout << endl;
}
Output:
Before compaction:
0: foo
1: bar
4: world
After compaction:
0: foo
1: bar
2: world
I didn't bother to make it standard-compliant (too lazy and the result isn't necessarily that much more useful for this particular situation) but hopefully not a strong need here.
Size of vector is implementation dependent and if you are compiling in debug mode. Normally its at least the size of some internal pointers (begin, end of storage and end of reserved memory). On my Linux system the sizeof(vector<int32_t>)is 24 bytes (probably 3 x 8 bytes for each pointer). That means that for your 10000000 items it should be at least ca. 240 MB.
How much memory does a vector<uint32_t> with a length of 1 need? Here are some estimates:
4 bytes for the uint32_t. That's what you expected.
ca. 8/16 bytes dynamic memory allocation overhead. People always forget that the new implementation must remember the size of the allocation, plus some additional housekeeping data. Typically, you can expect an overhead of two pointers, so 8 bytes on a 32 bit system, 16 bytes on a 64 bit system.
ca. 4/12 bytes for alignment padding. Dynamic allocations must be aligned for any data type. How much is required depends on the CPU, typical alignment requirements are 8 bytes (fully aligned double) or 16 bytes (for the CPUs vector instructions). So, your new implementation will add 4/12 padding bytes to the 4 bytes of payload.
ca. 12/24 bytes for the vector<> object itself. The vector<> object needs to store three things of pointer size: the pointer to the dynamic memory, its size, and the number of actually used objects. Multiply with the pointer size 4/8, and you get its sizeof.
Summing this all up, I arrive at 4 + 8/16 + 4/12 + 12/24 = 28/48 bytes that are used to store 4 bytes.
From your numbers, I guess that you compile in 32 bit mode, and that your max alignment is 8 bytes. In debug mode, your new implementation seems to add additional allocation overhead to catch common programming mistakes.
You're creating 41943040 instances of vector<int32_t>, stored inside another vector. I'm sure that 720MB is a reasonable amount of memory for all the instances' internal data members plus the outer vector's buffer.
As other have pointed out, the sizeof(vector<int32_t>) is big enough to produce such numbers when you initialize 41943040 instances.
What you may want is a cpp dictionary implementation - a map:
https://www.moderncplusplus.com/map/
It will still be big (even bigger), but less awkward stylistically. Now, if memory is a concern, then don't use it.
sizeof(std::map<std::string, std::string>) == 48 on my system.

C++ Storing vector<std::string> from a file using reserve and possibly emplace_back

I am looking for a quick way to store strings from a file into a vector of strings such that I can reserve the number of lines ahead of time. What is the best way to do this? Should I cont the new line characters first or just get a total size of the file and just reserve say the size / 80 in order to give a rough estimate on what to reserve. Ideally I don't want to have the vector have to realloc each time which would slow things down tremendously for a large file. Ideally I would count the number of items ahead of time but should I do this by opening in binary mode counting the new lines and then reopening? That seems wasteful, curious on some thoughts for this. Also is there a way to use emplace_back to get rid of the temporary somestring in the getline code below. I did see the following 2 implmentations for counting the number of lines ahead of time Fastest way to find the number of lines in a text (C++)
std::vector<std::string> vs;
std::string somestring;
std::ifstream somefile("somefilename");
while (std::getline(somefile, somestring))
vs.push_back(somestring);
Also I could do something to get the total size ahead of time, can I just transform the char* in this case into the vector directly? This goes back to my reserve hint of saying size / 80 or some constant to give an estimated size to the reserve upfront.
#include <iostream>
#include <fstream>
int main () {
char* contents;
std::ifstream istr ("test.txt");
if (istr)
{
std::streambuf * pbuf = istr.rdbuf();
//which I can use as a reserve hint say size / 80
std::streamsize size = pbuf->pubseekoff(0,istr.end);
//maybe I can construct the vector from the char buf directly?
pbuf->pubseekoff(0,istr.beg);
contents = new char [size];
pbuf->sgetn (contents,size);
}
return 0;
}
Rather than waste time counting the lines ahead of time, I would just reserve() an initial value, then start pushing the actual lines, and if you happen to push the reserved number of items then just reserve() some more space before continuing with more pushing, repeating as needed.
The strategy for reserving space in a std::vector is designed to "grow on demand". That is, you will not allocate one string at a time, you will first allocate one, then, say, ten, then, one hundred and so on (not exactly those numbers, but that's the idea). In other word, the implementation of std::vector::push_back already manages this for you.
Consider the following example: I am reading the entire text of War and Peace (65007 lines) using two versions: one which allocates and one which does not (i.e., one reserves zero space, and the other reserves the full 65007 lines; text from: http://www.gutenberg.org/cache/epub/2600/pg2600.txt)
#include<iostream>
#include<fstream>
#include<vector>
#include<string>
#include<boost/timer/timer.hpp>
void reader(size_t N=0) {
std::string line;
std::vector<std::string> lines;
lines.reserve(N);
std::ifstream fp("wp.txt");
while(std::getline(fp, line)) {
lines.push_back(line);
}
std::cout<<"read "<<lines.size()<<" lines"<<std::endl;
}
int main() {
{
std::cout<<"no pre-allocation"<<std::endl;
boost::timer::auto_cpu_timer t;
reader();
}
{
std::cout<<"full pre-allocation"<<std::endl;
boost::timer::auto_cpu_timer t;
reader(65007);
}
return 0;
}
Results:
no pre-allocation
read 65007 lines
0.027796s wall, 0.020000s user + 0.000000s system = 0.020000s CPU (72.0%)
full pre-allocation
read 65007 lines
0.023914s wall, 0.020000s user + 0.010000s system = 0.030000s CPU (125.4%)
You see, for a non-trivial amount of text I have a difference of milliseconds.
Do you really need to know the lines beforehand? Is it really a bottleneck? Are you saving, say, one second of Wall time but complicating your code ten-fold by preallocating the lines?