C++ Fast way to load large txt file in vector<string>

C++ Fast way to load large txt file in vector<string> - c++

I have a file ~12.000.000 hex lines and 1,6GB
Example of file:
999CBA166262923D53D3EFA72F5C4E8EE1E1FF1E7E33C42D0CE8B73604034580F2
889CBA166262923D53D3EFA72F5C4E8EE1E1FF1E7E33C42D0CE8B73604034580F2
Example of code:
vector<string> buffer;
ifstream fe1("strings.txt");
string line1;
while (getline(fe1, line1)) {
buffer.push_back(line1);
}
Now the loading takes about 20 minutes. Any suggestions how to speed up? Thanks a lot in advance.

Loading a large text file into std::vector<std::string> is rather inefficient and wasteful because it allocates heap memory for each std::string and re-allocates the vector multiple times. Each of these heap allocations requires heap book-keeping information under the hood (normally 8 bytes per allocation on a 64-bit system), and each line requires an std::string object (8-32 bytes depending on the standard library), so that a file loaded this way takes a lot more space in RAM than on disk.
One fast way is to map the file into memory and implement iterators to walk over lines in it. This sidesteps the issues mentioned above.
Working example:
#include <boost/interprocess/file_mapping.hpp>
#include <boost/interprocess/mapped_region.hpp>
#include <boost/iterator/iterator_facade.hpp>
#include <boost/range/iterator_range_core.hpp>
#include <iostream>
class LineIterator
: public boost::iterator_facade<
LineIterator,
boost::iterator_range<char const*>,
boost::iterators::forward_traversal_tag,
boost::iterator_range<char const*>
>
{
char const *p_, *q_;
boost::iterator_range<char const*> dereference() const { return {p_, this->next()}; }
bool equal(LineIterator b) const { return p_ == b.p_; }
void increment() { p_ = this->next(); }
char const* next() const { auto p = std::find(p_, q_, '\n'); return p + (p != q_); }
friend class boost::iterator_core_access;
public:
LineIterator(char const* begin, char const* end) : p_(begin), q_(end) {}
};
inline boost::iterator_range<LineIterator> crange(boost::interprocess::mapped_region const& r) {
auto p = static_cast<char const*>(r.get_address());
auto q = p + r.get_size();
return {LineIterator{p, q}, LineIterator{q, q}};
}
inline std::ostream& operator<<(std::ostream& s, boost::iterator_range<char const*> const& line) {
return s.write(line.begin(), line.size());
}
int main() {
boost::interprocess::file_mapping file("/usr/include/gnu-versions.h", boost::interprocess::read_only);
boost::interprocess::mapped_region memory(file, boost::interprocess::read_only);
unsigned n = 0;
for(auto line : crange(memory))
std::cout << n++ << ' ' << line;
}

You can read the entire file into memory. This can be done with C++ streams, or you may be able to get even more performance by using platform specific API's, such as memory mapped files or their own file reading API's.
Once you have this block of data, for performance you want to avoid any further copies and use it in-place. In C++17 you have std::string_view which is similar to std::string but uses existing string data, avoiding the copy. Otherwise you might just work with C style char* strings, either by replacing the newline with a null (\0), using a pair of pointers (begin/end) or a pointer and size.
Here I used string_view, I also assumed newlines are always \n and that there is a newline at the end. You may need to adjust the loop if this is not the case. Guessing the size of the vector will also gain a little performance, you could maybe do so from the file length. I also skipped some error handling.
std::fstream is("data.txt", std::ios::in | std::ios::binary);
is.seekg(0, std::ios::end);
size_t data_size = is.tellg();
is.seekg(0, std::ios::beg);
std::unique_ptr<char[]> data(new char[data_size]);
is.read(data.get(), data_size);
std::vector<std::string_view> strings;
strings.reserve(data_size / 40); // If you have some idea, avoid re-allocations as general practice with vector etc.
for (size_t i = 0, start = 0; i < data_size; ++i)
{
if (data[i] == '\n') // End of line, got string
{
strings.emplace_back(data.get() + start, i - start);
start = i + 1;
}
}
To get a little more performance, you might run the loop to do CPU work in parallel of the file IO. This can be done with threads or using platform-specific async file IO. However in this case the loop will be very fast, so there would not be much to gain.

You can simply allocate enough RAM memory and read the whole text file almost at once. Than you can access the data in RAM by memory pointer. I read the whole 4GB text file in about 3 seconds.

Related

Read undefined number of variables from file line

I am making a simple database system in C++.
The table data is stored in a file, where each line represents a table row, where all data is separated by spaces.
I want to read ncols elements in same line, where ncols is not always the same, and store each read value in data[x].
data variable declaration is char** data.
void Table::LoadTableRows(Table::TableStruct *table,char *dbname) {
ifstream fp;
Table::RowStruct *p = (Table::RowStruct*) malloc(sizeof(Table::RowStruct));
char *filename;
int x;
filename = (char*) malloc((strlen(table->tablename)+strlen(dbname)+strlen("Data"))*sizeof(char));
strcpy(filename,dbname);
strcat(filename,table->tablename);
strcat(filename,"Data");
fp.open(filename);
while(!fp.eof()) { //goes through all file lines
Table::RowStruct *newrow = (Table::RowStruct*) malloc(sizeof(Table::RowStruct)); //allocates space for a new row
//initializes element
newrow->prev = NULL;
newrow->next = NULL;
newrow->data = (char**) malloc(table->ncols*30*sizeof(char)); //allocates space to store the row data
for(x=0;x<table->ncols;x++) {
newrow->data[x] = (char*) malloc(30*sizeof(char)); //allocates space for individual data element
fp >> newrow->data[x];
}
for(p=table->rows;p->next!=NULL;p=p->next) {}
newrow->prev = p;
p->next = newrow;
}
fp.close();
}
I've tried this code, but it crashed as I expected.

I do not fully understand what you want to do. There is missing information. Anyway. I will try to help.
I guess that you are new to C++. You are using a lot of C functions. And your program looks completely like C, with some additional C++ features. That you should not do. You are especially using malloc and raw pointers. This you must not do at all.
Try to learn C++ step by step.
Let me first show you what I mean with C-Style programming. I took your program and added comments with hints.
// Do not pass arguments by pointer, pass by reference
// For invariants, pass as const T&
// Do not use "char *". Should be at least const. But do not use at all
// Use std::string (so pass "const std::string& dbname") as argument
void Table::LoadTableRows(Table::TableStruct *table,char *dbname) {
// Always initialize varibles. Use universal initialization, with {}
ifstream fp;
// Never use malloc. Use new.
// Do not use raw ptr. use std::unique_ptr. Initialize with std::make_unique
// Do not use C-Style cast. Use static_cast
Table::RowStruct *p = (Table::RowStruct*) malloc(sizeof(Table::RowStruct));
// Use std::string
char *filename;
int x;
// Again. No malloc, no C-Style cast
// Do not use C-Sytle string functions
filename = (char*) malloc((strlen(table->tablename)+strlen(dbname)+strlen("Data"))*sizeof(char));
// Do not use C-Sytle string functions
strcpy(filename,dbname);
// Do not use C-Sytle string functions
strcat(filename,table->tablename);
// Do not use C-Sytle string functions
strcat(filename,"Data");
// Check, if open works, Open file through constructor, then it will be closed by destructor
fp.open(filename);
while(!fp.eof()) { //goes through all file lines
// Do not use malloc and C-Style cast
Table::RowStruct *newrow = (Table::RowStruct*) malloc(sizeof(Table::RowStruct)); //allocates space for a new row
//initializes element
// Do not use NULL, but nullptr
newrow->prev = NULL;
newrow->next = NULL;
// Do not use malloc and C-Style cast
newrow->data = (char**) malloc(table->ncols*30*sizeof(char)); //allocates space to store the row data
// Do not use x++ but ++x
for(x=0;x<table->ncols;x++) {
// Do not use malloc and C-Style cast
newrow->data[x] = (char*) malloc(30*sizeof(char)); //allocates space for individual data element
// Check for out of bounds
fp >> newrow->data[x];
}
// Do not use selfmade linked list. Use STL container
for(p=table->rows;p->next!=NULL;p=p->next) {}
newrow->prev = p;
p->next = newrow;
}
fp.close();
}
You see, there is a lot of C in it and not so much C++.
The modern C++ makes much use of containers and algorithms.
A full fledged example for C++ is below. It is hard to understand for beginners. But try to analyze and you will get a hang of it.
#include <vector>
#include <string>
#include <iostream>
#include <fstream>
#include <iterator>
#include <algorithm>
#include <sstream>
using AllWordsInOneLine = std::vector<std::string>;
using AllLines =std::vector<AllWordsInOneLine>;
struct Line // ! This is a proxy for the input_iterator !
{ // Input function. Read on line of text file and split it in words
friend std::istream& operator>>(std::istream& is, Line& line) {
std::string wholeLine; std::getline(is, wholeLine); std::istringstream iss{ wholeLine }; line.allWordsInOneLine.clear();
std::copy(std::istream_iterator<std::string>(iss), std::istream_iterator<std::string>(), std::back_inserter(line.allWordsInOneLine));
return is;
}
operator AllWordsInOneLine() const { return allWordsInOneLine; } // cast to needed result
AllWordsInOneLine allWordsInOneLine{}; // Local storage for all words in line
};
int main()
{
std::ifstream inFileStream{ "r:\\input.txt" }; // Open input file. Will be closed by destructor
if (!inFileStream) { // ! operator is overloaded
std::cerr << "Could not open input file\n";
}
else {
// Read complete input file into memory and organize it in words by lines
AllLines allLines{ std::istream_iterator<Line>(inFileStream), std::istream_iterator<Line>() };
// Make exact ncols entries.
const size_t ncols = 6; // whatever ncols may be. Empty cols will be filled with ___ (or whatever you like)
std::for_each(allLines.begin(), allLines.end(), [ncols](AllWordsInOneLine& awil) {awil.resize(ncols, "___"); });
// copy result to std::cout
std::for_each(allLines.begin(), allLines.end(), [](AllWordsInOneLine & awil) {std::copy(awil.begin(), awil.end(), std::ostream_iterator<std::string>(std::cout, " ")); std::cout << '\n'; });
}
return 0;
}
Please see especially that the whole file, with all lines split into words, will be read in one line of code in function main.
An additional one-liner converts this into a vector with exactly ncols elements (words). This regardless if there were more or less than ncols words per line in the source file.
Hope I could help at least a little bit.

char *filename;
filename = (char*) malloc((strlen(table->tablename)+strlen(dbname)+strlen("Data"))*sizeof(char));
strcpy(filename,dbname);
strcat(filename,table->tablename);
strcat(filename,"Data");
Here's your first problem. You haven't allocated space for the terminating nul byte at the end of the string. I'm not sure why you're using C-style strings instead of std::string, but C-style strings use a zero byte at the end to mark the end of the string.
fp.open(filename);
while(!fp.eof()) { //goes through all file lines
You are misusing eof. It can't predict that a future read will succeed, it's not a future-predicting function but a past reporting function.
newrow->data = (char**) malloc(table->ncols*30*sizeof(char)); //allocates space to store the row data
This is puzzling. The type is char **, which means you're allocating a pointer to a pointer. Yet you allocate space for 30 characters. Why allocate 30 characters for a pointer?
fp >> newrow->data[x];
You don't check if this read succeeds. That's never a good thing and makes your program impossible to debug.
These are the major issues that immediately stand out.

What is the fastest way to resize std::string?

I'm working on a C++ project (using VS2008) where I will need to load a very large XML file into std::wstring from a file. Presently the following line reserves memory before the data is loaded:
//std::wstring str;
//size_t ncbDataSz = file size in bytes
str.resize(ncbDataSz / sizeof(WCHAR));
But my current issue is that the resize method takes somewhat long time for a larger string size (I just tested it with 3GB of data, in a x64 project, on a desktop PC with 12GB of free RAM and it took about 4-5 seconds to complete.)
So I'm curious, is there's a faster (more optimized) method to resize std::string? I'm asking for Windows only.

You can instantiate basic_string with char_traits which does nothing on assign(count):
#include <string>
struct noinit_char_traits : std::char_traits<char> {
using std::char_traits<char>::assign;
static char_type* assign(char_type* p, std::size_t count, char_type a) { return p; }
};
using noinit_string = std::basic_string<char, noinit_char_traits>;
Note that it will also affect functions like basic_string::fill() etc.

Instead of resizing your input string you could just allocate it using std::string::reserve because resizing also initializes every element.
You could try something like this to see if it improves performance for you:
std::wstring load_file(std::string const& filename)
{
std::wifstream ifs(filename, std::ios::ate);
// errno works on POSIX systems not sure about windows
if(!ifs)
throw std::runtime_error(std::strerror(errno));
std::wstring s;
s.reserve(ifs.tellg()); // allocate but don't initialize
ifs.seekg(0);
wchar_t buf[4096];
while(ifs.read(buf, sizeof(buf)/sizeof(buf[0])))
s.append(buf, buf + ifs.gcount()); // this will never reallocate
return s;
}

10 fold size increase when reading file into struct

I am trying to read a csv file into a struct containing a vector of vector of strings. The file contains ~2 million lines and size on disk is ~350 mb. When I read the file into struct top shows me that the on reading the full file, the program is now using almost 3.5GB of my memory. I have used vector reserve to try to limit vector capacity increase on push_back.
#include<iomanip>
#include<stdio.h>
#include<stdlib.h>
#include<iostream>
#include<fstream>
#include<string.h>
#include<sstream>
#include<math.h>
#include<vector>
#include<algorithm>
#include<array>
#include<ctime>
#include<boost/algorithm/string.hpp>
using namespace std;
struct datStr{
vector<string> colNames;
vector<vector<string>> data;
};
datStr readBoost(string fileName)
{
datStr ds;
ifstream inFile;
inFile.open(fileName);
string line;
getline(inFile, line);
vector<string> colNames;
stringstream ss(line);
string item;
int i = 0;
vector<int> colTypeInt;
while(getline(ss, item, ','))
{
item.erase( remove( item.begin(), item.end(), ' ' ), item.end() );
colNames.push_back(item);
vector<string> colVec;
ds.data.push_back(colVec);
ds.data[i].reserve(3000000);
i++;
}
int itr = 0;
while(getline(inFile, line))
{
vector<string> rowStr;
boost::split(rowStr, line, boost::is_any_of(","));
for(int ktr = 0; ktr < rowStr.size(); ktr++)
{
rowStr[ktr].erase( remove( rowStr[ktr].begin(), rowStr[ktr].end(), ' ' ), rowStr[ktr].end() );
ds.data[ktr].push_back(rowStr[ktr]);
}
itr++;
}
int main()
{
datStr ds = readBoost("file.csv");
while(true)
{
}
}
PS: The last while is just so I can monitor the memory usage on completion of the program.
Is this something expected when using vectors or am I missing something here?
Another interesting fact. I started adding up size and capacity for each string in the read loop. Surprisingly it just adds up to 1/10 of what I am shown in top on ubuntu? Could it be that top is misreporting or my compiler is allocating too much space?

I tested your code with an input file that has 1886850 lines of text, with a size of 105M.
With your code, the memory consumption was about 2.5G.
Then, I started modifying how data is stored.
First test:
Change datStr to:
struct datStr{
vector<string> colNames;
vector<string> lines;
};
This reduced the memory consumption to 206M. That's more than 10 fold reduction in size. It's clear that the penalty for using
vector<vector<string>> data;
is rather stiff.
Second test:
Change datStr to:
struct datStr{
vector<string> colNames;
vector<string> lines;
vector<vector<string::size_type>> indices;
};
with indices keeping track of where the tokens in lines start. You can extract the tokens from each line by using lines and indices.
With this change, the memory consumption went to 543MB but sill is five times smaller than the original.
Third test
Change dataStr to:
struct datStr{
vector<string> colNames;
vector<string> lines;
vector<vector<unsigned int>> indices;
};
With this change, the memory consumption came down to 455MB. This should work if you don't expect your lines to be longer or equal to UINT_MAX.
Fourth Test
Change dataStr to:
struct datStr{
vector<string> colNames;
vector<string> lines;
vector<vector<unsigned short>> indices;
};
With this change, the memory consumption came down to 278MB. This should work if you don't expect your lines to be longer or equal to USHRT_MAX. For this case, the overhead of indices is really small, only 72MB.
Here's the modified code I used for my tests.
#include<iomanip>
#include<stdio.h>
#include<stdlib.h>
#include<iostream>
#include<fstream>
#include<string.h>
#include<sstream>
#include<math.h>
#include<vector>
#include<algorithm>
#include<array>
#include<ctime>
// #include<boost/algorithm/string.hpp>
using namespace std;
struct datStr{
vector<string> colNames;
vector<string> lines;
vector<vector<unsigned short>> data;
};
void split(vector<unsigned short>& rowStr, string const& line)
{
string::size_type begin = 0;
string::size_type end = line.size();
string::size_type iter = begin;
while ( iter != end)
{
++iter;
if ( line[iter] == ',' )
{
rowStr.push_back(static_cast<unsigned short>(begin));
++iter;
begin = iter;
}
}
if (begin != end )
{
rowStr.push_back(static_cast<unsigned short>(begin));
}
}
datStr readBoost(string fileName)
{
datStr ds;
ifstream inFile;
inFile.open(fileName);
string line;
getline(inFile, line);
vector<string> colNames;
stringstream ss(line);
string item;
int i = 0;
vector<int> colTypeInt;
while(getline(ss, item, ','))
{
item.erase( remove( item.begin(), item.end(), ' ' ), item.end() );
ds.colNames.push_back(item);
}
int itr = 0;
while(getline(inFile, line))
{
ds.lines.push_back(line);
vector<unsigned short> rowStr;
split(rowStr, line);
ds.data.push_back(rowStr);
}
}
int main(int argc, char** argv)
{
datStr ds = readBoost(argv[1]);
while(true)
{
}
}

Your vector<vector<string>> suffers from the costs of indirection (pointer members to dynamically allocated memory), housekeeping (members supporting size()/end()/capacity()), and the housekeeping and rounding-up internal to the dynamic memory allocation functions... if you look at the first graph titled Real memory consumption for different string lengths here it suggests total overheads around 40-45 bytes per string for a 32 bit app built with G++ 4.6.2, though an implementation could potentially get this as low as 4 bytes for strings of up to ~4 characters. Then there's waste for vector overheads....
You can address the issue in any of several ways, depending on your input data and efficiency needs:
store vector<std::pair<string, Column_Index>>, where Column_Index is a class you write that records the offsets in the string where each field appear
store vector<std::string> where column values are padded to known maximum widths, which will help most if the lengths are small, fixed and/or similar (e.g. date/times, small monetary amounts, ages)
memory map the file, then store offsets (but unquoting/unescaping is an issue - you could do that in-place, e.g. abc""def or abc\"def (whichever you support) -> abc"deff)
with the last two approaches, you can potentially overwrite the character after each field with a NUL if that's useful to you, so you can treat the fields as "C"-style ASCIIZ NUL-terminated strings
if some/all fields contain values like values like say 1.23456789012345678... where the textual representation may be longer than a binary inbuilt type (float, double, int64_t), doing a conversion before storage makes sense
similarly, if there's a set of repeating values - like a field of what are logically enumeration identifiers, you can encoding them as integers, or if the values are repetitive but not known until runtime, you can create a bi-directional mapping from incrementing indices to values

A couple things come to mind:
You say your file has about 2 million lines, but you reserve space for 3 million strings for each column. Even if you only have one column, that's a lot of wasted space. If you have a bunch of columns, that's a ton of wasted space. It might be informative to see how much space difference it makes if you don't reserve at all.
string has a small* but nonzero amount of overhead that you're paying for every single field in your 2-million line file. If you really need to hold all the data in memory at once and it's causing problems to do so, this may actually be a case where you're better off just using char* instead of string. But I'd only resort to this if adjusting reserve doesn't help.
* The overhead due to metadata is small, but if the strings are allocating extra capacity for their internal buffers, that could really add up. See this recent question.
Update: The problem with your update is that you are storing pointers to temporary std::string objects in datStr. By the time you get around to printing, those strings have been destroyed and your pointers are wild.
If you want a simple, safe way to store your strings in datStr that doesn't allocate more space than it needs, you could use something like this:
class TrivialReadOnlyString
{
private:
char* m_buffer;
public:
TrivialReadOnlyString(const std::string& src)
{
InitFrom(src.c_str());
}
TrivialReadOnlyString(const TrivialReadOnlyString& src)
{
InitFrom(src.m_buffer);
}
~TrivialReadOnlyString()
{
delete[] m_buffer;
}
const char* Get()
{
return m_buffer;
}
private:
void InitFrom(const char* src)
{
// Can switch to the safe(r) versions of these functions
// if you're using vc++ and it complains.
size_t length = strlen(src);
m_buffer = new char[length + 1];
strcpy(m_buffer, src);
}
};
There are a lot of further enhancements that could be made to this class, but I think it is sufficient for your program's needs as shown. This will fragment memory more than Blastfurnace's idea of storing the whole file in a single buffer. however.
If there is a lot of repetition in your data, you might also consider "folding" the repeats into a single object to avoid redundantly storing the same strings in memory over and over (flyweight pattern).

Indulge me while I take a very different approach in answering your question. Others have already answered your direct question quite well, so let me provide another perspective entirely.
Do you realize that you could store that data in memory with a single allocation, plus one pointer for each line, or perhaps one pointer per cell?
On a 32 bit machine, that's 350MB + 8MB (or 8MB * number columns).
Did you know that it's easy to parallelize CSV parsing?
The problem you have is layers and layers of bloat. ifstream, stringstream, vector<vector<string>>, and boost::split are wonderful if you don't care about size or speed. All of that can be done more directly and at lower cost.
In situations like this, where size and speed do matter, you should consider doing things the manual way. Read the file using an API from your OS. Read it into a single memory location, and modify the memory in place by replacing commas or EOLs with '\0'. Store pointers to those C strings in your datStr and you're done.
You can write similar solutions for variants of the problem. If the file is too large for memory, you can process it in pieces. If you need to convert data to other formats like floating point, that's easy to do. If you'd like to parallelize parsing, that's far easier without the extra layers between you and your data.
Every programmer should be able to choose to use convenience layers or to use simpler methods. If you lack that choice, you won't be able to solve some problems.

C++ read()-ing from a socket to an ofstream

Is there a C/C++ way to read data from a socket using read() and having the receiving buffer be a file (ofstream) or a similar self-extending object (vector e.g.)?
EDIT: The question arose while I contemplated how to read a stream socket that may receive the contents of a, say 10000+ byte file. I just never did like putting 20000 or 50000 bytes (large enough for now) on the stack as a buffer where the file could be stored temporarily till I could stick in into a file. Why not just stream it directly into the file to star with.
Much like you can get at the char* inside a std:string, I thought of something like
read( int fd, outFile.front(), std::npos ); // npos = INT_MAX
or something like that.
end edit
Thanks.

This is simplistic, and off the top of my fingers, but I think something along these lines would work out:
template <unsigned BUF_SIZE>
struct Buffer {
char buf_[BUF_SIZE];
int len_;
Buffer () : buf_(), len_(0) {}
int read (int fd) {
int r = read(fd, buf_ + len_, BUF_SIZE - len_);
if (r > 0) len_ += r;
return r;
}
int capacity () const { return BUF_SIZE - len_; }
}
template <unsigned BUF_SIZE>
struct BufferStream {
typedef std::unique_ptr< Buffer<BUF_SIZE> > BufferPtr;
std::vector<BufferPtr> stream_;
BufferStream () : stream_(1, BufferPtr(new Buffer<BUF_SIZE>)) {}
int read (int fd) {
if ((*stream_.rbegin())->capacity() == 0)
stream_.push_back(BufferPtr(new Buffer<BUF_SIZE>));
return (*stream_.rbegin())->read(fd);
}
};
In a comment, you mentioned you wanted to avoid creating a big char buffer. When using the read system call, it is generally more efficient to perform a few large reads rather than many small ones. So most implementations will opt for large input buffers to gain that efficiency. You could implement something like:
std::vector<char> input;
char in;
int r;
while ((r = read(fd, &in, 1)) == 1) input.push_back(in);
But that would involve a system call and at least one byte copied for every byte of input. In contrast, the code I put forth avoids extra data copies.
I don't really expect the code I put out to be the solution you would adopt. I just wanted to provide you with an illustration of how to create a self-extending object that was fairly space and time efficient. Depending on your purposes, you may want to extend it, or write your own. Off the top of my head, some improvements may be:
use std::list instead, to avoid vector resizing
allow API a parameter to specify how many bytes to read
use readv to always allow at least BUF_SIZE bytes (or more than BUF_SIZE bytes) to be read at a time

Take a look at stream support in boost::asio.

What's the most efficient way to read a file into a std::string?

I currently do this, and the conversion to std::string at the end take 98% of the execution time. There must be a better way!
std::string
file2string(std::string filename)
{
std::ifstream file(filename.c_str());
if(!file.is_open()){
// If they passed a bad file name, or one we have no read access to,
// we pass back an empty string.
return "";
}
// find out how much data there is
file.seekg(0,std::ios::end);
std::streampos length = file.tellg();
file.seekg(0,std::ios::beg);
// Get a vector that size and
std::vector<char> buf(length);
// Fill the buffer with the size
file.read(&buf[0],length);
file.close();
// return buffer as string
std::string s(buf.begin(),buf.end());
return s;
}

Being a big fan of C++ iterator abstraction and the algorithms, I would love the following to be the fasted way to read a file (or any other input stream) into a std::string (and then print the content):
#include <algorithm>
#include <fstream>
#include <iostream>
#include <iterator>
#include <string>
int main()
{
std::string s(std::istreambuf_iterator<char>(std::ifstream("file")
>> std::skipws),
std::istreambuf_iterator<char>());
std::cout << "file='" << s << "'\n";
}
This certainly is fast for my own implementation of IOStreams but it requires a lot of trickery to actually get it fast. Primarily, it requires optimizing algorithms to cope with segmented sequences: a stream can be seen as a sequence of input buffers. I'm not aware of any STL implementation consistently doing this optimization. The odd use of std::skipws is just to get reference to the just created stream: the std::istreambuf_iterator<char> expects a reference to which the temporary file stream wouldn't bind.
Since this probably isn't the fastest approach, I would be inclined to use std::getline() with a particular "newline" character, i.e. on which isn't in the file:
std::string s;
// optionally reserve space although I wouldn't be too fuzzed about the
// reallocations because the reads probably dominate the performances
std::getline(std::ifstream("file") >> std::skipws, s, 0);
This assumes that the file doesn't contain a null character. Any other character would do as well. Unfortunately, std::getline() takes a char_type as delimiting argument, rather than an int_type which is what the member std::istream::getline() takes for the delimiter: in this case you could use eof() for a character which never occurs (char_type, int_type, and eof() refer to the respective member of char_traits<char>). The member version, in turn, can't be used because you would need to know ahead of time how many characters are in the file.
BTW, I saw some attempts to use seeking to determine the size of the file. This is bound not to work too well. The problem is that the code conversion done in std::ifstream (well, actually in std::filebuf) can create a different number of characters than there are bytes in the file. Admittedly, this isn't the case when using the default C locale and it is possible to detect that this doesn't do any conversion. Otherwise the best bet for the stream would be to run over the file and determine the number of characters being produced. I actually think that this is what would be needed to be done when the code conversion could something interesting although I don't think it actually is done. However, none of the examples explicitly set up the C locale, using e.g. std::locale::global(std::locale("C"));. Even with this it is also necessary to open the file in std::ios_base::binary mode because otherwise end of line sequences may be replaced by a single character when reading. Admittedly, this would only make the result shorter, never longer.
The other approaches using the extraction from std::streambuf* (i.e. those involving rdbuf()) all require that the resulting content is copied at some point. Given that the file may actually be very large this may not be an option. Without the copy this could very well be the fastest approach, however. To avoid the copy, it would be possible to create a simple custom stream buffer which takes a reference to a std::string as constructor argument and directly appends to this std::string:
#include <fstream>
#include <iostream>
#include <string>
class custombuf:
public std::streambuf
{
public:
custombuf(std::string& target): target_(target) {
this->setp(this->buffer_, this->buffer_ + bufsize - 1);
}
private:
std::string& target_;
enum { bufsize = 8192 };
char buffer_[bufsize];
int overflow(int c) {
if (!traits_type::eq_int_type(c, traits_type::eof()))
{
*this->pptr() = traits_type::to_char_type(c);
this->pbump(1);
}
this->target_.append(this->pbase(), this->pptr() - this->pbase());
this->setp(this->buffer_, this->buffer_ + bufsize - 1);
return traits_type::not_eof(c);
}
int sync() { this->overflow(traits_type::eof()); return 0; }
};
int main()
{
std::string s;
custombuf sbuf(s);
if (std::ostream(&sbuf)
<< std::ifstream("readfile.cpp").rdbuf()
<< std::flush) {
std::cout << "file='" << s << "'\n";
}
else {
std::cout << "failed to read file\n";
}
}
At least with a suitably chosen buffer I would expect the version to be the fairly fast. Which version is the fastest will certainly depend on the system, the standard C++ library being used, and probably a number of other factors, i.e. you want to measure the performance.

You can try this:
#include <fstream>
#include <sstream>
#include <string>
int main()
{
std::ostringstream oss;
std::string s;
std::string filename = get_file_name();
if (oss << std::ifstream(filename, std::ios::binary).rdbuf())
{
s = oss.str();
}
else
{
// error
}
// now s contains your file
}
You can also just use oss.str() directly if you like; just make sure you have some sort of error check somewhere.
No guarantee that it's the most efficient; you probably can't beat <cstdio> and fread. As #Benjamin pointed out, the string stream only exposes the data by copy, so you could instead read directly into the target string:
#include <string>
#include <cstdio>
std::FILE * fp = std::fopen("file.bin", "rb");
std::fseek(fp, 0L, SEEK_END);
unsigned int fsize = std::ftell(fp);
std::rewind(fp);
std::string s(fsize, 0);
if (fsize != std::fread(static_cast<void*>(&s[0]), 1, fsize, fp))
{
// error
}
std::fclose(fp);
(You might like to use a RAII wrapper for the FILE*.)
Edit: The fstream-analogue of the second version goes like this:
#include <string>
#include <fstream>
std::ifstream infile("file.bin", std::ios::binary);
infile.seekg(0, std::ios::end);
unsigned int fsize = infile.tellg();
infile.seekg(0, std::ios::beg);
std::string s(fsize, 0);
if (!infile.read(&s[0], fsize))
{
// error
}
Edit: Yet another version, using streambuf-iterators:
std::ifstream thefile(filename, std::ios::binary);
std::string s((std::istreambuf_iterator<char>(thefile)), std::istreambuf_iterator<char>());
(Mind the aditional parentheses to get the correct parsing.)

Ironically, the example for string::reserve is reading a file into a string. You don't want to read the file into one buffer and then have to allocate/copy into another one.
Here's the example code:
// string::reserve
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main ()
{
string str;
size_t filesize;
ifstream file ("test.txt",ios::in|ios::ate);
filesize=file.tellg();
str.reserve(filesize); // allocate space in the string
file.seekg(0);
for (char c; file.get(c); )
{
str += c;
}
cout << str;
return 0;
}

I don't know how efficient it is, but here is a simple (to read) way, by just setting the EOF as the delimiter:
string buffer;
ifstream fin;
fin.open("filename.txt");
if(fin.is_open()) {
getline(fin,buffer,'\x1A');
fin.close();
}
The efficiency of this obviously depends on what's going on internally in the getline algorithm, so you could take a look at the code in the standard libraries to see how it works.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C++ Fast way to load large txt file in vector<string> - c++

You can simply allocate enough RAM memory and read the whole text file almost at once. Than you can access the data in RAM by memory pointer. I read the whole 4GB text file in about 3 seconds.

Related

Read undefined number of variables from file line

What is the fastest way to resize std::string?

10 fold size increase when reading file into struct

C++ read()-ing from a socket to an ofstream

What's the most efficient way to read a file into a std::string?

Categories

Resources