What is the fastest way to resize std::string? - c++

I'm working on a C++ project (using VS2008) where I will need to load a very large XML file into std::wstring from a file. Presently the following line reserves memory before the data is loaded:
//std::wstring str;
//size_t ncbDataSz = file size in bytes
str.resize(ncbDataSz / sizeof(WCHAR));
But my current issue is that the resize method takes somewhat long time for a larger string size (I just tested it with 3GB of data, in a x64 project, on a desktop PC with 12GB of free RAM and it took about 4-5 seconds to complete.)
So I'm curious, is there's a faster (more optimized) method to resize std::string? I'm asking for Windows only.

You can instantiate basic_string with char_traits which does nothing on assign(count):
#include <string>
struct noinit_char_traits : std::char_traits<char> {
using std::char_traits<char>::assign;
static char_type* assign(char_type* p, std::size_t count, char_type a) { return p; }
};
using noinit_string = std::basic_string<char, noinit_char_traits>;
Note that it will also affect functions like basic_string::fill() etc.

Instead of resizing your input string you could just allocate it using std::string::reserve because resizing also initializes every element.
You could try something like this to see if it improves performance for you:
std::wstring load_file(std::string const& filename)
{
std::wifstream ifs(filename, std::ios::ate);
// errno works on POSIX systems not sure about windows
if(!ifs)
throw std::runtime_error(std::strerror(errno));
std::wstring s;
s.reserve(ifs.tellg()); // allocate but don't initialize
ifs.seekg(0);
wchar_t buf[4096];
while(ifs.read(buf, sizeof(buf)/sizeof(buf[0])))
s.append(buf, buf + ifs.gcount()); // this will never reallocate
return s;
}

Related

C++ Fast way to load large txt file in vector<string>

I have a file ~12.000.000 hex lines and 1,6GB
Example of file:
999CBA166262923D53D3EFA72F5C4E8EE1E1FF1E7E33C42D0CE8B73604034580F2
889CBA166262923D53D3EFA72F5C4E8EE1E1FF1E7E33C42D0CE8B73604034580F2
Example of code:
vector<string> buffer;
ifstream fe1("strings.txt");
string line1;
while (getline(fe1, line1)) {
buffer.push_back(line1);
}
Now the loading takes about 20 minutes. Any suggestions how to speed up? Thanks a lot in advance.
Loading a large text file into std::vector<std::string> is rather inefficient and wasteful because it allocates heap memory for each std::string and re-allocates the vector multiple times. Each of these heap allocations requires heap book-keeping information under the hood (normally 8 bytes per allocation on a 64-bit system), and each line requires an std::string object (8-32 bytes depending on the standard library), so that a file loaded this way takes a lot more space in RAM than on disk.
One fast way is to map the file into memory and implement iterators to walk over lines in it. This sidesteps the issues mentioned above.
Working example:
#include <boost/interprocess/file_mapping.hpp>
#include <boost/interprocess/mapped_region.hpp>
#include <boost/iterator/iterator_facade.hpp>
#include <boost/range/iterator_range_core.hpp>
#include <iostream>
class LineIterator
: public boost::iterator_facade<
LineIterator,
boost::iterator_range<char const*>,
boost::iterators::forward_traversal_tag,
boost::iterator_range<char const*>
>
{
char const *p_, *q_;
boost::iterator_range<char const*> dereference() const { return {p_, this->next()}; }
bool equal(LineIterator b) const { return p_ == b.p_; }
void increment() { p_ = this->next(); }
char const* next() const { auto p = std::find(p_, q_, '\n'); return p + (p != q_); }
friend class boost::iterator_core_access;
public:
LineIterator(char const* begin, char const* end) : p_(begin), q_(end) {}
};
inline boost::iterator_range<LineIterator> crange(boost::interprocess::mapped_region const& r) {
auto p = static_cast<char const*>(r.get_address());
auto q = p + r.get_size();
return {LineIterator{p, q}, LineIterator{q, q}};
}
inline std::ostream& operator<<(std::ostream& s, boost::iterator_range<char const*> const& line) {
return s.write(line.begin(), line.size());
}
int main() {
boost::interprocess::file_mapping file("/usr/include/gnu-versions.h", boost::interprocess::read_only);
boost::interprocess::mapped_region memory(file, boost::interprocess::read_only);
unsigned n = 0;
for(auto line : crange(memory))
std::cout << n++ << ' ' << line;
}
You can read the entire file into memory. This can be done with C++ streams, or you may be able to get even more performance by using platform specific API's, such as memory mapped files or their own file reading API's.
Once you have this block of data, for performance you want to avoid any further copies and use it in-place. In C++17 you have std::string_view which is similar to std::string but uses existing string data, avoiding the copy. Otherwise you might just work with C style char* strings, either by replacing the newline with a null (\0), using a pair of pointers (begin/end) or a pointer and size.
Here I used string_view, I also assumed newlines are always \n and that there is a newline at the end. You may need to adjust the loop if this is not the case. Guessing the size of the vector will also gain a little performance, you could maybe do so from the file length. I also skipped some error handling.
std::fstream is("data.txt", std::ios::in | std::ios::binary);
is.seekg(0, std::ios::end);
size_t data_size = is.tellg();
is.seekg(0, std::ios::beg);
std::unique_ptr<char[]> data(new char[data_size]);
is.read(data.get(), data_size);
std::vector<std::string_view> strings;
strings.reserve(data_size / 40); // If you have some idea, avoid re-allocations as general practice with vector etc.
for (size_t i = 0, start = 0; i < data_size; ++i)
{
if (data[i] == '\n') // End of line, got string
{
strings.emplace_back(data.get() + start, i - start);
start = i + 1;
}
}
To get a little more performance, you might run the loop to do CPU work in parallel of the file IO. This can be done with threads or using platform-specific async file IO. However in this case the loop will be very fast, so there would not be much to gain.
You can simply allocate enough RAM memory and read the whole text file almost at once. Than you can access the data in RAM by memory pointer. I read the whole 4GB text file in about 3 seconds.

Stack around the variable 'folderPath' was corrupted

Hi i'm using visual studio and trying to make a program that replicate itself to a disk, when i run it does just that, but then i get the message:
"*Run-Time Check Failure #2 - Stack around the variable 'folderPath' was corrupted*."
the code is as follows:
void copyToDrive(char driveLetter) {
char folderPath[10] = { driveLetter };
strcat(folderPath, ":\\");
strcat(folderPath, FILE_NAME);
char filename[MAX_PATH];
DWORD size = GetModuleFileNameA(NULL, filename, MAX_PATH);
std::ifstream src(filename, std::ios::binary);
std::ofstream dest(folderPath, std::ios::binary);
dest << src.rdbuf();
return;
}
what is causing it? and how can i fix that?
The string "app.exe" is seven characters long. That means the total length of the string you construct will be ten characters long.
Unfortunately you seem to forget that char strings in C++ are really called null-terminated byte strings, and that the null-terminator also needs space.
Since there is no space for the null-terminator (character '\0') the last strcat call will write out of bounds of your folderPath array, leading to undefined behavior (and the error you get).
The simple solution is to add one element to the array to make space for the terminator as well:
char folderPath[11];
A more correct solution is to use std::string instead, and not have to worry about the length.
And since you are working with paths I would suggest you use std::filesystem::path (or Boost filesystem path if you don't have C++17 available).

Rewrite file with 0's. What am I doing wrong?

I want rewrite file with 0's. It only write a few bytes.
My code:
int fileSize = boost::filesystem::file_size(filePath);
int zeros[fileSize] = { 0 };
boost::filesystem::path rewriteFilePath{filePath};
boost::filesystem::ofstream rewriteFile{rewriteFilePath, std::ios::trunc};
rewriteFile << zeros;
Also... Is this enough to shred the file? What should I do next to make the file unrecoverable?
EDIT: Ok. I rewrited my code to this. Is this code ok to do this?
int fileSize = boost::filesystem::file_size(filePath);
boost::filesystem::path rewriteFilePath{filePath};
boost::filesystem::ofstream rewriteFile{rewriteFilePath, std::ios::trunc};
for(int i = 0; i < fileSize; i++) {
rewriteFile << 0;
}
There are several problems with your code.
int zeros[fileSize] = { 0 };
You are creating an array that is sizeof(int) * fileSize bytes in size. For what you are attempting, you need an array that is fileSize bytes in size instead. So you need to use a 1-byte data type, like (unsigned) char or uint8_t.
But, more importantly, since the value of fileSize is not known until runtime, this type of array is known as a "Variable Length Array" (VLA), which is a non-standard feature in C++. Use std::vector instead if you need a dynamically allocated array.
boost::filesystem::ofstream rewriteFile{rewriteFilePath, std::ios::trunc};
The trunc flag truncates the size of an existing file to 0. What that entails is to update the file's metadata to reset its tracked byte size, and to mark all of the file's used disk sectors as available for reuse. The actual file bytes stored in those sectors are not wiped out until overwritten as sectors get reused over time. But any bytes you subsequently write to the truncated file are not guaranteed to (and likely will not) overwrite the old bytes on disk. So, do not truncate the file at all.
rewriteFile << zeros;
ofstream does not have an operator<< that takes an int[], or even an int*, as input. But it does have an operator<< that takes a void* as input (to output the value of the memory address being pointed at). An array decays into a pointer to the first element, and void* accepts any pointer. This is why only a few bytes are being written. You need to use ofstream::write() instead to write the array to file, and be sure to open the file with the binary flag.
Try this instead:
int fileSize = boost::filesystem::file_size(filePath);
std::vector<char> zeros(fileSize, 0);
boost::filesystem::path rewriteFilePath(filePath);
boost::filesystem::ofstream rewriteFile(rewriteFilePath, std::ios::binary);
rewriteFile.write(zeros.data()/*&zeros[0]*/, fileSize);
That being said, you don't need a dynamically allocated array at all, let alone one that is allocated to the full size of the file. That is just a waste of heap memory, especially for large files. You can do this instead:
int fileSize = boost::filesystem::file_size(filePath);
const char zeros[1024] = {0}; // adjust size as desired...
boost::filesystem::path rewriteFilePath(filePath);
boost::filesystem::ofstream rewriteFile(rewriteFilePath, std::ios::binary);
int loops = fileSize / sizeof(zeros);
for(int i = 0; i < loops; ++i) {
rewriteFile.write(zeros, sizeof(zeros));
}
rewriteFile.write(zeros, fileSize % sizeof(zeros));
Alternatively, if you open a memory-mapped view of the file (MapViewOfFile() on Windows, mmap() on Linux, etc) then you can simply use std::copy() or std::memset() to zero out the bytes of the entire file directly on disk without using an array at all.
Also... Is this enough to shred the file?
Not really, no. At the physical hardware layer, overwriting the file just one time with zeros can still leave behind remnant signals in the disk sectors, which can be recovered with sufficient tools. You should overwrite the file multiple times, with varying types of random data, not just zeros. That will more thoroughly scramble the signals in the sectors.
I cannot stress strongly enough the importance of the comments that overwriting a file's contents does not guarantee that any of the original data is overwritten. ALL OTHER ANSWERS TO THIS QUESTION ARE THEREFORE IRRELEVANT ON ANY RECENT OPERATING SYSTEM.
Modern filing systems are extents based, meaning that files are stored as a linked list of allocated chunks. Updating a chunk may be faster for the filing system to write a whole new chunk and simply adjust the linked list, so that's what they do. Indeed copy-on-write filing systems always write a copy of any modified chunk and update their B-tree of currently valid extents.
Furthermore, even if your filing system doesn't do this, your hard drive may use the exact same technique also for performance, and any SSD almost certainly always uses this technique due to how flash memory works. So overwriting data to "erase" it is meaningless on modern systems. Can't be done. The only safe way to keep old data hidden is full disk encryption. Anything else you are deceiving yourself and your users.
Just for fun, overwriting with random data:
Live On Coliru
#include <boost/iostreams/device/mapped_file.hpp>
#include <random>
namespace bio = boost::iostreams;
int main() {
bio::mapped_file dst("main.cpp");
std::mt19937 rng { std::random_device{} () };
std::uniform_int_distribution<char> dist;
std::generate_n(dst.data(), dst.size(), [&] { return dist(rng); });
}
Note that it scrambles its own source file after compilation :)

How to use inplace const char* as std::string content

I am working on a embedded SW project. A lot of strings are stored inside flash memory. I would use these strings (usually const char* or const wchar*) as std::string's data. That means I want to avoid creating a copy of the original data because of memory restrictions.
An extended use might be to read the flash data via stringstream directly out of the flash memory.
Example which unfortunately is not working in place:
const char* flash_adr = 0x00300000;
size_t length = 3000;
std::string str(flash_adr, length);
Any ideas will be appreciated!
If you are willing to go with compiler and library specific implementations, here is an example that works in MSVC 2013.
#include <iostream>
#include <string>
int main() {
std::string str("A std::string with a larger length than yours");
char *flash_adr = "Your source from the flash";
char *reset_adr = str._Bx._Ptr; // Keep the old address around
// Change the inner buffer
(char*)str._Bx._Ptr = flash_adr;
std::cout << str << std::endl;
// Reset the pointer or the program will crash
(char*)str._Bx._Ptr = reset_adr;
return 0;
}
It will print Your source from the flash.
The idea is to reserve a std::string capable of fitting the strings in your flash and keep on changing its inner buffer pointer.
You need to customize this for your compiler and as always, you need to be very very careful.
I have now used string_span described in CPP Core Guidelines (https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md). GSL provides a complete implementation (GSL: Guidelines Support Library https://github.com/Microsoft/GSL).
If you know the address of your string inside flash memory you can just use the address directly with the following constructor to create a string_span.
constexpr basic_string_span(pointer ptr, size_type length) noexcept
: span_(ptr, length)
{}
std::string_view might have done the same job as Captain Obvlious (https://stackoverflow.com/users/845568/captain-obvlious) commented as my favourite comment.
I am quite happy with the solution. It works good from performance side including providing a good readability.

Fastest way to write large STL vector to file using STL

I have a large vector (10^9 elements) of chars, and I was wondering what is the fastest way to write such vector to a file. So far I've been using next code:
vector<char> vs;
// ... Fill vector with data
ofstream outfile("nanocube.txt", ios::out | ios::binary);
ostream_iterator<char> oi(outfile, '\0');
copy(vs.begin(), vs.end(), oi);
For this code it takes approximately two minutes to write all data to file. The actual question is: "Can I make it faster using STL and how"?
With such a large amount of data to be written (~1GB), you should write to the output stream directly, rather than using an output iterator. Since the data in a vector is stored contiguously, this will work and should be much faster.
ofstream outfile("nanocube.txt", ios::out | ios::binary);
outfile.write(&vs[0], vs.size());
There is a slight conceptual error with your second argument to ostream_iterator's constructor. It should be NULL pointer, if you don't want a delimiter (although, luckily for you, this will be treated as such implicitly), or the second argument should be omitted.
However, this means that after writing each character, the code needs to check for the pointer designating the delimiter (which might be somewhat inefficient).
I think, if you want to go with iterators, perhaps you could try ostreambuf_iterator.
Other options might include using the write() method (if it can handle output this large, or perhaps output it in chunks), and perhaps OS-specific output functions.
Since your data is contiguous in memory (as Charles said), you can use low level I/O. On Unix or Linux, you can do your write to a file descriptor. On Windows XP, use file handles. (It's a little trickier on XP, but well documented in MSDN.)
XP is a little funny about buffering. If you write a 1GB block to a handle, it will be slower than if you break the write up into smaller transfer sizes (in a loop). I've found the 256KB writes are most efficient. Once you've written the loop, you can play around with this and see what's the fastest transfer size.
OK, I did write method implementation with for loop that writes 256KB blocks (as Rob suggested) of data at each iteration and result is 16 seconds, so problem solved. This is my humble implementation so feel free to comment:
void writeCubeToFile(const vector<char> &vs)
{
const unsigned int blocksize = 262144;
unsigned long blocks = distance(vs.begin(), vs.end()) / blocksize;
ofstream outfile("nanocube.txt", ios::out | ios::binary);
for(unsigned long i = 0; i <= blocks; i++)
{
unsigned long position = blocksize * i;
if(blocksize > distance(vs.begin() + position, vs.end())) outfile.write(&*(vs.begin() + position), distance(vs.begin() + position, vs.end()));
else outfile.write(&*(vs.begin() + position), blocksize);
}
outfile.write("\0", 1);
outfile.close();
}
Thnx to all of you.
If you have other structure this method is still valid.
For example:
typedef std::pair<int,int> STL_Edge;
vector<STL_Edge> v;
void write_file(const char * path){
ofstream outfile(path, ios::out | ios::binary);
outfile.write((const char *)&v.front(), v.size()*sizeof(STL_Edge));
}
void read_file(const char * path,int reserveSpaceForEntries){
ifstream infile(path, ios::in | ios::binary);
v.resize(reserveSpaceForEntries);
infile.read((char *)&v.front(), v.size()*sizeof(STL_Edge));
}
Instead of writing via the file i/o methods, you could try to create a memory-mapped file, and then copy the vector to the memory-mapped file using memcpy.
Use the write method on it, it is in ram after all and you have contigous memory.. Fastest, while looking for flexibility later? Lose the built-in buffering, hint sequential i/o, lose the hidden things of iterator/utility, avoid streambuf when you can but do get dirty with boost::asio ..