C# coder just wrote this simple C++ method to get text from a file:
static std::vector<std::string> readTextFile(const std::string &filePath) {
std::string line;
std::vector<std::string> lines;
std::ifstream theFile(filePath.c_str());
while (theFile.good()) {
getline (theFile, line);
lines.push_back(line);
}
theFile.close();
return lines;
}
I know this code is not efficient; the text lines are copied once as they are read and a second time when returned-by-value.
Two questions:
(1) can this code leak memory ?
(2) more generally can returning containers of objects by value ever leak memory ? (assuming the objects themselves don't leak)
while (theFile.good()) {
getline (theFile, line);
lines.push_back(line);
}
Forget about efficiency, this code is not correct. It will not read the file correctly. See the following topic to know why:
What's preferred pattern for reading lines from a file in C++?
So the loop should be written as:
while (getline (theFile, line)) {
lines.push_back(line);
}
Now this is correct. If you want to make it efficient, profile your application first. Try see the part which takes most CPU cycles.
(1) can this code leak memory ?
No.
(2) more generally can returning containers of objects by value ever leak memory ?
Depends on the type of the object in the containers. In your case, the type of the object in std::vector is std::string which makes sure that no memory will be leaked.
No and no. Returning by value never leaks memory (assuming the containers and the contained objects are well written). It would be fairly useless if it was any other way.
And I second what Nawaz says, your while loop is wrong. It's frankly incredible how many times we see that, there must be an awful lot of bad advice out there.
(1) can this code leak memory ?
No
(2) more generally can returning containers of objects by value ever leak memory ?
No. You might leak memory that is stored in a container by pointer or through objects that leak. But that would not be caused by returning by value.
I know this code is not efficient; the text lines are copied once as they are read and a second time when returned-by-value.
Most probably not. There are two copies of the string, but not the ones that you are thinking about. The return copy will most likely be optimized in C++03, and will either be optimized away or transformed into a move (cheap) in C++11.
The two copes are rather:
getline (theFile, line);
lines.push_back(line);
The first line copies from the file to line, and the second copies from line to the container. If you are using a C++11 compiler you can change the second line to:
lines.push_back(std::move(line));
to move the contents of the string to the container. Alternatively (and also valid in C++03), you can change the two lines with:
lines.push_back(std::string()); // In most implementations this is *cheap*
// (i.e. no memory allocation)
getline(theFile, lines.back());
And you should test the result of the read (if the read fails, in the last alternative, make sure to resize to one less element to remove the last empty string.
In C++11, you can do:
std::vector<std::string>
read_text_file(const std::string& path)
{
std::string line;
std::vector<std::string> ans;
std::ifstream file(path.c_str());
while (std::getline(file, line))
ans.push_back(std::move(line));
return ans;
}
and no extra copies are made.
In C++03, you accept the extra copies, and painfully remove them only if profiling dictates so.
Note: you don't need to close the file manually, the destructor of std::ifstream does it for you.
Note2: You can template on the char type, this may be useful in some circumstances:
template <typename C, typename T>
std::vector<std::basic_string<C, T>>
read_text_file(const char* path)
{
std::basic_string<C, T> line;
std::vector<std::basic_string<C, T>> ans;
std::basic_ifstream<C, T> file(path);
// Rest as above
}
No, returning container by value should not leak memory. The standard library is designed not to leak memory itself in any case. It can only leak a memory if there is a bug in its implementation. At least there used to be a bug in vector of strings in an old MSVC.
Related
I have a large .txt file that I need to load and store in a vector. The file is around 5MB in size, 500 000 lines, and every line is around 10-20 characters, delimited with '\n'. I was doing some benchmarking on the time needed to read the whole file using a sample code below:
#include<iostream>
#include<vector>
#include<fstream>
int main()
{
std::fstream input("words.txt");
std::vector<std::string> vector;
std::string line;
while(input >> line){
vector.push_back(line);
}
}
I was curious if passing the string as an rvalue reference would prove faster but instead it was slower by about 10ms.
#include<iostream>
#include<vector>
#include<fstream>
int main()
{
std::fstream input("words.txt");
std::vector<std::string> vector;
std::string line;
while(input >> line){
vector.push_back(std::move(line));
}
}
The average load time for the first code example was about 58ms, and for the second one 68-70ms. I was thinking that moving is always faster or equal to copying, and that's why this doesn't seem right to me.
Does anyone know the reason to why this is happening?
The benchmarks were done using:
perf stats -r 100 ./a.out
on Arch Linux, and the code has been compiled using GCC 10.2, C++17 std.
Also if anyone knows a more optimal way to do this would be much appreciated.
If you invoke g++ -E you can ogle the relevant code:
Copy construction:
basic_string(const basic_string& __str)
: _M_dataplus(_M_local_data(),
_Alloc_traits::_S_select_on_copy(__str._M_get_allocator()))
{
_M_construct(__str._M_data(), __str._M_data() + __str.length());
}
Move construction:
# 552 "/usr/local/include/c++/10.2.0/bits/basic_string.h" 3
basic_string(basic_string&& __str) noexcept
: _M_dataplus(_M_local_data(), std::move(__str._M_get_allocator()))
{
if (__str._M_is_local())
{
traits_type::copy(_M_local_buf, __str._M_local_buf,
_S_local_capacity + 1);
}
else
{
_M_data(__str._M_data());
_M_capacity(__str._M_allocated_capacity);
}
_M_length(__str.length());
__str._M_data(__str._M_local_data());
__str._M_set_length(0);
}
Notably, (to support the Short-String-Optimisation) the move constructor needs to look at ._M_is_local() to work out whether to copy or move (so has a branch to predict), and it clears out the moved-from string / sets its length to 0. Extra work = extra time.
#Manuel made an interesting comment:
When you move line it gets empty, so the next iteration needs to allocate space. std::string has a small buffer as an optimization (most of the implementations, if not all) and the copy just copies chars, no memory allocation. That could be the difference.
That doesn't quite add up as stated, but there are subtle differences here. For input space-separated words long enough to need dynamic allocation, the:
a) move version may have cleared out line's dynamic buffer after the last word, and therefore need to reallocate; if the input is long enough it may have to reallocate one or more times to increase capacity.
b) copy version will likely have a large-enough buffer (its capacity will be increased as necessary, effectively being a high-water-mark for the words seen), but then a dynamic allocation will be needed when copy constructing inside push_back. The exact size of that allocation is known up front though - it won't need to be resized to grow capacity.
That does suggest copy could be faster when there's a lot of variation in the input word lengths.
Also if anyone knows a more optimal way to do this would be much appreciated.
If you really care about performance, I'd recommend benchmarking memory mapping the file and creating a vector of string_views into it: that's likely to be considerably faster.
Suppose we've read the content of a text file into a stringstream via
std::ifstream file(filepath);
std::stringstream ss;
ss << file.rdbuf();
and now want to process the file line-by-line. This can be done via
for (std::string line; std::getline(ss, line);)
{
}
However, having in mind that ss contains the whole content of the file in an internal buffer (and that we can obtain this content as a string via ss.str()), the code above is highly inefficient. For each line of the file, a memory allocation and a copy operation is performed.
Is it possible to come up with a solution that provides the lines in form of a std::string_view? (Feel free to use an other mechanism to load the whole file; I don't need it to be accessed via a stringstream.)
the code above is highly inefficient. For each line of the file, a memory allocation and a copy operation is performed.
Is it possible to come up with a solution that provides the lines in form of a std::string_view?
No, as far I know.
But you can use the getline() method in stringstream that receive a pointer to a char.
Something as (caution: code not tested)
constexpr std::size_t dim { 100U }; // choose you the dim
char buff[dim];
while ( ss.getline(buff, dim) )
{
// do something
}
The copy remain but, at least, this way you should avoid the -- for each line -- allocation.
I'm trying to store this string into a vector of pointers. This is just the starting point of the code. Eventually the vector will store words the user types in and either add it to the vector if the vector doesn't have it or show the word is already in the vector.
I tried to use strcpy() but then it told me to use strcpy_s(). So I did now and now it just crashes every time with no error. Can someone give me some insight into what is going on here.
vector<char*> wordBank;
string str;
cin >> str;
wordBank[0] = new char[str.length() + 1];
strcpy_s(wordBank[0], str.length() + 1 , str.c_str() );
cout << wordBank[0];
delete[] wordBank[0];
I would not consider vector<char*> wordBank; this c++ code, but rather C code that happens to use some C++ features.
The standard library in C++ can make your life easier. You should use std::vector<std::string> instead.
vector<string> wordBank;
wordBank.push_back(str);
It avoid all the pointer stuff, so you need not to do memory management (which is a good reason get rid of the pointers).
For strcpy_s, it's a safer version of strcpy, because you have to explicitly specify the size of the target buffer, which can avoid buffer overflows during copies.
However, strcpy_s is non-standard and MS specific, NEVER use strcpy_s unless your just want your code compiled on MSVS. Use std::copy instead.
The default size of a vector is 0
Hence this line
vector<char*> wordBank;
just defines a 0 sized vector of character pointers
As people have mentioned in comments you can go with either of these 2 options:-
vector<char*> wordBank(1);
OR
wordBank.push_back(...);
You don't need char* elements in the vector. You can use string instead, and append your strings to the vector with push_back(), which will allocate the needed space:
vector<string> wordBank;
string str;
cin >> str;
wordBank.push_back(str);
cout << wordBank[0];
This will free you from the burden of having to use delete every time you want to remove a string from the vector. Basically, you should never have to use delete on anything, and to accomplish that, you should avoid allocating memory with new. In this case, that means avoiding the use of new char[/*...*/], which in turn means you should use string to store strings.
vector<char*> wordBank; constructs an empty vector. As such, wordBank[0], which uses operator[], has undefined behavior since you're accessing out of bounds.
Returns a reference to the element at specified location pos. No bounds checking is performed.
You can use push_back to add a new element, as such:
wordBank.push_back(new char[str.length + 1]);
Of course the most sensible thing is to just use use a vector of strings
vector<string> wordBank;
wordBank.push_back(str);
You're trying to manually manage memory for your strings, while the std::string class was designed to do that for you.
Also, from what you are describing your use case to be, you may wish to check out std::map and/or std::set. Here's a tutorial.
I would like to know what's the performance overhead of
string line, word;
while (std::getline(cin, line))
{
istringstream istream(line);
while (istream >> word)
// parse word here
}
I think this is the standard c++ way to tokenize input.
To be specific:
Does each line copied three times, first via getline, then via istream constructor, last via operator>> for each word?
Would frequent construction & destruction of istream be an issue? What's the equivalent implementation if I define istream before the outer while loop?
Thanks!
Update:
An equivalent implementation
string line, word;
stringstream stream;
while (std::getline(cin, line))
{
stream.clear();
stream << line;
while (stream >> word)
// parse word here
}
uses a stream as a local stack, that pushes lines, and pops out words.
This would get rid of possible frequent constructor & destructor call in the previous version, and utilize stream internal buffering effect (Is this point correct?).
Alternative solutions, might be extends std::string to support operator<< and operator>>, or extends iostream to support sth. like locate_new_line. Just brainstorming here.
Unfortunately, iostreams is not for performance-intensive work. The problem is not copying things in memory (copying strings is fast), it's virtual function dispatches, potentially to the tune of several indirect function calls per character.
As for your question about copying, yes, as written everything gets copied when you initialize a new stringstream. (Characters also get copied from the stream to the output string by getline or >>, but that obviously can't be prevented.)
Using C++11's move facility, you can eliminate the extraneous copies:
string line, word;
while (std::getline(cin, line)) // initialize line
{ // move data from line into istream (so it's no longer in line):
istringstream istream( std::move( line ) );
while (istream >> word)
// parse word here
}
All that said, performance is only an issue if a measurement tool tells you it is. Iostreams is flexible and robust, and filebuf is basically fast enough, so you can prototype the code so it works and then optimize the bottlenecks without rewriting everything.
When you define a variable inside a block, it will be allocated on the stack. When you are leaving the block it will get popped from the stack. Using this code you have a lot of operation on the stack. This goes for 'word' too. You can use pointers and operate on pointers instead of variables. Pointers are stored on the stack too but where they are pointing to is a place inside the heap memory.
Such operations can have overhead for making the variables, pushing it on the stack and popping it from the stack again. But using pointers you allocate the space once and you work with the allocated space in the heap. As well pointers can be much smaller than real objects so their allocation will be faster.
As you see getLine() method accepts a reference(some kind of pointers) to line object which make it work with it without creating a string object again.
In your code , line and word variables are made once and their references are used. The only object you are making in each iteration is ss variable. If you want to not to make it in each iteration you can make it before loop and initialize it using its relates methods. You can search to find a suitable method to reassign it not using the constructor.
You can use this :
string line, word ;
istringstream ss ;
while (std::getline(cin, line))
{
ss.clear() ;
ss.str(line) ;
while (ss >> word) {
// parse word here
}
}
Also you can use this reference istringstream
EDIT : Thanks for comment #jrok. Yes, you should clear error flags before assigning new string. This is the reference for str() istringstream::str
Is it possible to use an std::string for read() ?
Example :
std::string data;
read(fd, data, 42);
Normaly, we have to use char* but is it possible to directly use a std::string ? (I prefer don't create a char* for store the result)
Thank's
Well, you'll need to create a char* somehow, since that's what the
function requires. (BTW: you are talking about the Posix function
read, aren't you, and not std::istream::read?) The problem isn't
the char*, it's what the char* points to (which I suspect is what
you actually meant).
The simplest and usual solution here would be to use a local array:
char buffer[43];
int len = read(fd, buffer, 42);
if ( len < 0 ) {
// read error...
} else if ( len == 0 ) {
// eof...
} else {
std::string data(buffer, len);
}
If you want to capture directly into an std::string, however, this is
possible (although not necessarily a good idea):
std::string data;
data.resize( 42 );
int len = read( fd, &data[0], data.size() );
// error handling as above...
data.resize( len ); // If no error...
This avoids the copy, but quite frankly... The copy is insignificant
compared to the time necessary for the actual read and for the
allocation of the memory in the string. This also has the (probably
negligible) disadvantage of the resulting string having an actual buffer
of 42 bytes (rounded up to whatever), rather than just the minimum
necessary for the characters actually read.
(And since people sometimes raise the issue, with regards to the
contiguity of the memory in std:;string: this was an issue ten or more
years ago. The original specifications for std::string were designed
expressedly to allow non-contiguous implementations, along the lines of
the then popular rope class. In practice, no implementor found this
to be useful, and people did start assuming contiguity. At which point,
the standards committee decided to align the standard with existing
practice, and require contiguity. So... no implementation has ever not
been contiguous, and no future implementation will forego contiguity,
given the requirements in C++11.)
No, you cannot and you should not. Usually, std::string implementations internally store other information such as the size of the allocated memory and the length of the actual string. C++ documentation explicitly states that modifying values returned by c_str() or data() results in undefined behaviour.
If the read function requires a char *, then no. You could use the address of the first element of a std::vector of char as long as it's been resized first. I don't think old (pre C++11) strings are guarenteed to have contiguous memory otherwise you could do something similar with the string.
No, but
std::string data;
cin >> data;
works just fine. If you really want the behaviour of read(2), then you need to allocate and manage your own buffer of chars.
Because read() is intended for raw data input, std::string is actually a bad choice, because std::string handles text. std::vector seems like the right choice to handle raw data.
Using std::getline from the strings library - see cplusplus.com - can read from an stream and write directly into a string object. Example (again ripped from cplusplus.com - 1st hit on google for getline):
int main () {
string str;
cout << "Please enter full name: ";
getline (cin,str);
cout << "Thank you, " << str << ".\n";
}
So will work when reading from stdin (cin) and from a file (ifstream).