C++ appending to vector of strings efficiently (and idiomatically) - c++

If I want to fill a vector of strings with lines from a file in C++, is it a good idea to use push_back with std::move?
{
std::ifstream file("E:\\Temp\\test.txt");
std::vector<std::string> strings;
// read
while (!file.eof())
{
std::string s;
std::getline(file, s);
strings.push_back(std::move(s));
}
// dump to cout
for (const auto &s : strings)
std::cout << s << std::endl;
}
Or is there some other variant where I would simply append a new string instance to the vector and get its reference?
E.g. I can do
std::vector<std::string> strings;
strings.push_back("");
string &s = strings.back();
But I feel like there must be a better way, e.g.
// this doesn't exist
std::vector<std::string> strings;
string & s = strings.create_and_push_back();
// s is now a reference to the last item in the vector,
// no copying needed

Except for the eof misuse, this is the pretty much the idiomatic way to do it yes.
Below is the correct code:
std::string s;
while(std::getline(file, s))
{
strings.push_back(std::move(s));
s.clear();
}
Note the explicit s.clear() call: the only guarantee you have for a moved-from object std::string is that you can call member functions with no prerequisites, so clearing a string should reset it to a "fresh" state, as the move is not guaranteed to do anything to the object, and you can't rely on getline not doing anything weird.
There are some other ways to spell this out (you can probably achieve something similar with istream_iterator and a proper whitespace setting), but I do think this is the clearest.

Related

Extracting certain columns from a CSV file in C++

I would like to know how I can extract / skip certain columns such as age and weight from a CSV file in C++.
Does it make more sense to extract the desired information after I loaded the entire csv file (if memory is not a problem)?
EDIT: If possible, I would like to have a reading, printing and modification part.
If possible, I want to use only the STL. The content of my test csv file looks as follows:
*test.csv*
name;age;weight;height;test
Bla;32;1.2;4.3;True
Foo;43;2.2;5.3;False
Bar;None;3.8;2.4;True
Ufo;32;1.5;5.4;True
I load the test.csv file with the following C++ program that prints the file's content on the screen:
#include <iostream>
#include <vector>
#include <string>
#include <iomanip>
#include <fstream>
#include <sstream>
void readCSV(std::vector<std::vector<std::string> > &data, std::string filename);
void printCSV(const std::vector<std::vector<std::string>> &data);
int main(int argc, char** argv) {
std::string file_path = "./test.csv";
std::vector<std::vector<std::string> > data;
readCSV(data, file_path);
printCSV(data);
return 0;
}
void readCSV(std::vector<std::vector<std::string> > &data, std::string filename) {
char delimiter = ';';
std::string line;
std::string item;
std::ifstream file(filename);
while (std::getline(file, line)) {
std::vector<std::string> row;
std::stringstream string_stream(line);
while (std::getline(string_stream, item, delimiter)) {
row.push_back(item);
}
data.push_back(row);
}
file.close();
}
void printCSV(const std::vector<std::vector<std::string> > &data) {
for (std::vector<std::string> row: data) {
for (std::string item: row) {
std::cout << item << ' ';
}
std::cout << std::endl;
}
}
Any assistance you can provide would be greatly appreciated.
Basically I answered this question already in a similar thread. But anyway, I will show a ready to use solution with a different approach and some explanation here.
One hint: You should make yourself more familiar with object oriented programming. And think over your design. In your read and write function you create a unneccessary dependency to a file or to std::cout- So, you should not handover a file name and then open the file in the function, but, use streams. Because, in the function that I created, using the C++ IO facilities, it doesn't matter, if we read from a file or a std::istringstream or write to std::cout or a file stream.
All will be handled via the (overloaded) extractor and inserter operators.
So, and because I wanted the code a little bit more flexible, I made my struct a template, to be able to put in the selected Columns and reuse the same struct for other column combinations.
If you want to have fixed selected columns then you can delete the line with template and can replace std::vector<size_t> selectedFields{ {Colums...} }; with std::vector<size_t> selectedFields{ {1,2} };
Later we use a using for the template to allow easier handling and understanding:
// Define Dataype for selected columns age and weight
using AgeAndWeight = SelectedColumns<1, 2>;
OK, let's first see the source code and then try to understand.
#include <iostream>
#include <string>
#include <vector>
#include <regex>
#include <fstream>
#include <initializer_list>
#include <iterator>
#include <algorithm>
std::regex re{ ";" };
// Proxy for reading an splitting a line and extracting certain fields and some simple output
template<size_t ... Colums>
struct SelectedColumns {
std::vector<std::string> data{};
std::vector<size_t> selectedFields{ {Colums...} };
// Overwrite extractor operator
friend std::istream& operator >> (std::istream& is, SelectedColumns& sl) {
// Read a complete line and check, if it could be read
if (std::string line{}; std::getline(is, line)) {
// Now split the line into tokens
std::vector tokens(std::sregex_token_iterator(line.begin(), line.end(), re, -1), {});
// Clear old data
sl.data.clear();
// So, and now copy the selected columns into our data vector
for (const size_t& column : sl.selectedFields)
if (column < tokens.size()) sl.data.push_back(tokens[column]);
}
return is;
}
// Simple extractor
friend std::ostream& operator << (std::ostream & os, const SelectedColumns & sl) {
std::copy(sl.data.begin(), sl.data.end(), std::ostream_iterator<std::string>(os, "\t"));
return os;
}
};
// Define Dataype for selected columns age and weight
using AgeAndWeight = SelectedColumns<1U, 2U>;
const std::string fileName{ "./test.csv" };
int main() {
// Open the csv file and check, if it is open
if (std::ifstream csvFileStream{ fileName }; csvFileStream) {
// Read complete csv file and extract age and weight columns
std::vector sc(std::istream_iterator<AgeAndWeight>(csvFileStream), {});
// Now all data is available in this vector sc Do something
sc[3].data[0] = "77";
// Show some debug out put
std::copy(sc.begin(), sc.end(), std::ostream_iterator<AgeAndWeight>(std::cout, "\n"));
// By the way, you could also write the 2 lines above in one line.
//std::copy(std::istream_iterator<AgeAndWeight>(csvFileStream), {}, std::ostream_iterator<AgeAndWeight>(std::cout, "\n"));
}
else std::cerr << "\n*** Error: Could not open source file\n\n";
return 0;
}
One major task here is to split a line with CSV Data into its tokens. Let us have a look at this.
Splitting a string into tokens:
What do people expect from the function, when they read
getline ?
Most people would say, Hm, I guess it will read a complete line from somewhere. And guess what, that was the basic intention for this function. Read a line from a stream and put it into a string.
But, as you can see here std::getline has some additional functionality.
And this lead to a major misuse of this function for splitting up std::strings into tokens.
Splitting strings into tokens is a very old task. In very early C there was the function strtok, which still exists, even in C++. Here std::strtok. Please see the std::strtok-example
std::vector<std::string> data{};
for (char* token = std::strtok(const_cast<char *>(line.data()), ","); token != nullptr; token = std::strtok(nullptr, ","))
data.push_back(token);
Simple, right?
But because of the additional functionality of std::getline is has been heavily misused for tokenizing strings. If you look on the top question/answer regarding how to parse a CSV file (please see here), then you will see what I mean.
People are using std::getline to read a text line, a string, from the original stream, then stuffing it into an std::istringstream and use std::getline with delimiter again to parse the string into tokens. Weird.
But, since many many years, we have a dedicated, special function for tokenizing strings, especially and explicitly designed for that purpose. It is the
std::sregex_token_iterator
And since we have such a dedicated function, we should simply use it.
This thing is an iterator. For iterating over a string, hence the function name is starting with an s. The begin part defines, on what range of input we shall operate, the end part is default constructed, and then there is a std::regex for what should be matched / or what should not be matched in the input string. The type of matching strategy is given with last parameter.
0 --> give me the stuff that I defined in the regex and (optional)
-1 --> give me that what is NOT matched based on the regex.
We can use this iterator for storing the tokens in a std::vector. The std::vector has a range constructor, which takes 2 iterators as parameter, and copies the data between the first iterator and 2nd iterator to the std::vector. The statement
std::vector tokens(std::sregex_token_iterator(s.begin(), s.end(), re, -1), {});
defines a variable “tokens” as a std::vector and uses the so called range-constructor of the std::vector. Please note: I am using C++17 and can define the std::vector without template argument. The compiler can deduce the argument from the given function parameters. This feature is called CTAD ("class template argument deduction").
Additionally, you can see that I do not use the "end()"-iterator explicitly.
This iterator will be constructed from the empty brace-enclosed default initializer with the correct type, because it will be deduced to be the same as the type of the first argument due to the std::vector constructor requiring that.
You can read any number of tokens in a line and put it into the std::vector
But you can do even more. You can validate your input. If you use 0 as last parameter, you define a std::regex that even validates your input. And you get only valid tokens.
Overall, the usage of a dedicated functionality is superior over the misused std::getline and people should simple use it.
Some people complain about the function overhead, and, they are right, but how many of them are using big data. And even then, the approach would be probably then to use string.findand string.substring or std::stringviews or whatever.
So, now to further topics.
In the extractor, we first read a complete line from the source stream and check, if that worked. Or, if we have and end of file or any other error.
Then we tokenize that just read string as described above.
And then, we will copy only selected columns from the tokens into our resulting data. This is done in a simple for loop. Here we also check the boundaries, because somebody could specify invalid selected columns, or, a line could have less tokens than expected.
So the body of the extractor is vey simple. Just 5 line of code . . .
Then, again,
You should start using object-oriented features in C++. In C++ you can put data and methods that operate on these data into one object. The reason is that the outside world should not care about objects internals. For example, your readCSV and printCSV function should be part of a struct (or class).
And as next step, we will not use your “read” and “print” functions. We will use the dedicated function for Stream-IO, the extractor operator >> and the inserter operator <<. And we will overwrite the standard IO-functions in our struct.
In function main we will open the the source file and check, if the open was successful. BTW. All input output functions shall be checked, if they were successful.
Then, we use the next iterator, the std::istream_iterator. And this together with our “AgeAndWeight”-type and the input file stream. Also here we use CTAD and the default constructed end-iterator. The std::istream_iterator will repeatedly call the AgeAndWeight extractor operator, until all lines of the source file are read.
For output, we will use the std::ostream_iterator. This will call the inserter operator for "AgeAndWeight" until all data are written.

Most efficient way to read lines from text file to std::vector<string>

The common way to add lines extracted from a text file into a std::vector< std::string > where every element of vector correspond to a file's line is something like these example:
https://stackoverflow.com/a/8365024/7030542
std::string line;
std::vector<std::string> myLines;
while (std::getline(myfile, line))
{
myLines.push_back(line);
}
or also
https://stackoverflow.com/a/12506764/7030542
std::vector<std::string> lines;
for (std::string line; std::getline( ifs, line ); /**/ )
lines.push_back( line );
Does exist a most efficient way to do that like avoid the auxiliary string?
Don't overthink it:
std::vector<std::string> lines;
std::string line;
while(std::getline( ifs, line ))
lines.push_back(std::move(line));
Note that the moved from line is in a valid, but indeterminate state, so calling std::getline is fine because that will replace the std::string's contents (whatever they may be) completely, eradicating any indeterminate state that was left behind by the move.
#rubenvb's answer is great.
As an alternative
bool get_line_into_vector( std::istream& is, std::vector<std::string>& v ) {
std::string tmp;
if (!std::getline(is, tmp))
return false;
v.push_back(std::move(tmp));
return true;
}
std::vector<std::string> lines;
while(get_line_into_vector( ifs, lines ))
{} // do nothing
This is rubenvb's solution with the temporary moved into a helper function.
We can avoid the small buffer optimization sized copies of characters with this:
bool get_line_into_vector( std::istream& is, std::vector<std::string>& v ) {
v.emplace_back();
if (std::getline(is, v.back()))
return true;
v.pop_back();
return false;
}
this can (in an edge case) cause an extra massive reallocation, but that is asymptotically rare.
Unlike #pschill's answer, here the invalid states are isolated to within a helper function, and all the flow control is centered around avoiding those invalid states from leaking.
The nice thing is that
std::vector<std::string> lines;
while(get_line_into_vector( ifs, lines ))
{} // do nothing
is how you use it; which of these two implementations you use is isolated to within the get_line_into_vector function. That lets you swap between them and determine which is better.
If you want to avoid temporary variables, you can use the last vector element as buffer:
std::vector<std::string> lines(1);
while (std::getline(ifs, lines.back())
lines.emplace_back();
lines.erase(--lines.end()); // remove the buffer element

Vector push_back on duplicate strings with the help of delimiter

I am trying to read the PATH Environment variable and remove any duplicates that are present in it using vector functionalities such as - sort, erase and unique. But as I've seen vector will delimit each element default by newline. When I get the path as C:\Program Files(x86)\..., its breaking at C:/ Program. This is my code so far:
char *path = getenv("PATH");
char str[10012] = "";
strcpy(str,path);
string strr(str);
vector<string> vec;
stringstream ss(strr);
string s;
while(ss >> s)
{
push_back(s);
}
sort(vec.begin(),vec.end());
vec.erase(unique(vec.begin(),vec.end()),vec.end());
for(unsigned i=0;i<vec.size();i++)
{
cout<<vec[i]<<endl;
}
Is it the delimiter problem? I need to pus_back at every ; and search for duplicates. Can anyone help me in this regard.
I would use a stringstream to chop it up, and the use a set to ensure there are no duplicates.
std::string p { std::getenv("PATH") }
std::set<string> set;
std::stringstream ss { p };
std::string s;
while(std::getline(ss, s, ':')) //this might need to be ';' for windows
{
set.insert(s);
}
for(const auto& elem : set)
std::cout << elem << std::endl;
Should you need to use a vector for some reason, you'd want to sort it with std::sort then remove duplicates with std::unique then erase the slack with erase.
std::sort(begin(vec), end(vec));
auto it=std::unique(begin(vec), end(vec));
vec.erase(it, end(vec));
EDIT: link to docs
http://en.cppreference.com/w/cpp/container/set
http://en.cppreference.com/w/cpp/algorithm/unique
http://en.cppreference.com/w/cpp/algorithm/sort
For this task it is better to use std::set<std::string> which will eliminate duplicates automatically. To read in PATH, use strtok to split it into substrings.
You need to use a different delimiter (':' or ';' to split the directories from the PATH, depending on the system). For instance, you can have a look at the std::getline() function to replace your current while () / push_back loop. This function allows you to specify a custom delimiter and would be a drop-in replacement in your code.
It isn't so much that std::vector<T> is delimiting anything but that the formatted input operator (operator>>()) for strings uses whitespace as delimiters. Other already posted about using std::getline() and the like. There are two other approaches:
Change what is considered to be whitespace for the stream! The std::string input operator uses the stream's std::locale object to obtain a std::ctype<char> facet which can be replaced. The std::ctype<char> facet has functions to do character classification and it can be used to consider, e.g., the character ';' as a space. It is a bit involved but a more solid approach than the next one.
I don't think path components can include newlines, i.e., a simple approach could be to replace all semicolons by newlines before reading the components:
std::string path(std::getenv("PATH"));
std::replace(path.begin(), path.end(), path.begin(), ';', '\n');
std::istringstream pin(path);
std::istream_iterator<std::string> pbegin(pin), pend;
std::vector<std::string> vec(pbegin, pend);
This approach may have the problem that the PATH may contain components which contain spaces: these would be split into individual object. You might want to replace spaces with another character (e.g., the now unused ';') and restore those at an appropriate to become spaces.

sequentially reading a text file in C++

In C++, I want to sequentially read word from a text file, and store each word into an array? After that, I will perform some operation on this array. But I do not know how to handle the first phase: sequentially reading word from a text file and store each word into an array.
I should skip those punctuations, which include ".", ",", "?"
You need to use streams for this. Take a look at the examples here:
Input/Output with files
This sounds like homework. If it is, please be forthright.
First of all, it's almost always a bad idea in C++ to use a raw array -- using a vector is a much better idea. As for your question about punctuation -- that's up to your customer, but my inclination is to separate on whitespace.
Anyway, here's an easy way to do it that takes advantage of operator>>(istream&, string&) separating on whitespace by default.
ifstream infile("/path/to/file.txt");
vector<string> words;
copy(istream_iterator<string>(file), istream_iterator<string>(), back_inserter(words));
Here's a complete program that reads words from a file named "filename", stores them in a std::vector and removes punctuation from the words.
#include <algorithm> // iostream, vector, iterator, fstream, string
struct is_punct {
bool operator()(char c) const {
static const std::string punct(",.:;!?");
return punct.find(c) != std::string::npos;
}
};
int main(int argc, char* argv[])
{
std::ifstream in("filename");
std::vector<std::string> vec((std::istream_iterator<std::string>(in)),
std::istream_iterator<std::string>());
std::transform(vec.begin(), vec.end(),
vec.begin(),
[](std::string s) {
s.erase(std::remove_if(s.begin(), s.end(), is_punct()),
s.end());
return s;
});
// manipulate vec
}
Do you know how many words you'll be reading? If not, you'll need to grow the array as you read more and more words. The easiest way to do that is to use a standard container that does it for you: std::vector. Reading words separated by whitespace is easy as it's the default behavior of std::ifstream::operator>>. Removing punctuation marks requires some extra work, so we'll get to that later.
The basic workflow for reading words from a file goes like this:
#include <fstream>
#include <string>
#include <vector>
int main()
{
std::vector<std::string> words;
std::string w;
std::ifstream file("words.txt"); // opens the file for reading
while (file >> w) // read one word from the file, stops at end-of-file
{
// do some work here to remove punctuation marks
words.push_back(w);
}
return 0;
}
Assuming you're doing homework here, the real key is learning how to remove the punctuation marks from w before adding it to the vector. I would look into the following concepts to help you:
The erase-remove idiom. Note that a std::string behaves like a container of char.
std::remove_if
The ispunct function in the cctype library
Feel free to post more questions if you run into trouble.
Yet another possibility, using (my usual) a special facet:
class my_ctype : public std::ctype<char> {
public:
mask const *get_table() {
// this copies the "classic" table used by <ctype.h>:
static std::vector<std::ctype<char>::mask>
table(classic_table(), classic_table()+table_size);
// Anything we want to separate tokens, we mark its spot in the table as 'space'.
table[','] = (mask)space;
table['.'] = (mask)space;
table['?'] = (mask)space;
// and return a pointer to the table:
return &table[0];
}
my_ctype(size_t refs=0) : std::ctype<char>(get_table(), false, refs) { }
};
Using that, reading the words is pretty easy:
int main(int argc, char **argv) {
std::ifstream infile(argv[1]); // open the file.
infile.imbue(std::locale(std::locale(), new my_ctype()); // use our classifier
// Create a vector containing the words from the file:
std::vector<std::string> words(
(std::istream_iterator<std::string>(infile)),
std::istream_iterator<std::string>());
// and now we're ready to process the words in the vector
// though it might be worth considering using `std::transform`, to take
// the input from the file and process it directly.

using iterators, ifstream, ofstream the way it's meant to be done

I have a txt file containing a bunch of words, one per line.
I need to read this file and put each word in a list
then the user will be able to modify this list
when done editing, the program will write the modified list in a new file.
Since it's object orinted c++, I'm gonna have two classes, one to read/write to file, and one to edit/mess with the list and the user.
with this approach in mind, here's my read function from the first class:
bool FileMgr::readToList(list<string> &l)
{
if(!input.is_open())
return false;
string line;
while(!input.eof())
{
getline(input, line);
l.push_back(line);
}
return true;
}
keep in mind: input is opened at constructor.
questions: is there a less redundant way of getting that damn line from istream and pushing it back to l? (without the 'string' in between).
questions aside, this functions seems to work properly.
now the output function:
bool FileMgr::writeFromList(list<string>::iterator begin, list<string>::iterator end)
{
ofstream output;
output.open("output.txt");
while(begin != end)
{
output << *begin << "\n";
begin++;
}
output.close();
return true;
}
this is a portion of my main:
FileMgr file;
list<string> words;
file.readToList(words);
cout << words.front() << words.back();
list<string>::iterator begin = words.begin();
list<string>::iterator end = words.end();
file.writeFromList(begin, end);
thanks for the help, both functions now work.
Now regarding style, is this a good way to implement these two functions?
also the getline(input,line) part I really don't like, anyone got a better idea?
As written, your input loop is incorrect. The eof flag is set after a read operation that reaches eof, so you could end up going through the loop one too many times. In addition, you fail to check the bad and fail flags. For more information on the flags, what they mean, and how to properly write an input loop, see the Stack Overflow C++ FAQ question Semantics of flags on basic_ios.
Your loop in readToList should look like this:
std::string line;
while (std::getline(input, line))
{
l.push_back(line);
}
For a more C++ way to do this, see Jerry Coffin's answer to How do I iterate over cin line-by-line in C++? His first solution is quite straightforward and should give you a good idea of how this sort of thing looks in idiomatic, STL-style C++.
For your writeFromList function, as Tomalak explains, you need to take two iterators. When using iterators in C++, you almost always have to use them in pairs: one pointing to the beginning of the range and one pointing to the end of the range. It is often preferable as well to use a template parameter for the iterator type so that you can pass different types of iterators to the function; this allows you to swap in different containers as needed.
You don't need to explicitly call output.close(): it is called automatically by the std::ofstream destructor.
You can use std::copy with std::ostream_iterator to turn the output loop into a single line:
template <typename ForwardIterator>
bool FileMgr::writeFromList(ForwardIterator first, ForwardIterator last)
{
std::ofstream output("output.txt");
std::copy(first, last, std::ostream_iterator<std::string>(output, ""));
}
writeFromList should take a start iterator and an end iterator. This range-based approach is what the iterators are designed for, and it's how the stdlib uses them.
So:
bool FileMgr::writeFromList(list<string>::iterator it, list<string>::iterator end)
{
ofstream output("output.txt");
for (; it != end; ++it)
output << *it;
return true;
}
and
file.writeFromList(words.begin(), words.end());
You'll notice I improved your stream usage a little, too.
Ideally writeFromList would be generic and take any iterator type. But that's future work.
Also note that your .eof is wrong. Do this:
string line;
while (getline(input, line))
l.push_back(line);