Trouble implementing a line-by-line file parser in C++ - c++

I have trouble implementing a simple file parser in C++11 which reads a file line by line and tokenizes the line. It should properly manage its resources. Usage of the parser should be like:
Parser parser;
parser.open("/path/to/file");
std::pair<int> header = parser.getHeader();
while (parser.hasNext()) {
std::vector<int> tokens = parser.getNext();
}
parser.close();
So the Parser class needs one member std::ifstream file (or std::ifstream* file?)
1) How should the constructor initialize this->file?
2) How should the open method set this->file to the input file?
3) How should the next line from the file get loaded into a string?
(Is this what you would use: std::getline(this->file, line)) ?
Can you give some advice? Ideally, could you sketch out the class as a code example.

Since the Parser is probably in a pretty useless state once you've constructed it and before you've opened the file, I would suggest having your use case look something like this:
Parser parser("/path/to/file");
std::pair<int> header = parser.getHeader();
while (parser.hasNext()) {
std::vector<int> tokens = parser.getNext();
}
parser.close();
In which case, you should use the constructor's member initialization list to initialise the file member (which, yes, should be of type std::ifstream):
Parser::Parser(std::string file_name)
: file(file_name)
{
// ...
}
If you kept the constructor and open member function separate, you could just leave the constructor as default because the file member will be default constructed giving you a file stream that is not associated with any file. You would then get Parser::open to forward the file name to std::ifstream::open, like so:
void Parser::open(std::string file_name)
{
file.open(file_name);
}
Then, yes, to read lines from the file, you want to use something similar to this:
std::string line;
while (std::getline(file, line)) {
// Do something with line
}
Good job for not falling into the trap of doing while (!file.eof()).

It can be designed in many ways.
You may ask the user to provide you a stream instead of specifying a filename.
That will be more generic and will work in all streams.
That way you should have a std::ifstream& member variable though you can have a pointer type as well but you need to do *_stream << to invoke any operator.
If you take a file, you mat construct a stream in your constructor and close it if open in destructor

Actually, there is an alternative to feeding the name of the file to Parser: you could feed it a std::istream. What's interesting in this is that this way any derived class of std::istream can be used, and thus you could feed it, for example, a std::istringstream, which makes it easier to write unit-tests.
class Parser {
public:
explicit Parser(std::istream& is);
/**/
private:
std::istream& _stream;
/**/
};
Next, comes iteration. It is not idiomatic in C++ to have a has followed by a get. std::istream supports iteration (with an input iterator), you could perfectly design your parser so it does too. This way you will have the benefit of compatibility with many STL algorithms.
class ParserIterator:
public std::iterator< std::input_iterator_tag, std::vector<int> >
{
public:
ParserIterator(): _stream(nullptr) {} // end
ParserIterator(std::istream& is): _stream(&is) { this->advance(); }
// Accessors
std::vector<int> const& operator*() const { return _vec; }
std::vector<int> const* operator->() const { return &_vec; }
bool equals(ParserIterator const& other) const {
if (_stream != other._stream) { return false; }
if (_stream == nullptr) { return true; }
return false;
}
// Modifiers
ParserIterator& operator++() { this->advance(); return *this; }
ParserIterator operator++(int) {
ParserIterator tmp(*this);
this->advance();
return tmp;
}
private:
void advance() {
assert(_stream && "cannot advance an end iterator");
_vec.clear();
std::string buffer;
if (not getline(*_stream, buffer)) {
_stream = 0; // end of story
}
// parse here
}
std::istream* _stream;
std::vector<int> _vec;
}; // class ParserIterator
inline bool operator==(ParserIterator const& left, ParserIterator const& right) {
return left.equals(right);
}
inline bool operator!= (parserIterator const& left, ParserIterator const& right) {
return not left.equals(right);
}
And with that we can augment our parser:
ParserIterator Parser::begin() const {
return ParserIterator(_stream);
}
ParserIterator Parser::end() const {
return ParserIterator();
}
I'll leave the getHeader method and the actual parsing content to you ;)

Related

C++ custom lazy iterator

I have a somewhat simple text file parser. The text I parse is split into blocks denoted by { block data }.
My parser has a string read() function, which gets tokens back, such that in the example above the first token is { followed by block followed by data followed by }.
To make things less repetitive, I want to write a generator-like iterator that will allow me to write something similar to this JavaScript code:
* readBlock() {
this.read(); // {
let token = this.read();
while (token !== '}') {
yield token;
token = this.read();
}
}
which in turn allows me to use simple for-of syntax:
for (let token of parser.readBlock()) {
// block
// data
}
For C++ I would like something similar:
for (string token : reader.read_block())
{
// block
// data
}
I googled around to see if this can be done with an iterator, but I couldn't figure if I can have a lazy iterator like this which has no defined beginning or end. That is, its beginning is the current position of the reader (an integer offset into a vector of characters), and its end is when the token } is found.
I don't need to construct arbitrary iterators, or to iterate in reverse, or to see if two iterators are equal, since it's purely to make linear iteration less repetitive.
Currently every time I want to read a block, I need to re-write the following:
stream.skip(); // {
while ((token = stream.read()) != "}")
{
// block
// data
}
This becomes very messy, especially when I have blocks inside blocks. To support blocks inside blocks, the iterators would have to all reference the same reader's offset, such that an inner block will advance the offset, and the outer block will re-start iterating (after the inner is finished) from that advanced offset.
Is this possible to achieve in C++?
In order to be usable in a for-range loop, a class has to have member functions begin() and end() which return iterators.
What is an iterator? Any object fulfilling a set of requirements. There are several kind of iterators, depending on which operations allow you. I suggest to implement an input iterator, which is the simplest: https://en.cppreference.com/w/cpp/named_req/InputIterator
class Stream
{
public:
std::string read() { /**/ }
bool valid() const { /* return true while more tokens are available */ }
};
class FileParser
{
std::string current_;
Stream* stream_;
public:
class iterator
{
FileParser* obj_;
public:
using value_type = std::string;
using reference = const std::string&;
using pointer = const std::string*;
using iterator_category = std::input_iterator_tag;
iterator(FileParser* obj=nullptr): obj_ {obj} {}
reference operator*() const { return obj_->current_; }
iterator& operator++() { increment(); return *this; }
iterator operator++(int) { increment(); return *this; }
bool operator==(iterator rhs) const { return obj_ == rhs.obj_; }
bool operator!=(iterator rhs) const { return !(rhs==*this); }
protected:
void increment()
{
obj_->next();
if (!obj_->valid())
obj_ = nullptr;
}
};
FileParser(Stream& stream): stream_ {&stream} {};
iterator begin() { return iterator{this}; }
iterator end() { return iterator{}; }
void next() { current_ = stream_->read(); }
bool valid() const { return stream_->valid(); }
};
So your end-of-file iterator is represented by an iterator pointing to no object.
Then you can use it like this:
int main()
{
Stream s; // Initialize it as needed
FileParser parser {s};
for (const std::string& token: parser)
{
std::cout << token << std::endl;
}
}

Can I get the name of file used from ifstream/ofstream?

I need to know if there exists a method in ifstream so I can get the name of the file tied to it.
For instance
void some_function(ifstream& fin) {
// here I need get name of file
}
Is there a method in ifstream/ofstream that allows to get that?
As mentioned there's no such method provided by std::fstream and it's derivates. Also std::basic_filebuf doesn't provide such feature.
For simplification I'm using std::fstream instead of std::ifstream/std::ofstream in the following code samples
I would recommend, to manage the underlying file name in a little helper class yourself:
class MyFstream {
public:
MyFstream(const std::string& filename)
: filename_(filename), fs_(filename) {
}
std::fstream& fs() { return fs_; }
const std::string& filename() const { return filename_; }
private:
std::string filename_;
std::fstream fs_;
};
void some_function(MyFstream& fin) {
// here I need get name of file
std::string filename = fin.filename();
}
int main() {
MyFstream fs("MyTextFile.txt");
some_function(fs):
}
Another alternative,- if you can't use another class to pass to some_function() as mentioned above -, may be to use an associative map of fstream* pointers and their associated filenames:
class FileMgr {
public:
std::unique_ptr<std::fstream> createFstream(const std::string& filename) {
std::unique_ptr<std::fstream> newStream(new std::fstream(filename));
fstreamToFilenameMap[newStream.get()] = filename;
return newStream;
}
std::string getFilename(std::fstream* fs) const {
FstreamToFilenameMap::const_iterator found =
fstreamToFilenameMap.find(fs);
if(found != fstreamToFilenameMap.end()) {
return (*found).second;
}
return "";
}
private:
typedef std::map<std::fstream*,std::string> FstreamToFilenameMap;
FstreamToFilenameMap fstreamToFilenameMap;
};
FileMgr fileMgr; // Global instance or singleton
void some_function(std::fstream& fin) {
std::string filename = fileMgr.getFilename(&fin);
}
int main() {
std::unique_ptr<std::fstream> fs = fileMgr.createFstream("MyFile.txt");
some_function(*(fs.get()));
}
No. C++ streams do not save the name or the path of the file.
but, since you need some string to initialize the stream anyway, you can just save it for future use.
No, such a method does not exist.

Trying to implement an iterator without an explicit container

After my recent question, I am trying to implement my own contrived example.
I have a basic structure in place but even after reading this, which is probably the best tutorial I've seen, I'm still very confused. I think I should probably convert the Chapter._text into a stream and for the increment operator do something like
string p = "";
string line;
while ( getline(stream, line) ) {
p += line;
}
return *p;
but I'm not sure which of the "boilerplate" typedefs to use and how to put all these things together. Thanks much for your help
#include <stdio.h>
#include <stdlib.h>
#include <vector>
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
class Paragraph {
public:
string _text;
Paragraph (string text) {
_text = text;
}
};
class Chapter {
public:
string _text;
/* // I guess here I should do something like:
class Iterator : public iterator<input_iterator_tag, Paragraph> {
}
// OR should I be somehow delegating from istream_iterator ? */
Chapter (string txt_file) {
string line;
ifstream infile(txt_file.c_str());
if (!infile.is_open()) {
cout << "Error opening file " << txt_file << endl;
exit(0);
}
while ( getline(infile, line) ) {
_text += (line + "\n");
}
}
};
int main(int argc, char** argv) {
Chapter c(argv[1]);
// want to do something like:
// for (<Paragraph>::iterator pIt = c.begin(); pIt != c.end(); pIt++) {
// Paragraph p(*pIt);
// // Do something interesting with p
// }
return 0;
}
As you weren't planning on a chapter loading at a time (merely a paragraph), and your paragraph is empty, I think this might be best done with a single paragraph_iterator class
class paragraph_iterator :
public std::iterator<std::input_iterator_tag, std::string>
{
std::shared_ptr<std::ifstream> _infile; //current file
std::string _text; //current paragraph
paragraph_iterator(const paragraph_iterator &b); //not defined, so no copy
paragraph_iterator &operator=(const paragraph_iterator &b); //not defined, so no copy
// don't allow copies, because streams aren't copiable.
// make sure to always pass by reference
// unfortunately, this means no stl algorithms either
public:
paragraph_iterator(string txt_file) :_infile(txt_file.c_str()) {
if (!infile.is_open())
throw std::exception("Error opening file ");
std::getline(_infile, _text);
}
paragraph_iterator() {}
paragraph_iterator &operator++() {
std::getline(_infile, _text);
return *this;
}
// normally you'd want operator++(int) as well, but that requires making a copy
// and std::ifstream isn't copiable.
bool operator==(const paragraph_iterator &b) const {
if (_infile.bad() == b._infile.bad())
return true; //all end() and uninitialized iterators equal
// so we can use paragraph_iterator() as end()
return false; //since they all are seperate strings, never equal
}
bool operator!=(const paragraph_iterator &b) const {return !operator==(b);}
const std::string &operator*() const { return _text;}
};
int main() {
paragraph_iterator iter("myfile.txt");
while(iter != paragraph_iterator()) {
// dostuff with *iter
}
}
the stream is encapsulated in the iterator, so that if we have two iterators to the same file, both will get every line. If you have a seperate Chapter class with two iterators, you may run into "threading" problems. This is pretty bare code, and completely untested. I'm sure there's a way to do it with copiable iterators, but far trickier.
In general, an iterator class implementation is closely tied with the data structure it iterates over. Otherwise, we'd just have a few generic iterator classes.

Avoiding Error Flags when Reading Files

This is how I usually read files with std::ifstream:
while (InFile.peek() != EOF)
{
char Character = InFile.get();
// Do stuff with Character...
}
This avoids the need of an if statement inside the loop. However, it seems that even peek() causes eofbit to be set, which makes calling clear() necessary if I plan on using that same stream later.
Is there a cleaner way to do this?
Typically, you would just use
char x;
while(file >> x) {
// do something with x
}
// now clear file if you want
If you forget to clear(), then use an RAII scope-based class.
Edit: Given a little more information, I'd just say
class FileReader {
std::stringstream str;
public:
FileReader(std::string filename) {
std::ifstream file(filename);
file >> str.rdbuf();
}
std::stringstream Contents() {
return str;
}
};
Now you can just get a copy and not have to clear() the stream every time. Or you could have a self-clearing reference.
template<typename T> class SelfClearingReference {
T* t;
public:
SelfClearingReference(T& tref)
: t(&tref) {}
~SelfClearingReference() {
tref->clear();
}
template<typename Operand> T& operator>>(Operand& op) {
return *t >> op;
}
};
I'm not sure I understand. Infile.peek() only sets eofbit
when it returns EOF. And if it returns EOF, and later read
is bound to fail; the fact that it sets eofbit is an
optimization, more than anything else.

How do I iterate over cin line by line in C++?

I want to iterate over std::cin, line by line, addressing each line as a std::string. Which is better:
string line;
while (getline(cin, line))
{
// process line
}
or
for (string line; getline(cin, line); )
{
// process line
}
? What is the normal way to do this?
Since UncleBen brought up his LineInputIterator, I thought I'd add a couple more alternative methods. First up, a really simple class that acts as a string proxy:
class line {
std::string data;
public:
friend std::istream &operator>>(std::istream &is, line &l) {
std::getline(is, l.data);
return is;
}
operator std::string() const { return data; }
};
With this, you'd still read using a normal istream_iterator. For example, to read all the lines in a file into a vector of strings, you could use something like:
std::vector<std::string> lines;
std::copy(std::istream_iterator<line>(std::cin),
std::istream_iterator<line>(),
std::back_inserter(lines));
The crucial point is that when you're reading something, you specify a line -- but otherwise, you just have strings.
Another possibility uses a part of the standard library most people barely even know exists, not to mention being of much real use. When you read a string using operator>>, the stream returns a string of characters up to whatever that stream's locale says is a white space character. Especially if you're doing a lot of work that's all line-oriented, it can be convenient to create a locale with a ctype facet that only classifies new-line as white-space:
struct line_reader: std::ctype<char> {
line_reader(): std::ctype<char>(get_table()) {}
static std::ctype_base::mask const* get_table() {
static std::vector<std::ctype_base::mask>
rc(table_size, std::ctype_base::mask());
rc['\n'] = std::ctype_base::space;
return &rc[0];
}
};
To use this, you imbue the stream you're going to read from with a locale using that facet, then just read strings normally, and operator>> for a string always reads a whole line. For example, if we wanted to read in lines, and write out unique lines in sorted order, we could use code like this:
int main() {
std::set<std::string> lines;
// Tell the stream to use our facet, so only '\n' is treated as a space.
std::cin.imbue(std::locale(std::locale(), new line_reader()));
std::copy(std::istream_iterator<std::string>(std::cin),
std::istream_iterator<std::string>(),
std::inserter(lines, lines.end()));
std::copy(lines.begin(), lines.end(),
std::ostream_iterator<std::string>(std::cout, "\n"));
return 0;
}
Keep in mind that this affects all input from the stream. Using this pretty much rules out mixing line-oriented input with other input (e.g. reading a number from the stream using stream>>my_integer would normally fail).
What I have (written as an exercise, but perhaps turns out useful one day), is LineInputIterator:
#ifndef UB_LINEINPUT_ITERATOR_H
#define UB_LINEINPUT_ITERATOR_H
#include <iterator>
#include <istream>
#include <string>
#include <cassert>
namespace ub {
template <class StringT = std::string>
class LineInputIterator :
public std::iterator<std::input_iterator_tag, StringT, std::ptrdiff_t, const StringT*, const StringT&>
{
public:
typedef typename StringT::value_type char_type;
typedef typename StringT::traits_type traits_type;
typedef std::basic_istream<char_type, traits_type> istream_type;
LineInputIterator(): is(0) {}
LineInputIterator(istream_type& is): is(&is) {}
const StringT& operator*() const { return value; }
const StringT* operator->() const { return &value; }
LineInputIterator<StringT>& operator++()
{
assert(is != NULL);
if (is && !getline(*is, value)) {
is = NULL;
}
return *this;
}
LineInputIterator<StringT> operator++(int)
{
LineInputIterator<StringT> prev(*this);
++*this;
return prev;
}
bool operator!=(const LineInputIterator<StringT>& other) const
{
return is != other.is;
}
bool operator==(const LineInputIterator<StringT>& other) const
{
return !(*this != other);
}
private:
istream_type* is;
StringT value;
};
} // end ub
#endif
So your loop could be replaced with an algorithm (another recommended practice in C++):
for_each(LineInputIterator<>(cin), LineInputIterator<>(), do_stuff);
Perhaps a common task is to store every line in a container:
vector<string> lines((LineInputIterator<>(stream)), LineInputIterator<>());
The first one.
Both do the same, but the first one is much more readable, plus you get to keep the string variable after the loop is done (in the 2nd option, its enclosed in the for loop scope)
Go with the while statement.
See Chapter 16.2 (specifically pages 374 and 375) of Code Complete 2 by Steve McConell.
To quote:
Don't use a for loop when a while loop is more appropriate. A common abuse of the flexible for loop structure in C++, C# and Java is haphazardly cramming the contents of a while loop into a for loop header.
.
C++ Example of a while loop abusively Crammed into a for Loop Header
for (inputFile.MoveToStart(), recordCount = 0; !inputFile.EndOfFile(); recordCount++) {
inputFile.GetRecord();
}
C++ Example of appropriate use of a while loop
inputFile.MoveToStart();
recordCount = 0;
while (!InputFile.EndOfFile()) {
inputFile.getRecord();
recordCount++;
}
I've omitted some parts in the middle but hopefully that gives you a good idea.
This is based on Jerry Coffin's answer. I wanted to show c++20's std::ranges::istream_view. I also added a line number to the class. I did this on godbolt, so I could see what happened. This version of the line class still works with std::input_iterator.
https://en.cppreference.com/w/cpp/ranges/basic_istream_view
https://www.godbolt.org/z/94Khjz
class line {
std::string data{};
std::intmax_t line_number{-1};
public:
friend std::istream &operator>>(std::istream &is, line &l) {
std::getline(is, l.data);
++l.line_number;
return is;
}
explicit operator std::string() const { return data; }
explicit operator std::string_view() const noexcept { return data; }
constexpr explicit operator std::intmax_t() const noexcept { return line_number; }
};
int main()
{
std::string l("a\nb\nc\nd\ne\nf\ng");
std::stringstream ss(l);
for(const auto & x : std::ranges::istream_view<line>(ss))
{
std::cout << std::intmax_t(x) << " " << std::string_view(x) << std::endl;
}
}
prints out:
0 a
1 b
2 c
3 d
4 e
5 f
6 g