Differentiating between delimiter and newline in getline - c++

ifstream file;
file.open("file.csv");
string str;
while(file.good())
{
getline(file,str,',')
if (___) // string was split from delimiter
{
[do this]
}
else // string was split from eol
{
[do that]
}
}
file.close();
I'd like to read from a csv file, and differentiate between what happens when a string is split off due to a new line and what happens when it is split off due to the desired delimiter -- i.e. filling in the ___ in the sample code above.
The approaches I can think of are:
(1) manually adding a character to the end of each line in the original file,
(2) automatically adding a character to the end of each line by writing to another file,
(3) using getline without the delimiter and then making a function to split the resulting string by ','.
But is there a simpler or direct solution?
(I see that similar questions have been asked before, but I didn't see any solutions.)

My preference for clarity of the code would be to use your option 3) - use getline() with the standard '\n' delimiter to read the file into a buffer line by line and then use a tokenizer like strtok() (if you want to work on the C level) or boost::tokenizer to parse the string you read from the file.
You're really dealing with two distinct steps here, first read the line into the buffer, then take the buffer apart to extract the components you're after. Your code should reflect that and by doing so, you're also avoiding having to deal with odd states like the ones you describe where you end up having to do additional parsing anyway.

There is no easy way to determine "which delimiter terminated the string", and it gets "consumed" by getline, so it's lost to you.
Read the line, and parse split on commas yourself. You can use std::string::find() to find commas - however, if your file contains strings that in themselves contain commas, you will have to parse the string character by character, since you need to distinguish between commas in quoted text and commas in unquoted text.

Your big problem is your code does not do what you think it does.
getline with a delimiter treats \n as just another character from my reading of the docs. It does not split on both the delimiter and newline.
The efficient way to do this is to write your oen custom splitting getline: cppreference has a pretty clear description of what getline does, mimicing it should be easy (and safer than shooting from the hip, files are tricky).
Then return both the string, and information about why you finished your parse in a second channel.
Now, using getline naively then splitting is also viable, and will be much faster to write, snd probably less error prone to boot.

Related

Difficulty parsing comments that come after ; in a line

I'm parsing a file that has definitions of functions. Since functions may be written in multiple lines, I'm parsing until encountering a ;:
#include <iostream>
#include <string>
void removeLineBreaks(std::string &str)
{
auto pos = str.find('\n');
while (pos != std::string::npos)
{
str.replace(pos, 1, "");
pos = str.find('\n', pos);
}
}
int main()
{
std::ifstream ifStream("a.pr");
std::string sLine;
const char sDelim(';');
while (std::getline(ifStream, sLine, sDelim))
{
sLine += sDelim;
removeLineBreaks(sLine);
// process further
}
}
The text can be something like this:
a=f(b,c); // comment
d=f(e,f);
Since I'm reading until ;, here I get two pieces:
a=f(b,c); and
// comment
\n d=f(e,f);.
If I call removeLineBreaks on the second piece, it'll become // comment d=f(e,f); so it'd be treated a comment by my parser.
What options do I have to make this work correctly? I could think of this - before calling removeLineBreaks on the line, get the string until \n, and if it starts with //, cut that part from the line, and only then call removeLineBreaks.
Any other ideas?
You first need to remove the //comments from your input and only then you can split at the semicolons.
Consider the following input:
a=f(b,c); // Functions comments are not functions; a=F(b,c);
If you first split on semicolons and then remove the comments, then you would end up with two functions:
a=f(b,c);
a=F(b,c);
But you only want to have the first one.
The solution is to:
Read the file line by line (lines delimited by LF).
While doing that remove the line based // comments and also all line breaks.
Combine all input back into a single string.
Split the string on semicolons to extract all functions which are outside comments.
The steps do not have to be done sequentially. You can do all these steps simultaneously on the input stream of characters, end emit a stream of functions. In fact, this is what real parsers would do.
You are essentially writing a simple parser. As your language gets more and more complicated you will find it more and more difficult to parse a file in such a way. For example with the approach above it will be difficult to emit error messages with line number information.
If you want to write a proper parser I would recommend a recursive descent parser together with a PEG (parser expression grammar). This approach is easy to learn, has less pitfalls than other approaches and is yet very powerful for computer languages. See here: https://en.wikipedia.org/wiki/Parsing_expression_grammar
Warning: If you hear people suggesting flex and bison (or lex and yacc), I strongly recommend against using them. They are complicated to use and are very limited in what they can parse and how it needs to be specified. I would rather suggest to use a light-weight and modern parsing framework like PEGTL: https://github.com/taocpp/PEGTL.

When to use a blank cin.get()?

As the title says - when should I use a blank cin.get() ?
I encountered situations when the program acted strange until I added a few blank cin.get()s between reading lines. (e.g. in a struct when reading its fields I had to enter a cin.get() between each non-blank cin.get())
So what does a blank cin.get() do and when should I use it?
Thanks.
There are two broad categories of stream input operations: formatted and unformatted. Formatted operations expect input in a particular form; they start out by skipping whitespace, then looking for text that matches what they expect. They typically are written as extractors; that's the >> that you see so often:
int i;
std::cin >> i; // formatted input operation
Here, the extractor is looking for digits, and will translate the digits that it sees into an integer value. When it sees something that isn't a digit it stops looking.
Unformatted input operations just do something, without regard to any rules about what the input should look like. basic_istream::get() is one of those: it simply reads a character or a sequence of characters. If you ask it to read a sequence it doesn't care what's in that sequence, except that the form that takes a delimiter looks for that delimiter. Other than that, it just copies text.
When you mix formatted and unformatted operations they fight with each other.
int i;
std::cin >> i;
If std::cin is reading from the console (that is, you haven't redirected it at the command line), you'll typically type in some digits followed by the "Enter" key. The extractor reads the digits, and when it hits the newline character (that's what the "Enter" key looks like on input) it stops reading, and leaves the newline character alone. That's fine, if the next operation on that stream is also a formatted extractor: it skips the newline character and any other whitespace until it hits something that isn't whitespace, and then it starts translating the text into the appropriate value.
There's a problem, though, if you use a formatted operation followed by an unformatted operation. This is a common problem when folks mix extractors (>>) with getline(): the extractor reads up to the newline, and the call to getline() reads the newline character, says "Hey, I've got an empty line", and returns an empty string.
Same thing for the version of basic_istream::get() that reads a sequence of characters: when it hits the delimiter (newline if you haven't specified something else) it stops reading. If that newline was a leftover from an immediately preceding formatted extractor, it's probably not what you're looking for.
One (really really ugly) solution is the brute force cin.ignore(256, '\n');, which ignores up to 256 sequential newline characters.
A more delicate solution is to not create the problem in the first place. If you need to read lines, read lines. If you need to read lines and sometimes extract values from the text in a line, read the line, then create a std::stringstream object and extract from that.

Getline breaks when reading special characters into a wstring

As a exercise, I am making a simple vocabulary trainer. The file I am reading contains the vocabulary, which also includes special characters like äöü for example.
I have been struggling to read this file, however, without getting mangled characters instead of the approperate special characters.
I understand why this is happening but not how to correctly solve it.
Here is my attempt:
Unit(const char* file)
:unitName(getFileName(file),false){
std::wifstream infile(file);
std::wstring line;
infile.imbue(std::locale(infile.getloc(), new std::codecvt_utf8<wchar_t, 0x10ffff, std::consume_header>()));
while (std::getline(infile, line))
{
std::wcout<<line.c_str()<<"\n";
this->vocabulary.insert(parseLine(line.c_str(),Language::EN_UK,Language::DE));
}
}
The reading process stops as soon as a entry is reached that contains a special character.
I have even been able to change the code slightly to see where exactly it stops reading:
while (infile.eof()==false)
{
std::getline(infile, line);
std::wcout<<line.c_str()<<"\n";
this->vocabulary.insert(parseLine(line.c_str(),Language::EN_UK,Language::DE));
}
If I do it like this, the output loops the entry with the special character but stops outputting it right before the special character would appear like so:
Instead of:
cross-class|klassenübergreifend
It says:
cross-class|klassen
cross-class|klassen
cross-class|klassen
cross-class|klassen
.
.
.
this leads me to believe that the special character gets misinterpreted as a line end by getline.
I do not care if I have to use getline or something else, but in order for my parse function to work, the string it gets needs to represent a line in the file. Therefore reading the entire buffer into a string wont work, unless I do the seperation myself.
How can I properly and neatly read a utf-8 file line by line?
Note: I looked for other articles on here but most of them either use getline or just explain why but not how to solve it.

How do I loop through input line by line and also go through each of those lines in C++;

So I'm trying to go through input line by line. Each line in the input is formatted like this:
Words_More words_even more words_What I Need to Go Through_ Random_ Random_Etc.
With a random amount of word clusters (The words separated by '_')
I want, for each line, to be able to ignore all the words until I get to the fourth word cluster which in the example I gave would be: "What I Need To Go Through" and then store those separate words in some data structure that I haven't decided upon yet.
My first thought would be to use
getline(cin, trash, '_');
three times and deal with the data that follows, but then how would I loop line by line until the end of the input?
You basically have two options:
use getline for each line, then parse it
use getline(stream, string) to get a line from your stream, and store it into a string. Then construct an istringstream to parse this again (with the getline you thought of.
get what you need, and then ignore() stuff unill the next newline
You do getline() thing, and then you call ignore() (doc)
to read and discard the rest of the line, so you can start again with the next line.
which one you use is up to you. but the second one has slightly better performance, if you care about that stuff.

How to read a message from a file, modifying only words?

Suppose I have the following text:
My name is myName. I love
stackoverflow .
Hi, Guys! There is more than one space after "Guys!" 123
And also after "123" there are 2 spaces and newline.
Now I need to read this text file as it is. Need to make some actions only with alphanumeric words. And after it I have to print it with changed words but spaces and newlines and punctuations unchanged and on the same position. When changing alphanumeric words length remains same. I have tried this with library checking for alphanumeric values, but code get very messy. Is there anyother way?
You can read your file line-by-line with fgets() function. It will fill char array and you can work with this array, e.g. iterate over this array, split it into alnum words; change the words and then write fixed string into new file with "fwrite()" function.
If you prefer C++ way of working with files (iostream), you can use istream::getline. It will save spaces; but it will consume "\n". If you need to save even "\n" (it can be '\r' and '\r\n' sometimes), you can use istream::get.
Maybe you should look at Boost Tokenizer. It can break of a string into a series of tokens and iterate through them. The following sample breaks up a phrase into words:
int main()
{
std::string s = "Hi, Guys! There is more...";
boost::tokenizer<> tok(s);
for(boost::tokenizer<>::iterator beg = tok.begin(); beg != tok.end(); ++beg)
{
std::cout << *beg << "\n";
}
return 0;
}
But in your case you need to provide a TokenizerFunc that will break up a string at alphanumeric/non-alphanumeric boundaries.
For more information see Boost Tokenizer documentation and implementation of an already provided char_separator, offset_separator and escaped_list_separator.
The reason that your code got messy is usually because you didn't break down your problem in clear functions and classes. If you do, you will have a few functions that each do precisely one thing (not messy). Your main function will then just call these simple functions. If the function names are well chosen, the main function will become short and clear, too.
In this case, your main function needs to do:
Loop: Read every line of a file
On every line, check if and where a "special" word occurs.
If a special word occurs, replace it
Extra hints: a line of text can be stored as a std::string and can be read by std::getline(std::cin, line)