Getline breaks when reading special characters into a wstring - c++

As a exercise, I am making a simple vocabulary trainer. The file I am reading contains the vocabulary, which also includes special characters like äöü for example.
I have been struggling to read this file, however, without getting mangled characters instead of the approperate special characters.
I understand why this is happening but not how to correctly solve it.
Here is my attempt:
Unit(const char* file)
:unitName(getFileName(file),false){
std::wifstream infile(file);
std::wstring line;
infile.imbue(std::locale(infile.getloc(), new std::codecvt_utf8<wchar_t, 0x10ffff, std::consume_header>()));
while (std::getline(infile, line))
{
std::wcout<<line.c_str()<<"\n";
this->vocabulary.insert(parseLine(line.c_str(),Language::EN_UK,Language::DE));
}
}
The reading process stops as soon as a entry is reached that contains a special character.
I have even been able to change the code slightly to see where exactly it stops reading:
while (infile.eof()==false)
{
std::getline(infile, line);
std::wcout<<line.c_str()<<"\n";
this->vocabulary.insert(parseLine(line.c_str(),Language::EN_UK,Language::DE));
}
If I do it like this, the output loops the entry with the special character but stops outputting it right before the special character would appear like so:
Instead of:
cross-class|klassenübergreifend
It says:
cross-class|klassen
cross-class|klassen
cross-class|klassen
cross-class|klassen
.
.
.
this leads me to believe that the special character gets misinterpreted as a line end by getline.
I do not care if I have to use getline or something else, but in order for my parse function to work, the string it gets needs to represent a line in the file. Therefore reading the entire buffer into a string wont work, unless I do the seperation myself.
How can I properly and neatly read a utf-8 file line by line?
Note: I looked for other articles on here but most of them either use getline or just explain why but not how to solve it.

Related

C++ getline in for loop

I learn to program in c++ with previous experience with python and R. Id say I understand for loops well, but now I found out that I do not know nothig about them. Here is piece of code.
for (int i = 0; i != 1; ){
string name;
getline(infile, name);
if (name == end_input){
i = 1;
}
else{
names.push_back(name);
}
}
Whole program should (and do) read names (name) from file infile and store them into names string. Than I want them to store in another file. When I look on the code, I would thing c++ do following instructions:
create integer i and set it to 0
create string name
read the line from infile and store this line into names string vector.
this will repeat unless name == end_input
From this I would say that c++ will store first line in input file again and again because I didnt tell him to jump to next line after getline the first line. But program reads all names from that file, line by line as expected by author. How is that possible?
Thank you.
getline automatically moves to the next line after reading a line.
Also a do while loop might serve your purposes better here.
When an inbuilt function does not behave as you expected, the logical next step should be to check the Documentation. If you do, you will see the following:
Extracts characters from is and stores them into str until the delimitation character delim is found (or the newline character, '\n', for (2)).
The extraction also stops if the end of file is reached in is or if some other error occurs during the input operation.
If the delimiter is found, it is extracted and discarded (i.e. it is not stored and the next input operation will begin after it).
Which answers your Question.

Differentiating between delimiter and newline in getline

ifstream file;
file.open("file.csv");
string str;
while(file.good())
{
getline(file,str,',')
if (___) // string was split from delimiter
{
[do this]
}
else // string was split from eol
{
[do that]
}
}
file.close();
I'd like to read from a csv file, and differentiate between what happens when a string is split off due to a new line and what happens when it is split off due to the desired delimiter -- i.e. filling in the ___ in the sample code above.
The approaches I can think of are:
(1) manually adding a character to the end of each line in the original file,
(2) automatically adding a character to the end of each line by writing to another file,
(3) using getline without the delimiter and then making a function to split the resulting string by ','.
But is there a simpler or direct solution?
(I see that similar questions have been asked before, but I didn't see any solutions.)
My preference for clarity of the code would be to use your option 3) - use getline() with the standard '\n' delimiter to read the file into a buffer line by line and then use a tokenizer like strtok() (if you want to work on the C level) or boost::tokenizer to parse the string you read from the file.
You're really dealing with two distinct steps here, first read the line into the buffer, then take the buffer apart to extract the components you're after. Your code should reflect that and by doing so, you're also avoiding having to deal with odd states like the ones you describe where you end up having to do additional parsing anyway.
There is no easy way to determine "which delimiter terminated the string", and it gets "consumed" by getline, so it's lost to you.
Read the line, and parse split on commas yourself. You can use std::string::find() to find commas - however, if your file contains strings that in themselves contain commas, you will have to parse the string character by character, since you need to distinguish between commas in quoted text and commas in unquoted text.
Your big problem is your code does not do what you think it does.
getline with a delimiter treats \n as just another character from my reading of the docs. It does not split on both the delimiter and newline.
The efficient way to do this is to write your oen custom splitting getline: cppreference has a pretty clear description of what getline does, mimicing it should be easy (and safer than shooting from the hip, files are tricky).
Then return both the string, and information about why you finished your parse in a second channel.
Now, using getline naively then splitting is also viable, and will be much faster to write, snd probably less error prone to boot.

Stop carriage return from appearing in stringstream

I'm have some text parsing that I'd like to behave identically whether read from a file or from a stringstream. As such, I'm trying to use an std::istream to perform all the work. In the string version, I'm trying to get it to read from a static memory byte array I've created (which was originally from a text file). Let's say the original file looked like this:
4
The corresponding byte array is this:
const char byte_array[] = { 52, 13, 10 };
Where 52 is ASCII for the character 4, then the carriage return, then the linefeed.
When I read directly from the file, the parsing works fine.
When I try to read it in "string mode" like this:
std::istringstream iss(byte_array);
std::istream& is = iss;
I end up getting the carriage returns stuck on the end of the strings I retrieve from the stringstream with this method:
std::string line;
std::getline(is, line);
This screws up my parsing because the string.empty() method no longer gets triggered on "blank" lines -- every line contains at least a 13 for the carriage return even if it's empty in the original file that generated the binary data.
Why is the ifstream behaving differently from the istringstream in this respect? How can I have the istringstream version discard the carriage return just like the ifstream version does?
std::ifstream operates in text mode by default, which means it will convert non-LF line endings to a single LF. In this case, std::ifstream is removing the CR character before std::getline() ever sees it.
std::istringstream does not do any interpretation of the source string, and passes through all bytes as they are in the string.
It's important to note that std::string represents a sequence of bytes, not characters. Typically one uses std::string to store ASCII-encoded text, but they can also be used to store arbitrary binary data. The assumption is that if you have read text from a file into memory, you have already done any text transformations such as standardization of line endings.
The correct course of action here would be to convert line endings when the file is being read. In this case, it looks like you are generating code from a file. The program that reads the file and converts it to code should be eliminating the CR characters.
An alternative approach would be to write a stream wrapper that takes an std::istream and delegates read operations to it, converting line endings on the fly. This approach is viable, though can be tricky to get right. (Efficiently handling seeking, in particular, will be difficult.)

getline() text with UNIX formatting characters

I am writing a C++ program which reads lines of text from a .txt file. Unfortunately the text file is generated by a twenty-something year old UNIX program and it contains a lot of bizarre formatting characters.
The first few lines of the file are plain, English text and these are read with no problems. However, whenever a line contains one or more of these strange characters mixed in with the text, that entire line is read as characters and the data is lost.
The really confusing part is that if I manually delete the first couple of lines so that the very first character in the file is one of these unusual characters, then everything in the file is read perfectly. The unusual characters obviously just display as little ascii squiggles -arrows, smiley faces etc, which is fine. It seems as though a decision is being made automatically, without my knowledge or consent, based on the first line read.
Based on some googling, I suspected that the issue might be with the locale, but according to the visual studio debugger, the locale property of the ifstream object is "C" in both scenarios.
The code which reads the data is as follows:
//Function to open file at location specified by inFilePath, load and process data
int OpenFile(const char* inFilePath)
{
string line;
ifstream codeFile;
//open text file
codeFile.open(inFilePath,ios::in);
//read file line by line
while ( codeFile.good() )
{
getline(codeFile,line);
//check non-zero length
if (line != "")
ProcessLine(&line[0]);
}
//close line
codeFile.close();
return 1;
}
If anyone has any suggestions as to what might be going on or how to fix it, they would be very welcome.
From reading about your issues it sounds like you are reading in binary data, which will cause getline() to throw out content or simply skip over the line.
You have a couple of choices:
If you simply need lines from the data file you can first sanitise them by removing all non-printable characters (that is the "official" name for those weird ascii characters). On UNIX a tool such as strings would help you with that process.
You can off course also do this programmatically in your code by simply reading in X amount of data, storing it in a string, and then removing those characters that fall outside of the standard ASCII character range. This will most likely cause you to lose any unicode that may be stored in the file.
You change your program to understand the format and basically write a parser that allows you to parse the document in a more sane way.
If you can, I would suggest trying solution number 1, simply to see if the results are sane and can still be used. You mention that this is medical data, do you per-chance know what file format this is? If you are trying to find out and have access to a unix/linux machine you can use the utility file and maybe it can give you a clue (worst case it will tell you it is simply data).
If possible try getting a "clean" file that you can post the hex dump of so that we can try to provide better help than that what we are currently providing. With clean I mean that there is no personally identifying information in the file.
For number 2, open the file in binary mode. You mentioned using Windows, binary and non-binary files in std::fstream objects are handled differently, whereas on UNIX systems this is not the case (on most systems, I'm sure I'll get a comment regarding the one system that doesn't match this description).
codeFile.open(inFilePath,ios::in);
would become
codeFile.open(inFilePath, ios::in | ios::binary);
Instead of getline() you will want to become intimately familiar with .read() which will allow unformatted operations on the ifstream.
Reading will be like this:
// This code has not been tested!
char input[1024];
codeFile.read(input, 1024);
int actual_read = codeFile.gcount();
// Here you can process input, up to a maximum of actual_read characters.
//ProcessLine() // We didn't necessarily read a line!
ProcessData(input, actual_read);
The other thing as mentioned is that you can change the locale for the current stream and change the separator it considers a new line, maybe this will fix your issue without requiring to use the unformatted operators:
imbue the stream with a new locale that only knows about the newline. This method may or may not let your getline() function without issues.

How to read a message from a file, modifying only words?

Suppose I have the following text:
My name is myName. I love
stackoverflow .
Hi, Guys! There is more than one space after "Guys!" 123
And also after "123" there are 2 spaces and newline.
Now I need to read this text file as it is. Need to make some actions only with alphanumeric words. And after it I have to print it with changed words but spaces and newlines and punctuations unchanged and on the same position. When changing alphanumeric words length remains same. I have tried this with library checking for alphanumeric values, but code get very messy. Is there anyother way?
You can read your file line-by-line with fgets() function. It will fill char array and you can work with this array, e.g. iterate over this array, split it into alnum words; change the words and then write fixed string into new file with "fwrite()" function.
If you prefer C++ way of working with files (iostream), you can use istream::getline. It will save spaces; but it will consume "\n". If you need to save even "\n" (it can be '\r' and '\r\n' sometimes), you can use istream::get.
Maybe you should look at Boost Tokenizer. It can break of a string into a series of tokens and iterate through them. The following sample breaks up a phrase into words:
int main()
{
std::string s = "Hi, Guys! There is more...";
boost::tokenizer<> tok(s);
for(boost::tokenizer<>::iterator beg = tok.begin(); beg != tok.end(); ++beg)
{
std::cout << *beg << "\n";
}
return 0;
}
But in your case you need to provide a TokenizerFunc that will break up a string at alphanumeric/non-alphanumeric boundaries.
For more information see Boost Tokenizer documentation and implementation of an already provided char_separator, offset_separator and escaped_list_separator.
The reason that your code got messy is usually because you didn't break down your problem in clear functions and classes. If you do, you will have a few functions that each do precisely one thing (not messy). Your main function will then just call these simple functions. If the function names are well chosen, the main function will become short and clear, too.
In this case, your main function needs to do:
Loop: Read every line of a file
On every line, check if and where a "special" word occurs.
If a special word occurs, replace it
Extra hints: a line of text can be stored as a std::string and can be read by std::getline(std::cin, line)