Stop carriage return from appearing in stringstream - c++

I'm have some text parsing that I'd like to behave identically whether read from a file or from a stringstream. As such, I'm trying to use an std::istream to perform all the work. In the string version, I'm trying to get it to read from a static memory byte array I've created (which was originally from a text file). Let's say the original file looked like this:
4
The corresponding byte array is this:
const char byte_array[] = { 52, 13, 10 };
Where 52 is ASCII for the character 4, then the carriage return, then the linefeed.
When I read directly from the file, the parsing works fine.
When I try to read it in "string mode" like this:
std::istringstream iss(byte_array);
std::istream& is = iss;
I end up getting the carriage returns stuck on the end of the strings I retrieve from the stringstream with this method:
std::string line;
std::getline(is, line);
This screws up my parsing because the string.empty() method no longer gets triggered on "blank" lines -- every line contains at least a 13 for the carriage return even if it's empty in the original file that generated the binary data.
Why is the ifstream behaving differently from the istringstream in this respect? How can I have the istringstream version discard the carriage return just like the ifstream version does?

std::ifstream operates in text mode by default, which means it will convert non-LF line endings to a single LF. In this case, std::ifstream is removing the CR character before std::getline() ever sees it.
std::istringstream does not do any interpretation of the source string, and passes through all bytes as they are in the string.
It's important to note that std::string represents a sequence of bytes, not characters. Typically one uses std::string to store ASCII-encoded text, but they can also be used to store arbitrary binary data. The assumption is that if you have read text from a file into memory, you have already done any text transformations such as standardization of line endings.
The correct course of action here would be to convert line endings when the file is being read. In this case, it looks like you are generating code from a file. The program that reads the file and converts it to code should be eliminating the CR characters.
An alternative approach would be to write a stream wrapper that takes an std::istream and delegates read operations to it, converting line endings on the fly. This approach is viable, though can be tricky to get right. (Efficiently handling seeking, in particular, will be difficult.)

Related

Ignoring remaining newlines and white space when reading input file (C++)

I have a function that reads a text file as input and stores the data in a vector.
It works, as long as the text file doesn't contain any extra new lines or white space.
Here is the code I currently have:
std::ifstream dataStream;
dataStream.open(inputFileName, std::ios_base::in);
std::string pushThis;
while(dataStream >> pushThis){
dataVector.push_back(pushThis);
}
For example:
safe mace
bait mate
The above works as an input text file.
This does not work:
safe mace
bait mate
Is there any way to stop the stream once you reach the final character in the file, while still maintaining separation via white space between words in order to add them to something like a vector, stack, whatever?
i.e. a vector would contain ['safe', 'mace', 'bait', 'mate']
Answer:
The problem came from having two streams, one using !dataStream.eof() and the other using dataStream >> pushThis.
Fixed so that both use dataStream >> pushThis.
For future reference for myself and others who may find this:
Don't use eof() unless you want to grab the ending bit(s) of a file (whitespace inclusive).

Getline breaks when reading special characters into a wstring

As a exercise, I am making a simple vocabulary trainer. The file I am reading contains the vocabulary, which also includes special characters like äöü for example.
I have been struggling to read this file, however, without getting mangled characters instead of the approperate special characters.
I understand why this is happening but not how to correctly solve it.
Here is my attempt:
Unit(const char* file)
:unitName(getFileName(file),false){
std::wifstream infile(file);
std::wstring line;
infile.imbue(std::locale(infile.getloc(), new std::codecvt_utf8<wchar_t, 0x10ffff, std::consume_header>()));
while (std::getline(infile, line))
{
std::wcout<<line.c_str()<<"\n";
this->vocabulary.insert(parseLine(line.c_str(),Language::EN_UK,Language::DE));
}
}
The reading process stops as soon as a entry is reached that contains a special character.
I have even been able to change the code slightly to see where exactly it stops reading:
while (infile.eof()==false)
{
std::getline(infile, line);
std::wcout<<line.c_str()<<"\n";
this->vocabulary.insert(parseLine(line.c_str(),Language::EN_UK,Language::DE));
}
If I do it like this, the output loops the entry with the special character but stops outputting it right before the special character would appear like so:
Instead of:
cross-class|klassenübergreifend
It says:
cross-class|klassen
cross-class|klassen
cross-class|klassen
cross-class|klassen
.
.
.
this leads me to believe that the special character gets misinterpreted as a line end by getline.
I do not care if I have to use getline or something else, but in order for my parse function to work, the string it gets needs to represent a line in the file. Therefore reading the entire buffer into a string wont work, unless I do the seperation myself.
How can I properly and neatly read a utf-8 file line by line?
Note: I looked for other articles on here but most of them either use getline or just explain why but not how to solve it.

Differentiating between delimiter and newline in getline

ifstream file;
file.open("file.csv");
string str;
while(file.good())
{
getline(file,str,',')
if (___) // string was split from delimiter
{
[do this]
}
else // string was split from eol
{
[do that]
}
}
file.close();
I'd like to read from a csv file, and differentiate between what happens when a string is split off due to a new line and what happens when it is split off due to the desired delimiter -- i.e. filling in the ___ in the sample code above.
The approaches I can think of are:
(1) manually adding a character to the end of each line in the original file,
(2) automatically adding a character to the end of each line by writing to another file,
(3) using getline without the delimiter and then making a function to split the resulting string by ','.
But is there a simpler or direct solution?
(I see that similar questions have been asked before, but I didn't see any solutions.)
My preference for clarity of the code would be to use your option 3) - use getline() with the standard '\n' delimiter to read the file into a buffer line by line and then use a tokenizer like strtok() (if you want to work on the C level) or boost::tokenizer to parse the string you read from the file.
You're really dealing with two distinct steps here, first read the line into the buffer, then take the buffer apart to extract the components you're after. Your code should reflect that and by doing so, you're also avoiding having to deal with odd states like the ones you describe where you end up having to do additional parsing anyway.
There is no easy way to determine "which delimiter terminated the string", and it gets "consumed" by getline, so it's lost to you.
Read the line, and parse split on commas yourself. You can use std::string::find() to find commas - however, if your file contains strings that in themselves contain commas, you will have to parse the string character by character, since you need to distinguish between commas in quoted text and commas in unquoted text.
Your big problem is your code does not do what you think it does.
getline with a delimiter treats \n as just another character from my reading of the docs. It does not split on both the delimiter and newline.
The efficient way to do this is to write your oen custom splitting getline: cppreference has a pretty clear description of what getline does, mimicing it should be easy (and safer than shooting from the hip, files are tricky).
Then return both the string, and information about why you finished your parse in a second channel.
Now, using getline naively then splitting is also viable, and will be much faster to write, snd probably less error prone to boot.

rdbuf() reading junk

Using this code I'm read a string from file.
pbuf = infile.rdbuf();
size = pbuf->pubseekoff(0, ios::end, ios::in);
pbuf->pubseekpos (0,ios::in);
buf = new char[size];
pbuf->sgetn(buf, size);
str.assign(buf, buf+size);
I have to read data in temporary variable char* buff since sgetn needs a char* not a string.
So at this point before asking my actual question if anyone knows a better way of reading a string from a file that may contain white space character please tell(Not looping till eof).
The content of the file is:
blah blah blah
blah blah in a new line
But what I get is:
blah blah blah
blah blah in a new line═
Playing around with the code I noticed the number of strange characters increases, as I add more \n characters. It seems when I try to get size of file each \n character takes 2 bytes of space, but when in a string it only takes 1 byte, and thus my string looks strange. How do I avoid this?
On Windows, the representation of end-of-line in a text file is two bytes: 0x0d, 0x0a. When you use text mode to read from such a file, those two bytes get translated into the single character '\n'. When you use binary mode you're reading raw bytes, and they don't get translated for you. If you don't want them, you'll have to do the translation yourself.
This is due to the standard library implementation turning the standard windows line ending \r\n into the standard c++ line ending \n.
As #ipc says, you can use this answer to do what you want. (Note: According to the comments, the accepted answer on that question is not actually the best way to do it.)
Alternatively, you can disable the line ending translation by opening the stream in binary mode, like so:
std::ifstream t(fileName, std::ios_base::in | std::ios_base::binary);

getline() text with UNIX formatting characters

I am writing a C++ program which reads lines of text from a .txt file. Unfortunately the text file is generated by a twenty-something year old UNIX program and it contains a lot of bizarre formatting characters.
The first few lines of the file are plain, English text and these are read with no problems. However, whenever a line contains one or more of these strange characters mixed in with the text, that entire line is read as characters and the data is lost.
The really confusing part is that if I manually delete the first couple of lines so that the very first character in the file is one of these unusual characters, then everything in the file is read perfectly. The unusual characters obviously just display as little ascii squiggles -arrows, smiley faces etc, which is fine. It seems as though a decision is being made automatically, without my knowledge or consent, based on the first line read.
Based on some googling, I suspected that the issue might be with the locale, but according to the visual studio debugger, the locale property of the ifstream object is "C" in both scenarios.
The code which reads the data is as follows:
//Function to open file at location specified by inFilePath, load and process data
int OpenFile(const char* inFilePath)
{
string line;
ifstream codeFile;
//open text file
codeFile.open(inFilePath,ios::in);
//read file line by line
while ( codeFile.good() )
{
getline(codeFile,line);
//check non-zero length
if (line != "")
ProcessLine(&line[0]);
}
//close line
codeFile.close();
return 1;
}
If anyone has any suggestions as to what might be going on or how to fix it, they would be very welcome.
From reading about your issues it sounds like you are reading in binary data, which will cause getline() to throw out content or simply skip over the line.
You have a couple of choices:
If you simply need lines from the data file you can first sanitise them by removing all non-printable characters (that is the "official" name for those weird ascii characters). On UNIX a tool such as strings would help you with that process.
You can off course also do this programmatically in your code by simply reading in X amount of data, storing it in a string, and then removing those characters that fall outside of the standard ASCII character range. This will most likely cause you to lose any unicode that may be stored in the file.
You change your program to understand the format and basically write a parser that allows you to parse the document in a more sane way.
If you can, I would suggest trying solution number 1, simply to see if the results are sane and can still be used. You mention that this is medical data, do you per-chance know what file format this is? If you are trying to find out and have access to a unix/linux machine you can use the utility file and maybe it can give you a clue (worst case it will tell you it is simply data).
If possible try getting a "clean" file that you can post the hex dump of so that we can try to provide better help than that what we are currently providing. With clean I mean that there is no personally identifying information in the file.
For number 2, open the file in binary mode. You mentioned using Windows, binary and non-binary files in std::fstream objects are handled differently, whereas on UNIX systems this is not the case (on most systems, I'm sure I'll get a comment regarding the one system that doesn't match this description).
codeFile.open(inFilePath,ios::in);
would become
codeFile.open(inFilePath, ios::in | ios::binary);
Instead of getline() you will want to become intimately familiar with .read() which will allow unformatted operations on the ifstream.
Reading will be like this:
// This code has not been tested!
char input[1024];
codeFile.read(input, 1024);
int actual_read = codeFile.gcount();
// Here you can process input, up to a maximum of actual_read characters.
//ProcessLine() // We didn't necessarily read a line!
ProcessData(input, actual_read);
The other thing as mentioned is that you can change the locale for the current stream and change the separator it considers a new line, maybe this will fix your issue without requiring to use the unformatted operators:
imbue the stream with a new locale that only knows about the newline. This method may or may not let your getline() function without issues.