rdbuf() reading junk - c++

Using this code I'm read a string from file.
pbuf = infile.rdbuf();
size = pbuf->pubseekoff(0, ios::end, ios::in);
pbuf->pubseekpos (0,ios::in);
buf = new char[size];
pbuf->sgetn(buf, size);
str.assign(buf, buf+size);
I have to read data in temporary variable char* buff since sgetn needs a char* not a string.
So at this point before asking my actual question if anyone knows a better way of reading a string from a file that may contain white space character please tell(Not looping till eof).
The content of the file is:
blah blah blah
blah blah in a new line
But what I get is:
blah blah blah
blah blah in a new line═
Playing around with the code I noticed the number of strange characters increases, as I add more \n characters. It seems when I try to get size of file each \n character takes 2 bytes of space, but when in a string it only takes 1 byte, and thus my string looks strange. How do I avoid this?

On Windows, the representation of end-of-line in a text file is two bytes: 0x0d, 0x0a. When you use text mode to read from such a file, those two bytes get translated into the single character '\n'. When you use binary mode you're reading raw bytes, and they don't get translated for you. If you don't want them, you'll have to do the translation yourself.

This is due to the standard library implementation turning the standard windows line ending \r\n into the standard c++ line ending \n.
As #ipc says, you can use this answer to do what you want. (Note: According to the comments, the accepted answer on that question is not actually the best way to do it.)
Alternatively, you can disable the line ending translation by opening the stream in binary mode, like so:
std::ifstream t(fileName, std::ios_base::in | std::ios_base::binary);

Related

C++ get the size (in bytes) of EOL

I am reading an ASCII text file. It is defined by the size of each field, in bytes. E.g. Each row consists of a 10 bytes for some string, 8 bytes for a floating point value, 5 bytes for an integer and so on.
My problem is reading the newline character, which has a variable size depending on the OS (usually 2 bytes for windows and 1 byte for linux I believe).
How can I get the size of the EOL character in C++?
For example, in python I can do:
len(os.linesep)
The time honored way to do this is to read a line.
Now, the last char should be \n. Strip it. Then, look at the previous character. It will either be \r or something else. If it's \r, strip it.
For Windows [ascii] text files, there aren't any other possibilities.
This works even if the file is mixed (e.g. some lines are \r\n and some are just \n).
You can tentatively do this on few lines, just to be sure you're not dealing with something weird.
After that, you now know what to expect for most of the file. But, the strip method is the general reliable way. On Windows, you could have a file imported from Unix (or vice versa).
I'm not sure that the translation occurs where you think it is. Look at the following code:
ostringstream buf;
buf<< std::endl;
string s = buf.str();
int i = strlen(s.c_str());
After this, running on Windows, i == 1. So the end of line definition in std is 1 character. As others have commented, this is the "\n" character.

C++ file reading and string printing

Why do these two print different things? The first prints abcd but the second prints \x61\x62\x63\x64. What do I need to do to make the line from the file to be read as abcd?
std::string line("\x61\x62\x63\x64");
ifstream myfile ("myfile.txt"); //<-- the file contains \x61\x62\x63\x64
std::string line_file;
getline(myfile,line_file);
cout << line << endl;
cout << line_file << endl;
In c++, the backslash is an escape character, which can be used to represent special characters such as new-lines \n and tabs \t, or in your case, hexadecimal representations of ASCII characters in string literals. If you actually want to store a backslash in c++ you have to escape it: char c='\\'. When you read a backslash from a file, it's not treated as an escape character, but as an actual backslash.
It has to do with the input file stream character interpretation:
File streams opened in binary mode perform input and output operations independently of any format considerations. Non-binary files are known as text files, and some translations may occur due to formatting of some special characters (like newline and carriage return characters).
Text file streams are those where the ios::binary flag is not included in their opening mode. These files are designed to store text and thus all values that are input or output from/to them can suffer some formatting transformations, which do not necessarily correspond to their literal binary value.
So, the backslashes'\' are the most probable reason your ifstream is reading and interpreting the bytes from the file differently (as separate characters), as opposed to the string that contains information about its value, thus making it non-ambiguous.
For further reading see how fstreams work and learn about character literals backslash escape.

C++ in Windows I can't put the Enter character into a .txt file

I made a program wich use the Huffman Coding to compress and decompress .txt files (ANSI, Unicode, UTF-8, Big Endian Unicode...).
In the decompression I take characters from a binary tree and I put them into a .txt in binary mode:
Ofstream F;
F.open("example.txt", ios::binary);
I have to write into .txt file in binary mode because I need to decompress every type of .txt file (not only ANSI) so my simbols are the single bytes.
On Windows it puts every simbol but doesn't care about the Enter character!
For example, if I have this example.txt file:
Hello
World!
=)
I compress it into example.dat file and I save the Huffman tree into another file (exampletree.dat).
Now to decompress example.dat I take characters from the tree saved in exampletree.dat and I put them into a new .txt file through put() or fwrite(), but on Windows it will be like this:
HelloWorld!=)
On Ubuntu it works perfectly and saves also the Enter character!
It isn't a code error because if I print in the console the decompressed .txt file, it also prints the enter characters! So there is a problem in Windows! Could someone help me?
Did you try opening the file using a wordpad or any other advanced text editor(Notepad++) which identify LF as newline character. The default editor notepad would put it in a single line like you described.
This may not be the solution you are looking for. But the problem looks to be due to having LF as the line break instead of windows default CR/LF.
It looks like it will be the difference in handling EndOfLine on Linux vs. Windows. The EOL can be just "\n" or "\r\n" - i.e. Windows usually puts 0x0d,0x0a at the end of lines.
On Windows there's a difference between:
fopen( "filename", "w" );
fopen( "filename", "tw" );
quote:
In text mode, carriage return–linefeed combinations are translated into single linefeeds on input, and linefeed characters are translated to carriage return–linefeed combinations on output

Stop carriage return from appearing in stringstream

I'm have some text parsing that I'd like to behave identically whether read from a file or from a stringstream. As such, I'm trying to use an std::istream to perform all the work. In the string version, I'm trying to get it to read from a static memory byte array I've created (which was originally from a text file). Let's say the original file looked like this:
4
The corresponding byte array is this:
const char byte_array[] = { 52, 13, 10 };
Where 52 is ASCII for the character 4, then the carriage return, then the linefeed.
When I read directly from the file, the parsing works fine.
When I try to read it in "string mode" like this:
std::istringstream iss(byte_array);
std::istream& is = iss;
I end up getting the carriage returns stuck on the end of the strings I retrieve from the stringstream with this method:
std::string line;
std::getline(is, line);
This screws up my parsing because the string.empty() method no longer gets triggered on "blank" lines -- every line contains at least a 13 for the carriage return even if it's empty in the original file that generated the binary data.
Why is the ifstream behaving differently from the istringstream in this respect? How can I have the istringstream version discard the carriage return just like the ifstream version does?
std::ifstream operates in text mode by default, which means it will convert non-LF line endings to a single LF. In this case, std::ifstream is removing the CR character before std::getline() ever sees it.
std::istringstream does not do any interpretation of the source string, and passes through all bytes as they are in the string.
It's important to note that std::string represents a sequence of bytes, not characters. Typically one uses std::string to store ASCII-encoded text, but they can also be used to store arbitrary binary data. The assumption is that if you have read text from a file into memory, you have already done any text transformations such as standardization of line endings.
The correct course of action here would be to convert line endings when the file is being read. In this case, it looks like you are generating code from a file. The program that reads the file and converts it to code should be eliminating the CR characters.
An alternative approach would be to write a stream wrapper that takes an std::istream and delegates read operations to it, converting line endings on the fly. This approach is viable, though can be tricky to get right. (Efficiently handling seeking, in particular, will be difficult.)

getline() text with UNIX formatting characters

I am writing a C++ program which reads lines of text from a .txt file. Unfortunately the text file is generated by a twenty-something year old UNIX program and it contains a lot of bizarre formatting characters.
The first few lines of the file are plain, English text and these are read with no problems. However, whenever a line contains one or more of these strange characters mixed in with the text, that entire line is read as characters and the data is lost.
The really confusing part is that if I manually delete the first couple of lines so that the very first character in the file is one of these unusual characters, then everything in the file is read perfectly. The unusual characters obviously just display as little ascii squiggles -arrows, smiley faces etc, which is fine. It seems as though a decision is being made automatically, without my knowledge or consent, based on the first line read.
Based on some googling, I suspected that the issue might be with the locale, but according to the visual studio debugger, the locale property of the ifstream object is "C" in both scenarios.
The code which reads the data is as follows:
//Function to open file at location specified by inFilePath, load and process data
int OpenFile(const char* inFilePath)
{
string line;
ifstream codeFile;
//open text file
codeFile.open(inFilePath,ios::in);
//read file line by line
while ( codeFile.good() )
{
getline(codeFile,line);
//check non-zero length
if (line != "")
ProcessLine(&line[0]);
}
//close line
codeFile.close();
return 1;
}
If anyone has any suggestions as to what might be going on or how to fix it, they would be very welcome.
From reading about your issues it sounds like you are reading in binary data, which will cause getline() to throw out content or simply skip over the line.
You have a couple of choices:
If you simply need lines from the data file you can first sanitise them by removing all non-printable characters (that is the "official" name for those weird ascii characters). On UNIX a tool such as strings would help you with that process.
You can off course also do this programmatically in your code by simply reading in X amount of data, storing it in a string, and then removing those characters that fall outside of the standard ASCII character range. This will most likely cause you to lose any unicode that may be stored in the file.
You change your program to understand the format and basically write a parser that allows you to parse the document in a more sane way.
If you can, I would suggest trying solution number 1, simply to see if the results are sane and can still be used. You mention that this is medical data, do you per-chance know what file format this is? If you are trying to find out and have access to a unix/linux machine you can use the utility file and maybe it can give you a clue (worst case it will tell you it is simply data).
If possible try getting a "clean" file that you can post the hex dump of so that we can try to provide better help than that what we are currently providing. With clean I mean that there is no personally identifying information in the file.
For number 2, open the file in binary mode. You mentioned using Windows, binary and non-binary files in std::fstream objects are handled differently, whereas on UNIX systems this is not the case (on most systems, I'm sure I'll get a comment regarding the one system that doesn't match this description).
codeFile.open(inFilePath,ios::in);
would become
codeFile.open(inFilePath, ios::in | ios::binary);
Instead of getline() you will want to become intimately familiar with .read() which will allow unformatted operations on the ifstream.
Reading will be like this:
// This code has not been tested!
char input[1024];
codeFile.read(input, 1024);
int actual_read = codeFile.gcount();
// Here you can process input, up to a maximum of actual_read characters.
//ProcessLine() // We didn't necessarily read a line!
ProcessData(input, actual_read);
The other thing as mentioned is that you can change the locale for the current stream and change the separator it considers a new line, maybe this will fix your issue without requiring to use the unformatted operators:
imbue the stream with a new locale that only knows about the newline. This method may or may not let your getline() function without issues.