C++ file reading and string printing - c++

Why do these two print different things? The first prints abcd but the second prints \x61\x62\x63\x64. What do I need to do to make the line from the file to be read as abcd?
std::string line("\x61\x62\x63\x64");
ifstream myfile ("myfile.txt"); //<-- the file contains \x61\x62\x63\x64
std::string line_file;
getline(myfile,line_file);
cout << line << endl;
cout << line_file << endl;

In c++, the backslash is an escape character, which can be used to represent special characters such as new-lines \n and tabs \t, or in your case, hexadecimal representations of ASCII characters in string literals. If you actually want to store a backslash in c++ you have to escape it: char c='\\'. When you read a backslash from a file, it's not treated as an escape character, but as an actual backslash.

It has to do with the input file stream character interpretation:
File streams opened in binary mode perform input and output operations independently of any format considerations. Non-binary files are known as text files, and some translations may occur due to formatting of some special characters (like newline and carriage return characters).
Text file streams are those where the ios::binary flag is not included in their opening mode. These files are designed to store text and thus all values that are input or output from/to them can suffer some formatting transformations, which do not necessarily correspond to their literal binary value.
So, the backslashes'\' are the most probable reason your ifstream is reading and interpreting the bytes from the file differently (as separate characters), as opposed to the string that contains information about its value, thus making it non-ambiguous.
For further reading see how fstreams work and learn about character literals backslash escape.

Related

C++ Reading and printing newline characters from a file

I am keeping a large repository of strings in a character-delimited file. Currently, I am reading the strings into string variables, and then later printing them.
The problem I'm facing is how to store and print new line characters. In the file, if the string, for example, is:
"Hello this is \n\n a new line"
then the literal '\n' is printed in my program terminal when I print the string, however I would like to print new lines.
Is this a matter of processing the strings character by character, or is there a proper way to read the strings into the string variables that will allow this to work?

Read text-file in C++ with fopen without linefeed conversion

I'm working with text-files (UTF-8) on Windows and want to read them using C++.
To open the file corrently, I use fopen. As described here, there are two options for opening the file:
Text mode "rt" (Carriage return + Linefeed will automatically be converted into Linefeed; Short "\r\n" becomes "\n").
Binary mode "rb" (The file will be read byte by byte).
Now it becomes tricky. I don't want to open the file in binary mode, since I would lose the correct handling of my UTF-8 characters (and there are special characters in my text-files, which are corrupted when interpreted as ANSI-character). But I also don't want fopen to convert all my CR+LF into LF.
Is there a way to combine the two modes, to read a text-file into a string without tampering with the linefeeds, while still being able to read UTF-8 correctly?
I am aware, that the reverse conversion would happen, if I write it through the same file, but the string is sent to another application that expects Windows-style line-endings.
The difference between opening files in text and binary mode is exactly the handling of line end sequences in text mode or not touching them in binary mode. Nothing more nothing less. Since the ASCII characters use the same code points in Unicode and UTF-8 retains the encoding of ASCII characters (i.e., every ASCII file happens to be a UTF-8 encoded Unicode file) whether you use binary or text mode won't affect the other bytes.
It may be worth to have a look at James McNellis "Unicode in C++" presentation at C++Now 2014.

Stop carriage return from appearing in stringstream

I'm have some text parsing that I'd like to behave identically whether read from a file or from a stringstream. As such, I'm trying to use an std::istream to perform all the work. In the string version, I'm trying to get it to read from a static memory byte array I've created (which was originally from a text file). Let's say the original file looked like this:
4
The corresponding byte array is this:
const char byte_array[] = { 52, 13, 10 };
Where 52 is ASCII for the character 4, then the carriage return, then the linefeed.
When I read directly from the file, the parsing works fine.
When I try to read it in "string mode" like this:
std::istringstream iss(byte_array);
std::istream& is = iss;
I end up getting the carriage returns stuck on the end of the strings I retrieve from the stringstream with this method:
std::string line;
std::getline(is, line);
This screws up my parsing because the string.empty() method no longer gets triggered on "blank" lines -- every line contains at least a 13 for the carriage return even if it's empty in the original file that generated the binary data.
Why is the ifstream behaving differently from the istringstream in this respect? How can I have the istringstream version discard the carriage return just like the ifstream version does?
std::ifstream operates in text mode by default, which means it will convert non-LF line endings to a single LF. In this case, std::ifstream is removing the CR character before std::getline() ever sees it.
std::istringstream does not do any interpretation of the source string, and passes through all bytes as they are in the string.
It's important to note that std::string represents a sequence of bytes, not characters. Typically one uses std::string to store ASCII-encoded text, but they can also be used to store arbitrary binary data. The assumption is that if you have read text from a file into memory, you have already done any text transformations such as standardization of line endings.
The correct course of action here would be to convert line endings when the file is being read. In this case, it looks like you are generating code from a file. The program that reads the file and converts it to code should be eliminating the CR characters.
An alternative approach would be to write a stream wrapper that takes an std::istream and delegates read operations to it, converting line endings on the fly. This approach is viable, though can be tricky to get right. (Efficiently handling seeking, in particular, will be difficult.)

QString::split() and "\r", "\n" and "\r\n" convention

I understand that QString::split should be used to get a QStringList from a multiline QString. But if I have a file and I don't know if it comes from Mac, Windows or Unix, I'm not sure if QString.split("\n") would work well in all the cases. What is the best way to handle this situation?
If it's acceptable to remove blank lines, you can try:
QString.split(QRegExp("[\r\n]"),QString::SkipEmptyParts);
This splits the string whenever any of the newline character (either line feed or carriage return) is found. Any consecutive line breaks (e.g. \r\n\r\n or \n\n) will be considered multiple delimiters with empty parts between them, which will be skipped.
Emanuele Bezzi's answer misses a couple of points.
In most cases, a string read from a text file will have been read using a text stream, which automatically translates the OS's end-of-line representation to a single '\n' character. So if you're dealing with native text files, '\n' should be the only delimiter you need to worry about. For example, if your program is running on a Windows system, reading input in text mode, line endings will be marked in memory with single \n characters; you'll never see the "\r\n" pairs that exist in the file.
But sometimes you do need to deal with "foreign" text files.
Ideally, you should probably translate any such files to the local format before reading them, which avoids the issue. Only the translation utility needs to be aware of variant line endings; everything else just deals with text.
But that's not always possible; sometimes you might want your program to handle Windows text files when running on a POSIX system (Linux, UNIX, etc.), or vice versa.
A Windows-format text file on a POSIX system will appear to have an extra '\r' character at the end of each line.
A POSIX-format text file on a Windows system will appear to consist of one very long line with embedded '\n' characters.
The most general approach is to read the file in binary mode and deal with the line endings explicitly.
I'm not familiar with QString.split, but I suspect that this:
QString.split(QRegExp("[\r\n]"),QString::SkipEmptyParts);
will ignore empty lines, which will appear either as "\n\n" or as "\r\n\r\n", depending on the format. Empty lines are perfectly valid text data; you shouldn't ignore them unless you're certain that it makes sense to do so.
If you need to deal with text input delimited either by "\n", "\r\n", or "\r", then I think something like this:
QString.split(QRegExp("\n|\r\n|\r"));
would do the job. (Thanks to parsley72's comment for helping me with the regular expression syntax.)
Another point: you're probably not likely to encounter text files that use just '\r' to delimit lines. That's the format used by MacOS up to version 9. MaxOS X is based on UNIX, and it uses standard UNIX-style '\n' line endings (though it probably tolerates '\r' line endings as well).

C++ change newline from CR+LF to LF

I am writing code that runs in Windows and outputs a text file that later becomes the input to a program in Linux. This program behaves incorrectly when given files that have newlines that are CR+LF rather than just LF.
I know that I can use tools like dos2unix, but I'd like to skip the extra step. Is it possible to get a C++ program in Windows to use the Linux newline instead of the Windows one?
Yes, you have to open the file in "binary" mode to stop the newline translation.
How you do it depends on how you are opening the file.
Using fopen:
FILE* outfile = fopen( "filename", "wb" );
Using ofstream:
std::ofstream outfile( "filename", std::ios_base::binary | std::ios_base::out );
OK, so this is probably not what you want to hear, but here's my $0.02 based on my experience with this:
If you need to pass data between different platforms, in the long run you're probably better off using a format that doesn't care what line breaks look like. If it's text files, users will sometimes mess with them. If by messing the line endings up they cause your application to fail, this is going to be a support intensive application.
Been there, done that, switched to XML. Made the support guys a lot happier.
A much cleaner solution is to use the ASCII escape sequence for the LF character (decimal 10): '\012' or '\x0A' represents an explicit single line feed regardless of platform.
Note that this at least on some compilers does not work; for example, on MSVC 2019 16.11.6, both '\012' and '\x0A' get translated to carriage return and line feed. It also does not matter there whether a string literal ("\012") or a char literal ('\012') is used.
This method also avoids string length surprises, as '\n' can expand to two characters. But so can multibyte unicode characters, in UTF8, when written directly into a string literal in the source code.
Note also that '\r' is the platform-independent code for a single carriage return (decimal 13). The '\f' character is not the line feed, but rather the form feed (decimal 12), which is not a newline on any platform I am aware of. C does not offer a single-character backslash escape for the line feed, thus the need for the longer octal or hexadecimal escapes.