I am writing code that runs in Windows and outputs a text file that later becomes the input to a program in Linux. This program behaves incorrectly when given files that have newlines that are CR+LF rather than just LF.
I know that I can use tools like dos2unix, but I'd like to skip the extra step. Is it possible to get a C++ program in Windows to use the Linux newline instead of the Windows one?
Yes, you have to open the file in "binary" mode to stop the newline translation.
How you do it depends on how you are opening the file.
Using fopen:
FILE* outfile = fopen( "filename", "wb" );
Using ofstream:
std::ofstream outfile( "filename", std::ios_base::binary | std::ios_base::out );
OK, so this is probably not what you want to hear, but here's my $0.02 based on my experience with this:
If you need to pass data between different platforms, in the long run you're probably better off using a format that doesn't care what line breaks look like. If it's text files, users will sometimes mess with them. If by messing the line endings up they cause your application to fail, this is going to be a support intensive application.
Been there, done that, switched to XML. Made the support guys a lot happier.
A much cleaner solution is to use the ASCII escape sequence for the LF character (decimal 10): '\012' or '\x0A' represents an explicit single line feed regardless of platform.
Note that this at least on some compilers does not work; for example, on MSVC 2019 16.11.6, both '\012' and '\x0A' get translated to carriage return and line feed. It also does not matter there whether a string literal ("\012") or a char literal ('\012') is used.
This method also avoids string length surprises, as '\n' can expand to two characters. But so can multibyte unicode characters, in UTF8, when written directly into a string literal in the source code.
Note also that '\r' is the platform-independent code for a single carriage return (decimal 13). The '\f' character is not the line feed, but rather the form feed (decimal 12), which is not a newline on any platform I am aware of. C does not offer a single-character backslash escape for the line feed, thus the need for the longer octal or hexadecimal escapes.
Related
We know that Windows uses a CR + LF pair as its new line, Unix (including Linux and OS X) uses a single LF, while MacOS uses a single CR.
Does that mean that the interpretation of a newline in C and C++ depends upon the execution environment, even though K&R (section 1.5.3 Line Counting) states the following very categorically?
so '\n' stands for the value of the newline character, which is 10 in ASCII.
We know that Windows uses a CR + LF pair as its new line,…
The page you link to does not say Windows uses “CR + LF” as its new line character. It says Windows marks the end of a line in a text file with a carriage-return character and a line-feed character. That does not mean those characters are a new-line character or vice-versa.
Does that mean that the interpretation of a newline…
The new-line character is a new-line character. In C, it is intended to mark a new line. When ASCII is used, ASCII’s line-feed character (code 10) is typically used as C’s new-line character ('\n').
If a C program reads a Windows-style text file using a binary stream, it will see a carriage-return character and a line-feed marking the ends of lines. If a C program reads a Windows-style text file using a text stream (in an environment that supports this), the Windows line-ending indications (carriage-return character and line-feed character) will be automatically translated to C new-line characters.
Conversely, if a C program writes to a Windows-style text file using a text stream, the new-line characters it writes will be translated to Windows line-ending indications. If it writes using a binary stream, it must write the carriage-return characters and the line-feed characters itself.
Does that mean that the interpretation of a newline in C and C++ depends upon the execution environment
No, it does not depend. The interpretation depends the tool that reading the file which is platform suggested but can differ. A robust text tool will tolerate various encodings and will
handle change.
Further, text files originating on one system are accessed/edited by other planforms with different rules.
No, \n always means LF.
On Windows there is LF <-> CR-LF conversion that's performed by the IO streams (FILE *, std::??stream), if the stream is opened in text mode (as opposed to binary mode).
Does that mean that the interpretation of a newline in C and C++ depends upon the execution environment?
The interpretation of the file contents does indeed depend on the execution environment, so that the C programmer does not have to handle the different conventions explicitly:
if the stream is open as binary "rb", no translation is performed and each byte of the file contents is returned directly by getchar(). Unix systems handle text files and binary files identically, so no translation occurs for text files either.
on other systems, streams open in text mode "rt" or just "r" are handled in a system specific way to translate line ending patterns to the single byte '\n', which in ASCII has the value 10. On Windows and MS/DOS systems, this translation converts CR/LF pairs to single bytes '\n', which can be implemented as simply removing CR bytes. This convention was inherited from previous microcomputer operating systems such as Gary Kildall's CP/M, whose APIs were emulated in QDOS, Seattle Computer Products' original 8086 OS that later became MS/DOS.
older Mac systems (before OS/X) used to represent line endings with a single CR byte, but Apple changed this when they adopted a Unix kernel for their OS/X system. No translation is performed anymore on macOS.
Antique systems used to have even more cumbersome representations for text files, such as fixed length records and the stream implementation was inserting extra '\n' bytes to simulate unix line endings when read such streams in text mode.
It is important to understand that this translation process is system specific and is not designed to handle files copied from other systems that use a different convention. Advanced Text tools such as the QEmacs programmers' editor can detect different line endings and perform the appropriate translation regardless of the current execution environment, preserving the convention used in the file, or converting it to another convention under user control.
When newline in string is necessary, I use the \n character
int main()
{
string str = "Hello world\n";
}
Is \n crossplatform? Or do I need to use macro adapting it's value with the platform?
Especially when str is going to be written to a file or stdout.
As long as you read/write text streams, or files in text mode, \n will be translated into the correct sequence for the platform.
http://en.cppreference.com/w/c/io
In addition on previous answer if you need read file in Unix saved in Windows and vice-versa you may use this:
std::getline(fileName,inputStr);
inputStr.erase( std::remove( inputStr.begin(), inputStr.end(), '\r' ), inputStr.end() );
inputStr.erase( std::remove( inputStr.begin(), inputStr.end(), '\n' ), inputStr.end() );
It will delete all \r and \n.
Another way to put it is that \n is cross platform for the compiler. It will compile on all platforms and generate correct output for the platform. But the output is not really cross platform since new line in text is different on different platforms. So reading need extra handling to be platform independent.
Part of the confusion here is that a string with an \n in it is literally just that - a string with an LF byte (0x0A).
The cross-platformyness comes into the equation when considering the reading and writing of streams in normal ie not binary mode.
Stream objects translate \n to \n, \r or \r\n depending on the platform the executable code has been compiled for.
At least this is my understanding of the situation, please correct me if I am wrong about this. It isn't something I have had to worry about much in the past, since I usually exclusively write code for Linux systems.
Thought I should add this since the question doesn't really make sense, although I get what you are asking.
I'm working with text-files (UTF-8) on Windows and want to read them using C++.
To open the file corrently, I use fopen. As described here, there are two options for opening the file:
Text mode "rt" (Carriage return + Linefeed will automatically be converted into Linefeed; Short "\r\n" becomes "\n").
Binary mode "rb" (The file will be read byte by byte).
Now it becomes tricky. I don't want to open the file in binary mode, since I would lose the correct handling of my UTF-8 characters (and there are special characters in my text-files, which are corrupted when interpreted as ANSI-character). But I also don't want fopen to convert all my CR+LF into LF.
Is there a way to combine the two modes, to read a text-file into a string without tampering with the linefeeds, while still being able to read UTF-8 correctly?
I am aware, that the reverse conversion would happen, if I write it through the same file, but the string is sent to another application that expects Windows-style line-endings.
The difference between opening files in text and binary mode is exactly the handling of line end sequences in text mode or not touching them in binary mode. Nothing more nothing less. Since the ASCII characters use the same code points in Unicode and UTF-8 retains the encoding of ASCII characters (i.e., every ASCII file happens to be a UTF-8 encoded Unicode file) whether you use binary or text mode won't affect the other bytes.
It may be worth to have a look at James McNellis "Unicode in C++" presentation at C++Now 2014.
I understand that QString::split should be used to get a QStringList from a multiline QString. But if I have a file and I don't know if it comes from Mac, Windows or Unix, I'm not sure if QString.split("\n") would work well in all the cases. What is the best way to handle this situation?
If it's acceptable to remove blank lines, you can try:
QString.split(QRegExp("[\r\n]"),QString::SkipEmptyParts);
This splits the string whenever any of the newline character (either line feed or carriage return) is found. Any consecutive line breaks (e.g. \r\n\r\n or \n\n) will be considered multiple delimiters with empty parts between them, which will be skipped.
Emanuele Bezzi's answer misses a couple of points.
In most cases, a string read from a text file will have been read using a text stream, which automatically translates the OS's end-of-line representation to a single '\n' character. So if you're dealing with native text files, '\n' should be the only delimiter you need to worry about. For example, if your program is running on a Windows system, reading input in text mode, line endings will be marked in memory with single \n characters; you'll never see the "\r\n" pairs that exist in the file.
But sometimes you do need to deal with "foreign" text files.
Ideally, you should probably translate any such files to the local format before reading them, which avoids the issue. Only the translation utility needs to be aware of variant line endings; everything else just deals with text.
But that's not always possible; sometimes you might want your program to handle Windows text files when running on a POSIX system (Linux, UNIX, etc.), or vice versa.
A Windows-format text file on a POSIX system will appear to have an extra '\r' character at the end of each line.
A POSIX-format text file on a Windows system will appear to consist of one very long line with embedded '\n' characters.
The most general approach is to read the file in binary mode and deal with the line endings explicitly.
I'm not familiar with QString.split, but I suspect that this:
QString.split(QRegExp("[\r\n]"),QString::SkipEmptyParts);
will ignore empty lines, which will appear either as "\n\n" or as "\r\n\r\n", depending on the format. Empty lines are perfectly valid text data; you shouldn't ignore them unless you're certain that it makes sense to do so.
If you need to deal with text input delimited either by "\n", "\r\n", or "\r", then I think something like this:
QString.split(QRegExp("\n|\r\n|\r"));
would do the job. (Thanks to parsley72's comment for helping me with the regular expression syntax.)
Another point: you're probably not likely to encounter text files that use just '\r' to delimit lines. That's the format used by MacOS up to version 9. MaxOS X is based on UNIX, and it uses standard UNIX-style '\n' line endings (though it probably tolerates '\r' line endings as well).
I have a multi-line ASCII string coming from some (Windows/UNIX/...) system. Now, I know about differences in newline character in Windows and UNIX (CR-LF / LF) and I want to parse this string on both (CR and LF) characters to detect which newline character(s) is used in this string, so I need to know what "\n" in VS6 C++ means.
My question is if I write a peace of code in Visual Studio 6 for Windows:
bool FindNewline (string & inputString) {
size_t found;
found = inputString.find ("\n");
return (found != string::npos ? true : false);
}
does this searches for CR+LF or only LF? Should I put "\r\n" or compiler interprets "\n" like CR+LF?
inputString.find ("\n");
will search for the LF character (alone).
Library routines may 'translate' between CR/LF and '\n' when I/O is performed on a text stream, but inside the realm of your program code, '\n' is just a line-feed.
"\n" means "\n". Nothing else. So you search for LF only. However Microsoft CRT does some conversions for you when you read a file in text mode, so you can write simpler code, sometimes.
All translation between "\n" and "\r\n" happens during I/O. At all other times, "\n" is just that and nothing more.
Somehow: return (found != string::npos ? true : false); reminds me of another answer I wrote a while back.
Apart from the VS6 part (you really, really want to upgrade this, the compiler is way out of date and Microsoft doesn't really support it anymore), the answer to the question depends on how you are getting the string.
For example, if you read it from a file in text mode, the runtime library will translate \r\n into \n. So if all your text strings are read in text mode via the usual file-based APIs, your search for\n` (ie, newline only) would be sufficient.
If the strings originate in files that are read in binary mode on Windows and are known to contain the DOS/Windows line separator \r\n, the you're better off searching for that character sequence.
EDIT: If you do get it in binary form, yes, ideally you'd have to check for both \r\n and \n. However I would expect that they aren't mixed within one string and still carry the same meaning unless it's a really messed up data format. I would probably check for \r\n first and then \n second if the strings are short enough and scanning them twice doesn't make that much of a difference. If it does, I'd write some code that checks for both \r\n and single \n in a single pass.