Portable end of line (newline) - c++

It's been an unpleasant surprise that '\n' is replaced with "\r\n" on Windows, I did not know that. (I am guessing it is also replaced on Mac...)
Is there an easy way to ensure that Linux, Mac and Windows users can easily exchange text files?
By easy way I mean: without writing the file in binary mode or testing and replacing the end-of-line chars myself (or with some third party program/code). This issue effects my C++ program doing the text file I/O.

Apologies for the partial overlap with other answers, but for the sake of completeness:
Myth: endl is 'more portable' since it writes the line ending depending on the platform convention.
Truth: endl is defined to write \n to the stream and then call flush. So in fact you almost never want to use it. All \n that are written to a text-mode stream are implicitly converted to \r\n by the CRT behind the scenes, whether you use os<<endl, os<<'\n', or fputs("\n",file).
Myth: You should open files in text mode to write text and in binary mode to write binary data.
Truth: Text mode exists in the first place because some time ago there were file-systems that distinguished between text files and binary files. It's no longer true on any sane platform I know. You can write text to binary-opened files just as well, you just loose the automatic \n -> \r\n conversion on Windows. However, this conversion causes more harm than good. Among others, it makes your code behave differently on different platforms, and tell/seek become problematic to use. Therefore it's best to avoid this automatic conversion. Note that POSIX does not distinguish between binary and text mode.
How to do text: Open everything in binary mode and use the plain-old \n. You'll also need to worry about the encoding. Standardize on UTF-8 for Unicode-correctness. Use UTF-8 encoded narrow-strings internally, instead of wchar_t which is different on different platforms. Your code will become easier to port.
Tip: You can force MSVC to open all files in binary mode by default. It should work as follows:
#include <stdio.h>
#include <iostream>
int main() {
_fmode = _O_BINARY;
std::ofstream f("a.txt"); // opens in binary mode
}
EDIT: As of 2021, Windows 10 Notepad understands UNIX line endings.

The issue isn’t with endl at all, it’s that text streams reformat line breaks depending on the system’s standard.
If you don’t want that, simply don’t use text streams – use binary streams. That is, open your files with the ios::binary flag.
That said, if the only issue is that users can exchange files, I wouldn’t bother with the output mode at all, I’d rather make sure that your program can read different formats without choking. That is, it should accept different line endings.
This is by the way what any decent text editor does (but then again, the default notepad.exe on Windows is not a decent text editor, and won’t correctly handle Unix line breaks).

If you really just want an ASCII LF, the easiest way is to open the file in binary mode: in non-binary mode \n is replaced by a platform specific end of line sequence (e.g. it may be replaced by a LF/CR or a CR/LF sequence; on UNIXes it typically is just LF). In binary mode this is not done. Turning off the replacement is also the only effect of the binary mode.
BTW, using endl is equivalent to writing a \n followed by flushing the stream. The typically unintended flush can become a major performance problem. Thus, endl should be use rarely and only when the flush is intended.

Related

How to get the platform-specific text line terminator string in C++20?

If there is a binary-mode std::ostream, how could one output a platform-specific line-terminator string (e.g. \r\n on Windows)? Alternatively, is there a standard way to get that terminator as a string?
So far, I have considered the following options:
Just use std::endl: this would just output \n regardless of the platform, and the implied std::flush hurts performance quite severely.
The old C-style #ifdef _WIN32 ... approach, with one branch per possible terminator string. Fragile and exhausting.
A more modern branching alternative with constexpr conditionals. Slightly less fragile, still exhausting.
Output std::endl to an std::stringstream and get the string: This does not seem to work when using mingw to compile a 64-bit Windows executable on Linux and testing using WINE - I still only get the single \n. That might be an environment problem, though.
Is there a somewhat portable way that would work?
UPDATE:
It was pointed out in the comments that having a line terminator is not a given; there are platforms that delimit text lines in other ways, like e.g. length-prefixed string records. So a truly generic solution would be one that answers the question:
How to write a single platform-specific text line to a binary-mode output stream?
Is there, for example, a way to wrap a text-mode stream around an existing binary stream?
It turns out that something like this may not actually be possible, intentionally. C++ needs to support a multitude of platforms, and some of them have fundamentally different models for text files.
E.g. on OpenVMS text files can be stored in a record-per-line format using the Record Management Services subsystem. This makes the concept of line terminator strings irrelevant, and it also means that text-mode files are not just binary files with some additional assumptions on top of them.

C++ read and write UTF-32 files

I want to write a language learning app for myself using Visual Studio 2017, C++ and the WindowsAPI (formerly known as Win32). The Operation System is the latest Windows 10 insider build and backwards-compatibility is a non-issue. Since I assume English to be the mother tounge of the user and the language I am currently interested in is another European language, ASCII might suffice. But I want to future-proof it (more excotic languages) and I also want to try my hands on UTF-32. I have previously used both UTF-8 and UTF-16, though I have more experience with the later.
Thanks to std::basic_string, it was easy to figure out how to get an UTF-32 string:
typedef std::basic_string<char32_t> stringUTF32
Since I am using the WinAPI for all GUI staff, I need to do some conversion between UTF-32 and UTF-16.
Now to my problem: Since UTF-32 is not widely used because of its inefficiencies, there is hardly any material about it on the web. To avoid unnecessary conversions, I want to save my vocabulary lists and other data as UTF-32 (for all UTF-8 advocates/evangelists, the alternative would be UTF-16). The problem is, I cannot find how to write and open files in UTF-32.
So my question is: How to write/open files in UTF-32? I would prefer if no third-party libraries are needed unless they are a part of Windows or are usually shipped with that OS.
If you have a char32_t sequence, you can write it to a file using a std::basic_ofstream<char32_t> (which I will refer to as u32_ofstream, but this typedef does not exist). This works exactly like std::ofstream, except that it writes char32_ts instead of chars. But there are limitations.
Most standard library types that have an operator<< overload are templated on the character type. So they will work with u32_ofstream just fine. The problem you will encounter is for user types. These almost always assume that you're writing char, and thus are defined as ostream &operator<<(ostream &os, ...);. Such stream output can't work with u32_ofstream without a conversion layer.
But the big issue you're going to face is endian issues. u32_ofstream will write char32_t as your platform's native endian. If your application reads them back through a u32_ifstream, that's fine. But if other applications read them, or if your application needs to read something written in UTF-32 by someone else, that becomes a problem.
The typical solution is to use a "byte order mark" as the first character of the file. Unicode even has a specific codepoint set aside for this: \U0000FEFF.
The way a BOM works is like this. When writing a file, you write the BOM before any other codepoints.
When reading a file of an unknown encoding, you read the first codepoint as normal. If it comes out equal to the BOM in your native encoding, then you can read the rest of the file as normal. If it doesn't, then you need to read the file and endian-convert it before you can process it. That process would look at bit like this:
constexpr char32_t native_bom = U'\U0000FEFF';
u32_ifstream is(...);
char32_t bom;
is >> bom;
if(native_bom == bom)
{
process_stream(is);
}
else
{
basic_stringstream<char32_t> char_stream
//Load the rest of `is` and endian-convert it into `char_stream`.
process_stream(char_stream);
}
I am currently interested in is another European language, [so] ASCII might suffice
No. Even in plain English. You know how Microsoft Word creates “curly quotes”? Those are non-ASCII characters. All those letters with accents and umlauts in eg. French or English are non-ASCII characters.
I want to future-proof it
UTF-8, UTF-16 and UTF-32 all can encode every Unicode code point. They’re all future-proof. UTF-32 does not have an advantage over the other two.
Also for future proofing: I’m quite sure some scripts use characters (the technical term is ‘grapheme clusters’) consisting of more than one code point. A cursory search turns up Playing around with Devanagari characters.
A downside of UTF-32 is support in other tools. Notepad won’t open your files. Beyond Compare won’t. Visual Studio Code… nope. Visual Studio will, but it won’t let you create such files.
And the Win32 API: it has a function MultiByteToWideChar which can convert UTF-8 to UTF-16 (which you need to pass in to all Win32 calls) but it doesn’t accept UTF-32.
So my honest answer to this question is, don’t. Otherwise follow Nicol’s answer.

What should i do if i want to use functions that are not standard and are due to compiler specific API?

I have been using the very very old Turbo C++ 3.0 compiler.
During the usage of this compiler, I have become used to functions like getch(), getche() and most importantly clrscr().
Now I have started using Visual C++ 2010 Express. This is causing a lot of problems, as most of these functions (I found this out now) are non-standard and are not available in Visual C++.
What am I to do now?
Always try to avoid them if possible or try their alternatives :
for getch() --- cin.get()
clrscr -- system("cls") // try avoiding the system commands. check : [System][1]
And for any others you can search for them .
The real question is what you are trying to do, globally.
getch and clrscr have never been portable. If you're trying
to create masks or menus in a console window, you should look
into curses or ncurses: these offer a portable solution for
such things. If it's just paging, you can probably get away
with simple outputing a large number of '\n' (for clrscr),
and std::cin.get() for getch. (But beware that this will only
return once the user has entered a new line, and will only read
one character of the line, leaving the rest in the buffer. It
is definitely not a direct replacement for getch. In fact,
std::getline or std::cin::ignore might be better choices.)
Edit:
Adding some more possiblities:
First, as Joachim Pileborg suggested in his comment, if
portability is an issue, there may be platform specific
functions for much of what you are trying to do. If all you're
concerned about is Windows (and it probably is, since system(
"cls" ) and getch() don't work elsewhere), then his comment
may be a sufficient answer.
Second, for many consoles (including xterm and the a console
window under Windows), the escape sequence "\x1b""2J" should
clear the screen. (Note that you have to enter it as two
separate string literals, since otherwise, it would be
interpreted as two characters, the first with the impossible hex
value of 0x1b2.) Don't forget about possible issues of
redirection and flushing, however.
Finally, if you're doing anything non-trivial, you should look
into curses (or ncurses, they're the same thing, but with
different implementations). It's a bit more effort to put into
action (you need explicit initialization, etc.), but it has
a getch function which does exactly what you want, and it also
has functions for explicitly positionning the curser, etc. which
may also make your code simpler. (The original curses was
developed to support the original vi editor, at UCB. Any
editor like task not being developed in its own window would
benefit enormously from it.)
Well,
People, i have found the one best solution that can be used everywhere.
I simply googled the definitions of clrscr() and gotoxy() and created a header file and added these definitions to it. Thus, i can include this file and do everything that i was doing prior.
But, i have a query too.
windows.h is there in the definition. suppose i compile the file and make a exe file. Then will i be able to run it on a linux machine?
According to me the answer has to be yes. But please tell me if i am wrong and also tell me why i am wrong.

Nasty unicode and C++: Easy way to read ASCII/UTF-8/UTF-16 BE/LE text file

sorry if the question is stupid and has been asked thousands of times but I spent a few hours googling it and could not find an answer.
I want to read in text file which can be any of these: ASCII/UTF-8/UTF-16 BE/LE
I assume that if file is unicode then BOM is always present.
Is there any automatic way (STL,Boost or something else) to use file stream or anything to read in file line by line without checking BOMs and always getting UTF8 to put into std::string?
In this project I am using Windows only. It would also be good to know how to solve it for other platforms.
Thanks in advance!
libiconv
BOMs are often not present in UTF-8 files. As a consequence, you can't know if a file is ASCII or UTF-8 until after you have read the data and found a byte which isn't ASCII.
Furthermore, as you are on Windows, do you intend to handle ISO-8859-1 and Windows-1252 as well? The later is often the default for files from things like Notepad and Wordpad. In these cases, things are even worse: One can only distinguish heuristically between such encodings, other encodings and UTF-8.
The ICU library has a character set detection system that you can use to guess the likely character encoding of a file. I do not believe that iconv has such a function.
ICU is generally available, already installed on Mac and Linux, but, alas, not Windows. Such a routine might be available in Win32 API as well.

Windows Codepage Interactions with Standard C/C++ filenames?

A customer is complaining that our code used to write files with Japanese characters in the filename but no longer works in all cases. We have always just used good old char * strings to represent filenames, so it came as a bit of a shock to me that it ever worked, and we haven't done anything I am aware of that should have made it stop working. I had them send me a file with an embedded filename in it exported from our software, and it looks like the strings use hex characters 82 and 83 as the first character of a double-byte sequence to represent the Japanese characters. Poking around online leads me to believe this is probably SHIFT_JIS and/or Windows codepage 932.
It looks to me like what is happening is previously both fopen and ofstream::open accepted filenames using this codepage; now only fopen does. I've checked the Visual Studio fopen docs, and I see no hint of what makes an acceptable string to pass to fopen.
In the short run, I'm hoping someone can shed some light on the specific Windows fopen versus ofstream::open issue for me. In the long run, I'd really like to know the accepted way of opening Unicode (and other?) filenames in C++, on Windows, Linux, and OS X.
Edited to add: I believe that the opens that work are done in the "C" locale, whereas the ones that do not work are done in whatever the customer's default locale is. However, that has been the case for years now, and the old version of the program still works today on their system, so this seems a longshot for explaining the issue we are seeing.
Update: I sent off a small test program to the customer. It has verified that fopen works fine with the SHIFT_JIS filename, and std::ofstream does not. This is in Visual Studio 2005, and happened regardless of whether I used the default locale or the "C" locale.
I'm still interested if anyone has an explanation for this behavior (and why it mysteriously changed -- perhaps a service pack of VS2005?) and hoping to put together a comprehensive "best practices" for handling Unicode filenames in portable C++ code.
Functions like fopen or ofstream::open take the file name as char *, but that is interpreted as being in the system code page.
It means that it can be a Japanese character represented as Shift-JIS (cp932), or Chinese Simplified (Big 5/cp936), Korean, Arabic, Russian, you name it (as long as it matches the OS system code page).
It also means that it can use Japanese file names on a Japanese system only.
Change the system code page and the application "stops working"
I suspect this is what happens here (no big changes in Windows since Win 2000, in this area).
This is how you change the system code page: http://www.mihai-nita.net/article.php?artID=20050611a
In the long run you might consider moving to Unicode (and using _wfopen, wofstream).
I'm not aware of any portable way of using unicode files using default system libraries. But there are some frameworks that provide portable functions, for example:
for C: glib uses filenames in UTF-8;
for C++: glibmm also uses filenames in UTF-8, requires glib;
for C++: boost can use wstring for filenames.
I'm pretty sure .NET/mono frameworks also do contain portable filesystem functions, but I don't know them.
Is somebody still watching this? I've just researched this question and found no answers anywhere, so I can try to explain my findings here.
In VS2005 the fstream filename handling is the odd man out: it doesn't use the system default encoding, the one you get with GetACP and set in Control Panel/Region and Language/Administrative. But always CP 1252 -- I believe.
This can cause big confusion, and Microsoft has removed this quirk in later VS versions.
All workarounds for VS2005 have their drawbacks:
Convert your code to use Unicode everywhere
Never open fstreams using narrow character filenames, always convert to them to Unicode using the system default encoding yourself, the use wide character filename open/ctor
Retrieve the codepage using GetACP(), then do a
matching setlocale:
setlocale (LC_ALL, ("." + lexical_cast<string> (GetACP())).c_str())
I'm nearly certain that on Linux, the filename string is a UTF-8 string (on the EXT3 filesystem, for example, the only disallowed chars are slash and NULL), stored in a normal char *. The man page doesn't seem to mention character encoding, which is what leads me to believe it is the system standard of UTF-8. OS X likely uses the same, since it comes from similar roots, but I am less sure about this.
You may have to set the thread locale to the system default locale.
See here for a possible reason for your problems:
http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=100887
Mac OS X uses Unicode as its native character encoding. The basic string objects are CFString and NSString. They store array of characters as Unicode.