How to keep CR when reading file into string? [duplicate]

How to keep CR when reading file into string? [duplicate] - c++

I tried using fopen in C, the second parameter is the open mode. The two modes "r" and "rb" tend to confuse me a lot. It seems they are the same. But sometimes it is better to use "rb". So, why does "r" exist?
Explain it to me in detail or with examples.
Thank You.

You should use "r" for opening text files. Different operating systems have slightly different ways of storing text, and this will perform the correct translations so that you don't need to know about the idiosyncracies of the local operating system. For example, you will know that newlines will always appear as a simple "\n", regardless of where the code runs.
You should use "rb" if you're opening non-text files, because in this case, the translations are not appropriate.

On Linux, and Unix in general, "r" and "rb" are the same. More specifically, a FILE pointer obtained by fopen()ing a file in in text mode and in binary mode behaves the same way on Unixes. On windows, and in general, on systems that use more than one character to represent "newlines", a file opened in text mode behaves as if all those characters are just one character, '\n'.
If you want to portably read/write text files on any system, use "r", and "w" in fopen(). That will guarantee that the files are written and read properly. If you are opening a binary file, use "rb" and "wb", so that an unfortunate newline-translation doesn't mess your data.
Note that a consequence of the underlying system doing the newline translation for you is that you can't determine the number of bytes you can read from a file using fseek(file, 0, SEEK_END).
Finally, see What's the difference between text and binary I/O? on comp.lang.c FAQs.

use "rb" to open a binary file. Then the bytes of the file won't be encoded when you read them

"r" is the same as "rt" for Translated mode
"rb" is
non-translated mode.
This makes a difference on Windows, at least. See that link for details.

On most POSIX systems, it is ignored. But, check your system to be sure.
XNU
The mode string can also include the letter 'b' either as last character or as a character between the characters in any of the two-character strings described above. This is strictly for compatibility with ISO/IEC 9899:1990 ('ISO C90') and has no effect; the 'b' is ignored.
Linux
The mode string can also include the letter 'b' either as a last
character or as a character between the characters in any of the two-
character strings described above. This is strictly for
compatibility with C89 and has no effect; the 'b' is ignored on all
POSIX conforming systems, including Linux. (Other systems may treat
text files and binary files differently, and adding the 'b' may be a
good idea if you do I/O to a binary file and expect that your program
may be ported to non-UNIX environments.)

Related

Are fprintf, fputs... functions on windows legal according to the C standard?

I was wondering: why does writing in a file with the standard lib converts your \n into \r\n? Is it standard behaviour according C99 or a "commidity" added by MS? If it is standard, with the old Apple convention (\r), would writing "toto\n" to a file write "toto\r"? Is the current behaviour here so that you could read UNIX file but UNIXes could not read yours?(I love conspiracy theories)

why does writing in a file with the standard lib converts your \n into \r\n?
It's to make code more portable; you can just use \n in your program and have it work on UNIX, Windows, Macs, and (supposedly) everything else.
Is it standard behaviour according C99 or a "commidity" added by MS?
Yes, it's standard.
If it is standard, with the old Apple convention (\r), would writing "toto\n" to a file write "toto\r"?
Yes, translating end-of-line characters is expected.
Is the current behaviour here so that you could read UNIX file but UNIXes could not read yours?
No, there's no conspiracy.
From the C11 spec, §7.21.2 Streams, ¶2:
… Characters may have to be added, altered, or deleted on input and output to conform to differing conventions for representing text in the host environment. Thus, there need not be a one-to-one correspondence between the characters in a stream and those in the external representation. …
If you don't want this behaviour, open your file as a binary stream rather than as a text stream.

How do I write MBCS files from a UNICODE application?

My question seems to have confused folks. Here's something concrete:
Our code does the following:
FILE * fout = _tfsopen(_T("丸穴種類.txt"), _T("w"), _SH_DENYNO);
_fputts(W2T(L"刃物種類\n"), fout);
fclose(fout);
Under MBCS build target, the above produces a properly encoded file for code page 932 (assuming that 932 was the system default code page when this was run).
Under UNICODE build target, the above produces a garbage file full of ????.
I want to define a symbol, or use a compiler switch, or include a special header, or link to a given library, to make the above continue to work when the build target is UNICODE without changing the source code.
Here's the question as it used to exist:
FILE* streams can be opened in t(ranslated) or b(inary) modes.
Desktop applications can be compiled for UNICODE or MBCS (under
Windows).
If my application is compiled for MBCS, then writing MBCS strings to a
"wt" stream results in a well-formed text file containing MBCS text
for the system code page (i.e. the code page "for non Unicode
software").
Because our software generally uses the _t versions of most string &
stream functions, in MBCS builds output is handled primarily by
puts(pszMBString) or something similar putc etc. Since
pszMBString is already in the system code page (e.g. 932 when
running on a Japanese machine), the string is written out verbatim
(although line terminators are massaged by puts and gets
automatically).
However, if my application is compiled for UNICODE, then writing MBCS
strings to a "wt" stream results in garbage (lots of "?????"
characters) (i.e. I convert the UNICODE to the system's default code
page and then write that to the stream using, for example,
fwrite(pszNarrow, 1, length, stream)).
I can open my streams in binary mode, in which case I'll get the
correct MBCS text... but, the line terminators will no longer be
PC-style CR+LF, but instead will be UNIX-style LF only. This, because
in binary (non-translated) mode, the file stream doesn't handle the
LF->CR+LF translation.
But what I really need, is to be able to produce the exact same files I used to be able to produce when compiling for MBCS: correct
line terminators and MBCS text files using the system's code page.
Obviously I can manually adjust the line terminators myself and use
binary streams. However, this is a very invasive approach, as I now
have to find every bit of code throughout the system that writes text
files, and alter it so that it does all of this correctly. What blows
my mind, is that UNICODE target is stupider / less capable than the
MBCS target we used to use! Surely there is a way to toggle the C
library to say "output narrow strings as-is but handle line
terminators properly, exactly as you'd do in MBCS builds"?!

Sadly, this is a huge topic that deserves a small book devoted to it. And that book would basically need a specialized chapter for every target platform one wished to build for (Linux, Windows [flavor], Mac, etc.).
My answer is only going to cover Windows desktop applications, compiled for C++ with or without MFC.
Please Note: this pertains to wanting to read in and write out MBCS (narrow) files from a UNICODE build using the system default code page (i.e. the code page for non-Unicode software). If you want to read and write Unicode files from a UNICODE build, you must open the files in binary mode, and you must handle BOM and line feed conversions manually (i.e. on input, you must skip the BOM (if any), and both convert the external encoding to Windows Unicode [i.e. UTF-16LE] as well as convert any CR+LF sequences to LF only; and for output, you must write the BOM (if any), and convert from UTF-16LE to whatever target encoding you want, plus you must convert LF to CR+LF sequences for it to be a properly formatted PC text file).
BEWARE of MS's std C library's puts and gets and fwrite and so on, which if opened in text/translated mode, will convert any 0x0D to a 0x0A 0x0D sequence on write, and vice verse on read, regardless of whether you're reading or writing a single byte, or a wide character, or a stream of random binary data -- it doesn't care, and all of these functions boil down to doing blind byte-conversions in text/translated mode!!!
Also be aware that many of the Windows API functions use CP_ACP internally, without any external control over their behavior (e.g. WritePrivateProfileString()). Hence the reason one might want to ensure that all libraries are operating with the same character locale: CP_ACP and not some other one, since you can't control some of the functions behaviors, you're forced to conform to their choice or not use them at all.
If using MFC, one needs to:
// force CP_ACP *not* CP_THREAD_ACP for MFC CString auto-conveters!!!
// this makes MFC's CString and CStdioFile and other interfaces use the
// system default code page, instead of the thread default code page (which is normally "c")
#define _CONVERSION_DONT_USE_THREAD_LOCALE
For C++ and C libraries, one must tell the libraries to use the system code page:
// force C++ and C libraries based on setlocale() to use system locale for narrow strings
// (this automatically calls setlocale() which makes the C library do the same thing as C++ std lib)
// we only change the LC_CTYPE, not collation or date/time formatting
std::locale::global(std::locale(str(boost::format(".%||") % GetACP()).c_str(), LC_CTYPE));
I do the #define in all of my precompiled headers, before including any other headers. I set the global locale in main (or its moral equivalent), once for the entire program (you may need to call this for every thread that is going to do I/O or string conversions).
The build target is UNICODE, and for most of our I/O, we use explicit string conversions before outputting via CStringA(my_wide_string).
One other thing that one should be aware of, there are two different sets of multibyte functions in the C standard library under VS C++ - those which use the thread's locale for their operations, and another set which use something called the _setmbcp() (which you can query via _getmbcp(). This is the actual code page (not a locale) that is used for all narrow string interpretation (NOTE: this is always initialized to CP_ACP, i.e. GetACP() by the VS C++ startup code).
Useful reference materials:
- the-secret-family-split-in-windows-code-page-functions
- Sorting it all out (explains that there are four different locales in effect in Windows)
- MS offers some functions that allow you to set the encoding to use directly, but I didn't explore them
- An important note about a change to MFC that caused it to no longer respect CP_ACP, but rather CP_THREAD_ACP by default starting in MFC 7.0
- Exploration of why console apps in Windows are extreme FAIL when it comes to Unicode I/O
- MFC/ATL narrow/wide string conversion macros (which I don't use, but you may find useful)
- Byte order marker, which you need to write out for Unicode files of any encoding to be understood by other Windows software

The C library has support for both narrow (char) and wide (wchar_t) strings. In Windows these two types of strings are called MBCS (or ANSI) and Unicode respectively.
It is fully possible to use the narrow functions even though you have defined _UNICODE. The following code should produce the same output, regardless if _UNICODE is defined or not:
FILE* f = fopen("foo.txt", "wt");
fputs("foo\nbar\n", f);
fclose(f);
In your question you wrote: "I convert the UNICODE to the system's default code page and write that to the stream". This leads me to believe that your wide string contain characters that cannot be converted to the current code page, and thus replacing each of them with a question-mark.
Perhaps you could use some other encoding than the current code page. I recommend using the UTF-8 encoding where ever possible.
Update: Testing your example code on a Windows machine running on code page 1252, the call to _fputts returns -1, indicating an error. errno was set to EILSEQ, which means "Illegal byte sequence". The MSDN documentation for fopen states that:
When a Unicode stream-I/O function operates in text mode (the
default), the source or destination stream is assumed to be a sequence
of multibyte characters. Therefore, the Unicode stream-input functions
convert multibyte characters to wide characters (as if by a call to
the mbtowc function). For the same reason, the Unicode stream-output
functions convert wide characters to multibyte characters (as if by a
call to the wctomb function).
This is key information for this error. wctomb will use the locale for the C standard library. By explicitly setting the locale for the C standard library to code page 932 (Shift JIS), the code ran perfectly and the output was correctly encoded in Shift JIS in the output file.
int main()
{
setlocale(LC_ALL, ".932");
FILE * fout = _wfsopen(L"丸穴種類.txt", L"w", _SH_DENYNO);
fputws(L"刃物種類\n", fout);
fclose(fout);
}
An alternative (and perhaps preferable) solution to this would be to handle the conversions yourself before calling the narrow string functions of the C standard library.

When you compile for UNICODE, c++ library knows nothing about MBCS. If you say you open the file for outputting text, it will attempt to treat the buffers you pass to it as UNICODE buffers.
Also, MBCS is variable-length encoding. To parse it, c++ library needs to iterate over characters, which is of course impossible when it knows nothing about MBCS. Hence it's impossible to "just handle line terminators correctly".
I would suggest that you either prepare your strings beforehand, or make your own function that writes string to file. Not sure if writing characters one by one would be efficient (measurements required), but if not, you can handle strings piecewise, putting everything that doesn't contain \n in one go.

Why CR LF is changed to LF in Windows?

In Windows when you read characters \r\n from the file(or stdin) in text mode, \r gets deleded and you only read \n.
Is there a standard according to which it should be so?
Could I be sure that it will be true for any compiler on Windows? Will others platform-specifics character combinations will replaced by \n on those platforms too?
I use this code to generate the input and use this code to read it. The results are here. You may note few missed \r's

Yes, this comes from compatibility with C. In C text streams, lines are terminated by a newline character. This is the internal representation of the text stream as seen by the program. The I/O library converts between the internal representation and some external one.
The internal representation is platform-independent, whereas there are different platform-specific conventions for text. That's the point of having a text mode in the stream library; portable text manipulating programs can be written which do not have to contain a pile of #ifdef directives to work on different platforms, or build their own platform-independent text abstraction.
It so happens that the internal representation for C text streams matches the native Unix representation of text files, since the C language and its library originated on Unix. For portability of C programs to other platforms, the text stream abstraction was added which makes text files on non-Unix system look like Unix text files.
In the ISO/IEC 9899:1999 standard ("C99"), we have this:
7.19.2 Streams
[...]
A text stream is an ordered sequence of characters composed into lines, each line
consisting of zero or more characters plus a terminating new-line character. Whether the
last line requires a terminating new-line character is implementation-defined. Characters
may have to be added, altered, or deleted on input and output to conform to differing
conventions for representing text in the host environment. Thus, there need not be a one-to-one correspondence between the characters in a stream and those in the external
representation.
Bold emphasis mine. C++ streams are defined in terms of C streams. There is no explanation of text versus binary mode in the C++ standard, except for a table which maps various stream mode flag combinations to strings suitable as mode arguments to fopen.

What Makes a Binary Stream Special?

std::fstream has the option to consider streams as binary, rather than textual. What's the difference?
As far as I know, it all depends on how the file is opened in other programs. If I write A to a binary stream, it'll just get converted to 01000001 (65, A's ASCII code) - the exact same representation. That can be read as the letter "A" in text editors, or the binary sequence 01000001 for other programs.
Am I missing something, or does it not make any difference whether a stream is considered binary or not?

In text streams, newline characters may be translated to and from the \n character; with binary streams, this doesn't happen. The reason is that different OS's have different conventions for storing newlines; Unix uses \n, Windows \r\n, and old-school Macs used \r. To a C++ program using text streams, these all appear as \n.

On Linux/Unix/Android there is no difference.
On a Mac OS/X or later there is no difference, but I think older Macs might change an '\n' to an '\r' on reading, and the reverse on writing (only for a Text stream).
On Windows, for a text stream some characters get treated specially. A '\n' character is written as "\r\n", and a "\r\n" pair is read as '\n'. An '\0x1A' character is treated as "end of file" and terminates reading.
I think Symbian PalmOS/WebOS behave the same as Windows.
A binary stream just writes bytes and won't do any transformation on any platform.

You got it the other way around, it's text streams that are special, specifically because of the \n translation to either \n or \r\n (or even \r..) depending on your system.

The practical difference is the treatment of line-ending sequences on Microsoft operating systems.
Binary streams return the data in the file precisely as it is stored. Text streams normalize line-ending sequences, replacing them with '\n'.

If you open it as text then the C or C++ runtime will perform newline conversions depending on the host (Windows or linux).

Changing binary file

I have a file on windows. I'm writing in C++. I have a problem where I need to remove some bytes from the end of the file. I am using ifstream, but I don't know how to remove those chars, simple put '\0' in the file or what ?

On linux machines, use truncate():
http://linux.die.net/man/2/truncate
On the Windows machines, use SetEndOfFile():
http://msdn.microsoft.com/en-us/library/aa365531%28v=vs.85%29.aspx
Both are OS dependent calls.

You can't portably change the size of a file; the only way to do it is to copy the file to a temporary, then delete the original and rename the temporary.
If it's just a case of truncating the file, both Windows and Unix (but not necessarily other systems) have system level functions which can do this, but there's nothing in the standard which supports it. And if you ever end up having to remove bytes other than at the end, neither Windows nor Unix allow it (although some other systems do, at least in specific cases).

Why not truncate the file? Have a look at the chsize() method.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js