Line terminators

Line terminators - c++

What are difference between:
\r\n - Line feed followed by carriage return.
\n - Line feed.
\r - Carriage Return.

They are the line terminator used by different systems:
\r\n = Windows
\n = UNIX and Mac OS X
\r = Old Mac
You should use std::endl to abstract it, if you want to write one out to a file:
std::cout << "Hello World" << std::endl;

In general, an \r character moves the cursor to the beginning of the line, while an \n moves the cursor down one line. However, different platforms interpret this in different ways, leading to annoying compatibility issues, especially between Windows and UNIX. This is because Windows requires an \r\n to move down one line and move the cursor to the start of the line, whereas on UNIX a single \n suffices.
Also, obligatory Jeff Atwood link: http://www.codinghorror.com/blog/2010/01/the-great-newline-schism.html

Historical info
The terminology comes from typewriters. Back in the day, when people used typewriters to write, when you got to the end of a line you'd press a key on the typewriter that would mechanically return the carriage to the left side of the page and feed the page up a line. The terminology was adopted by computers and represented as the ascii control codes 0xa, for the linefeed, and 0xd for the carriage return. Different operating systems use them differently, which leads to problems when editing a text file written on a Unix machine on a Windows machine and vice-versa.
Pragmatic info
On Unix based machines in text files a newline is represented by the linefeed character, 0xa. On Windows both a linefeed and carriage return are used. For example when you write some code on Linux that has the following in it where the file was opened in text mode:
fprintf(f, "\n");
the underlying runtime will insert only a linefeed character 0xa to the file. On Windows it will translate the \n and insert 0xd0xa. Same code but different results depending on the operating system used. However this changes if the file is opened in binary mode on Windows. In this case the insertion is done literally and only a linefeed character is inserted. This means that the sequence you asked about \r\n will have a different representation on Windows if output to a binary or text stream. In the case of it being a text stream you'll see the following in the file: 0xd0xd0xa. If the file was in binary mode then you'll see: 0xd0xa.
Because of the differences in how operating systems represent newlines in text files text editors have had to evolve to deal with them, although some, like Notepad, still don't know what to do. So in general if you're working on Windows and you're given a text file that was originally written on a Unix machine it's not a good idea to edit it in Notepad because it will insert the Windows style carriage return linefeed (0xd0xa) into the file when you really want a 0xa. This can cause problems for programs running on old Unix machines that read text files as input.

See http://en.wikipedia.org/wiki/Newline

Different operating systems have different conventions; Windows uses \r\n, Mac uses \r, and UNIX uses \n.

\r
This sends the cursor to the beginning column of the display
\n
This moves the cursor to the new line of the display, but the cursor stays in the same column as the previous line.
\r\n
Combine 1 and 2. The cursor is moved to the new line, and it is also moved to the first column of the display.
Some compilers prints both a new line and carriage return when you specify only \n.

Related

Demystifying the newline character (again)

We know that Windows uses a CR + LF pair as its new line, Unix (including Linux and OS X) uses a single LF, while MacOS uses a single CR.
Does that mean that the interpretation of a newline in C and C++ depends upon the execution environment, even though K&R (section 1.5.3 Line Counting) states the following very categorically?
so '\n' stands for the value of the newline character, which is 10 in ASCII.

We know that Windows uses a CR + LF pair as its new line,…
The page you link to does not say Windows uses “CR + LF” as its new line character. It says Windows marks the end of a line in a text file with a carriage-return character and a line-feed character. That does not mean those characters are a new-line character or vice-versa.
Does that mean that the interpretation of a newline…
The new-line character is a new-line character. In C, it is intended to mark a new line. When ASCII is used, ASCII’s line-feed character (code 10) is typically used as C’s new-line character ('\n').
If a C program reads a Windows-style text file using a binary stream, it will see a carriage-return character and a line-feed marking the ends of lines. If a C program reads a Windows-style text file using a text stream (in an environment that supports this), the Windows line-ending indications (carriage-return character and line-feed character) will be automatically translated to C new-line characters.
Conversely, if a C program writes to a Windows-style text file using a text stream, the new-line characters it writes will be translated to Windows line-ending indications. If it writes using a binary stream, it must write the carriage-return characters and the line-feed characters itself.

Does that mean that the interpretation of a newline in C and C++ depends upon the execution environment
No, it does not depend. The interpretation depends the tool that reading the file which is platform suggested but can differ. A robust text tool will tolerate various encodings and will
handle change.
Further, text files originating on one system are accessed/edited by other planforms with different rules.

No, \n always means LF.
On Windows there is LF <-> CR-LF conversion that's performed by the IO streams (FILE *, std::??stream), if the stream is opened in text mode (as opposed to binary mode).

Does that mean that the interpretation of a newline in C and C++ depends upon the execution environment?
The interpretation of the file contents does indeed depend on the execution environment, so that the C programmer does not have to handle the different conventions explicitly:
if the stream is open as binary "rb", no translation is performed and each byte of the file contents is returned directly by getchar(). Unix systems handle text files and binary files identically, so no translation occurs for text files either.
on other systems, streams open in text mode "rt" or just "r" are handled in a system specific way to translate line ending patterns to the single byte '\n', which in ASCII has the value 10. On Windows and MS/DOS systems, this translation converts CR/LF pairs to single bytes '\n', which can be implemented as simply removing CR bytes. This convention was inherited from previous microcomputer operating systems such as Gary Kildall's CP/M, whose APIs were emulated in QDOS, Seattle Computer Products' original 8086 OS that later became MS/DOS.
older Mac systems (before OS/X) used to represent line endings with a single CR byte, but Apple changed this when they adopted a Unix kernel for their OS/X system. No translation is performed anymore on macOS.
Antique systems used to have even more cumbersome representations for text files, such as fixed length records and the stream implementation was inserting extra '\n' bytes to simulate unix line endings when read such streams in text mode.
It is important to understand that this translation process is system specific and is not designed to handle files copied from other systems that use a different convention. Advanced Text tools such as the QEmacs programmers' editor can detect different line endings and perform the appropriate translation regardless of the current execution environment, preserving the convention used in the file, or converting it to another convention under user control.

C++ - Cross-platform newline character in string

When newline in string is necessary, I use the \n character
int main()
{
string str = "Hello world\n";
}
Is \n crossplatform? Or do I need to use macro adapting it's value with the platform?
Especially when str is going to be written to a file or stdout.

As long as you read/write text streams, or files in text mode, \n will be translated into the correct sequence for the platform.
http://en.cppreference.com/w/c/io

In addition on previous answer if you need read file in Unix saved in Windows and vice-versa you may use this:
std::getline(fileName,inputStr);
inputStr.erase( std::remove( inputStr.begin(), inputStr.end(), '\r' ), inputStr.end() );
inputStr.erase( std::remove( inputStr.begin(), inputStr.end(), '\n' ), inputStr.end() );
It will delete all \r and \n.

Another way to put it is that \n is cross platform for the compiler. It will compile on all platforms and generate correct output for the platform. But the output is not really cross platform since new line in text is different on different platforms. So reading need extra handling to be platform independent.

Part of the confusion here is that a string with an \n in it is literally just that - a string with an LF byte (0x0A).
The cross-platformyness comes into the equation when considering the reading and writing of streams in normal ie not binary mode.
Stream objects translate \n to \n, \r or \r\n depending on the platform the executable code has been compiled for.
At least this is my understanding of the situation, please correct me if I am wrong about this. It isn't something I have had to worry about much in the past, since I usually exclusively write code for Linux systems.
Thought I should add this since the question doesn't really make sense, although I get what you are asking.

What is Eol in text file and normal file?

Now I am quite confused about the end of line character I am working with c++ and I know that text files have a end of line marker which sets the limit for reading a line which a single shifing operator(>>).Data is read continously untill eol character does not apprears and while opening a file in text mode carriage return(CR) is converted into CRLF which is eol marker so if i add white spaces in my text then would it act as eol maker cause it does.
Now i created a normal file i.e. a file without .txt
eg
ifstream("test"); // No .txt
Now what is eol marker in this case

The ".txt" at the end of the filename is just a convention. It's just part of the filename.
It does not signify any magical property of the file, and it certainly doesn't change how the file is handled by your operating system kernel or file system driver.
So, in short, what difference is there? None.
I know that text files have a end of line marker which sets the limit for reading a line which a single shifing operator(>>)
That is incorrect.
Data is read continously untill eol character does not apprears
Also incorrect. Some operating systems (e.g. Windows IIRC) inject an EOF (not EOL!) character into the stream to signify to calling applications that there is no more data to read. Other operating systems don't even do that. But in neither case is there an actual EOF character at the end of the actual file.
while opening a file in text mode carriage return(CR) is converted into CRLF which is eol marker
That conversion may or may not happen and, either way, EOL is not EOF.
if i add white spaces in my text then would it act as eol maker cause it does.
That's a negative, star command.
I'm not sure where you're getting all this stuff from, but you've been heavily mistaught. I suggest a good, peer-reviewed, well-recommended book from Amazon about how computer operating systems work.

When reading strings in C++ using the extraction operator >>, the default is to skip spaces.
If you want the entire line verbatim, use std::getline.
A typical input loop is:
int main(void)
{
std::string text_from_file;
std::ifstream input_file("My_data.txt");
if (!input_file)
{
cerr << "Error opening My_data.txt for reading.\n";
return EXIT_FAILURE;
}
while (input_file >> text_from_file)
{
// Process the variable text_from_file.
}
return EXIT_SUCCESS;
}

A lot of old and mainframe operating systems required a record structure of all data files which, for text files, originated with a Hollerith (punch) card of 80 columns and was faithfully preserved through disk file records, magnetic tapes, output punch card decks, and line printer lines. No line ending was used because the record structure required that every record have 80 columns (and were typically filled with spaces). In later years (1960s+), having variable length records with an 80 column maximum became popular. Today, even OpenVMS still requires the file creator to specify a file format (sequential, indexed, or "stream") and record size (fixed, variable) where the maximum record size must be specified in advance.
In the modern era of computing (which effectively began with Unix) it is widely considered a bad idea to force a structure on data files. Any programmer is free to do that to themselves and there are plenty of record-oriented data formats like compiler/linker object files (.obj, .so, .o, .lib, .exe, etc.), and most media formats (.gif, .tiff, .flv, .mov, mp3, etc.)
For communicating text lines, the paradigm is to target a terminal or printer and for that, line endings should be indicated. Most operating systems environments (except MSDOS and Windows) use the \n character which is encoded in ASCII as a linefeed (ASCII 10) code. MSDOS and ilk use '\r\n' which are encoded as carriage return then linefeed (ASCII 13, 10). There are advantages and disadvantages to both schemes. But text files may also contain other controls, most commonly the ANSI escape sequences which control devices in specific ways:
clear the screen, either in part or all of it
eject a printer page, skip some lines, reverse feed, and other little-used features
establish a scrolling region
change the text color
selecting a font, text weight, page size, etc.
For these operations, line endings are not a concern.
Also, data files encoded in ASCII such as JSON and XML (especially HTML with embedded Javascript), might not have any line endings, especially when the data is obfuscated or compressed.
To answer your questions:
I am quite confused about the end of line character I am working with c++ and I know that text files have a end of line marker
Maybe. Maybe not. From a C or C++ program's viewpoint, writing \n indicates to the runtime environment the end of a line. What the system does with that varies by runtime operating environment. For Unix and Linux, no translation occurs (though writing to a terminal-like device converts to \r\n). In MSDOS, '\n' is translated to \r\n. In OpenVMS, '\n' is removed and that record's size is set. Reading does the inverse translation.
which sets the limit for reading a line which a single shifing operator(>>).
There is no such limit: A program can choose to read data byte-by-byte if it wants as well as ignore the line boundaries.
The "shifting operators" are overloaded for filestreams to input or output data but are not related to bit twiddling shifts. These operators were chosen for visual approximation of input/output and due to their low operator precedence.
Data is read continously untill eol character does not apprears
This bit is confusing: I think you meant until eol character appears, which is indeed how the line-oriented functions gets() and fgets() work.
and while opening a file in text mode carriage return(CR) is converted into CRLF which is eol marker so if i add white spaces in my text then would it act as eol maker cause it does.
Opening the file does not convert anything, but reading from a file might. However, no environment (that I know of) converts input to CR LF. MSDOS converts CR LF on input to \n.
Adding spaces has no effect on end of lines, end of file, or anything. Spaces are just data. However, the C++ streaming operations reading/writing numbers and some other datatypes use whitespace (a sequence of spaces, horizontal tabs, vertical tabs, form feed, and maybe some others) as a delimiter. This convenience feature may cause some confusion.
Now i created a normal file i.e. a file without .txt eg
ifstream("test"); \No .txt
Now what is eol marker in this case
The filename does not determine the file type. In fact, file.txt may not be a text file at all. Using a particular file extension is convenient for humans to communicate a file's purpose, but it is not obligatory.

Meaning of \r on linux systems

I'm looking at some linux specific code which is outputting the likes of:
\r\x1b[J>
to the std io.
I understand that <ESC>[J represents deleting the contents of the screen from the current line down, but what does \r do here?
I'm also seeing the following:
>user_input\n\r>
where user_input is the text entered by the user. But what is the purpose of the \r here?

The character '\r' is carriage return. It returns the cursor to the start of the line.
It is often used in Internet protocols conjunction with newline ('\n') to mark the end of a line (most standards specifies it as "\r\n", but some allows the wrong way around). On Windows the carriage-return newline pair is also used as end-of-line. On the old Macintosh operating system (before OSX) a single carriage-return was used instead of newline as end-of-line, while UNIX and UNIX-like systems (like Linux and OSX) uses a single newline.

Control character \r moves caret (a.k.a text cursor) to the leftmost position within current line.

From Wikipedia
Systems based on ASCII or a compatible character set use either LF
(Line feed, '\n', 0x0A, 10 in decimal) or CR (Carriage return, '\r',
0x0D, 13 in decimal) individually, or CR followed by LF (CR+LF,
'\r\n', 0x0D0A). These characters are based on printer commands: The
line feed indicated that one line of paper should feed out of the
printer thus instructed the printer to advance the paper one line, and
a carriage return indicated that the printer carriage should return to
the beginning of the current line. Some rare systems, such as QNX
before version 4, used the ASCII RS (record separator, 0x1E, 30 in
decimal) character as the newline character.

FWIW - this is a part of carriage control - from mainframe control words to Windows/UNIX/FORTRAN carriage control. Carriage control can be implemented at a language level like FORTRAN does, or system-wide like UNIX and Windows do.
\n arose from limitations of early PDP user "interfaces" - the tty terminal. Go to a museum if you want see one.
A very simple point: The difference between \n \r is explained above. But all of these explanations are really saying that carriage control is implementation dependent.
The [J is part of ANSI escape sequences and what they do on a "standards conforming tty terminal".
DOS used to have ANSI.SYS to provide: colors, underline, bold using those sequences.
http://ascii-table.com/ansi-escape-sequences.php
Is a good reference for the question: what does some odd looking string in the output do?

\r is carriage return. Similarly \n is linefeed.

QString::split() and "\r", "\n" and "\r\n" convention

I understand that QString::split should be used to get a QStringList from a multiline QString. But if I have a file and I don't know if it comes from Mac, Windows or Unix, I'm not sure if QString.split("\n") would work well in all the cases. What is the best way to handle this situation?

If it's acceptable to remove blank lines, you can try:
QString.split(QRegExp("[\r\n]"),QString::SkipEmptyParts);
This splits the string whenever any of the newline character (either line feed or carriage return) is found. Any consecutive line breaks (e.g. \r\n\r\n or \n\n) will be considered multiple delimiters with empty parts between them, which will be skipped.

Emanuele Bezzi's answer misses a couple of points.
In most cases, a string read from a text file will have been read using a text stream, which automatically translates the OS's end-of-line representation to a single '\n' character. So if you're dealing with native text files, '\n' should be the only delimiter you need to worry about. For example, if your program is running on a Windows system, reading input in text mode, line endings will be marked in memory with single \n characters; you'll never see the "\r\n" pairs that exist in the file.
But sometimes you do need to deal with "foreign" text files.
Ideally, you should probably translate any such files to the local format before reading them, which avoids the issue. Only the translation utility needs to be aware of variant line endings; everything else just deals with text.
But that's not always possible; sometimes you might want your program to handle Windows text files when running on a POSIX system (Linux, UNIX, etc.), or vice versa.
A Windows-format text file on a POSIX system will appear to have an extra '\r' character at the end of each line.
A POSIX-format text file on a Windows system will appear to consist of one very long line with embedded '\n' characters.
The most general approach is to read the file in binary mode and deal with the line endings explicitly.
I'm not familiar with QString.split, but I suspect that this:
QString.split(QRegExp("[\r\n]"),QString::SkipEmptyParts);
will ignore empty lines, which will appear either as "\n\n" or as "\r\n\r\n", depending on the format. Empty lines are perfectly valid text data; you shouldn't ignore them unless you're certain that it makes sense to do so.
If you need to deal with text input delimited either by "\n", "\r\n", or "\r", then I think something like this:
QString.split(QRegExp("\n|\r\n|\r"));
would do the job. (Thanks to parsley72's comment for helping me with the regular expression syntax.)
Another point: you're probably not likely to encounter text files that use just '\r' to delimit lines. That's the format used by MacOS up to version 9. MaxOS X is based on UNIX, and it uses standard UNIX-style '\n' line endings (though it probably tolerates '\r' line endings as well).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js