End of the line delimiter conversion between Windows and Linux - c++

I need to read a txt file created in Windows in C++ program, compiled on the Debian Linux. Unfortunately I have a problem with the end of line delimiters. I know that the end of the line indicators are different in Linux and Windows. Consequently, in Linux my C++ program reads something like "correct_line^M".
My question: How can I read in Linux my file created in Windows, correctly?
Do I need to convert it manually to Linux representation (I would like to avoid it)?
Thank you.

You'll have to do it yourself. (IMNO, a good library would do it
automatically, in filebuf, if opened in text mode. But the libraries
I'm familiar with don't.)
Depending on what you're doing, it may not matter. Any line oriented
input should accept trailing white space anyway, and the extra 0x0D
character is white space. So except for editors, it usually won't
matter.
If you want to suppress the extra 0x0D when writing the file (under
Windows), just open the file in binary mode. For that matter, when
portability of the file is a concern, it's often a good idea to open the
file in binary mode, and write whichever convention the protocol
requires. (Using the two character sequence 0x0D, 0x0A is more or less
standard, and is what most Internet protocols specify.) When reading,
again, open in binary mode, and write your code to accept any of the
usual conventions: the two character sequence 0x0D, 0x0A, or either a
single 0x0D or a single 0x0A. (This could be done with a filtering
streambuf.)

You may run dos2unix. It'll converts your file.

Related

What is Eol in text file and normal file?

Now I am quite confused about the end of line character I am working with c++ and I know that text files have a end of line marker which sets the limit for reading a line which a single shifing operator(>>).Data is read continously untill eol character does not apprears and while opening a file in text mode carriage return(CR) is converted into CRLF which is eol marker so if i add white spaces in my text then would it act as eol maker cause it does.
Now i created a normal file i.e. a file without .txt
eg
ifstream("test"); // No .txt
Now what is eol marker in this case
The ".txt" at the end of the filename is just a convention. It's just part of the filename.
It does not signify any magical property of the file, and it certainly doesn't change how the file is handled by your operating system kernel or file system driver.
So, in short, what difference is there? None.
I know that text files have a end of line marker which sets the limit for reading a line which a single shifing operator(>>)
That is incorrect.
Data is read continously untill eol character does not apprears
Also incorrect. Some operating systems (e.g. Windows IIRC) inject an EOF (not EOL!) character into the stream to signify to calling applications that there is no more data to read. Other operating systems don't even do that. But in neither case is there an actual EOF character at the end of the actual file.
while opening a file in text mode carriage return(CR) is converted into CRLF which is eol marker
That conversion may or may not happen and, either way, EOL is not EOF.
if i add white spaces in my text then would it act as eol maker cause it does.
That's a negative, star command.
I'm not sure where you're getting all this stuff from, but you've been heavily mistaught. I suggest a good, peer-reviewed, well-recommended book from Amazon about how computer operating systems work.
When reading strings in C++ using the extraction operator >>, the default is to skip spaces.
If you want the entire line verbatim, use std::getline.
A typical input loop is:
int main(void)
{
std::string text_from_file;
std::ifstream input_file("My_data.txt");
if (!input_file)
{
cerr << "Error opening My_data.txt for reading.\n";
return EXIT_FAILURE;
}
while (input_file >> text_from_file)
{
// Process the variable text_from_file.
}
return EXIT_SUCCESS;
}
A lot of old and mainframe operating systems required a record structure of all data files which, for text files, originated with a Hollerith (punch) card of 80 columns and was faithfully preserved through disk file records, magnetic tapes, output punch card decks, and line printer lines. No line ending was used because the record structure required that every record have 80 columns (and were typically filled with spaces). In later years (1960s+), having variable length records with an 80 column maximum became popular. Today, even OpenVMS still requires the file creator to specify a file format (sequential, indexed, or "stream") and record size (fixed, variable) where the maximum record size must be specified in advance.
In the modern era of computing (which effectively began with Unix) it is widely considered a bad idea to force a structure on data files. Any programmer is free to do that to themselves and there are plenty of record-oriented data formats like compiler/linker object files (.obj, .so, .o, .lib, .exe, etc.), and most media formats (.gif, .tiff, .flv, .mov, mp3, etc.)
For communicating text lines, the paradigm is to target a terminal or printer and for that, line endings should be indicated. Most operating systems environments (except MSDOS and Windows) use the \n character which is encoded in ASCII as a linefeed (ASCII 10) code. MSDOS and ilk use '\r\n' which are encoded as carriage return then linefeed (ASCII 13, 10). There are advantages and disadvantages to both schemes. But text files may also contain other controls, most commonly the ANSI escape sequences which control devices in specific ways:
clear the screen, either in part or all of it
eject a printer page, skip some lines, reverse feed, and other little-used features
establish a scrolling region
change the text color
selecting a font, text weight, page size, etc.
For these operations, line endings are not a concern.
Also, data files encoded in ASCII such as JSON and XML (especially HTML with embedded Javascript), might not have any line endings, especially when the data is obfuscated or compressed.
To answer your questions:
I am quite confused about the end of line character I am working with c++ and I know that text files have a end of line marker
Maybe. Maybe not. From a C or C++ program's viewpoint, writing \n indicates to the runtime environment the end of a line. What the system does with that varies by runtime operating environment. For Unix and Linux, no translation occurs (though writing to a terminal-like device converts to \r\n). In MSDOS, '\n' is translated to \r\n. In OpenVMS, '\n' is removed and that record's size is set. Reading does the inverse translation.
which sets the limit for reading a line which a single shifing operator(>>).
There is no such limit: A program can choose to read data byte-by-byte if it wants as well as ignore the line boundaries.
The "shifting operators" are overloaded for filestreams to input or output data but are not related to bit twiddling shifts. These operators were chosen for visual approximation of input/output and due to their low operator precedence.
Data is read continously untill eol character does not apprears
This bit is confusing: I think you meant until eol character appears, which is indeed how the line-oriented functions gets() and fgets() work.
and while opening a file in text mode carriage return(CR) is converted into CRLF which is eol marker so if i add white spaces in my text then would it act as eol maker cause it does.
Opening the file does not convert anything, but reading from a file might. However, no environment (that I know of) converts input to CR LF. MSDOS converts CR LF on input to \n.
Adding spaces has no effect on end of lines, end of file, or anything. Spaces are just data. However, the C++ streaming operations reading/writing numbers and some other datatypes use whitespace (a sequence of spaces, horizontal tabs, vertical tabs, form feed, and maybe some others) as a delimiter. This convenience feature may cause some confusion.
Now i created a normal file i.e. a file without .txt eg
ifstream("test"); \No .txt
Now what is eol marker in this case
The filename does not determine the file type. In fact, file.txt may not be a text file at all. Using a particular file extension is convenient for humans to communicate a file's purpose, but it is not obligatory.

Read text-file in C++ with fopen without linefeed conversion

I'm working with text-files (UTF-8) on Windows and want to read them using C++.
To open the file corrently, I use fopen. As described here, there are two options for opening the file:
Text mode "rt" (Carriage return + Linefeed will automatically be converted into Linefeed; Short "\r\n" becomes "\n").
Binary mode "rb" (The file will be read byte by byte).
Now it becomes tricky. I don't want to open the file in binary mode, since I would lose the correct handling of my UTF-8 characters (and there are special characters in my text-files, which are corrupted when interpreted as ANSI-character). But I also don't want fopen to convert all my CR+LF into LF.
Is there a way to combine the two modes, to read a text-file into a string without tampering with the linefeeds, while still being able to read UTF-8 correctly?
I am aware, that the reverse conversion would happen, if I write it through the same file, but the string is sent to another application that expects Windows-style line-endings.
The difference between opening files in text and binary mode is exactly the handling of line end sequences in text mode or not touching them in binary mode. Nothing more nothing less. Since the ASCII characters use the same code points in Unicode and UTF-8 retains the encoding of ASCII characters (i.e., every ASCII file happens to be a UTF-8 encoded Unicode file) whether you use binary or text mode won't affect the other bytes.
It may be worth to have a look at James McNellis "Unicode in C++" presentation at C++Now 2014.

Choosing line ending with libxml2

I try to generate some xml files (TMX) on our servers.
The servers are Solaris SPARC servers, but the destination of the files are some legacy Windows CAT Tools.
The CAT-Tool requires CR+LF line endings as is the default on Windows. Writing the files with libxml2, using xmlWriter is easy and works quite well. But I haven't figured out a way to force the lib to emit CR+LF instead of the Unix standard LF. The lib only seem to support the line ending of the platform it runs on.
Has somebody found a way to generate files with another line ending than the default of the platform it runs on. Actually my workaround is to open the written file and writing a new file with the changed line ending using a simple C loop. That works, but it is annoying to have such a unnecessary step in our chain.
I haven't tried this myself, but from xmlsave, I can see two possibilities
xmlSaveToBuffer: save to a buffer, convert to CR/LF and write it out yourself.
xmlSaveToIO: register an iowrite callback and convert to CF/LF while writing in your callback function
Maybe, there are other options, but I haven't found them.
The CAT-Tool requires CR+LF line endings as is the default on Windows.
FWIW, that means the CAT-Tool has a broken XML parser. It shouldn't care about this, as the the XML spec says:
To simplify the tasks of applications, the XML processor must behave as if it normalized all line breaks ... by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.
I know often these things are out of our control, but if you can lean on the CAT-Tool vendor to fix their software, it could become a more future-proof solution.
According to the source code (as of April 2013), libxml2 just puts "\n" into the output stream. At least, when writing dtd-part of a document. Therefore, re-encoding the stream on the fly is the only option to get "\r\n" as result.
If you were lucky (as me) and your tool run on Windows, you could open the file in the text mode, and the OS would do recoding for you.

QString::split() and "\r", "\n" and "\r\n" convention

I understand that QString::split should be used to get a QStringList from a multiline QString. But if I have a file and I don't know if it comes from Mac, Windows or Unix, I'm not sure if QString.split("\n") would work well in all the cases. What is the best way to handle this situation?
If it's acceptable to remove blank lines, you can try:
QString.split(QRegExp("[\r\n]"),QString::SkipEmptyParts);
This splits the string whenever any of the newline character (either line feed or carriage return) is found. Any consecutive line breaks (e.g. \r\n\r\n or \n\n) will be considered multiple delimiters with empty parts between them, which will be skipped.
Emanuele Bezzi's answer misses a couple of points.
In most cases, a string read from a text file will have been read using a text stream, which automatically translates the OS's end-of-line representation to a single '\n' character. So if you're dealing with native text files, '\n' should be the only delimiter you need to worry about. For example, if your program is running on a Windows system, reading input in text mode, line endings will be marked in memory with single \n characters; you'll never see the "\r\n" pairs that exist in the file.
But sometimes you do need to deal with "foreign" text files.
Ideally, you should probably translate any such files to the local format before reading them, which avoids the issue. Only the translation utility needs to be aware of variant line endings; everything else just deals with text.
But that's not always possible; sometimes you might want your program to handle Windows text files when running on a POSIX system (Linux, UNIX, etc.), or vice versa.
A Windows-format text file on a POSIX system will appear to have an extra '\r' character at the end of each line.
A POSIX-format text file on a Windows system will appear to consist of one very long line with embedded '\n' characters.
The most general approach is to read the file in binary mode and deal with the line endings explicitly.
I'm not familiar with QString.split, but I suspect that this:
QString.split(QRegExp("[\r\n]"),QString::SkipEmptyParts);
will ignore empty lines, which will appear either as "\n\n" or as "\r\n\r\n", depending on the format. Empty lines are perfectly valid text data; you shouldn't ignore them unless you're certain that it makes sense to do so.
If you need to deal with text input delimited either by "\n", "\r\n", or "\r", then I think something like this:
QString.split(QRegExp("\n|\r\n|\r"));
would do the job. (Thanks to parsley72's comment for helping me with the regular expression syntax.)
Another point: you're probably not likely to encounter text files that use just '\r' to delimit lines. That's the format used by MacOS up to version 9. MaxOS X is based on UNIX, and it uses standard UNIX-style '\n' line endings (though it probably tolerates '\r' line endings as well).

Encoding issue using XZIP

I wrote a c++ program that needs to zip files in it's work. For creating these zip files I used the XZip library. While developing this program ran on a Win7 machine and it works fine.
Now the program should be used on a WindowsXP machine. The issue I run into is:
If I let XZip create the zip archive "ü.zip" and add the file "ü.txt" to it on Win7 it is working as intended. On WindowsXP however I end up having the "ü.zip" file with "³.txt" as file in it.
The "³" => "ü" thing is of course an encoding issue between UTF8 and Ascii (ü = 252 in UTF8 and 252 = ³ in Ascii) BUT I can't really imagine how this could affect the creating of the internal zip structure in different ways depending on the OS.
//EDIT to clear it up:
the problem is that I run a test with XZip on Win7 and get the archive "ü.zip" containing the file with name "ü.txt".
When I run that test on an XP machine I get the archive "ü.zip" containing the file "³.txt".
//Edit2:
The thing that makes me wonder about that is, what exactly causes the zip to change between XP and Win7. The fact that it does change means that either a windows function behaves differently or XZip has specific behavior for different OS built in.
When having a quick look at XZip I can't see that it changes the encoding flag on the zip archives. The question of course only can be answered by people who did have a closer look into this exact problem before.
As a general rule, if you want any sort of portability between locales, OS's (including different versions) and what have you, you should limit your filenames to the usual 26 letters, the 10 digits, and perhaps '_' and '-' (and I'm not even sure about the latter), and one '.', no more than three characters from the end. Once you start using letters beyond the original ASCII character set, you're at the merci of the various programs which interpret the character set.
Also, 252 isn't anything in ASCII, since ASCII only uses character codes in the range 0...127. And in UTF-8, 252 would be the first byte of a six byte character. Something that doesn't exist in Unicode: in UTF-8, LATIN SMALL LETTER U WITH DIAERESIS would be the two byte sequence 0xC3, 0xBC. 256 is the encoding of LATIN SMALL LETTER U WITH DIAERESIS in ISO 8859-1, otherwise known as Latin-1; it's also the encoding in UTF-16 and UTF-32.
None of this, of course, should affect what is in the file.
May be you are building your Win32 program (or the library) as ASCII (not as UNICODE). It may help if you build your Win32 applications with UNICODE configuration setting (you may change it in your Visual Studio project settings).
It is impossible to say what happened in your program without seeing your code. May be your library or the archive format is not UNICODE-aware, may be your program's code is not UNICODE-aware, may be you don't handle strings careful enough, or may be you just have to change your project setting to UNICODE. Also your "8-bit encoding for non-Unicode programs" Windows OS setting matters if you don't use UNICODE strings.
As for 252, UTF8 and ASCII read post by James Kanze. It is more or less safe to use ASCII file names with no ':', '?', '*', '/', '\' characters. Using non-ASCII characters may lead to encoding problems if you are not using UNICODE-based programs and file-systems.